uni.kn.logo

WPA11 of Basic data and decision analysis in R, taught at the University of Konstanz in Winter 2017/2018.


To complete and submit these exercises, please remember and do the following:

  1. Use the .Rmd Format: Your WPAs should be written as scripts of commented code (as .Rmd files) and submitted as reproducible documents that combine text with code (in .html or .pdf formats).

    • A simple .Rmd template is provided here.

    • (Alternatively, open a plain R script and save it as LastnameFirstname_WPA##_yymmdd.R.)

  2. Commening your code: Indicate the current assignment (e.g., WPA11), your name, and the current date at the top of your document. Please always include appropriate comments with your code. For instance, your file LastFirst_WPA11_yymmdd.Rmd could look like this:

---
title: "Your Assignment Title (WPA11)"
author: "Your Name"
date: "Year Month Day"
output: html_document
---

This file contains my solutions: 

# Exercise 1

To show and run R code in your document, use a code chunk (without the '#' symbols):

# ```{r, exercise_1, echo = TRUE, eval = TRUE}
# 
# v <- c(1, 2, 3) # some vector
# sum(v)
#     
# ```

More text and code chunks... 

[Updated on `r Sys.Date()` by Your Name.]
<!-- End of document -->
  1. Complete as many exercises as you can by Wednesday (23:59).

  2. Submit your script or output file (including all code) to the appropriate folder on Ilias.


Preparations

0. The following steps prepare the current session by opening an R project, creating a new .Rmd file, and compiling it into an .html output file:

0a. Open your R project from last week (called RCourse or something similar), which contains some files and at least two subfolders (data and R).

0b. Create a new R Markdown (.Rmd) script and save it as LastFirst_WPA11_yymmdd.Rmd (with an appropriate header) in your project directory.

0c. Insert a code chunk and load the rmarkdown, knitr and yarrr packages. (Hint: It’s always a good idea to name code chunks and load all required packages with library() at the beginning of your document. Using the chunk option include = FALSE evaluates the chunk, but does not show it or its outputs in the html output file.)

library(rmarkdown)
library(knitr)
library(yarrr)

# Store original par() settings:
opar <- par()
# par(opar) # restores original (default) par settings later

0d. Make sure that you can create an .html output-file by “knitting” your current document.


A. Theoretical Background

In this WPA, we will use the R package FFTrees to make binary classification decisions with fast-and-frugal trees (FFTs). To understand what FFTs are and which task they are solving, we first need to cover some theoretical background, before plunging into the concrete details of how to construct and evaluate FFTs.

What are FFTs?

Many real-world problems call for rapid and robust binary classification decisions. We may want to predict whether a patient is in peril, whether an investment is profitable, or whether a new partnership is promising. Such classifications have important consequences, yet are typically made under time-pressure and conditions of high risk and uncertainty.

Decision trees are aids or tools to make decisions by answering a sequence of questions. Decision trees are generally comprised of a series of nodes, representing questions (or cues), branches, representing the answers to questions, and leaves, representing the resulting classification decisions. A fast and frugal tree (FFT) is a specific case of a simple decision tree that makes binary classification decisions and contains exactly two branches at each node, where either one or both branches are leaves (see Martignon et al., 2008, for additional details). Thus, a FFT with n nodes classifies a case into one of two categories (e.g., dead or alive, yes or no, 1 or 0, signal or noise, etc.) by answering a series of 1 to n questions.

An example FFT

The figure on the right illustrates a FFT that can be used as a screening tool for clinical depression (developed by Jenny et al., 2013):

To use this particular FFT, you would start with the first question (the parent node): “Have you cried more than usual in the last week”. If the answer to this question is “no”, the person to be diagnosed is immediately classified as not having a clinically depressed mood – no further information is considered. By contrast, if the answer to the first question is “yes”, then the second question is asked, etc. Note that only the last node in the FFT (Question 4) contains two leaves, which correspond to the two possible decisions.

FFTs solve a seemingly complex task – use the available information to make good binary classification decisions – without being complex themselves. Whereas statistical classification algorithms – such as logistic regression – are immensely powerful under ideal conditions (e.g., lots of information, computational power, time, etc.), the actual decisions of experts and laypeople often require decision strategies that work rapidly and reliably on the basis of limited and noisy information. Additionally, statistical procedures often just yield a probability, rather than a decision outcome, and thus need to be supplemented by an additional decision rule (e.g., a probability threshold). By contrast, FFTs directly yield a decision and are “frugal” by only considering a (typically small) number of cues and stopping to use additional information as soon as a leave (or exit branch) is reached. Thus, a FFT is non-compensatory, as any potentially contradictory information in subsequent cues is ignored.

Evaluating classification outcomes

To evaluate the performance of classification algorithms we need to understand two things:

First, there is a difference between fitting an algorithm or a model to exisiting data, and predicting new instances with this algorithm or model. Without going into detail, a complex model that can adapt itself flexibly to any kind of data tends to have an advantage in fitting, relative to a simpler model. However, when the complex model captures irrelevant noise (overfitting), this advantage can turn into an obstacle when predicting new cases (out-of-sample prediction). (See Gigerenzer & Brighton, 2009, for details on the bias-variance dilemma.) In practice, this means that we will typically split our data set into multiple subsets – at least one for fitting (called training below) and another one for predicting (testing) – when constructing and evaluating FFTs.

Second, binary classification decisions yield four possible outcomes. The corresponding 2 x 2 classification table (aka. a confusion matrix) is used to assess the performance of any classification model:

In the figure on the right, columns refer to the frequencies of true criterion values (with the truth being either 1 or 0), while rows refer to the frequencies of decisions made by the model (with decisions also being either 1 or 0). Note that there are two correct outcomes (cells \(hi\) and \(cr\) refer to hits and correct rejections) and two incorrect outcomes (cells \(mi\) and \(fa\) refer to the frequency of misses and false alarms, respectively).

A variety of measures combine these frequencies into different measures of classification accuracy. Among the most prominent measures are the following measures of correct performance:

  • An algorithm’s hit rate (also called sensitivity or recall) is defined as \(HR = hi/(hi + mi)\) and represents the percentage of hits (or true positives) relative to the number of true criterion values.

  • An algorithm’s specificity (SPC) is defined as \(SPC = cr/(cr + fa)\) and represents the percentage of correct rejections (true negatives) relative to the number of false criterion values.

These two measures of accuracy are mirrored by two corresponding measures of error:

  • An algorithm’s miss rate is defined as \(MR = mi/(mi + hi)\) and represents the percentage of misses (or false negatives) relative to the number of true criterion values (with \(MR = 1 - HR\)).

  • An algorithm’s false alarm rate (FAR) is defined as \(FAR = fa/(fa + cr)\) and represents the percentage of false alarms (false positives) relative to the number of false criterion values (with \(FAR = 1 - SPC\)).

Aggregate measures that integrate the performance of all four cells include the following:

  • An algorithm’s accuracy can be calculated by dividing the sum of correctly classified cases by the sum of all cases: \((hi + cr)/(hi + mi + fa + cr)\).

  • An algorithm’s validity \(v\) subtracts FAR from HR: \(v = HR - FAR = HR + SPC - 1\).

Generally speaking, the goal of any classification algorithm is to maximize counts in cells \(hi\) and \(cr\) while minimizing counts in cells \(mi\) and \(fa\). Importantly, the interdependency of the four classification outcomes implies inevitable trade-offs. Whereas it is normally not possible to influence the criterion, shifting one’s decision threshold towards 1 (or “signal”) will increase both the number of hits and the number of false alarms, while simultaneously decreasing the number of correct rejections and misses. Shifting one’s decision threshold towards 0 (or “noise”) has the opposite effects. (Additional measures based on signal-detection theory – like d-prime and bias – and their relation to the structure of FFTs are discussed in Luan, Schooler, and Gigerenzer, 2011.)

B. Tutorial

This section provides a brief overview of the main functions contained in the FFTrees package.

Step 0: Installing and loading the FFTrees package

Install the package:

install.packages("FFTrees") # installs the package

Once the package is installed, you can load it by calling:

library(FFTrees)            # loads the package

After loading the package, you can click on FFTrees in your package browser to view the package’s documentation pages. Calling FFTrees.guide() opens an overview of the package’s main functions.

Step 1: Preparing training and test data

The FFTrees package contains several datasets from the UCI’s Machine Learning Repository that you can use to construct and evaluate FFTs.

Inspect the titanic data set and split it into two subsets (for training and testing FFTs).

?titanic
dim(titanic)
#> [1] 2201    4
head(titanic)
#>   class   age  sex survived
#> 1 first adult male        1
#> 2 first adult male        1
#> 3 first adult male        1
#> 4 first adult male        1
#> 5 first adult male        1
#> 6 first adult male        1

str(titanic)
#> 'data.frame':    2201 obs. of  4 variables:
#>  $ class   : chr  "first" "first" "first" "first" ...
#>  $ age     : chr  "adult" "adult" "adult" "adult" ...
#>  $ sex     : chr  "male" "male" "male" "male" ...
#>  $ survived: int  1 1 1 1 1 1 1 1 1 1 ...

# Get some simple summary counts:
table(titanic$class, titanic$survived)
#>         
#>            0   1
#>   crew   673 212
#>   first  122 203
#>   second 167 118
#>   third  528 178
table(titanic$age, titanic$survived)
#>        
#>            0    1
#>   adult 1438  654
#>   child   52   57
table(titanic$sex, titanic$survived)
#>         
#>             0    1
#>   female  126  344
#>   male   1364  367

Let’s split the data set into two parts:

# Define data for training and testing: 
data <- titanic

n.cases <- nrow(data) # number of cases in data (population)
n.sample <- floor(.50 * n.cases) # sample size: 50% of all cases

set.seed(100) # for reproducible randomness
ix.train <- sort(x = sample(1:n.cases, size = n.sample, replace = FALSE), decreasing = FALSE) # draw random sample

titanic.train <- data[ix.train, ]  # data set for training tree
titanic.test  <- data[-ix.train, ] # data set for testing tree

Step 2: Creating a FFTrees object

The FFTrees() function creates a new FFTrees object (internally represented as a list):

titanic.fft <- FFTrees(formula = survived ~ .,
                       data = titanic.train,
                       data.test = titanic.test)

typeof(titanic.fft)
#> [1] "list"

Step 3: Inspect summary statistics

Inspecting the new object can be done by calling it directly or by applying summary() to it:

titanic.fft # shows the FFTrees object
#> FFT #1 predicts survived using 3 cues: {sex,class,age}
#> 
#> [1] If sex = {female}, predict True.
#> [2] If class != {first,second}, predict False.
#> [3] If age != {child}, predict False, otherwise, predict True.
#> 
#>                     train    test
#> cases       :n    1100.00 1101.00
#> speed       :mcu     1.95    1.95
#> frugality   :pci     0.51    0.51
#> accuracy    :acc     0.77    0.79
#> weighted    :wacc    0.70    0.73
#> sensitivity :sens    0.48    0.54
#> specificity :spec    0.92    0.92
#> 
#> pars: algorithm = 'ifan', goal = 'wacc', goal.chase = 'bacc', sens.w = 0.5, max.levels = 4

summary(titanic.fft) # provides summaries of the FFTrees generated for training and test data
#> $train
#>   tree    n  hi  mi  fa  cr      sens      spec       ppv       npv
#> 1    1 1100 171 187  63 679 0.4776536 0.9150943 0.7307692 0.7840647
#> 2    2 1100 216 142 209 533 0.6033520 0.7183288 0.5082353 0.7896296
#> 3    3 1100 111 247  14 728 0.3100559 0.9811321 0.8880000 0.7466667
#>           far       acc      bacc      wacc       bpv    dprime      cost
#> 1 0.084905660 0.7727273 0.6963740 0.6963740 0.7574169 1.3167670 0.2272727
#> 2 0.281671159 0.6809091 0.6608404 0.6608404 0.6489325 0.8399171 0.3190909
#> 3 0.018867925 0.7627273 0.6455940 0.6455940 0.8173333 1.5820205 0.2372727
#>         pci      mcu
#> 1 0.5127273 1.949091
#> 2 0.3920455 2.431818
#> 3 0.6715909 1.313636
#>  [ reached getOption("max.print") -- omitted 1 row ]
#> 
#> $test
#>   tree    n  hi  mi  fa  cr      sens      spec       ppv       npv
#> 1    1 1101 189 164  63 685 0.5354108 0.9157754 0.7500000 0.8068316
#> 2    2 1101 228 125 224 524 0.6458924 0.7005348 0.5044248 0.8073960
#> 3    3 1101 137 216  20 728 0.3881020 0.9732620 0.8726115 0.7711864
#>          far       acc      bacc      wacc       bpv    dprime      cost
#> 1 0.08422460 0.7938238 0.7255931 0.7255931 0.7784158 1.4660825 0.2061762
#> 2 0.29946524 0.6830154 0.6732136 0.6732136 0.6559104 0.9001932 0.3169846
#> 3 0.02673797 0.7856494 0.6806820 0.6806820 0.8218990 1.6467882 0.2143506
#>         pci      mcu
#> 1 0.5124886 1.950045
#> 2 0.4030427 2.387829
#> 3 0.6718892 1.312443
#>  [ reached getOption("max.print") -- omitted 1 row ]

Step 4: Visualize results

A key function of the FFTrees package is to make it easy to create pleasing visualizations of FFTs. The following code shows the best training tree:

# (a) Plot best training tree:
plot(titanic.fft,
     data = "train",
     main = "Titanic (train)",
     decision.labels = c("dead", "alive")
     )

Note the various performance measures at the bottom panel. This particular tree achieves a high specificity of 92%, but a hit rate of 48% and AUC (area under the curve) of 71% are not impressive.

If we now wanted to know how this tree performed in predicting new data, we simply need to replace the argument data = "train" with data = "test":

# (b) Plot best test tree:
plot(titanic.fft,
     data = "test",
     main = "Titanic (test)",
     decision.labels = c("dead", "alive")
     )

The showcues() command provides information about each cue: It plots the hit rate (\(HR\)) as a function of the false alarm rate (\(FAR\)) of the best cue split that was found:

# (c) Plot cue accuracies (in ROC space):
showcues(titanic.fft,
         data = "test", 
         main = "Titanic cues"
         )

Any cue on the diagonal is basically performing at a random level (with an either conservative or liberal threshold in deciding that a signal is detected). The best possible cue performance would be at the top left corner (\(HR = 1\), \(FAR = 0\)).

Plotting another tree (specifically, tree number 3 of the ones indicated on the ROC curve on the bottom right) is easy as well:

# Plot another test tree:
plot(titanic.fft,
     data = "test",
     tree = 3, 
     main = "Titanic (test)",
     decision.labels = c("dead", "alive")
     )

Note that the hit rate (HR) of this tree decreased even further (to 39%), but this corresponds to an increase in specificity (to 97%, or a further reduction in false alarms to 3%).

C. Exercises

Select one of the following datasets:

  • bank: Bank marketing data
  • contraceptive: Contraceptive use data
  • creditapproval: Credit approval data
  • forestfires: A forest fire dataset
  • voting: Voting data set
  • wine: Wine analysis dataset

and use it to complete the following Exercises:

1. Data exploration: Provide a descriptive overview of the data (IVs and DVs).

2. Data preparation: Split your data into three versions of 2 sets:

  1. training and testing sets of equal sizes (i.e., a split of 50%:50%).
  2. a (relatively) large training and small testing set (75%:25%).
  3. a (relatively) small training set and large testing set (25%:75%).

3. FFT construction: Create FFT objects with the selected data sets.

4. FFT visualization: Plot training and test performance, as well as cue accuracy (for test performance).
How do the 3 different data sets compare (in training vs. prediction)?

5. Identify the more relevant error and try to minimize it.

Submission

That’s it – now it’s time to submit your assignment!

Save and submit your script or output file (including all code) to the appropriate folder on Ilias before midnight.


References

  • Gigerenzer, G., & Brighton, H. (2009). Homo heuristicus: Why biased minds make better inferences. Topics in Cognitive Science, 1(1), 107–143.

  • Jenny, M. A., Pachur, T., Williams, S. L., Becker, E., & Margraf, J. (2013). Simple rules for detecting depression. Journal of Applied Research in Memory and Cognition, 2(3), 149–157.

  • Luan, S., Schooler, L. J., & Gigerenzer, G. (2011). A signal-detection analysis of fast-and-frugal trees. Psychological Review, 118(2), 316–338.

  • Martignon, L., Katsikopoulos, K. V., & Woike, J. K. (2008). Categorization with limited resources: A family of simple heuristics. Journal of Mathematical Psychology, 52(6), 352–361.

  • Phillips, N. D., Neth, H., Woike, J. K. & Gaissmaier, W. (2017). FFTrees: A toolbox to create, visualize, and evaluate fast-and-frugal decision trees. Judgment and Decision Making, 12 (4), 344–368. [Get online in html or pdf format.]


[WPA11.Rmd updated on 2018-01-22 16:40:32 by hn.]