uni.kn.logo

Answers for WPA10 of Basic data and decision analysis in R, taught at the University of Konstanz in Winter 2017/2018.


To complete and submit these exercises, please remember and do the following:

  1. Use the .Rmd Format: Your WPAs should be written as scripts of commented code (as .Rmd files) and submitted as reproducible documents that combine text with code (in .html or .pdf formats).

    • A simple .Rmd template is provided here.

    • (Alternatively, open a plain R script and save it as LastnameFirstname_WPA##_yymmdd.R.)

  2. Commening your code: Indicate the current assignment (e.g., WPA10), your name, and the current date at the top of your document. Please always include appropriate comments with your code. For instance, your file LastFirst_WPA10_yymmdd.Rmd could look like this:

---
title: "Your Assignment Title (WPA10)"
author: "Your Name"
date: "Year Month Day"
output: html_document
---

This file contains my solutions: 

# Exercise 1

To show and run R code in your document, use a code chunk (without the '#' symbols):

# ```{r, exercise_1, echo = TRUE, eval = TRUE}
# 
# v <- c(1, 2, 3) # some vector
# sum(v)
#     
# ```

More text and code chunks... 

[Updated on `r Sys.Date()` by Your Name.]
<!-- End of document -->
  1. Complete as many exercises as you can by Wednesday (23:59).

  2. Submit your script or output file (including all code) to the appropriate folder on Ilias.


General remarks and recommendations

In the following exercises, we will identify and address some typical problems that arise in the context of data cleaning. In the suggested solutions below, we favor explicitness over elegance and accuracy over speed. For most problems, R offers a vast variety of different solutions, some of which are faster, more frugal and more elegant than the ones shown here. But whenever dealing with data that was difficult or expensive to obtain, your primary focus should be on ensuring and preserving the integrity of the data. Understanding this priority will shift your concerns and working mode away from pure programming towards keeping valuable data safe and sound. Adopting the mindset of a data scientist results in a number of recommendations:

  • Play it safe: People (and yes, even those who qualify as data scientists) routinely make errors, and ideally learn from them. If you anticipate that things will get messed up somewhere, you can prepare for and recover from such events. Thus, always make and work with copies (e.g., of files of code and data, but also of data frames and variables within your code). This way, you will never alter or destroy the original if something goes wrong.

  • Scale it down: Rather than directly working with large data sets, consider pruning them by choosing just the parts of the data that you are currently working on (e.g., by selecting only the variables needed for solving a specific task). Even if you eventually aim to transform a huge data set, it is advisable to first develop and test your solution on some smaller parts or some dummy data that you can easily monitor, manipulate, and check.

  • Check all changes: Any transformations of and additions to the data should be followed by an immediate verification step. This typically involves two aspects: First check whether the manipulation had the intended effects. Even if this is the case: Always also check for potential side-effects (e.g., instances of TRUE turning into 1s, or NA’s turning into 0 after certain operations)!
  • Effectiveness beats efficiency: Focus on getting things done, rather than always finding the fastest, most general, or pure and elegant solution. For instance, you know that R is much more suited for vector-based operations rather than loops. However, if a particular step is only needed once, it may be perfectly fine to use a loop to get the job done. Crucially, the most valuable resource to preserve here may be your personal time, rather than the speed of your code or your computer.

Even though the following exercises cover some ground, they only scratch the surface of data cleaning. This is necessarily so, as any new data set presents its own problems. Hadley Wickam quotes Tolstoy to express this in a very nice way:

Tidy datasets are all alike, but every messy dataset is messy in its own way. (In PDF, p. 2)

Fortunately, the R community is a rich, searchable and (sometimes) well-documented source of inspirations and solutions. For a more comprehensive treatment of these issues and additional pointers, see books like R for Data Science or explore the documentation of the tidyverse R package (which includes many useful packages, like dplyr and tidyr).

A. Data Cleaning

The process of data cleaning typically involves loading and checking the data, dealing with missing values, and initial screening of variables (e.g., for the distributions of values and potential outliers).

Preparations

0. The following steps prepare the current session by opening an R project, creating a new .Rmd file, and compiling it into an .html output file:

0a. Open your R project from last week (called RCourse or something similar), which contains some files and at least two subfolders (data and R).

0b. Create a new R Markdown (.Rmd) script and save it as LastFirst_WPA10_yymmdd.Rmd (with an appropriate header) in your project directory.

0c. Insert a code chunk and load the rmarkdown, knitr and yarrr packages. (Hint: It’s always a good idea to name code chunks and load all required packages with library() at the beginning of your document. Using the chunk option include = FALSE evaluates the chunk, but does not show it or its outputs in the html output file.)

library(rmarkdown)
library(knitr)
library(yarrr)

# Store original par() settings:
opar <- par()
# par(opar) # restores original (default) par settings later

0d. Make sure that you can create an .html output-file by “knitting” your current document.

An online study

In this WPA, we will analyze data from a ficticious online experiment on the effects of virtual consumption on aesthetic appreciation and problem solving.

A total of 180 people were recruited online and instructed to imagine drinking either a cup of coffee or a pint of beer. After this manipulation of “virtual consumption” (or condition), they were confronted with two problems – an anagram puzzle and a decision making task (with the order of these tasks being counterbalanced) – and asked to evaluate a piece of art (on a Likert rating scale ranging from 1: ‘do not like at all’ to 10: ‘like very much’) and indicate their favorite type of music (out of a list containing six genres). The experiment also contained an attention check item between the problem solving and the aesthetic appreciation tasks. The study was concluded by a brief test that assessed each person’s level of numeracy (see details below).

In addition to variables corresponding to participants’ responses the data contains meta-information on every experimental session, such as the ID of the final page fin.pg viewed by the participant and the total duration (in seconds) of his or her visit to the study server.

The (ficticious) data from this study is stored in a file that looks very similar to the raw data file of an actual online experiment. This means that the data is rectangular and tidy, which Hadley Wickam defines as each variable being stored in a column, each observation being stored as a row, and each observational unit being a table (see his pages or PDF for details). Much to the surprise (and dismay) of students in the social sciences starting with a tidy file typically does not mean that the data can be analyzed immediately.

Instead, the process of data screening and cleaning takes substantial effort and time and even relatively benign (rectangular) data files tend to require substantial checking, recoding, re-checking, and reformatting before they can be analyzed and subjected to statistical tests.

Loading and checking data

1a. Our data is available online at http://Rpository.com/down/data/WPA10_exp.txt. Load this data into a date frame named raw.dat. (Hint: Note that the data file is tab-delimited and contains a header row. Also, include the option stringsAsFactors = FALSE if you want to avoid that character strings are interpreted as factors.)

raw.dat <- read.table(file = "http://Rpository.com/down/data/WPA10_exp.txt",   # load online
                      # file = "data/WPA10_exp.txt", # load from local file. 
                      sep = "\t", header = TRUE, stringsAsFactors = FALSE)

Here’s how the first lines of the raw.dat should look:

head(raw.dat)
#>   initials          start.time sex age   fin.pg duration  task.1 accu.1
#> 1      W.J 2018-01-15 13:06:39   0  54 44832986      231 anagram   TRUE
#> 2      B.F 2018-01-15 18:44:48   f  77 44832986      329 anagram   TRUE
#> 3      H.W 2018-01-15 13:31:43   m  18 54852782      200 anagram  FALSE
#>   time.1   task.2 accu.2 time.2                   att.ch   cond b.rate
#> 1     30 decision      1     25  i read the insrctnions.   beer      5
#> 2     30 decision      1     48 I read the instructions. coffee    -77
#> 3     32 decision      1     36  Can I have your number?   beer      9
#>   c.rate b.like  c.like BNT.1 BNT.2a BNT.2b  BNT.3
#> 1    -77   Rock     -66  0.25  -77.0     10   0.50
#> 2      2    -66     Pop 30.00   40.0    -66 -77.00
#> 3    -77    Pop     -66  0.22    0.3    -66 -77.00
#>  [ reached getOption("max.print") -- omitted 3 rows ]

1b. Inspect some basic properties of the raw data:

  • How many rows and columns does raw.dat contain?
  • Are there any missing (NA) values?
  • Of which data types are the individual variables (columns)?
dim(raw.dat)        # number of rows (cases) and columns (variables)
#> [1] 180  22
sum(is.na(raw.dat)) # any missing values?
#> [1] 0
str(raw.dat)        # show structure of columns (variables)
#> 'data.frame':    180 obs. of  22 variables:
#>  $ initials  : chr  "W.J" "B.F" "H.W" "K.D" ...
#>  $ start.time: chr  "2018-01-15 13:06:39" "2018-01-15 18:44:48" "2018-01-15 13:31:43" "2018-01-15 14:04:32" ...
#>  $ sex       : chr  "0" "f" "m" "m" ...
#>  $ age       : int  54 77 18 23 48 60 51 72 65 42 ...
#>  $ fin.pg    : int  44832986 44832986 54852782 44832986 44832986 44832986 44832986 44832986 44832986 54852782 ...
#>  $ duration  : int  231 329 200 318 615 505 612 236 499 545 ...
#>  $ task.1    : chr  "anagram" "anagram" "anagram" "anagram" ...
#>  $ accu.1    : logi  TRUE TRUE FALSE TRUE TRUE TRUE ...
#>  $ time.1    : int  30 30 32 26 43 40 31 50 44 31 ...
#>  $ task.2    : chr  "decision" "decision" "decision" "decision" ...
#>  $ accu.2    : int  1 1 1 0 1 0 1 0 1 0 ...
#>  $ time.2    : int  25 48 36 54 22 37 39 43 24 55 ...
#>  $ att.ch    : chr  "i read the insrctnions." "I read the instructions." "Can I have your number?" "I redd the instruction." ...
#>  $ cond      : chr  "beer" "coffee" "beer" "coffee" ...
#>  $ b.rate    : int  5 -77 9 -77 9 -77 10 -77 5 -77 ...
#>  $ c.rate    : int  -77 2 -77 3 -77 3 -77 8 -77 7 ...
#>  $ b.like    : chr  "Rock" "-66" "Pop" "-66" ...
#>  $ c.like    : chr  "-66" "Pop" "-66" "Classic" ...
#>  $ BNT.1     : num  0.25 30 0.22 0.25 0.28 0.25 0.25 0.28 0.24 0.2 ...
#>  $ BNT.2a    : num  -77 40 0.3 -77 31 -77 -77 29 0.3 32 ...
#>  $ BNT.2b    : chr  "10" "-66" "-66" "10" ...
#>  $ BNT.3     : num  0.5 -77 -77 0.45 -77 -77 0.5 -77 -77 -77 ...

Recoding missing and erroneous values

2a. Copy raw.dat into another R object called dat. This way, you keep your original raw.data safe in case you happened to corrupt the data you’re currently working with.

dat <- raw.dat

2b. It’s strange and suspicious that our data seems to contain no missing (NA) values. On closer inspection, it turns out that some numeric variables contain frequent instances of -77 and some character variables contain frequent instances of "-66". It turns out that our web server uses exactly these values to indicate the absence of entries. Thus, recode any missing values in dat as NA. (Hint: Use logical indexing on the entire data frame to do this, rather than doing it for individual variables.)

# Note: 
dat[dat == NA] # yields all NAs (and a warning to use is.na() to check for NA values)
#>  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [24] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [47] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [70] NA NA NA NA NA NA
#>  [ reached getOption("max.print") -- omitted 3885 entries ]

dat[is.na(dat)] # suggests that there are NO (0) missing values
#> character(0)
sum(is.na(dat)) # suggests that there are NO (0) NA values, 
#> [1] 0
# but note that: 
table(dat$b.rate) # contains 90 instances of (the number) -77
#> 
#> -77   1   2   3   4   5   6   7   8   9  10 
#>  90   4   5   5  14   6   3  16   8  18  11
table(dat$b.like) # contains 90 instances of (the string) "-66"
#> 
#>     -66 Classic Country    Jazz     Pop     Rap    Rock 
#>      90      10      10      17      26       8      19

# Recode all instances of -77 and "-66" as NA: 
dat[dat == -77] <- NA   # recode (the number) -77 to NA
dat[dat == "-66"] <- NA # recode (the string) "-66" to NA

sum((is.na(dat))) # now finds many instances of NA
#> [1] 680

3. Let’s check the plausibility of our subject-related variables and correct some obvious mistakes.

Inspect a histogram of participant’s age distribution and determine the range of age values. Then set any implausible values to NA. (Hint: Assume that valid participants had to be adults from 18 to 99 years. Any values outside this range should be set to NA, but record how many people were recoded.)

# Check age values: 
sum(is.na(dat$age))          # shows 1 NA value
#> [1] 1
range(dat$age, na.rm = TRUE) # shows 2 implausible values
#> [1] -24 230
hist(dat$age, breaks = 50)   # shows a bi-modal distribution, but also 2 outliers: 

sum(dat$age < 18, na.rm = TRUE) # 1 person too young / invalid value
#> [1] 1
sum(dat$age > 99, na.rm = TRUE) # 1 person too old   / invalid value
#> [1] 1

dat$age[dat$age < 10] <- NA # recode
dat$age[dat$age > 99] <- NA # recode

# Check recoded age values:
sum(is.na(dat$age))          # shows 3 NA values
#> [1] 3
range(dat$age, na.rm = TRUE) # shows plausible range: 18-80 years
#> [1] 18 80
hist(dat$age, breaks = 10)   # shows plausible distribution

B. Data Wrangling

Data wrangling is the continuation of data cleaning. It typically involves re-arrangling existing variables (columns), defining new variables (e.g., filter variables or variables that compute some score from other variables) and defining and selecting subsets of cases (rows).

Renaming and recoding variables

4a. Rename the variable called sex to the more politically correct label of gender.

# Check current names:
names(dat)
#>  [1] "initials"   "start.time" "sex"        "age"        "fin.pg"    
#>  [6] "duration"   "task.1"     "accu.1"     "time.1"     "task.2"    
#> [11] "accu.2"     "time.2"     "att.ch"     "cond"       "b.rate"    
#> [16] "c.rate"     "b.like"     "c.like"     "BNT.1"      "BNT.2a"    
#> [21] "BNT.2b"     "BNT.3"

# Which column is named "sex"?
col.nr <- which(names(dat) == "sex")

# Rename the column "sex" to "gender":
names(dat)[col.nr] <- "gender"

# OR all in 1 step: 
names(dat)[names(dat) == "sex"] <- "gender" 

4b. Inspect the type of the new gender variable and create a table of the existing gender values. Then change any numeric values to the more informative character strings. Specifically, recode all values of 0 to “f” (for “female”), values of 1 to “m” (for “male”), and values of 3 to “other”. What is the resulting gender distribution in your data?

typeof(dat$gender) # is of type "character"
#> [1] "character"

# Check existing gender values:
table(dat$gender) # shows a mix of numeric and character data
#> 
#>     0     1     3     f     m other 
#>    28    27     1    69    54     1

# Recode all numeric character to alphabetical characters:
dat$gender[dat$gender == "0"] <- "f"
dat$gender[dat$gender == "1"] <- "m"
dat$gender[dat$gender == "3"] <- "other"

table(dat$gender) # shows the new distribution
#> 
#>     f     m other 
#>    97    81     2

4c. Use the following recode.var() function to create a new variable gender.num that contains the same information as the character variable gender in numeric format. Add gender.num to dat. (Hint: Remember that new variables in R can be created by assigning them.)

# Create a new function called recode.v:
recode.var <- function(x,   # what vector do you want to recode?
                       old, # what values do you want to change?
                       new, # what should the new values be?
                       otherNA = TRUE, # should other values be converted to NA?
                       numeric = TRUE) { # should result be numeric?
  
  x.new <- x  # copy vector to x.new
  
  if(class(x.new) == "factor") {x.new <- paste(x.new)} # remove factors
 
  for(i in 1:length(old)) { # loop through all old values:
    x.new[x == old[i]] <- new[i]
    }
 
  if(otherNA) { # convert unspecified values to NA:
    x.new[(x %in% old) == FALSE] <- NA
    }
 
  if(numeric) {x.new <- as.numeric(x.new)} # convert vector to numeric values
 
  return(x.new)  # return new vector
}
gender.num <- recode.var(dat$gender, 
                         old = c("f", "m", "other"), 
                         new = c(0, 1, 2), 
                         otherNA = TRUE, numeric = TRUE)

table(gender.num) # to verify that distribution is identical
#> gender.num
#>  0  1  2 
#> 97 81  2

dat$gender.num <- gender.num # add new column to dat by assigning it

4d. Use the recode.var() function again to create a numeric variable cond.num that corresponds to the between-subjects condition indicated by cond (with beer being indicated by the value 1 and coffee being indicated by the value 2). Then add cond.num to a new column of dat.

table(dat$cond)
#> 
#>   beer coffee 
#>     90     90
dat$cond.num <- recode.var(dat$cond, 
                           old = c("beer", "coffee"), 
                           new = c(1, 2), 
                           otherNA = TRUE, numeric = TRUE)
table(dat$cond.num) # to verify that distribution is identical
#> 
#>  1  2 
#> 90 90

Rearranging variables (columns)

5. Rearrange the columns of dat so that it more accurately reflects the procedure of the experiment. Specifically, the columns of dat should contain the following variables (from left to right):

  • the participant’s demographic information,
  • the participant’s experiment start time,
  • the attention check item (att.ch),
  • the between-subjects condition (in character and numeric format), followed by
  • the performance in the two problem solving tasks, and
  • the two aesthetic judgments. Finally,
  • the answers to the four numeracy questions (labeled as BNT) should appear, before noting
  • the ID of the final page fin.pg and the duration of the experiment (in seconds) as two final columns.

Store the revised order of variables in a new data frame called df.

dim(dat)   # check current dimensions
#> [1] 180  24
names(dat) # check current names and their indices
#>  [1] "initials"   "start.time" "gender"     "age"        "fin.pg"    
#>  [6] "duration"   "task.1"     "accu.1"     "time.1"     "task.2"    
#> [11] "accu.2"     "time.2"     "att.ch"     "cond"       "b.rate"    
#> [16] "c.rate"     "b.like"     "c.like"     "BNT.1"      "BNT.2a"    
#> [21] "BNT.2b"     "BNT.3"      "gender.num" "cond.num"

# List column indices for each part:
i.demo <- c(1, 3, 23, 4)
i.start <- 2
i.attc <- 13
i.cond <- c(14, 24)
i.prob <- 7:12   # 6 columns problem solving tasks
i.aest <- 15:18  # 4 columns aesthetic judgment
i.bnt <- 19:22   # 4 columns BNT
i.fin <- c(5, 6) # 2 final columns

# Combine parts into new order:
new.order <- c(i.demo, i.start, i.attc, i.cond, i.prob, i.aest, i.bnt, i.fin)
# new.order
length(new.order) == length(names(dat)) # ensure that no column is lost
#> [1] TRUE

# Rearrange columns and store result in new df:
df <- dat[ , new.order] # Copy to df to keep original dat

## Checks:
if (all.equal(dim(df), dim(dat))) {  # if no row or colum is lost:
  # dat <- df # overwrite original dat
  print("dat and df have equal dimensions.")
  }
#> [1] "dat and df have equal dimensions."

dim(df)   # check revised dimensions
#> [1] 180  24
names(df) # view revised names
#>  [1] "initials"   "gender"     "gender.num" "age"        "start.time"
#>  [6] "att.ch"     "cond"       "cond.num"   "task.1"     "accu.1"    
#> [11] "time.1"     "task.2"     "accu.2"     "time.2"     "b.rate"    
#> [16] "c.rate"     "b.like"     "c.like"     "BNT.1"      "BNT.2a"    
#> [21] "BNT.2b"     "BNT.3"      "fin.pg"     "duration"

Checkpoint 1

At this point you completed some basic exercises. This is good, but additional practice will deepen your understanding, so please keep carrying on…

Defining filter variables

6. Let’s define some useful filter variables and add them to df. Rather than removing participants from the data file these variables can later be used to subset different samples of participants.

6a. The last page of the experiment had an ID of 44832986. Define a logical filter variable f.final that indicates whether the final page pg.fin that was seen by the participant matches this ID.

# Filter for pg.fin == "44832986":
f.final <- rep(NA, nrow(df)) # initialize filter variable
f.final[df$fin.pg == "44832986"] <- TRUE
f.final[df$fin.pg != "44832986"] <- FALSE

# Contast distributions:
table(df$fin.pg)                  # without filter
#> 
#> 23978391 44832923 44832986 54334921 54852782 
#>        2        8      160        5        5
table(df$fin.pg[f.final == TRUE]) # with filter 
#> 
#> 44832986 
#>      160

6b. Theoretical considerations and your pretests indicate that a conscientious completion of this experiment takes at least 2 minutes and no longer than 10 minutes. Define a logical filter variable f.duration that indicates whether the duration of a participant falls within this range.

min.time <- 60 *  2 # min of  2 minutes
max.time <- 60 * 10 # max of 10 minutes

# Filter for duration within reasonable range:
f.duration <- rep(NA, nrow(df)) # initialize filter variable
f.duration[df$duration < min.time] <- FALSE
f.duration[df$duration > max.time] <- FALSE
f.duration[(df$duration >= min.time) & (df$duration <= max.time)] <- TRUE

# Contrast distributions without and with filter:
hist(df$duration[f.final == TRUE], breaks = 10) # distribution for people who reached final page

hist(df$duration[f.final == TRUE & f.duration == TRUE], breaks = 10) # same distribution with filter

6c. Your experiment included an attention check item that required participants to write the sentence “I read the instructions.” into an empty text field. Participants’ entries into this field are stored in the variable att.ch. Define a logical filter variable f.att that indicates whether a participant should be counted as having passed this test. (Hint: Inspect the range of existing answers and note that demanding the entry of the exact and entire phrase would only include a small minority of participants. Thus, consider allowing for a wider variety of answers by including all entries indicating that someone (a) noticed the test, and (b) provided some reasonable answer. Use the grep(pattern = "xyz", x, ...) function that searches a character sequence xyz in a string x.)

table(df$att.ch) # show answers
#> 
#>          Can I have your number?     I did read the instructions. 
#>                                2                               10 
#>                 I pay attention.              I read instrctions. 
#>                                9                               11 
#>          i read the insrctnions.           i read the instruction 
#>                                9                               10 
#>          I read the instruction.          I read the instructions 
#>                               16                               18 
#>         I READ THE INSTRUCTIONS!         I read the instructions. 
#>                               11                               25 
#>          I read the instructons.     I really rde the instrctions 
#>                                7                                3 
#>           I red the instructions          I redd the instruction. 
#>                                5                                8 
#>   My worker ID is 17856379947635                READ INSTRUCTIONS 
#>                                1                                9 
#>           Read the instructions. The instructions have been read. 
#>                               12                                5 
#>                         Whazzup?      Yes, I am paying attention. 
#>                                2                                5

# Filter for attention checks:
f.att <- rep(FALSE, nrow(df)) # initialize filter variable (to FALSE)
f.att[grep(pattern = "read", x = df$att.ch, ignore.case = TRUE, fixed = FALSE)] <- TRUE
f.att[grep(pattern = "the", x = df$att.ch, ignore.case = TRUE, fixed = FALSE)] <- TRUE
f.att[grep(pattern = "instruction", x = df$att.ch, ignore.case = TRUE, fixed = FALSE)] <- TRUE
f.att[grep(pattern = "attention", x = df$att.ch, ignore.case = TRUE, fixed = FALSE)] <- TRUE
# f.att[grep(pattern = "test", x = df$att.ch, ignore.case = TRUE, fixed = FALSE)] <- TRUE

# Show answers: 
table(df$att.ch[f.att == TRUE])  # for f.att == TRUE
#> 
#>     I did read the instructions.                 I pay attention. 
#>                               10                                9 
#>              I read instrctions.          i read the insrctnions. 
#>                               11                                9 
#>           i read the instruction          I read the instruction. 
#>                               10                               16 
#>          I read the instructions         I READ THE INSTRUCTIONS! 
#>                               18                               11 
#>         I read the instructions.          I read the instructons. 
#>                               25                                7 
#>     I really rde the instrctions           I red the instructions 
#>                                3                                5 
#>          I redd the instruction.                READ INSTRUCTIONS 
#>                                8                                9 
#>           Read the instructions. The instructions have been read. 
#>                               12                                5 
#>      Yes, I am paying attention. 
#>                                5
table(df$att.ch[f.att == FALSE]) # for f.att == FALSE
#> 
#>        Can I have your number? My worker ID is 17856379947635 
#>                              2                              1 
#>                       Whazzup? 
#>                              2

6d. Add your filter variables to dat (in case you haven’t already done so) and show how adding each filter (or any combination of them) reduces your participant sample.

# Add filters to df: 
df <- cbind(df, f.final, f.duration, f.att)
dim(df)
#> [1] 180  27

# Show consequences of filters on number of cases (participants):
# (a) without filter: 
nrow(df)
#> [1] 180

# (b) with 1 filter: 
nrow(df[df$f.final == TRUE, ])    
#> [1] 160
nrow(df[df$f.duration == TRUE, ])
#> [1] 162
nrow(df[df$f.att == TRUE, ])
#> [1] 173

# (c) with 2 filters: 
nrow(df[df$f.final == TRUE & df$f.duration == TRUE, ])    
#> [1] 143
nrow(df[df$f.duration == TRUE & df$f.att == TRUE, ])
#> [1] 156
nrow(df[df$f.final == TRUE & df$f.att == TRUE, ])
#> [1] 154

# (d) with 3 filters: 
nrow(df[df$f.final == TRUE & df$f.duration == TRUE & df$f.att == TRUE, ])
#> [1] 138

# OR: as a series of flat contingendy tables:
with(df, ftable(f.final, f.duration, useNA = "ifany"))
#>         f.duration FALSE TRUE
#> f.final                      
#> FALSE                  1   19
#> TRUE                  17  143
with(df, ftable(f.final, f.att, useNA = "ifany"))
#>         f.att FALSE TRUE
#> f.final                 
#> FALSE             1   19
#> TRUE              6  154
with(df, ftable(f.duration, f.att, useNA = "ifany"))
#>            f.att FALSE TRUE
#> f.duration                 
#> FALSE                1   17
#> TRUE                 6  156

with(df, ftable(f.final, f.duration, f.att, useNA = "ifany"))
#>                    f.att FALSE TRUE
#> f.final f.duration                 
#> FALSE   FALSE                0    1
#>         TRUE                 1   18
#> TRUE    FALSE                1   16
#>         TRUE                 5  138

Defining categorical factors (bins)

7. Let’s split some continuous numerical variables into categorical factors. (Using the latter in your analyses will typically reduce statistial power, but may allow for simpler representations.)

7a. Define a factor dur.min that classifies the duration of each participant into categories (bins) of full minutes. (Hint: Use the cut() function.)

# Determine number of bins needed:
range(df$duration, na.rm = TRUE)
#> [1]  27 649
max.dur <- max(df$duration, na.rm = TRUE)
n.bins <- ceiling(max.dur/60)
n.bins
#> [1] 11

# Classify duration into bins:
dur.mins <- cut(x = df$duration,
                breaks = seq(from = 0, to = (n.bins * 60), by = 60), # explicitly define n.bins of 60 seconds each
                labels = paste0(0:(n.bins - 1), "+"), # n.bins labels (from 0 to n.bins - 1)
                right = FALSE # to classify right break points into next category
                )

# Check result:
plot(y = df$duration, x = dur.mins)
for (i in 1:n.bins) {
  abline(h = i * 60, lty = 3)
}

7b. Define a factor age.group that classifies each participant into one of three categories (bins) depending on his or her age:

  1. Participants below the age of 30 should be classified as “junior”,
  2. participants from 30 to 59 years old should be classified as “mid-age”, and
  3. participants with an age of 60 or more years should be classified as “senior”.

(Hint: Use the cut() function with appropriate settings of the breaks and right options.)

age.group <- cut(x = df$age, 
                 breaks = c(0, 30, 60, 999), # explicitly define break points
                 labels = c("junior", "mid-age", "senior"), 
                 right = FALSE)

# Check result:
plot(y = df$age, x = age.group)
abline(h = 30, lty = 3)
abline(h = 60, lty = 3)

7c. How would an age.group.2 factor be defined if you used the cut() function with the option breaks = 3, rather than explicitly defining the specific age (in years) of each break point? Explain why this result occurs. (Hint: Compute the range of df$age and split it into 3 equal parts.)

age.group.2 <- cut(x = df$age, 
                   breaks = 3, # splits the range of x into 3 equal parts
                   labels = c("junior", "mid-age", "senior"), right = FALSE)

# Compute range:
min.age <- min(df$age, na.rm = TRUE)
max.age <- max(df$age, na.rm = TRUE)
cut.1 <- min.age + 1 * (max.age - min.age)/3
cut.2 <- min.age + 2 * (max.age - min.age)/3

# Check result:
plot(y = df$age, x = age.group.2)
abline(h = cut.1, lty = 3)
abline(h = cut.2, lty = 3)

Handling (within- vs. between-subjects) factors

8. Separating combined columns: As the order of the anagram and decision making tasks was counterbalanced, the variables accu.1 and time.1 and accu.2 and time.2 in df currently mix information about the accuracy and solution times of both types of tasks. To analyze this data, we first need to disentangle these variables such that a participant’s responses (in terms of accuracy and time) to each type of task is stored in a separate variable. Specifically, define the following variables:

  1. an.first should be a logical variable that is TRUE if and only if the anagram task was presented before the decision making task;
  2. an.accu should be a logical variable that contains the accuracy of the anagram task;
  3. an.time shound be a numeric variable that contains the time of the anagram task;
  4. dm.accu should be a logical variable that contains the accuracy of the decision making task;
  5. dm.time shound be a numeric variable that contains the time of the decision making task.
# head(df)

# (A) Check and select region of interest:

# Note that df$accu.2 is of type integer, rather than logical: 
typeof(df$accu.2)                  # "integer" (due to an -77/NA in the raw.data)
#> [1] "integer"
df$accu.2 <- as.logical(df$accu.2) # change to "logical"" (1 := TRUE, 0 := FALSE)

i.t1  <- which(names(df) == "task.1")   # index of "task.1" column

dt <- df[, i.t1:(i.t1 + 5)] # copy 6 columns of interest into temporary data frame
head(dt)
#>    task.1 accu.1 time.1   task.2 accu.2 time.2
#> 1 anagram   TRUE     30 decision   TRUE     25
#> 2 anagram   TRUE     30 decision   TRUE     48
#> 3 anagram  FALSE     32 decision   TRUE     36
#> 4 anagram   TRUE     26 decision  FALSE     54
#> 5 anagram   TRUE     43 decision   TRUE     22
#> 6 anagram   TRUE     40 decision  FALSE     37
# str(dt)

# (B) Define new variables:

# Initialize and set new variables:
# 1: 
an.first <- rep(NA, nrow(dt))
an.first[dt$task.1 == "anagram"] <- TRUE
an.first[dt$task.1 == "decision"] <- FALSE

# 2: 
an.accu <- rep(NA, nrow(dt))
an.accu[dt$task.1 == "anagram"] <- dt$accu.1[dt$task.1 == "anagram"]
an.accu[dt$task.2 == "anagram"] <- dt$accu.2[dt$task.2 == "anagram"]

# 3: 
an.time <- rep(NA, nrow(dt))
an.time[dt$task.1 == "anagram"] <- dt$time.1[dt$task.1 == "anagram"]
an.time[dt$task.2 == "anagram"] <- dt$time.2[dt$task.2 == "anagram"]

# 4: 
dm.accu <- rep(NA, nrow(dt))
dm.accu[dt$task.1 == "decision"] <- dt$accu.1[dt$task.1 == "decision"]
dm.accu[dt$task.2 == "decision"] <- dt$accu.2[dt$task.2 == "decision"]

# 5: 
dm.time <- rep(NA, nrow(dt))
dm.time[dt$task.1 == "decision"] <- dt$time.1[dt$task.1 == "decision"]
dm.time[dt$task.2 == "decision"] <- dt$time.2[dt$task.2 == "decision"]

# (C) Add new variables to df:

# Combine new vectors to a data frame:
d.new <- data.frame(an.first, an.accu, an.time, dm.accu, dm.time)

## Combine original and new columns (to allow comparison):
# dy <- cbind(dt, d.new)
# head(dy)
# tail(dy)

# Insert d.new at the appropriate location of `df`:
dim(df)
#> [1] 180  27
i.t2  <- which(names(df) == "time.2") # index of "df$time.2" column
i.max <- ncol(df)                     # ncol of df

# Slice df into 2 parts:
left  <- df[ , 1:i.t2]
right <- df[ , (i.t2 + 1):i.max]

if (sum(names(df) == "an.first") == 0) { # "an.first" is not a column of df yet:
  df <- cbind(left, d.new, right)        # combine old and new parts of df
  }

dim(df)
#> [1] 180  32

9. Combining separated columns: An analogous problem occurs with the between-subjects factor of cond. As every participant imagined either drinking beer or drinking coffee (but not both), the four variables b.rate, b.like, b.rate, and b.like are missing (NA) for every other participant. To analyze this data, we first need to combine the information that is currently distributed over two variables into one variable. Specifically, collect each particiant’s aesthetic judgment and music preference in two variables:

  1. rate should be a numeric variable that contains all rating values; and
  2. like should be a character variable that contains all music preferences.
# head(df)

# (A) Check and select region of interest:

i.co  <- which(names(df) == "cond")   # index of "cond" column
i.br  <- which(names(df) == "b.rate") # index of "df$b.rate" column

dt <- df[ , c(i.co, i.br:(i.br + 3))] # copy 5 columns of interest into temporary data frame
head(dt)
#>     cond b.rate c.rate b.like  c.like
#> 1   beer      5     NA   Rock    <NA>
#> 2 coffee     NA      2   <NA>     Pop
#> 3   beer      9     NA    Pop    <NA>
#> 4 coffee     NA      3   <NA> Classic
#> 5   beer      9     NA   Jazz    <NA>
#> 6 coffee     NA      3   <NA>    Jazz

# (B) Define new variables:

# 1:
rate <- rep(NA, nrow(dt))
rate[dt$cond == "beer"]   <- dt$b.rate[dt$cond == "beer"]
rate[dt$cond == "coffee"] <- dt$c.rate[dt$cond == "coffee"]
# rate

# 1b: IFF all columns are numerical AND all except 1 columns are NA, 
#     the following solution also works:
rate.alt <- rep(NA, nrow(dt))
# rate.alt <- rowSums(cbind(dt$b.rate, dt$c.rate), na.rm = TRUE) # would NA + NA to 0!
rate.alt <- rowMeans(cbind(dt$b.rate, dt$c.rate), na.rm = TRUE)  # sets mean(NA, NA) to NaN
# rate.alt
all.equal(rate, rate.alt) # check identity.
#> [1] TRUE

# 2:
like <- rep(NA, nrow(dt))
like[dt$cond == "beer"]   <- dt$b.like[dt$cond == "beer"]
like[dt$cond == "coffee"] <- dt$c.like[dt$cond == "coffee"]
# like

# (C) Add new variables to df:

# Combine new vectors to a data frame:
d.new <- data.frame(rate, like)

# Insert d.new at the appropriate location of `df`:
dim(df)
#> [1] 180  32
i.cl  <- which(names(df) == "c.like") # index of "df$c.like" column
i.max <- ncol(df)                     # ncol of df

# Slice df into 2 parts:
left  <- df[ , 1:i.cl]
right <- df[ , (i.cl + 1):i.max]

if (sum(names(df) == "rate") == 0) { # "rate" is not a column of df yet:
  df <- cbind(left, d.new, right)    # combine old and new parts of df
  }

dim(df)
#> [1] 180  34

Checkpoint 2

If you got this far you’re doing great, but don’t give up just yet…

Evaluating test results

10. Let’s determine some additional information about each participant (i.e., a so-called trait factor) that can later be used to qualify our results by inter-individual differences. Here, we will use the four BNT columns to determine a numeracy score for each participant.

The Berlin Numeracy Test (BNT) is an adaptive test that uses two or three questions to categorize a person into one of four levels of numeracy. The reference paper for the test’s validation is:

  • Cokely, E.T., Galesic, M., Schulz, E., Ghazal, S., & Garcia-Retamero, R. (2012). Measuring risk literacy: The Berlin Numeracy Test. Judgment and Decision Making, 7, 25–47. (download PDF.)

See RiskLiteracy.org for an online version of the test and additional details on the underlying notion of numeracy.

The adaptive nature of the test consists in the fact that not all participants need to answer all questions. Instead, the sequence of questions presented to a participant depends on his or her answers to previous questions. The five possible sequences (or paths) of questions and their corresponding classification (into a category from 1 to 4) are illustrated by Figure 1 (on p. 31 of Cokely at al., 2012): A red arrow below a question number indicates that the question was answered incorrectly, a green arrow below a question number indicates that the question was answered correctly. The bottom row shows that – based on their responses to either two or three questions – participants are classified into one of four levels of numeracy (i.e., obtain a score from 1 to 4, with 1 indicating the lowest and 4 indicating highest level of numeracy).

The topic and correct answers of each question is presented in the following Table:

Question: Topic: Correct answer:
Question 1: choir 25%
Question 2a: 5-sided die 30
Question 2b loaded die 20
Question 3: mushrooms 50%

10a. Note that the entries to the four variables starting with BNT in df are somewhat inconsistent and need to be cleaned up prior to any analysis. Specifically, ensure that all variables are numeric and that all responses are within an appropriate range (i.e., 0–1 for Questions 1 and 3, and 1–100 for Questions 2a and 2b).1

head(df)
#>   initials gender gender.num age          start.time
#> 1      W.J      f          0  54 2018-01-15 13:06:39
#> 2      B.F      f          0  77 2018-01-15 18:44:48
#>                     att.ch   cond cond.num  task.1 accu.1 time.1   task.2
#> 1  i read the insrctnions.   beer        1 anagram   TRUE     30 decision
#> 2 I read the instructions. coffee        2 anagram   TRUE     30 decision
#>   accu.2 time.2 an.first an.accu an.time dm.accu dm.time b.rate c.rate
#> 1   TRUE     25     TRUE    TRUE      30    TRUE      25      5     NA
#> 2   TRUE     48     TRUE    TRUE      30    TRUE      48     NA      2
#>   b.like  c.like rate    like BNT.1 BNT.2a BNT.2b BNT.3   fin.pg duration
#> 1   Rock    <NA>    5    Rock  0.25     NA     10  0.50 44832986      231
#> 2   <NA>     Pop    2     Pop 30.00   40.0   <NA>    NA 44832986      329
#>   f.final f.duration f.att
#> 1    TRUE       TRUE  TRUE
#> 2    TRUE       TRUE  TRUE
#>  [ reached getOption("max.print") -- omitted 4 rows ]

# A) Note that df$BNT.2b is of type character, rather than numeric (double): 
typeof(df$BNT.2b) # "character" 
#> [1] "character"
table(df$BNT.2b)  # Note the word "twenty"!                
#> 
#>    0.2     10     15     18     19     20     21     22     25     30 
#>      7      6      8      1     12     34      7      4      7      6 
#> twenty 
#>      1
df$BNT.2b[df$BNT.2b == "twenty"] <- "20" # recode "twenty" as "20" (still character)
table(df$BNT.2b)  # No more "twenty".
#> 
#> 0.2  10  15  18  19  20  21  22  25  30 
#>   7   6   8   1  12  35   7   4   7   6

df$BNT.2b <- as.numeric(df$BNT.2b) # Change data type
typeof(df$BNT.2b) # "double" 
#> [1] "double"
table(df$BNT.2b)  # Numbers only, as it should be.
#> 
#> 0.2  10  15  18  19  20  21  22  25  30 
#>   7   6   8   1  12  35   7   4   7   6

# B) Recode number ranges:
table(df$BNT.1) # shows maxima at 0.25 (correct) and 25:
#> 
#>  0.1  0.2 0.22 0.24 0.25 0.26 0.28  0.3  0.5 0.52   10   20   25   30   50 
#>    5   13   11   11   65    2    8    2   10    1    2    6   28    3   11
# ASSUME that values above 1 were meant as PERCENTAGES (i.e., 25 is correct):
i.g1 <- which(df$BNT.1 > 1) # indices of values > 1
df$BNT.1[i.g1] <- df$BNT.1[i.g1] / 100 # recode as percentage
table(df$BNT.1) # now has 0 < range < 1
#> 
#>  0.1  0.2 0.22 0.24 0.25 0.26 0.28  0.3  0.5 0.52 
#>    7   19   11   11   93    2    8    5   21    1

table(df$BNT.2a) # shows maxima at 30 (correct) and 10 instances of 0.3:
#> 
#> 0.3  20  25  28  29  30  31  32  35  40 
#>  10   8   4   5  10  29   8   5   3   4
# ASSUME that value of 0.3 was meant as 30 (i.e., 0.3 is correct):
i.l1 <- which(df$BNT.2a == 0.3) # indices of values == 0.3
df$BNT.2a[i.l1] <- 30 # recode as 30
table(df$BNT.2a) # now has 1 < range < 100
#> 
#> 20 25 28 29 30 31 32 35 40 
#>  8  4  5 10 39  8  5  3  4

table(df$BNT.2b) # shows maxima at 20 (correct) and 7 instances of 0.2:
#> 
#> 0.2  10  15  18  19  20  21  22  25  30 
#>   7   6   8   1  12  35   7   4   7   6
# ASSUME that value of 0.2 was meant as 20 (i.e., 0.3 is correct):
i.l2 <- which(df$BNT.2b == 0.2) # indices of values == 0.2
df$BNT.2b[i.l2] <- 20 # recode as 20
table(df$BNT.2b) # now has 1 < range < 100
#> 
#> 10 15 18 19 20 21 22 25 30 
#>  6  8  1 12 42  7  4  7  6

table(df$BNT.3) # shows maxima at 0.50 (correct) and 50:
#> 
#> 0.25  0.3  0.4 0.45  0.5 0.55 0.75   25   50   75 
#>    4    2    2    1   23    3    1    9    5    3
# ASSUME that values above 1 were meant as PERCENTAGES (i.e., 50 is correct):
i.g3 <- which(df$BNT.3 > 1) # indices of values > 1
df$BNT.3[i.g3] <- df$BNT.3[i.g3] / 100 # recode as percentage
table(df$BNT.3) # now has 0 < range < 1
#> 
#> 0.25  0.3  0.4 0.45  0.5 0.55 0.75 
#>   13    2    2    1   28    3    4

10b. Check for each case (participant) whether the his or her entries in the four BNT variables of df are valid (i.e., can actually occur, given the logic of the test). This includes two conditions:

  1. the existing combination of answered questions is possible (irrespective of the answers), and
  2. the sequence of questions answered corresponds to a correct evaluation of previous answers.

Define a filter variable f.BNT.valid that is TRUE if and only if both conditions are met.

# head(df)

# (A) Check and select region of interest:
i.bnt1  <- which(names(df) == "BNT.1") # column index
i.bnt3  <- which(names(df) == "BNT.3") # column index

dt <- df[ , i.bnt1:i.bnt3] # copy 4 columns of interest into temporary data frame
head(dt)
#>   BNT.1 BNT.2a BNT.2b BNT.3
#> 1  0.25     NA     10  0.50
#> 2  0.30     40     NA    NA
#> 3  0.22     30     NA    NA
#> 4  0.25     NA     10  0.45
#> 5  0.28     31     NA    NA
#> 6  0.25     NA     20    NA


# (B) Define 2 functions on BNT data frame:
valid.pattern <- function(bnt) {
  
  v <- rep(FALSE, nrow(bnt)) # initialize all to FALSE
  
  if ( ncol(bnt) != 4) {
    stop("Error: BNT data must contain 4 questions (columns in df).")
  }
  
  for (i in 1:nrow(bnt)) {
    
    # Pattern 1: q1 and q2a are not NA, q2b and q3 are NA:
    if ((is.na(bnt[i, 1]) == FALSE) & (is.na(bnt[i, 2]) == FALSE) & 
        (is.na(bnt[i, 3]) == TRUE) & (is.na(bnt[i, 4]) == TRUE)) { v[i] <- TRUE }
    
    # Pattern 2: q1 and q2b are not NA, q2a and q3 are NA:
    if ((is.na(bnt[i, 1]) == FALSE) & (is.na(bnt[i, 3]) == FALSE) & 
        (is.na(bnt[i, 2]) == TRUE) & (is.na(bnt[i, 4]) == TRUE)) { v[i] <- TRUE }
    
    # Pattern 3: q1 and q2b and q3 are not NA, q2a is NA:
    if ((is.na(bnt[i, 1]) == FALSE) & (is.na(bnt[i, 2]) == TRUE) & 
        (is.na(bnt[i, 3]) == FALSE) & (is.na(bnt[i, 4]) == FALSE)) { v[i] <- TRUE }
  }
  
  return(v)
}

# corresponding filter:
f.valid.pattern <- valid.pattern(dt)
table(f.valid.pattern)
#> f.valid.pattern
#> FALSE  TRUE 
#>     5   175

valid.answers <- function(bnt) {
  
  v <- rep(FALSE, nrow(bnt)) # initialize all to FALSE
  
  if ( ncol(bnt) != 4) {
    stop("Error: BNT data must contain 4 questions (columns in df).")
  }
  
  for (i in 1:nrow(bnt)) {
    
    # Pattern 1: q1 and q2a are not NA, q2b and q3 are NA:
    if ((is.na(bnt[i, 1]) == FALSE) & (is.na(bnt[i, 2]) == FALSE) & 
        (is.na(bnt[i, 3]) == TRUE) & (is.na(bnt[i, 4]) == TRUE) & 
        (bnt[i, 1] != 0.25)) # q1 was answered incorrectly: 
      { v[i] <- TRUE }
    
    # Pattern 2: q1 and q2b are not NA, q2a and q3 are NA:
    if ((is.na(bnt[i, 1]) == FALSE) & (is.na(bnt[i, 3]) == FALSE) & 
        (is.na(bnt[i, 2]) == TRUE) & (is.na(bnt[i, 4]) == TRUE) &
        (bnt[i, 1] == 0.25) & (bnt[i, 3] == 20)) # q1 and q2b answered correctly:
    { v[i] <- TRUE }
    
    # Pattern 3: q1 and q2b and q3 are not NA, q2a is NA:
    if ((is.na(bnt[i, 1]) == FALSE) & (is.na(bnt[i, 2]) == TRUE) & 
        (is.na(bnt[i, 3]) == FALSE) & (is.na(bnt[i, 4]) == FALSE) &
        (bnt[i, 1] == 0.25) & (bnt[i, 3] != 20)) # q1 correct and q2b incorrect:
      { v[i] <- TRUE }
  }
  
  return(v)
}

# corresponding filter:
f.valid.answers <- valid.answers(dt)
table(f.valid.answers)
#> f.valid.answers
#> FALSE  TRUE 
#>     9   171

# (C) Filter:
f.BNT.valid <- rep(FALSE, nrow(dt))
f.BNT.valid <- f.valid.pattern & f.valid.answers

table(f.BNT.valid)
#> f.BNT.valid
#> FALSE  TRUE 
#>     9   171

10c. Define a variable BNT.score that contains the numeracy score (with a value of 1–4) of all participants for which f.BNT.valid == TRUE and add it to df.

# (A) Extend previous function to include the validity check, 
#     but return numeric BNT score for valid cases:
score.BNT <- function(bnt) {
  
  s <- rep(NA, nrow(bnt)) # initialize all to NA
  
  if ( ncol(bnt) != 4) {
    stop("Error: BNT data must contain 4 questions (columns in df).")
  }
  
  for (i in 1:nrow(bnt)) {
    
    # Path 1: q1 and q2a are not NA, q2b and q3 are NA:
    if ((is.na(bnt[i, 1]) == FALSE) & (is.na(bnt[i, 2]) == FALSE) & 
        (is.na(bnt[i, 3]) == TRUE) & (is.na(bnt[i, 4]) == TRUE) & 
        (bnt[i, 1] != 0.25) & # q1 was answered incorrectly & 
        (bnt[i, 2] != 30))    # q2a was answered incorrectly:
    { s[i] <- 1 } # Score is 1.
    
    # Path 2: q1 and q2a are not NA, q2b and q3 are NA:
    if ((is.na(bnt[i, 1]) == FALSE) & (is.na(bnt[i, 2]) == FALSE) & 
        (is.na(bnt[i, 3]) == TRUE) & (is.na(bnt[i, 4]) == TRUE) & 
        (bnt[i, 1] != 0.25) & # q1 was answered incorrectly & 
        (bnt[i, 2] == 30))    # q2a was answered correctly:
    { s[i] <- 2 } # Score is 2.
    
    # Path 3: q1 and q2b are not NA, q2a and q3 are NA:
    if ((is.na(bnt[i, 1]) == FALSE) & (is.na(bnt[i, 3]) == FALSE) & 
        (is.na(bnt[i, 2]) == TRUE) & (is.na(bnt[i, 4]) == TRUE) &
        (bnt[i, 1] == 0.25) & (bnt[i, 3] == 20)) # q1 and q2b answered correctly:
    { s[i] <- 4 } # Score is 4.
    
    # Path 4: q1 and q2b and q3 are not NA, q2a is NA:
    if ((is.na(bnt[i, 1]) == FALSE) & (is.na(bnt[i, 2]) == TRUE) & 
        (is.na(bnt[i, 3]) == FALSE) & (is.na(bnt[i, 4]) == FALSE) &
        (bnt[i, 1] == 0.25) & (bnt[i, 3] != 20) & # q1 correct, q2b incorrect &
        (bnt[i, 4] != 0.50))                      # q3 incorrect:
      { s[i] <- 3 } # Score is 3.
    
    # Path 5: q1 and q2b and q3 are not NA, q2a is NA:
    if ((is.na(bnt[i, 1]) == FALSE) & (is.na(bnt[i, 2]) == TRUE) & 
        (is.na(bnt[i, 3]) == FALSE) & (is.na(bnt[i, 4]) == FALSE) &
        (bnt[i, 1] == 0.25) & (bnt[i, 3] != 20) & # q1 correct, q2b incorrect &
        (bnt[i, 4] == 0.50))                      # q3 correct:
      { s[i] <- 4 } # Score is 4.

  }
  
  return(s)
}


# (B) Use function to determine BNT scores:
BNT.score <- score.BNT(dt)
table(BNT.score) # distribution of scores
#> BNT.score
#>  1  2  3  4 
#> 46 38 23 64
sum(table(BNT.score)) # 171 non-NA cases
#> [1] 171

# Check:
BNT.score.v <- score.BNT(dt[which(f.BNT.valid), ]) # only considering f.BNT.valid cases
table(BNT.score.v) # same distribution for 171 cases
#> BNT.score.v
#>  1  2  3  4 
#> 46 38 23 64


# (C) Add new variables to df:

# Insert BNT.score at the appropriate location of `df`:
dim(df)
#> [1] 180  34
i.bnt3  <- which(names(df) == "BNT.3") # index of "BNT.3" column
i.max <- ncol(df)                      # ncol of df

# Slice df into 2 parts:
left  <- df[ , 1:i.bnt3]
right <- df[ ,  (i.bnt3 + 1):i.max]

if (sum(names(df) == "BNT.score") == 0) { # not a column of df yet:
  df <- cbind(left, BNT.score, right)    # combine old and new parts of df
  }

dim(df)
#> [1] 180  35

Checkpoint 3

If you got this far you have turned a (relatively) tidy data file into a clean data file — well done! You are now in a position to begin actually analyzing the data…

C. Analyzing Data

11. The actual data analysis (in the sense of “doing statistics”) only starts when your data file is in good shape and contains all the variables required. Even at this point, it makes sense to explore and visualize key variables prior to applying any statistical tests.

11a. Explore and visualize some of the main dependent variables of this experiment. Do they vary as a function of the between-subjects factors of order or condition?

# Plot some variables to explore possible effects:
yarrr::pirateplot(formula = an.time ~ an.first + cond, data = df)

yarrr::pirateplot(formula = dm.time ~ an.first + cond, data = df)

yarrr::pirateplot(formula = rate ~ cond, data = df)

yarrr::pirateplot(formula = rate ~ BNT.score, data = df)

It looks like there might be some effects in the data – so let’s check them out…

11b. Apply some statistical tests to verify (or rather “attempt to falsify”) some hypothesized effects.

# [insert statistics here]

That’s it – hope you enjoyed working on this assignment!


[WPA10_answers.Rmd updated on 2018-01-19 12:10:29 by hn.]


  1. We assume that participants were free to enter any kind of information in response to these questions. In a well-designed online study, the possible types and ranges of entries would have been restricted to exclude ambiguous and misleading answers.