uni.kn.logo

Answers for WPA10 of Basic data and decision analysis in R, taught at the University of Konstanz in Winter 2017/2018.


To complete and submit these exercises, please remember and do the following:

  1. Use the .Rmd Format: Your WPAs should be written as scripts of commented code (as .Rmd files) and submitted as reproducible documents that combine text with code (in .html or .pdf formats).

    • A simple .Rmd template is provided here.

    • (Alternatively, open a plain R script and save it as LastnameFirstname_WPA##_yymmdd.R.)

  2. Commening your code: Indicate the current assignment (e.g., WPA10), your name, and the current date at the top of your document. Please always include appropriate comments with your code. For instance, your file LastFirst_WPA09_yymmdd.Rmd could look like this:

---
title: "Your Assignment Title (WPA10)"
author: "Your Name"
date: "Year Month Day"
output: html_document
---

This file contains my solutions: 

# Exercise 1

To show and run R code in your document, use a code chunk (without the '#' symbols):

# ```{r, exercise_1, echo = TRUE, eval = TRUE}
# 
# v <- c(1, 2, 3) # some vector
# sum(v)
#     
# ```

More text and code chunks... 

[Updated on `r Sys.Date()` by Your Name.]
<!-- End of document -->
  1. Complete as many exercises as you can by Wednesday (23:59).

  2. Submit your script or output file (including all code) to the appropriate folder on Ilias.


General remarks and recommendations

In the following exercises, we will identify and address some typical problems that arise in the context of data cleaning. In the suggested solutions below, we favor explicitness over elegance and accuracy over speed. For most problems, R offers a vast variety of different solutions, some of which are faster, more frugal and more elegant than the ones shown here. But whenever dealing with data that was difficult or expensive to obtain, your primary focus should be on ensuring and preserving the integrity of the data. Understanding this priority will shift your concerns and working mode away from pure programming towards keeping valuable data safe and sound. Adopting the mindset of a data scientist results in a number of recommendations:

  • Play it safe: People (and yes, even those who qualify as data scientists) routinely make errors, and ideally learn from them. If you anticipate that things will get messed up somewhere, you can prepare for and recover from such events. Thus, always make and work with copies (e.g., of files of code and data, but also of data frames and variables within your code). This way, you will never alter or destroy the original if something goes wrong.

  • Scale it down: Rather than directly working with large data sets, consider pruning them by choosing just the parts of the data that you are currently working on (e.g., by selecting only the variables needed for solving a specific task). Even if you eventually aim to transform a huge data set, it is advisable to first develop and test your solution on some smaller parts or some dummy data that you can easily monitor, manipulate, and check.

  • Check all changes: Any transformations of and additions to the data should be followed by an immediate verification step. This typically involves two aspects: First check whether the manipulation had the intended effects. Even if this is the case: Always also check for potential side-effects (e.g., instances of TRUE turning into 1s, or NA’s turning into 0 after certain operations)!
  • Effectiveness beats efficiency: Focus on getting things done, rather than always finding the fastest, most general, or pure and elegant solution. For instance, you know that R is much more suited for vector-based operations rather than loops. However, if a particular step is only needed once, it may be perfectly fine to use a loop to get the job done. Crucially, the most valuable resource to preserve here may be your personal time, rather than the speed of your code or your computer.

Even though the following exercises cover some ground, they only scratch the surface of data cleaning. This is necessarily so, as any new data set presents its own problems. Hadley Wickam quotes Tolstoy to express this in a very nice way:

Tidy datasets are all alike, but every messy dataset is messy in its own way. (In PDF, p. 2)

Fortunately, the R community is a rich, searchable and (sometimes) well-documented source of inspirations and solutions. For a more comprehensive treatment of these issues and additional pointers, see books like R for Data Science or explore the documentation of the tidyverse R package (which includes many useful packages, like dplyr and tidyr).

A. Data Cleaning

The process of data cleaning typically involves loading and checking the data, dealing with missing values, and initial screening of variables (e.g., for the distributions of values and potential outliers).

Preparations

0. The following steps prepare the current session by opening an R project, creating a new .Rmd file, and compiling it into an .html output file:

0a. Open your R project from last week (called RCourse or something similar), which contains some files and at least two subfolders (data and R).

0b. Create a new R Markdown (.Rmd) script and save it as LastFirst_WPA10_yymmdd.Rmd (with an appropriate header) in your project directory.

0c. Insert a code chunk and load the rmarkdown, knitr and yarrr packages. (Hint: It’s always a good idea to name code chunks and load all required packages with library() at the beginning of your document. Using the chunk option include = FALSE evaluates the chunk, but does not show it or its outputs in the html output file.)

library(rmarkdown)
library(knitr)
library(yarrr)

# Store original par() settings:
opar <- par()
# par(opar) # restores original (default) par settings later

0d. Make sure that you can create an .html output-file by “knitting” your current document.

An online study

In this WPA, we will analyze data from a ficticious online experiment on the effects of virtual consumption on aesthetic appreciation and problem solving.

A total of 180 people were recruited online and instructed to imagine drinking either a cup of coffee or a pint of beer. After this manipulation of “virtual consumption” (or condition), they were confronted with two problems – an anagram puzzle and a decision making task (with the order of these tasks being counterbalanced) – and asked to evaluate a piece of art (on a Likert rating scale ranging from 1: ‘do not like at all’ to 10: ‘like very much’) and indicate their favorite type of music (out of a list containing six genres). The experiment also contained an attention check item between the problem solving and the aesthetic appreciation tasks. The study was concluded by a brief test that assessed each person’s level of numeracy (see details below).

In addition to variables corresponding to participants’ responses the data contains meta-information on every experimental session, such as the ID of the final page fin.pg viewed by the participant and the total duration (in seconds) of his or her visit to the study server.

The (ficticious) data from this study is stored in a file that looks very similar to the raw data file of an actual online experiment. This means that the data is rectangular and tidy, which Hadley Wickam defines as each variable being stored in a column, each observation being stored as a row, and each observational unit being a table (see his pages or PDF for details). Much to the surprise (and dismay) of students in the social sciences starting with a tidy file typically does not mean that the data can be analyzed immediately.

Instead, the process of data screening and cleaning takes substantial effort and time and even relatively benign (rectangular) data files tend to require substantial checking, recoding, re-checking, and reformatting before they can be analyzed and subjected to statistical tests.

Loading and checking data

1a. Our data is available online at http://Rpository.com/down/data/WPA10_exp.txt. Load this data into a date frame named raw.dat. (Hint: Note that the data file is tab-delimited and contains a header row. Also, include the option stringsAsFactors = FALSE if you want to avoid that character strings are interpreted as factors.)

Here’s how the first lines of the raw.dat should look:

head(raw.dat)
#>   initials          start.time sex age   fin.pg duration  task.1 accu.1
#> 1      W.J 2017-01-16 11:53:33   0  54 44832986      231 anagram   TRUE
#> 2      B.F 2017-01-16 17:31:42   f  77 44832986      329 anagram   TRUE
#> 3      H.W 2017-01-16 12:18:37   m  18 54852782      200 anagram  FALSE
#>   time.1   task.2 accu.2 time.2                   att.ch   cond b.rate
#> 1     30 decision      1     25  i read the insrctnions.   beer      5
#> 2     30 decision      1     48 I read the instructions. coffee    -77
#> 3     32 decision      1     36  Can I have your number?   beer      9
#>   c.rate b.like  c.like BNT.1 BNT.2a BNT.2b  BNT.3
#> 1    -77   Rock     -66  0.25  -77.0     10   0.50
#> 2      2    -66     Pop 30.00   40.0    -66 -77.00
#> 3    -77    Pop     -66  0.22    0.3    -66 -77.00
#>  [ reached getOption("max.print") -- omitted 3 rows ]

1b. Inspect some basic properties of the raw data:

  • How many rows and columns does raw.dat contain?
  • Are there any missing (NA) values?
  • Of which data types are the individual variables (columns)?
#> [1] 180  22
#> [1] 0
#> 'data.frame':    180 obs. of  22 variables:
#>  $ initials  : chr  "W.J" "B.F" "H.W" "K.D" ...
#>  $ start.time: chr  "2017-01-16 11:53:33" "2017-01-16 17:31:42" "2017-01-16 12:18:37" "2017-01-16 12:51:26" ...
#>  $ sex       : chr  "0" "f" "m" "m" ...
#>  $ age       : int  54 77 18 23 48 60 51 72 65 42 ...
#>  $ fin.pg    : int  44832986 44832986 54852782 44832986 44832986 44832986 44832986 44832986 44832986 54852782 ...
#>  $ duration  : int  231 329 200 318 615 505 612 236 499 545 ...
#>  $ task.1    : chr  "anagram" "anagram" "anagram" "anagram" ...
#>  $ accu.1    : logi  TRUE TRUE FALSE TRUE TRUE TRUE ...
#>  $ time.1    : int  30 30 32 26 43 40 31 50 44 31 ...
#>  $ task.2    : chr  "decision" "decision" "decision" "decision" ...
#>  $ accu.2    : int  1 1 1 0 1 0 1 0 1 0 ...
#>  $ time.2    : int  25 48 36 54 22 37 39 43 24 55 ...
#>  $ att.ch    : chr  "i read the insrctnions." "I read the instructions." "Can I have your number?" "I redd the instruction." ...
#>  $ cond      : chr  "beer" "coffee" "beer" "coffee" ...
#>  $ b.rate    : int  5 -77 9 -77 9 -77 10 -77 5 -77 ...
#>  $ c.rate    : int  -77 2 -77 3 -77 3 -77 8 -77 7 ...
#>  $ b.like    : chr  "Rock" "-66" "Pop" "-66" ...
#>  $ c.like    : chr  "-66" "Pop" "-66" "Classic" ...
#>  $ BNT.1     : num  0.25 30 0.22 0.25 0.28 0.25 0.25 0.28 0.24 0.2 ...
#>  $ BNT.2a    : num  -77 40 0.3 -77 31 -77 -77 29 0.3 32 ...
#>  $ BNT.2b    : chr  "10" "-66" "-66" "10" ...
#>  $ BNT.3     : num  0.5 -77 -77 0.45 -77 -77 0.5 -77 -77 -77 ...

Recoding missing and erroneous values

2a. Copy raw.dat into another R object called dat. This way, you keep your original raw.data safe in case you happened to corrupt the data you’re currently working with.

2b. It’s strange and suspicious that our data seems to contain no missing (NA) values. On closer inspection, it turns out that some numeric variables contain frequent instances of -77 and some character variables contain frequent instances of "-66". It turns out that our web server uses exactly these values to indicate the absence of entries. Thus, recode any missing values in dat as NA. (Hint: Use logical indexing on the entire data frame to do this, rather than doing it for individual variables.)

#> [1] 0
#> 
#> -77   1   2   3   4   5   6   7   8   9  10 
#>  90   4   5   5  14   6   3  16   8  18  11
#> 
#>     -66 Classic Country    Jazz     Pop     Rap    Rock 
#>      90      10      10      17      26       8      19
#> [1] 680

3. Let’s check the plausibility of our subject-related variables and correct some obvious mistakes.

Inspect a histogram of participant’s age distribution and determine the range of age values. Then set any implausible values to NA. (Hint: Assume that valid participants had to be adults from 18 to 99 years. Any values outside this range should be set to NA, but record how many people were recoded.)

#> [1] 1
#> [1] -24 230

#> [1] 1
#> [1] 1
#> [1] 3
#> [1] 18 80

B. Data Wrangling

Data wrangling is the continuation of data cleaning. It typically involves re-arrangling existing variables (columns), defining new variables (e.g., filter variables or variables that compute some score from other variables) and defining and selecting subsets of cases (rows).

Renaming and recoding variables

4a. Rename the variable called sex to the more politically correct label of gender.

4b. Inspect the type of the new gender variable and create a table of the existing gender values. Then change any numeric values to the more informative character strings. Specifically, recode all values of 0 to “f” (for “female”), values of 1 to “m” (for “male”), and values of 3 to “other”. What is the resulting gender distribution in your data?

#> [1] "character"
#> 
#>     0     1     3     f     m other 
#>    28    27     1    69    54     1
#> 
#>     f     m other 
#>    97    81     2

4c. Use the following recode.var() function to create a new variable gender.num that contains the same information as the character variable gender in numeric format. Add gender.num to dat. (Hint: Remember that new variables in R can be created by assigning them.)

# Create a new function called recode.v:
recode.var <- function(x,    # what vector do you want to recode?
                       old,  # what values do you want to change?
                       new,  # what should the new values be?
                       otherNA = TRUE,   # should other values be converted to NA?
                       numeric = TRUE) { # should result be numeric?
  
  x.new <- x  # copy vector to x.new
  
  if(class(x.new) == "factor") {x.new <- paste(x.new)} # remove factors
 
  for(i in 1:length(old)) { # loop through all old values:
    x.new[x == old[i]] <- new[i]
    }
 
  if(otherNA) { # convert unspecified values to NA:
    x.new[(x %in% old) == FALSE] <- NA
    }
 
  if(numeric) {x.new <- as.numeric(x.new)}  # convert vector to numeric values
 
  return(x.new)  # return new vector
}
#> gender.num
#>  0  1  2 
#> 97 81  2

4d. Use the recode.var() function again to create a numeric variable cond.num that corresponds to the between-subjects condition indicated by cond (with beer being indicated by the value 1 and coffee being indicated by the value 2). Then add cond.num to a new column of dat.

#> 
#>   beer coffee 
#>     90     90
#> 
#>  1  2 
#> 90 90

Rearranging variables (columns)

5. Rearrange the columns of dat so that it more accurately reflects the procedure of the experiment. Specifically, the columns of dat should contain the following variables (from left to right):

  • the participant’s demographic information,
  • the participant’s experiment start time,
  • the attention check item (att.ch),
  • the between-subjects condition (in character and numeric format), followed by
  • the performance in the two problem solving tasks, and
  • the two aesthetic judgments. Finally,
  • the answers to the four numeracy questions (labeled as BNT) should appear, before noting
  • the ID of the final page fin.pg and duration of the experiment (in seconds) as the final columns.

Store the revised order of variables in a new data frame called df.

#> [1] 180  24
#>  [1] "initials"   "start.time" "gender"     "age"        "fin.pg"    
#>  [6] "duration"   "task.1"     "accu.1"     "time.1"     "task.2"    
#> [11] "accu.2"     "time.2"     "att.ch"     "cond"       "b.rate"    
#> [16] "c.rate"     "b.like"     "c.like"     "BNT.1"      "BNT.2a"    
#> [21] "BNT.2b"     "BNT.3"      "gender.num" "cond.num"
#> [1] TRUE
#> [1] "dat and df have equal dimensions."
#> [1] 180  24
#>  [1] "initials"   "gender"     "gender.num" "age"        "start.time"
#>  [6] "att.ch"     "cond"       "cond.num"   "task.1"     "accu.1"    
#> [11] "time.1"     "task.2"     "accu.2"     "time.2"     "b.rate"    
#> [16] "c.rate"     "b.like"     "c.like"     "BNT.1"      "BNT.2a"    
#> [21] "BNT.2b"     "BNT.3"      "fin.pg"     "duration"

Checkpoint 1

At this point you completed some basic exercises. This is good, but additional practice will deepen your understanding, so please keep carrying on…

Defining filter variables

6. Let’s define some useful filter variables and add them to df. Rather than removing participants from the data file these variables can later be used to subset different samples of participants.

6a. The last page of the experiment had an ID of 44832986. Define a logical filter variable f.final that indicates whether the final page pg.fin that was seen by the participant matches this ID.

#> 
#> 23978391 44832923 44832986 54334921 54852782 
#>        2        8      160        5        5
#> 
#> 44832986 
#>      160

6b. Theoretical considerations and your pretests indicate that a conscientious completion of this experiment takes at least 2 minutes and no longer than 10 minutes. Define a logical filter variable f.duration that indicates whether the duration of a participant falls within this range.

6c. Your experiment included an attention check item that required participants to write the sentence “I read the instructions.” into an empty text field. Participants’ entries into this field are stored in the variable att.ch. Define a logical filter variable f.att that indicates whether a participant should be counted as having passed this test.

(Hint: Inspect the range of existing answers and note that demanding the entry of the exact and entire phrase would only include a small minority of participants. Thus, consider allowing for a wider variety of answers by including all entries indicating that someone (a) noticed the test, and (b) provided some reasonable answer. Use the grep(pattern = "xyz", x, ...) function to search a character sequence xyz in a string x.)

#> 
#>          Can I have your number?     I did read the instructions. 
#>                                2                               10 
#>                 I pay attention.              I read instrctions. 
#>                                9                               11 
#>          i read the insrctnions.           i read the instruction 
#>                                9                               10 
#>          I read the instruction.          I read the instructions 
#>                               16                               18 
#>         I READ THE INSTRUCTIONS!         I read the instructions. 
#>                               11                               25 
#>          I read the instructons.     I really rde the instrctions 
#>                                7                                3 
#>           I red the instructions          I redd the instruction. 
#>                                5                                8 
#>   My worker ID is 17856379947635                READ INSTRUCTIONS 
#>                                1                                9 
#>           Read the instructions. The instructions have been read. 
#>                               12                                5 
#>                         Whazzup?      Yes, I am paying attention. 
#>                                2                                5
#> 
#>     I did read the instructions.                 I pay attention. 
#>                               10                                9 
#>              I read instrctions.          i read the insrctnions. 
#>                               11                                9 
#>           i read the instruction          I read the instruction. 
#>                               10                               16 
#>          I read the instructions         I READ THE INSTRUCTIONS! 
#>                               18                               11 
#>         I read the instructions.          I read the instructons. 
#>                               25                                7 
#>     I really rde the instrctions           I red the instructions 
#>                                3                                5 
#>          I redd the instruction.                READ INSTRUCTIONS 
#>                                8                                9 
#>           Read the instructions. The instructions have been read. 
#>                               12                                5 
#>      Yes, I am paying attention. 
#>                                5
#> 
#>        Can I have your number? My worker ID is 17856379947635 
#>                              2                              1 
#>                       Whazzup? 
#>                              2

6d. Add your filter variables to dat (in case you haven’t already done so) and show how adding each filter (or any combination of them) reduces your participant sample.

#> [1] 180  27
#> [1] 180
#> [1] 160
#> [1] 162
#> [1] 173
#> [1] 143
#> [1] 156
#> [1] 154
#> [1] 138
#>         f.duration FALSE TRUE
#> f.final                      
#> FALSE                  1   19
#> TRUE                  17  143
#>         f.att FALSE TRUE
#> f.final                 
#> FALSE             1   19
#> TRUE              6  154
#>            f.att FALSE TRUE
#> f.duration                 
#> FALSE                1   17
#> TRUE                 6  156
#>                    f.att FALSE TRUE
#> f.final f.duration                 
#> FALSE   FALSE                0    1
#>         TRUE                 1   18
#> TRUE    FALSE                1   16
#>         TRUE                 5  138

Defining categorical factors (bins)

7. Let’s split some continuous numerical variables into categorical factors. (Using the latter in your analyses will typically reduce statistial power, but may allow for simpler representations.)

7a. Define a factor dur.min that classifies the duration of each participant into categories (bins) of full minutes. (Hint: Use the cut() function.)

#> [1]  27 649
#> [1] 11

7b. Define a factor age.group that classifies each participant into one of three categories (bins) depending on his or her age:

  1. Participants below the age of 30 should be classified as “junior”,
  2. participants from 30 to 59 years old should be classified as “mid-age”, and
  3. participants with an age of 60 or more years should be classified as “senior”.

(Hint: Use the cut() function with appropriate settings of the breaks and right options.)

7c. How would an age.group.2 factor be defined if you used the cut() function with the option breaks = 3, rather than explicitly defining the specific age (in years) of each break point? Explain why this result occurs. (Hint: Compute the range of df$age and split it into 3 equal parts.)

Handling (within- vs. between-subjects) factors

8. Separating combined columns: As the order of the anagram and decision making tasks was counterbalanced, the variables accu.1 and time.1 and accu.2 and time.2 in df currently mix information about the accuracy and solution times of both types of tasks. To analyze this data, we first need to disentangle these variables such that a participant’s responses (in terms of accuracy and time) to each type of task is stored in a separate variable. Specifically, define the following variables:

  1. an.first should be a logical variable that is TRUE if and only if the anagram task was presented before the decision making task;
  2. an.accu should be a logical variable that contains the accuracy of the anagram task;
  3. an.time shound be a numeric variable that contains the time of the anagram task;
  4. dm.accu should be a logical variable that contains the accuracy of the decision making task;
  5. dm.time shound be a numeric variable that contains the time of the decision making task.
#> [1] "integer"
#>    task.1 accu.1 time.1   task.2 accu.2 time.2
#> 1 anagram   TRUE     30 decision   TRUE     25
#> 2 anagram   TRUE     30 decision   TRUE     48
#> 3 anagram  FALSE     32 decision   TRUE     36
#> 4 anagram   TRUE     26 decision  FALSE     54
#> 5 anagram   TRUE     43 decision   TRUE     22
#> 6 anagram   TRUE     40 decision  FALSE     37
#> [1] 180  27
#> [1] 180  32

9. Combining separated columns: An analogous problem occurs with the between-subjects factor of cond. As every participant imagined either drinking beer or drinking coffee (but not both), the four variables b.rate, b.like, b.rate, and b.like are missing (NA) for every other participant. To analyze this data, we first need to combine the information that is currently distributed over two variables into one variable. Specifically, collect each particiant’s aesthetic judgment and music preference in two variables:

  1. rate should be a numeric variable that contains all rating values; and
  2. like should be a character variable that contains all music preferences.
#>     cond b.rate c.rate b.like  c.like
#> 1   beer      5     NA   Rock    <NA>
#> 2 coffee     NA      2   <NA>     Pop
#> 3   beer      9     NA    Pop    <NA>
#> 4 coffee     NA      3   <NA> Classic
#> 5   beer      9     NA   Jazz    <NA>
#> 6 coffee     NA      3   <NA>    Jazz
#> [1] TRUE
#> [1] 180  32
#> [1] 180  34

Checkpoint 2

If you got this far you’re doing great, but don’t give up just yet…

Evaluating test results

10. Let’s determine some additional information about each participant (i.e., a so-called trait factor) that can later be used to qualify our results by inter-individual differences. Here, we will use the four BNT columns to determine a numeracy score for each participant.

The Berlin Numeracy Test (BNT) is an adaptive test that uses two or three questions to categorize a person into one of four levels of numeracy. The reference paper for the test’s validation is:

  • Cokely, E.T., Galesic, M., Schulz, E., Ghazal, S., & Garcia-Retamero, R. (2012). Measuring risk literacy: The Berlin Numeracy Test. Judgment and Decision Making, 7, 25–47. (download PDF.)

See RiskLiteracy.org for an online version of the test and additional details on the underlying notion of numeracy.

The adaptive nature of the test consists in the fact that not all participants need to answer all questions. Instead, the sequence of questions presented to a participant depends on his or her answers to previous questions. The five possible sequences (or paths) of questions and their corresponding classification (into a category from 1 to 4) are illustrated by Figure 1 (on p. 31 of Cokely at al., 2012): A red arrow below a question number indicates that the question was answered incorrectly, a green arrow below a question number indicates that the question was answered correctly. The bottom row shows that – based on their responses to either two or three questions – participants are classified into one of four levels of numeracy (i.e., obtain a score from 1 to 4, with 1 indicating the lowest and 4 indicating highest level of numeracy).

The topic and correct answers of each question is presented in the following Table:

Question: Topic: Correct answer:
Question 1: choir 25%
Question 2a: 5-sided die 30
Question 2b loaded die 20
Question 3: mushrooms 50%

10a. Note that the entries to the four variables starting with BNT in df are somewhat inconsistent and need to be cleaned up prior to any analysis. Specifically, ensure that all variables are numeric and that all responses are within an appropriate range (i.e., 0–1 for Questions 1 and 3, and 1–100 for Questions 2a and 2b).1

#> [1] "character"
#> 
#>    0.2     10     15     18     19     20     21     22     25     30 
#>      7      6      8      1     12     34      7      4      7      6 
#> twenty 
#>      1
#> 
#> 0.2  10  15  18  19  20  21  22  25  30 
#>   7   6   8   1  12  35   7   4   7   6
#> [1] "double"
#> 
#> 0.2  10  15  18  19  20  21  22  25  30 
#>   7   6   8   1  12  35   7   4   7   6
#> 
#>  0.1  0.2 0.22 0.24 0.25 0.26 0.28  0.3  0.5 0.52   10   20   25   30   50 
#>    5   13   11   11   65    2    8    2   10    1    2    6   28    3   11
#> 
#>  0.1  0.2 0.22 0.24 0.25 0.26 0.28  0.3  0.5 0.52 
#>    7   19   11   11   93    2    8    5   21    1
#> 
#> 0.3  20  25  28  29  30  31  32  35  40 
#>  10   8   4   5  10  29   8   5   3   4
#> 
#> 20 25 28 29 30 31 32 35 40 
#>  8  4  5 10 39  8  5  3  4
#> 
#> 0.2  10  15  18  19  20  21  22  25  30 
#>   7   6   8   1  12  35   7   4   7   6
#> 
#> 10 15 18 19 20 21 22 25 30 
#>  6  8  1 12 42  7  4  7  6
#> 
#> 0.25  0.3  0.4 0.45  0.5 0.55 0.75   25   50   75 
#>    4    2    2    1   23    3    1    9    5    3
#> 
#> 0.25  0.3  0.4 0.45  0.5 0.55 0.75 
#>   13    2    2    1   28    3    4

10b. Check for each case (participant) whether the his or her entries in the four BNT variables of df are valid (i.e., can actually occur, given the logic of the test). This includes two conditions:

  1. the existing combination of answered questions is possible (irrespective of the answers), and
  2. the sequence of questions answered corresponds to a correct evaluation of previous answers.

Define a filter variable f.BNT.valid that is TRUE if and only if both conditions are met.

#> f.valid.pattern
#> FALSE  TRUE 
#>     5   175
#> f.valid.answers
#> FALSE  TRUE 
#>     9   171
#> f.BNT.valid
#> FALSE  TRUE 
#>     9   171

10c. Define a variable BNT.score that contains the numeracy score (with a value of 1–4) of all participants for which f.BNT.valid == TRUE and add it to df.

#> BNT.score
#>  1  2  3  4 
#> 46 38 23 64
#> [1] 171
#> BNT.score.v
#>  1  2  3  4 
#> 46 38 23 64
#> [1] 180  34
#> [1] 180  35

Checkpoint 3

If you got this far you have turned a (relatively) tidy data file into a clean data file — well done! You are now in a position to begin actually analyzing the data…

C. Analyzing Data

11. The actual data analysis (in the sense of “doing statistics”) only starts when your data file is in good shape and contains all the variables required. Even at this point, it makes sense to explore and visualize key variables prior to applying any statistical tests.

11a. Explore and visualize some of the main dependent variables of this experiment. Do they vary as a function of the between-subjects factors of order or condition?

It looks like there might be some effects in the data – so let’s check them out…

11b. Apply some statistical tests to verify (or rather “attempt to falsify”) some hypothesized effects.

# [insert statistics here]

That’s it – hope you enjoyed working on this assignment!


[WPA10.Rmd updated on 2018-01-15 13:22:26 by hn.]


  1. We assume that participants were free to enter any kind of information in response to these questions. In a well-designed online study, the possible types and ranges of entries would have been restricted to exclude ambiguous and misleading answers.