WPA03 of Basic data and decision analysis in R, taught at the University of Konstanz in Winter 2017/2018.
To complete and submit these exercises, please remember and do the following:
Your WPAs can be written and submitted either as scripts of commented code (as
.R
or.Rmd
files) or as reproducible documents that combine text with code (in.html
or.pdf
formats).A simple
.Rmd
template is provided here.Alternatively, open a plain R script and save it as
LastnameFirstname_WPA##_yymmdd.R
.
Also enter the current assignment (e.g., WPA03), your name, and the current date at the top of your document. When working on a task, always indicate which task you are answering with appopriate comments.
Here is an example how your file JamesJane_WPA03_171113.Rmd
could look:
# Assignment: WPA 03
# Name: Jane, James
# Date: 2017 Nov 13
# ~~~~~~~~~~~~~~~~~~~~~~~~~~
# A. In Class
# Combining vectors to matrices and data frames:
# Exercise 1:
a <- letters[1:3] # define some vector
# ...
Complete as many exercises as you can by Wednesday (23:59).
Submit your script or output file (including all code) to the appropriate folder on Ilias.
A. In Class
Here are some warm-up exercises that review the basic concepts of the current chapters:
Combining vectors to matrices and data frames
1. Define these vectors in your R session to complete the following exercises:
# Define some vectors:
a <- letters[1:3]
b <- letters[4:6]
c <- letters[7:9]
x <- 1:3
y <- 4:6
z <- 7:9
1a. Define the following matrices without using the matrix()
command. (Hint: Use cbind()
and rbind()
to combine the vectors defined above.)
#> x y z a b c
#> [1,] "1" "4" "7" "a" "d" "g"
#> [2,] "2" "5" "8" "b" "e" "h"
#> [3,] "3" "6" "9" "c" "f" "i"
#> [,1] [,2] [,3]
#> a "a" "b" "c"
#> b "d" "e" "f"
#> c "g" "h" "i"
#> x "1" "2" "3"
#> y "4" "5" "6"
#> z "7" "8" "9"
#> a b c
#> x "a" "d" "g" "1" "2" "3"
#> y "b" "e" "h" "4" "5" "6"
#> z "c" "f" "i" "7" "8" "9"
1b. What is the type of these three matrices? (Use the typeof()
function to compare the types of the vectors and matrices.)
1c. Re-create the matrices m1
and m2
without using the cbind()
and rbind()
commands and without using the vectors defined above. (Hint: Use matrix()
and define the data
of each matrix as a new vector. Note that the names of rows or columns may vary depending on your construction method, but ensure that all matrix elements match the original ones.)
1d. Re-create matrix m3
without using the rbind()
command. (Hint: Use matrix()
to create two 3x3 matrices and combine them.)
2. This exercise addresses some details of the cbind()
and rbind()
functions:
2a. Assume that vector x
changed from 1:3
to 1:9
, but y
and z
remained unchanged. Predict, check and explain the result of rbind(x, y, z)
and cbind(x, y, z)
.
2b. Describe the differences between cbind(a, z)
and data.frame(a, z)
(assuming the definitions above).
2c. Assuming the following definitions of df1
and df2
, describe and explain the results of cbind(df1, df2)
and rbind(df1, df2)
.
df1 <- data.frame(cbind(y, z))
# df1
df2 <- data.frame(cbind(y + 100, z + 100))
# df2
2d. Change (the column names of) df2
to make rbind(df1, df2)
possible. (Hint: Use the names()
function to get and set a data frame’s column names.)
Working with data frames
3. In this exercise, we’ll explore the in-built data frame InsectSprays
.
3a. Familiarize yourself with this data frame by using the ?
, View()
, head()
, str()
and summary()
commands. (Hint: Copy the original InsectSprays
into a new data frame called df
to leave the original unchanged and facilitate your typing.)
3b. How many cases of observations (rows) and variables (columns) does this data frame contain?
3c. Compute the mean and the median of the count
variable. What do their values suggest about the distribution of count
values?
3d. How many cases were treated with each of the insect sprays?
3e. Add a new variable hi.avg
to df
that indicates whether a count
is higher than the average (or mean()
) count
.
3f. Change the name of the count
variable to insect.count
(by using logical indexing).
3g. Save the initial and the final 10 rows of df
into new data frames and combine them into a data frame first.final.10
.
3h. Save all cases that were treated with spray A
that are lower than the overall average and all cases that were treated with spray F
that are higher than the overall average in separate data frames (using the subset()
command). Which of them has a higher average insect count? (Hint: Proceed in multiple steps:
- create new subsets of
df
that contain the cases that you want to compare, - compute and compare the means of these new subsets.)
3i. Is the average insect count of all cases that were treated with spray C
and are above average (within spray C) smaller or larger than the average insect count of all cases that were treated with spray F
and are below average (within spray F)? (Hint: Proceed in multiple steps:
- identify subsets of
df
that contain only cases ofspray C
andF
, - compute the averages of these subsets,
- create new subsets of
df
that contain the cases that you want to compare, - compute and compare the means of these new subsets.
All this can be done in far fewer lines of code, of course, but separating tasks into different steps (and corresponding objects) typically makes it clearer.
Saving and loading data
4. Save your data frame df
as a comma-delimited text file named myInsectSprays.csv
(into a subfolder data
) and then re-load this file into a new data frame df2
.
Checkpoint 1
At this point you completed all warm-up exercises. This is good, but please keep carrying on…
B. At Home
Priming Study
In a provocative paper, Bargh, Chen and Burrows (1996) sought to test whether or not priming people with trait concepts would trigger trait-consistent behavior. In one study, they primed participants with either neutral words (e.g., bat, cookie, pen), or with words related to an elderly stereotype (e.g., wise, stubborn, old). They then, unbeknownst to the participants, used a stopwatch to record how long it took the participants to walk down a hallway at the conclusion of an experiment. They predicted that participants primed with words related to the elderly would walk slower than those primed with neutral words.
In the following, we will analyze fake data corresponding to this cover story.
Dataset description
Our simulated study data has 3 primary independent variables:
prime
: What kind of primes was the participant given? There were 2 conditions:neutral
means neutral primes,elderly
means elderly primes;prime.duration
: For how much time (in minutes) were primes shown to participants? There were 4 conditions: 1, 5, 10, or 30 minutes;grandparents
: Did the participant have a close relationship with their grandparents?yes
means yes,no
means no,none
means that they had no relationship with their grandparents.
There was one primary dependent variable:
walk
: For how long (in seconds) did participants walk down the hallway?
There were 4 additional variables that characterise each particpant:
id
: The order in which participants completed the study;age
: Participants’ age;gender
: Participants’ gender;attention
: Was an attention check passed?0
indicates a failed,1
a passed attention check.
Project management
5a. Start a new R-project called RCourse
(or similar). Then (either within RStudio or in a file manager outside of R), navigate to the location of your RCourse
project, and add two folders named R
and data
.
5b. Open a new R script called LastFirst_WPA03_161114.R
(if First
is your first and Last
is your last name), and save it into the R
folder.
5c. When beginning a new project, it’s always a good idea to remove all previous assignments and objects from your workspace. (Hint: Check the help file of the rm()
function to obtain the correct command.)
rm(list = ls()) # clean all (without warning).
5d. A text file containing the data (called WPA03_priming.txt
) is available at http://Rpository.com/down/data/WPA03_priming.txt. Right-click the link and save the data file into the data
folder. (Note that this data file is not the original data, but freshly simulated data from 2017.)
5e. Use read.table()
to load the data into a new R object called priming
. (Hint: Note that the text file is tab-delimited and contains a header row, so be sure to include the sep = "\t"
and header = TRUE
arguments.)
Viewing and naming data
6a. Explore the data or specific variables using View()
, head()
, str()
, and summary()
. How many cases (rows) and variables (columns) does priming
contain?
6b. Obtain and study the names of priming
with names()
. Those aren’t very useful are they? Change the names to more informative values. (Hint: Make your life easy by using the same naming scheme as in the dataset description above.)
Applying functions to variables
7a. What was the mean age of the participants?
7b. How many participants were there from each gender?
7c. What was the median walking time?
7d. What percentage of participants passed the attention check? (Hint: To calculate a percentage from a binary [0, 1] variable, use mean()
.)
7e. Walking time is currently in seconds. Add a new column to the dataframe called walking.m
that shows the walking time in minutes rather than seconds (rounded to the nearest 2 decimals).
Indexing and subsettting data frames
Hint: Many of the following problems are best solved by splitting your answers into two steps:
Step 1: Index or subset the original data and store it as a new object with a new name.
Step 2: Calculate the appropriate summary statistic using the new, subsetted object that you just created.
8a. What were the genders of the first 10 participants (i.e., the first 10 rows)?
8b. Show all the data for the 50th participant (row)?
8c. What was the mean walking time for the elderly prime condition?
8d. What was the mean walking time for the neutral prime condition?
8e. What was the mean walking time for participants less than 23 years old?
8f. What was the mean walking time for females with a close relationship with their grandparents?
8g. What was the mean walking time for males over 24 years old without a close relationship with their grandparents?
Checkpoint 2
At this point you are doing very well. Try to hang on for a few more difficult tasks…
Creating new data frames
9a. Create a new data frame called priming.att
that only includes rows where participants passed the attention check. (Hint: Use logical indexing or subset()
.)
9b. Some of the data do not make sense and must be mistaken. For example, some walking times are negative, some prime values are incorrect, and some prime.duration
values were not part of the original study plan. This should be fixed before carrying out further analyses.
Create a new data frame called priming.c
(for ‘priming clean’) that only includes rows with valid values for each column. Do this by looking for strange values in each column, and by comparing them with the original dataset description. Additionally, only include participants who passed the attention check. Here’s a skeleton of how your code should look:
# Create priming.c, a subset of the original priming data:
# (Replace __ with the appropriate values.)
priming.c <- subset(x = priming,
subset = gender %in% c(__) &
age > __ &
attention == __ &
prime %in% __ &
prime.duration %in% __ &
grandparents %in% __ &
walk > __)
9c. How many participants gave valid data and passed the attention check? (Hint: Use the result from your previous answer.)
9d. Of those participants who gave valid data and passed the attention check, what was the mean walking time of those given the elderly and neutral prime (calculate these separately).
Saving and loading data
10a. Save your two dataframe objects priming
and priming.c
in an .RData file called priming.RData
in the data
folder of your project
10b. Save your priming.c
object as a tab-delimited text file called priming_clean.txt
in the data
folder of your project.
10c. Clean your workspace by running the appropriate rm()
command again.
10d. Re-load your two data frame objects using load()
.
11. A colleague of yours asks for access to the data, but is only interested in the data from females who experienced the neutral prime.
11a. Create a dataframe called priming.f
that only includes these data. Additionally, do not include the id
column as this could be used to identify the participants.
11b. Save your priming.f
object as a tab–delimited text file called priming_females.txt
in the data folder of your project.
11c. Save your entire workspace using to an .RData file called priming_ws.RData
in the data folder of your project.
Checkpoint 3
At this point you are doing great, well done! If you are curious, perhaps you also enjoy the following challenges?
C. Challenges
12. Use your cleaned dataframe (priming.c
) for the following exercises.
12a. Did the effect of priming condition (neutral vs. elderly) on walking times differ between the first 100 and the last 100 participants? (Hint: Given a total of \(n\) participants, you can find the id
of the \(100\). and of the \(n-100\). participant. Then use these id
values to determine subsets of priming.c
.)
12b. Due to a computer error, the data from every participant with an even id number is invalid. Remove these data from your priming.c
dataframe.
12c. Do you find evidence that a participant’s relationship with their grandparents affects how they responded to the primes?
That’s it – now it’s time to submit your assignment!
Save and submit your script or output file (including all code) to the appropriate folder on Ilias before midnight.
[WPA03.Rmd
updated on 2017-11-13 12:10:34 by hn.]