WPA09 of Basic data and decision analysis in R, taught at the University of Konstanz in Winter 2017/2018.
To complete and submit these exercises, please remember and do the following:
Use the .Rmd Format: Your WPAs should be written as scripts of commented code (as
.Rmd
files) and submitted as reproducible documents that combine text with code (in.html
or.pdf
formats).A simple
.Rmd
template is provided here.(Alternatively, open a plain R script and save it as
LastnameFirstname_WPA##_yymmdd.R
.)
Commening your code: Indicate the current assignment (e.g., WPA09), your name, and the current date at the top of your document. Please always include appropriate comments with your code. For instance, your file
LastFirst_WPA09_yymmdd.Rmd
could look like this:
---
title: "Your Assignment Title (WPA09)"
author: "Your Name"
date: "Year Month Day"
output: html_document
---
This file contains my solutions:
# Exercise 1
To show and run R code in your document, use a code chunk (without the '#' symbols):
# ```{r, exercise_1, echo = TRUE, eval = TRUE}
#
# v <- c(1, 2, 3) # some vector
# sum(v)
#
# ```
More text and code chunks...
[Updated on `r Sys.Date()` by Your Name.]
<!-- End of document -->
Complete as many exercises as you can by Wednesday (23:59).
Submit your script or output file (including all code) to the appropriate folder on Ilias.
A. In Class
Here are some warm-up exercises that review important points from previous chapters and practice the basic concepts of the current topic:
Preparations
0. The following steps prepare the current session by opening an R project, creating a new .Rmd
file, and compiling it into an .html
output file:
0a. Open your R project from last week (called RCourse
or something similar), which contains some files and at least two subfolders (data
and R
).
0b. Create a new R Markdown (.Rmd
) script and save it as LastFirst_WPA07_yymmdd.Rmd
(with an appropriate header) in your project directory.
0c. Insert a code chunk and load the rmarkdown
, knitr
and yarrr
packages. (Hint: It’s always a good idea to name code chunks and load all required packages with library()
at the beginning of your document. Using the chunk option include = FALSE
evaluates the chunk, but does not show it or its outputs in the html output file.)
library(rmarkdown)
library(knitr)
library(yarrr)
# Store original par() settings:
opar <- par()
# par(opar) # restores original (default) par settings later
0d. Make sure that you can create an .html
output-file by “knitting” your current document.
Functions
1. Write a function called feed.me()
that takes a string food
as an argument, and returns the sentence “I love to eat ___!“. Test your function by running feed.me("apples")
, etc.
feed.me <- function(___) {
output <- paste0("I love to eat ", ___, "!")
print(___)
}
2a. Write a function called z.trans()
that takes a vector of numbers \(v\) as input and returns the z-transformed (or standardized) values as output. (Hint: Remember that z.v <- (v - mean(v)) / sd(v)
), but beware that \(v\) could contain NA values.)
2b. Demonstrate that z.trans()
works as expected by some example calls.
Loops
3a. Create a loop that prints the squares of integers from 1 to 10.
for(i in ___) {
square.i <- ___
print(square.i)
}
3b. Create a loop that counts down from an integer start
in steps of size step
until an integer lower.bound
is reached. Test your loop for start <- 1000
, lower.limit <- 900
and step <- 33
and step <- 50
. (Hint: If start
is the 1st element, the maximal number of steps is ceiling((start - lower.bound)/step)) + 1
. Use an if
statement to ensure that the current count is at least lower.bound
. Using a while
loop is easier than using a for
loop for this task, but using seq()
is even easier.)
Functions with loops
4. The so-called Fibonacci sequence (see Wikipedia entry) starts with 0, 1, 1, 2, 3, 5, ...
and is defined by the fact that every element (after the first two numbers) is the sum of the two preceding numbers (i.e., \(F_{n} = F_{n-2} + F_{n-1}\) for \(n > 2\)).
4a. Write a function Fibonacci(n)
that uses a loop to print the first \(n\) Fibonacci numbers. (Hint: Initialize the series with c(0, 1)
and use an if
statement to directly return the result for \(0 < n \leq 2\). Use a loop to compute additional numbers if \(n \geq 3\).)
4b. One major advantage of a general function is that it can be used for alternative tasks. Test your function by showing the results of Fibonacci(n)
for the numbers 1 and 3 as different first and second elements of the series and a series length n
of 15.
5. Functions on text (character objects): Let’s define some functions that count the occurrence of items in vectors or words in sentences.
5a. Create a function count()
that takes a vector \(v\) and a scalar \(x\) as arguments and returns how many times \(x\) appears as an element of \(v\). (Hint: Use logical indexing to avoid using a loop, but note that \(x\) could be NA and \(v\) could contain NA values.)
5b. Demonstrate that count()
works as expected by some example calls.
5c. Create a function count.char()
that takes a string of characters or word \(w\) and a string of characters or sentence \(s\) as arguments and returns how many times \(w\) appears in \(s\). (Hint: Use a loop over all possible positions of \(s\), and note that nchar()
returns the length of a character string and substr(s, start, stop)
returns the substring of \(s\) from position start
to position stop
.)
5d. Demonstrate that count.char()
works as expected by some example calls.
6. Functions for graphs: Let’s plot and write a useful function for a simple financial data set.
6a. We will use a data frame \(df.st\) in this exercise. The corresponding data (to be loaded into \(df.st\)) contains the closing values of two prominent stock indices for all trading days of 2016 and is available at http://Rpository.com/down/data/WPA09_st.txt. (Hint: Note that this text file is tab-delimited and contains a header row.)
df.st <- read.table(file = "http://Rpository.com/down/data/WPA09_st.txt", # from url
sep = "\t",
header = TRUE)
The first few lines of df.st
should look as follows:
head(df.st)
#> date dax dj
#> 1 04/01/2016 10283.44 17148.94
#> 2 05/01/2016 10310.10 17158.66
#> 3 06/01/2016 10214.02 16906.51
#> 4 07/01/2016 9979.85 16514.10
#> 5 08/01/2016 9849.34 16346.45
#> 6 11/01/2016 9825.07 16398.57
6b. Sliding average: Write a function slide.mean
that calculates the sliding averages of the most recent \(w\) values for a vector \(v\) (and ignore any NA values, but return NA if there are fewer than \(w\) preceding values). For instance, for v <- c(1, 2, 3, 1, 2, 3, NA, 2, 3)
the function call slide.mean(v, 3)
should return a vector NA NA 2.0 2.0 2.0 2.0 2.5 2.5 2.5
and slide.mean(v, 2)
should return a vector NA 1.5 2.5 2.0 1.5 2.5 3.0 2.0 2.5
.
6c. Write a function that plots all stock-index series in \(df.st\), as well as their sliding averages over all variables in (or columns of) a data frame \(df.st\) for a given window size \(w\). (Hint: First create a plot for specific instances of \(df.st\) and \(w\), then generalise this plot into a function.)
A specific plot:
The following plot visualizes the data contained in df.st
:
# (1) A specific plot:
w <- 100
colors <- yarrr::piratepal("basel")
# (a) determine ranges for axes:
all.values <- c(df.st$dax, df.st$dj) # to determine y-range
# df.st$date[1]
dates <- as.Date(df.st$date, "%d/%m/%Y") # determine dates
# dates[1]
# format(dates[1], format="%d %b %Y")
# (b) define plotting area:
plot(x = 1:nrow(df.st), type = "n",
ylim = range(all.values),
xaxt = "n", # suppress x axis
main = "Specific plot",
xlab = "Date", ylab = "Values")
axis.Date(side = 1, dates, format = "%d %b")
grid(col = "grey60")
# (c) add lines:
lines(df.st$dax, lty = 1, col = colors[1], lwd = 2)
sm.dax <- slide.mean(df.st$dax, w)
lines(sm.dax, lty = 1, col = colors[1], lwd = 1)
lines(df.st$dj, lty = 1, col = colors[2], lwd = 2)
sm.dj <- slide.mean(df.st$dj, w)
lines(sm.dj, lty = 1, col = colors[2], lwd = 1)
# (d) add legend:
legend("topleft", names(df.st[,-1]), lty = 1, lwd = 2, col = colors)
Generalization to a function:
The following code generalizes code into a function that returns a plot:
# (b) as function:
plot.ts <- function(df, w) {
m <- nrow(df)
n <- ncol(df)
colors <- yarrr::piratepal("basel")
plot(x = 1:m, type = "n", ylim = range(df),
main = "Time series",xlab = "Time", ylab = "Values")
grid(col = "grey60")
for (i in 1:n) { # for each column of df:
cur.ts <- df[ , i] # current time series
lines(cur.ts, lty = 1, col = colors[i], lwd = 2)
sl.mn <- slide.mean(cur.ts, w)
lines(sl.mn, lty = 1, col = colors[i], lwd = 1)
}
legend("topleft", names(df), lty = 1, lwd = 2, col = colors)
}
# check function:
plot.ts(df.ts, w = 100)
6d. Write a function plot.stocks()
that generalizes the above code by using the data set df
and a specific value of w
as input arguments. However, the function should still assume that the first column of df
contains date values and all subsequent columns contain stock index values.
Calling the function with df.st
as input data should yield the same plot again:
# Check function:
plot.stocks(df.st, w = 30)
Show that the plot.stocks()
function allows not only for different values of w
, but also for different data sets (e.g., containing a flexible number of stocks)!
Checkpoint 1
At this point you completed all basic exercises. This is good, but additional practice will deepen your understanding, so please keep carrying on…
B. At Home
Facebook attraction
In this part, we will re-use data from the ficticious study on attraction that we also used in WPA07. In this study, 1,000 heterosexual University students viewed the Facebook profile of another student (the “target” person) of the opposite sex. Based on a target person’s profile, each participant made three judgments about the target: their perceived intelligence, attractiveness, and dateability. The primary judgment of interest was the dateability rating indicating as how dateable the target person was perceived (ranging from a minimum value of 0 to a maximum value of 100).
Data description
The data file contains 1,000 rows and 10 columns. Here are the columns:
session
: The experiment session in which the study was run. There were 50 sessions in total.sex
: The sex of the target person (“m” vs. “f”).age
: The age of the target person (in years).haircolor
: The hair color of the target person.university
: The university that the target person attended.education
: The highest level of education obtained by the target person.shirtless
: Did the target person have a shirtless profile picture? (1.No vs. 2.Yes).intelligence
: As how intelligent do you rate this target? (1.Low, 2.Medium, 3.High).attractiveness
: As how physically attractive do you rate this target? (1.Low, 2.Medium, 3.High).dateability
: As how dateable do you rate this target person? (Scale from 0 to 100).
Data loading and exploration
7a. The data are still located in a tab-delimited text file at http://Rpository.com/down/data/WPA07_facebook.txt. Using read.table()
load this data into R as a new object called facebook
.
Here is how the first few rows of the data should look:
head(facebook)
#> session sex age haircolor university education shirtless
#> 1 1 m 23 brown 3.Goettingen 3.Masters 2.Yes
#> 2 1 m 19 blonde 2.Freiburg 1.HighSchool 1.No
#> 3 1 f 22 brown 2.Freiburg 2.Bachelors 2.Yes
#> 4 1 f 22 red 2.Freiburg 2.Bachelors 1.No
#> 5 1 m 23 brown 3.Goettingen 2.Bachelors 1.No
#> 6 1 m 26 blonde 2.Freiburg 3.Masters 2.Yes
#> intelligence attractiveness dateability
#> 1 1.low 3.high 15
#> 2 2.medium 2.medium 44
#> 3 1.low 2.medium 100
#> 4 2.medium 3.high 100
#> 5 2.medium 2.medium 63
#> 6 3.high 3.high 76
7b. Inspect the first few rows of the dataframe with the head()
function to make sure it loaded correctly. Using the str()
function, look at the structure of the dataframe to make sure everything looks ok.
#> 'data.frame': 1000 obs. of 10 variables:
#> $ session : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ sex : Factor w/ 2 levels "f","m": 2 2 1 1 2 2 1 2 1 1 ...
#> $ age : int 23 19 22 22 23 26 19 25 22 19 ...
#> $ haircolor : Factor w/ 3 levels "blonde","brown",..: 2 1 2 3 2 1 2 3 2 1 ...
#> $ university : Factor w/ 3 levels "1.Konstanz","2.Freiburg",..: 3 2 2 2 3 2 3 2 2 3 ...
#> $ education : Factor w/ 4 levels "1.HighSchool",..: 3 1 2 2 2 3 1 3 2 1 ...
#> $ shirtless : Factor w/ 2 levels "1.No","2.Yes": 2 1 2 1 1 2 1 2 2 1 ...
#> $ intelligence : Factor w/ 3 levels "1.low","2.medium",..: 1 2 1 2 2 3 1 3 2 2 ...
#> $ attractiveness: Factor w/ 3 levels "1.low","2.medium",..: 3 2 2 3 2 3 2 1 2 2 ...
#> $ dateability : int 15 44 100 100 63 76 61 26 76 40 ...
Custom functions
8a. Write a function called my.mean()
that takes a vector x
as an argument, and returns the mean of the vector x
. Don’t use the mean()
function! Use sum()
and length()
!
my.mean <- function(___) {
result <- sum(___) / length(___)
return(result)
}
8b. Try your my.mean()
function to calculate the mean dateability rating of participants in the facebook
dataset and compare the result to the built-in mean()
function to make sure you get the same result!
9a. Write a function called how.many.na()
that takes a vector x
as an argument, and returns the number of NA values found in the vector:
how.many.na <- function(x) {
output <- sum(is.na(___))
return(___)
}
9b. Test your how.many.na()
function on the age of participants in the facebook
dataframe and then on the vector x = c(4, 7, 3, NA, NA, 1)
.
10a. Create a function called my.plot()
that takes arguments x
and y
and returns a customised scatterplot with gridlines and a regression line:
my.plot <- function(x, y) {
plot(x = ___,
y = ___,
pch = ___, # check ?points to see the values of pch!
col = ___
)
grid() # add gridlines
# Add a regression line:
abline(lm(___ ~ ___),
col = ___
)
}
10b. Test your my.plot()
function on the age and dateability of participants in the facebook
dataset.
Custom loops
11. Using the following template, create a loop that calculates (and prints) the mean dateability rating of students from each university in the facebook
dataset.
for(university.i in c("1.Konstanz", "2.Freiburg", "3.Goettingen")) {
data.i <- facebook$dateability[facebook$university == ___]
output <- paste0("The mean dateability of targets at university ", ___, " is ", ___)
print(___)
}
12. Create a histogram of dateability
ratings for each level of intelligence
using the following structure:
par(mfrow = c(1, 3)) # set up 1 x 3 plotting grid
for(intelligence.i in c("1.low", "2.medium", "3.high")) {
hist(facebook$dateability[facebook$intelligence == ___],
main = ___,
xlab = ___,
col = ___
)
}
par(mfrow = c(1, 1)) # reset plotting grid
Checkpoint 2
If you got this far you’re doing great, but don’t give up just yet…
More functions
13a. Write a function called ttest.apa()
that takes a vector x
as an argument and returns an apa style conclusion from a one-sample test of x
.
ttest.apa <- function(x, mu) {
# Store the one-sample ttest in object a
a <- t.test(x = ___,
mu = ___)
df <- a$parameter # Get the degrees of freedom
test.stat <- ___ # Get the test statistic
p.value <- ___ # Get the p-value
# If the test is significant...
if(p.value <= ___) {
output <- paste0("The test is significant! t(",
df, ") = ", test.stat,
", p = ", p.value,
" (H0 = ", mu, ")")
}
# If the test is not significant...
if(p.value > ___) {
output <- ___
}
print(___)
}
13b. Test your ttest.apa()
function on the the dateability of participants in the facebook study. Specifically, test if their mean dateability rating is different from 50.
More loops
14. Let’s use a design matrix to compute some descriptive statistics for specific subsets of the facebook
data.
14a. Create an R object design.matrix
that contains rows for all combinations of 2 independent variables: gender
(with values of “f” or “m”) and haircolor
(with values of “blonde”, “brown”, or “red”). Include placeholders (with NA
values) for 3 dependent variables: the number of cases nr
, the mean age age.mn
, and the mean dateability dateability.mn
). (Hint: Use expand.grid()
function to initialize the design.matrix
, which should have \(2 \cdot\ 3 = 6\) rows and \(2 + 3 = 5\) columns.)
Here’s how your design.matrix
should look after its creation:
design.matrix
#> gender haircolor nr age.mn dateability.mn
#> 1 f blonde NA NA NA
#> 2 m blonde NA NA NA
#> 3 f brown NA NA NA
#> 4 m brown NA NA NA
#> 5 f red NA NA NA
#> 6 m red NA NA NA
14b. Use a loop through all rows of your design.matrix
that
- gets the factor (or independent) values from the current row,
- finds the corresponding subset in the
facebook
data, - computes the desired statistics (
nr
,age.mn
, anddateability.mn
), and - adds them into the appropriate columns of the current row of
design.matrix
.
Here’s how design.matrix
looks after this loop:
design.matrix
#> gender haircolor nr age.mn dateability.mn
#> 1 f blonde 130 22.24615 58.14615
#> 2 m blonde 120 22.22500 48.63333
#> 3 f brown 230 22.33478 60.19130
#> 4 m brown 270 22.40000 47.53333
#> 5 f red 119 22.57983 63.67227
#> 6 m red 131 22.13740 49.99237
14c. Use either dplyr
or an ANOVA to check your results.
15. The following dataframe survey
contains results from a survey of 5 participants. Each participant was asked 5 questions on a 1-10 Likert scale. As you can see, many of the responses are not valid integers from 1-10. Using a loop, create a new dataframe called survey.corrected
that converts all invalid values to NA.
survey <- data.frame("p" = c(1, 2, 3, 4, 5),
"q1" = c(5, 3, 6, 3, 5),
"q2" = c(-1, 4, 3, 6, 11),
"q3" = c(6, 22, 4, 6, -5),
"q4" = c(6, 3, 4, -2, 4),
"q5" = c(1, 1, 900, 1, 2))
Here’s a template for the desired code and loop:
survey.corrected <- survey # Copy survey
for(column.i in ___ ) {# Loop over columns
x <- ___ # Copy original column as x
x[x %in% ___ == FALSE] <- ___ # replace any value not in 1:10 by NA
survey.corrected[ , ___] <- ___ # Assign x back to survey.correced
}
Checkpoint 3
If you got this far you’re doing an amazing job — well done! Enjoy the following challenge…
C. Challenge: Simulation
16. Let’s conduct some simulations to address some deep problem (literally) of inferential statistics.
16a. What is the probability of getting a significant \(p\)-value if the null hypothesis is true? Test this by conducting the following simulation:
- Create a vector called
p.values
with \(N = 100\) NA values. - Draw a sample size of 10 from a normal distribution with mean \(= 0\) and standard deviation \(= 1\).
- Do a one-sample \(t\)-test testing if the mean of the distribution is different from 0. Save the \(p\)-value from this test in the 1st position of
p.values
. - Repeat these steps with a loop to fill all \(N\)
p.values
. - Create a histogram of
p.values
and calculate the proportion ofp-values
that are significant at the .05 level.
N <- 100
p.values <- rep(NA, ___)
for(i in ___) {
x <- rnorm(n = ___, mean = ___, sd = ___)
result <- t.test(___)$___
p.values[___] <- ___
}
16b. Create a function p.simulation
that generalizes the simulation from Exercise 16a. with 4 arguments: sim
: the number of simulations, sample.size
: the sample size used in each simulation, mu.true
: the true mean, and sd.true
: the true standard deviation. That is, p.simulation
should calculate sim
\(p\)-values testing whether sample.size
samples from a normal distribution with a true mean of mu.true
and true standard deviation of sd.true
is significantly different from 0. The function should return a vector of \(p\)-values.
Submission
That’s it – now it’s time to submit your assignment! –>
Save and submit your script or output file (including all code) to the appropriate folder on Ilias before midnight.