uni.kn.logo

WPA09 of Basic data and decision analysis in R, taught at the University of Konstanz in Winter 2017/2018.


To complete and submit these exercises, please remember and do the following:

  1. Use the .Rmd Format: Your WPAs should be written as scripts of commented code (as .Rmd files) and submitted as reproducible documents that combine text with code (in .html or .pdf formats).

    • A simple .Rmd template is provided here.

    • (Alternatively, open a plain R script and save it as LastnameFirstname_WPA##_yymmdd.R.)

  2. Commening your code: Indicate the current assignment (e.g., WPA09), your name, and the current date at the top of your document. Please always include appropriate comments with your code. For instance, your file LastFirst_WPA09_yymmdd.Rmd could look like this:

---
title: "Your Assignment Title (WPA09)"
author: "Your Name"
date: "Year Month Day"
output: html_document
---

This file contains my solutions: 

# Exercise 1

To show and run R code in your document, use a code chunk (without the '#' symbols):

# ```{r, exercise_1, echo = TRUE, eval = TRUE}
# 
# v <- c(1, 2, 3) # some vector
# sum(v)
#     
# ```

More text and code chunks... 

[Updated on `r Sys.Date()` by Your Name.]
<!-- End of document -->
  1. Complete as many exercises as you can by Wednesday (23:59).

  2. Submit your script or output file (including all code) to the appropriate folder on Ilias.


A. In Class

Here are some warm-up exercises that review important points from previous chapters and practice the basic concepts of the current topic:

Preparations

0. The following steps prepare the current session by opening an R project, creating a new .Rmd file, and compiling it into an .html output file:

0a. Open your R project from last week (called RCourse or something similar), which contains some files and at least two subfolders (data and R).

0b. Create a new R Markdown (.Rmd) script and save it as LastFirst_WPA07_yymmdd.Rmd (with an appropriate header) in your project directory.

0c. Insert a code chunk and load the rmarkdown, knitr and yarrr packages. (Hint: It’s always a good idea to name code chunks and load all required packages with library() at the beginning of your document. Using the chunk option include = FALSE evaluates the chunk, but does not show it or its outputs in the html output file.)

library(rmarkdown)
library(knitr)
library(yarrr)

# Store original par() settings:
opar <- par()
# par(opar) # restores original (default) par settings later

0d. Make sure that you can create an .html output-file by “knitting” your current document.

Functions

1. Write a function called feed.me() that takes a string food as an argument, and returns the sentence “I love to eat ___!“. Test your function by running feed.me("apples"), etc.

feed.me <- function(___) {
  
  output <- paste0("I love to eat ", ___, "!")
  
  print(___)
}

2a. Write a function called z.trans() that takes a vector of numbers \(v\) as input and returns the z-transformed (or standardized) values as output. (Hint: Remember that z.v <- (v - mean(v)) / sd(v)), but beware that \(v\) could contain NA values.)

2b. Demonstrate that z.trans() works as expected by some example calls.

Loops

3a. Create a loop that prints the squares of integers from 1 to 10.

for(i in ___) {
  
  square.i <- ___
  
  print(square.i)
  
}

3b. Create a loop that counts down from an integer start in steps of size step until an integer lower.bound is reached. Test your loop for start <- 1000, lower.limit <- 900 and step <- 33 and step <- 50. (Hint: If start is the 1st element, the maximal number of steps is ceiling((start - lower.bound)/step)) + 1. Use an if statement to ensure that the current count is at least lower.bound. Using a while loop is easier than using a for loop for this task, but using seq() is even easier.)

Functions with loops

4. The so-called Fibonacci sequence (see Wikipedia entry) starts with 0, 1, 1, 2, 3, 5, ... and is defined by the fact that every element (after the first two numbers) is the sum of the two preceding numbers (i.e., \(F_{n} = F_{n-2} + F_{n-1}\) for \(n > 2\)).

4a. Write a function Fibonacci(n) that uses a loop to print the first \(n\) Fibonacci numbers. (Hint: Initialize the series with c(0, 1) and use an if statement to directly return the result for \(0 < n \leq 2\). Use a loop to compute additional numbers if \(n \geq 3\).)

4b. One major advantage of a general function is that it can be used for alternative tasks. Test your function by showing the results of Fibonacci(n) for the numbers 1 and 3 as different first and second elements of the series and a series length n of 15.

5. Functions on text (character objects): Let’s define some functions that count the occurrence of items in vectors or words in sentences.

5a. Create a function count() that takes a vector \(v\) and a scalar \(x\) as arguments and returns how many times \(x\) appears as an element of \(v\). (Hint: Use logical indexing to avoid using a loop, but note that \(x\) could be NA and \(v\) could contain NA values.)

5b. Demonstrate that count() works as expected by some example calls.

5c. Create a function count.char() that takes a string of characters or word \(w\) and a string of characters or sentence \(s\) as arguments and returns how many times \(w\) appears in \(s\). (Hint: Use a loop over all possible positions of \(s\), and note that nchar() returns the length of a character string and substr(s, start, stop) returns the substring of \(s\) from position start to position stop.)

5d. Demonstrate that count.char() works as expected by some example calls.

6. Functions for graphs: Let’s plot and write a useful function for a simple financial data set.

6a. We will use a data frame \(df.st\) in this exercise. The corresponding data (to be loaded into \(df.st\)) contains the closing values of two prominent stock indices for all trading days of 2016 and is available at http://Rpository.com/down/data/WPA09_st.txt. (Hint: Note that this text file is tab-delimited and contains a header row.)

df.st <- read.table(file = "http://Rpository.com/down/data/WPA09_st.txt", # from url
                    sep = "\t", 
                    header = TRUE)

The first few lines of df.st should look as follows:

head(df.st)
#>           date      dax       dj
#> 1   04/01/2016 10283.44 17148.94
#> 2   05/01/2016 10310.10 17158.66
#> 3   06/01/2016 10214.02 16906.51
#> 4   07/01/2016  9979.85 16514.10
#> 5   08/01/2016  9849.34 16346.45
#> 6   11/01/2016  9825.07 16398.57

6b. Sliding average: Write a function slide.mean that calculates the sliding averages of the most recent \(w\) values for a vector \(v\) (and ignore any NA values, but return NA if there are fewer than \(w\) preceding values). For instance, for v <- c(1, 2, 3, 1, 2, 3, NA, 2, 3) the function call slide.mean(v, 3) should return a vector NA NA 2.0 2.0 2.0 2.0 2.5 2.5 2.5 and slide.mean(v, 2) should return a vector NA 1.5 2.5 2.0 1.5 2.5 3.0 2.0 2.5.

6c. Write a function that plots all stock-index series in \(df.st\), as well as their sliding averages over all variables in (or columns of) a data frame \(df.st\) for a given window size \(w\). (Hint: First create a plot for specific instances of \(df.st\) and \(w\), then generalise this plot into a function.)

A specific plot:

The following plot visualizes the data contained in df.st:

# (1) A specific plot:
w <- 100
colors <- yarrr::piratepal("basel")

# (a) determine ranges for axes:
all.values <- c(df.st$dax, df.st$dj) # to determine y-range

# df.st$date[1]
dates <- as.Date(df.st$date, "%d/%m/%Y") # determine dates
# dates[1]
# format(dates[1], format="%d %b %Y")

# (b) define plotting area:
plot(x = 1:nrow(df.st), type = "n",  
     ylim = range(all.values),  
     xaxt = "n", # suppress x axis  
     main = "Specific plot", 
     xlab = "Date", ylab = "Values")
axis.Date(side = 1, dates, format = "%d %b")
grid(col = "grey60")

# (c) add lines:
lines(df.st$dax, lty = 1, col = colors[1], lwd = 2)
sm.dax <- slide.mean(df.st$dax, w)
lines(sm.dax, lty = 1, col = colors[1], lwd = 1)

lines(df.st$dj, lty = 1, col = colors[2], lwd = 2)
sm.dj  <- slide.mean(df.st$dj, w)
lines(sm.dj, lty = 1, col = colors[2], lwd = 1)

# (d) add legend:
legend("topleft", names(df.st[,-1]), lty = 1, lwd = 2, col = colors)

Generalization to a function:

The following code generalizes code into a function that returns a plot:

# (b) as function:
plot.ts <- function(df, w) {
  
  m <- nrow(df)
  n <- ncol(df)
  
  colors <- yarrr::piratepal("basel")
  
  
  plot(x = 1:m, type = "n", ylim = range(df), 
       main = "Time series",xlab = "Time", ylab = "Values")

  grid(col = "grey60")
    
  for (i in 1:n) { # for each column of df:
    
    cur.ts <- df[ , i] # current time series
    
    lines(cur.ts, lty = 1, col = colors[i], lwd = 2)
    sl.mn <- slide.mean(cur.ts, w)
    lines(sl.mn, lty = 1, col = colors[i], lwd = 1)
    
    }

  legend("topleft", names(df), lty = 1, lwd = 2, col = colors)

}

# check function:
plot.ts(df.ts, w = 100)

6d. Write a function plot.stocks() that generalizes the above code by using the data set df and a specific value of w as input arguments. However, the function should still assume that the first column of df contains date values and all subsequent columns contain stock index values.

Calling the function with df.st as input data should yield the same plot again:

# Check function:
plot.stocks(df.st, w = 30)

Show that the plot.stocks() function allows not only for different values of w, but also for different data sets (e.g., containing a flexible number of stocks)!

Checkpoint 1

At this point you completed all basic exercises. This is good, but additional practice will deepen your understanding, so please keep carrying on…

B. At Home

Facebook attraction

In this part, we will re-use data from the ficticious study on attraction that we also used in WPA07. In this study, 1,000 heterosexual University students viewed the Facebook profile of another student (the “target” person) of the opposite sex. Based on a target person’s profile, each participant made three judgments about the target: their perceived intelligence, attractiveness, and dateability. The primary judgment of interest was the dateability rating indicating as how dateable the target person was perceived (ranging from a minimum value of 0 to a maximum value of 100).

Data description

The data file contains 1,000 rows and 10 columns. Here are the columns:

  1. session: The experiment session in which the study was run. There were 50 sessions in total.

  2. sex: The sex of the target person (“m” vs. “f”).

  3. age: The age of the target person (in years).

  4. haircolor: The hair color of the target person.

  5. university: The university that the target person attended.

  6. education: The highest level of education obtained by the target person.

  7. shirtless: Did the target person have a shirtless profile picture? (1.No vs. 2.Yes).

  8. intelligence: As how intelligent do you rate this target? (1.Low, 2.Medium, 3.High).

  9. attractiveness: As how physically attractive do you rate this target? (1.Low, 2.Medium, 3.High).

  10. dateability: As how dateable do you rate this target person? (Scale from 0 to 100).

Data loading and exploration

7a. The data are still located in a tab-delimited text file at http://Rpository.com/down/data/WPA07_facebook.txt. Using read.table() load this data into R as a new object called facebook.

Here is how the first few rows of the data should look:

head(facebook)
#>   session sex age haircolor   university    education shirtless
#> 1       1   m  23     brown 3.Goettingen    3.Masters     2.Yes
#> 2       1   m  19    blonde   2.Freiburg 1.HighSchool      1.No
#> 3       1   f  22     brown   2.Freiburg  2.Bachelors     2.Yes
#> 4       1   f  22       red   2.Freiburg  2.Bachelors      1.No
#> 5       1   m  23     brown 3.Goettingen  2.Bachelors      1.No
#> 6       1   m  26    blonde   2.Freiburg    3.Masters     2.Yes
#>   intelligence attractiveness dateability
#> 1        1.low         3.high          15
#> 2     2.medium       2.medium          44
#> 3        1.low       2.medium         100
#> 4     2.medium         3.high         100
#> 5     2.medium       2.medium          63
#> 6       3.high         3.high          76

7b. Inspect the first few rows of the dataframe with the head() function to make sure it loaded correctly. Using the str() function, look at the structure of the dataframe to make sure everything looks ok.

#> 'data.frame':    1000 obs. of  10 variables:
#>  $ session       : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ sex           : Factor w/ 2 levels "f","m": 2 2 1 1 2 2 1 2 1 1 ...
#>  $ age           : int  23 19 22 22 23 26 19 25 22 19 ...
#>  $ haircolor     : Factor w/ 3 levels "blonde","brown",..: 2 1 2 3 2 1 2 3 2 1 ...
#>  $ university    : Factor w/ 3 levels "1.Konstanz","2.Freiburg",..: 3 2 2 2 3 2 3 2 2 3 ...
#>  $ education     : Factor w/ 4 levels "1.HighSchool",..: 3 1 2 2 2 3 1 3 2 1 ...
#>  $ shirtless     : Factor w/ 2 levels "1.No","2.Yes": 2 1 2 1 1 2 1 2 2 1 ...
#>  $ intelligence  : Factor w/ 3 levels "1.low","2.medium",..: 1 2 1 2 2 3 1 3 2 2 ...
#>  $ attractiveness: Factor w/ 3 levels "1.low","2.medium",..: 3 2 2 3 2 3 2 1 2 2 ...
#>  $ dateability   : int  15 44 100 100 63 76 61 26 76 40 ...

Custom functions

8a. Write a function called my.mean() that takes a vector x as an argument, and returns the mean of the vector x. Don’t use the mean() function! Use sum() and length()!

my.mean <- function(___) {
  
  result <- sum(___) / length(___)
  
  return(result)
  
}

8b. Try your my.mean() function to calculate the mean dateability rating of participants in the facebook dataset and compare the result to the built-in mean() function to make sure you get the same result!

9a. Write a function called how.many.na() that takes a vector x as an argument, and returns the number of NA values found in the vector:

how.many.na <- function(x) {
  
  output <- sum(is.na(___))
  
  return(___)
}

9b. Test your how.many.na() function on the age of participants in the facebook dataframe and then on the vector x = c(4, 7, 3, NA, NA, 1).

10a. Create a function called my.plot() that takes arguments x and y and returns a customised scatterplot with gridlines and a regression line:

my.plot <- function(x, y) {
  
  plot(x = ___, 
       y = ___, 
       pch = ___, # check ?points to see the values of pch!
       col = ___
       )
  
  grid() # add gridlines
  
  # Add a regression line:
  abline(lm(___ ~ ___), 
         col = ___
         )
}

10b. Test your my.plot() function on the age and dateability of participants in the facebook dataset.

Custom loops

11. Using the following template, create a loop that calculates (and prints) the mean dateability rating of students from each university in the facebook dataset.

for(university.i in c("1.Konstanz", "2.Freiburg", "3.Goettingen")) {
  
  data.i <- facebook$dateability[facebook$university == ___]
  
  output <- paste0("The mean dateability of targets at university ", ___, " is ", ___)
  
  print(___)
  
}

12. Create a histogram of dateability ratings for each level of intelligence using the following structure:

par(mfrow = c(1, 3)) # set up 1 x 3 plotting grid

for(intelligence.i in c("1.low", "2.medium", "3.high")) {
  
  hist(facebook$dateability[facebook$intelligence == ___],
       main = ___,
       xlab = ___,
       col = ___
       )
  }

par(mfrow = c(1, 1)) # reset plotting grid

Checkpoint 2

If you got this far you’re doing great, but don’t give up just yet…

More functions

13a. Write a function called ttest.apa() that takes a vector x as an argument and returns an apa style conclusion from a one-sample test of x.

ttest.apa <- function(x, mu) {
  
   # Store the one-sample ttest in object a
  
  a <- t.test(x = ___, 
              mu = ___) 
  
  df <- a$parameter  # Get the degrees of freedom
  test.stat <- ___      # Get the test statistic
  p.value <- ___        # Get the p-value
  
  
  # If the test is significant...
  if(p.value <= ___) {
    
    output <- paste0("The test is significant! t(",
                     df, ") = ", test.stat, 
                   ", p = ", p.value, 
                   " (H0 = ", mu, ")")
  }
  
  # If the test is not significant...
    if(p.value > ___) {
    
    output <- ___
    
  }
  
  print(___)  
}

13b. Test your ttest.apa() function on the the dateability of participants in the facebook study. Specifically, test if their mean dateability rating is different from 50.

More loops

14. Let’s use a design matrix to compute some descriptive statistics for specific subsets of the facebook data.

14a. Create an R object design.matrix that contains rows for all combinations of 2 independent variables: gender (with values of “f” or “m”) and haircolor (with values of “blonde”, “brown”, or “red”). Include placeholders (with NA values) for 3 dependent variables: the number of cases nr, the mean age age.mn, and the mean dateability dateability.mn). (Hint: Use expand.grid() function to initialize the design.matrix, which should have \(2 \cdot\ 3 = 6\) rows and \(2 + 3 = 5\) columns.)

Here’s how your design.matrix should look after its creation:

design.matrix
#>   gender haircolor nr age.mn dateability.mn
#> 1      f    blonde NA     NA             NA
#> 2      m    blonde NA     NA             NA
#> 3      f     brown NA     NA             NA
#> 4      m     brown NA     NA             NA
#> 5      f       red NA     NA             NA
#> 6      m       red NA     NA             NA

14b. Use a loop through all rows of your design.matrix that

  1. gets the factor (or independent) values from the current row,
  2. finds the corresponding subset in the facebook data,
  3. computes the desired statistics (nr, age.mn, and dateability.mn), and
  4. adds them into the appropriate columns of the current row of design.matrix.

Here’s how design.matrix looks after this loop:

design.matrix
#>   gender haircolor  nr   age.mn dateability.mn
#> 1      f    blonde 130 22.24615       58.14615
#> 2      m    blonde 120 22.22500       48.63333
#> 3      f     brown 230 22.33478       60.19130
#> 4      m     brown 270 22.40000       47.53333
#> 5      f       red 119 22.57983       63.67227
#> 6      m       red 131 22.13740       49.99237

14c. Use either dplyr or an ANOVA to check your results.

15. The following dataframe survey contains results from a survey of 5 participants. Each participant was asked 5 questions on a 1-10 Likert scale. As you can see, many of the responses are not valid integers from 1-10. Using a loop, create a new dataframe called survey.corrected that converts all invalid values to NA.

survey <- data.frame("p" = c(1, 2, 3, 4, 5),
                     "q1" = c(5, 3, 6, 3, 5),
                     "q2" = c(-1, 4, 3, 6, 11),
                     "q3" = c(6, 22, 4, 6, -5),
                     "q4" = c(6, 3, 4, -2, 4),
                     "q5" = c(1, 1, 900, 1, 2))

Here’s a template for the desired code and loop:

survey.corrected <- survey   # Copy survey

for(column.i in ___ ) {# Loop over columns
 
  x <- ___   # Copy original column as x
  x[x %in% ___ == FALSE] <- ___  # replace any value not in 1:10 by NA
  
  survey.corrected[ , ___] <- ___ # Assign x back to survey.correced
  
  }

Checkpoint 3

If you got this far you’re doing an amazing job — well done! Enjoy the following challenge…

C. Challenge: Simulation

16. Let’s conduct some simulations to address some deep problem (literally) of inferential statistics.

16a. What is the probability of getting a significant \(p\)-value if the null hypothesis is true? Test this by conducting the following simulation:

  • Create a vector called p.values with \(N = 100\) NA values.
  • Draw a sample size of 10 from a normal distribution with mean \(= 0\) and standard deviation \(= 1\).
  • Do a one-sample \(t\)-test testing if the mean of the distribution is different from 0. Save the \(p\)-value from this test in the 1st position of p.values.
  • Repeat these steps with a loop to fill all \(N\) p.values.
  • Create a histogram of p.values and calculate the proportion of p-values that are significant at the .05 level.
N <- 100

p.values <- rep(NA, ___)

for(i in ___) {

x <- rnorm(n = ___, mean = ___, sd = ___)

result <- t.test(___)$___

p.values[___] <- ___

}

16b. Create a function p.simulation that generalizes the simulation from Exercise 16a. with 4 arguments: sim: the number of simulations, sample.size: the sample size used in each simulation, mu.true: the true mean, and sd.true: the true standard deviation. That is, p.simulation should calculate sim \(p\)-values testing whether sample.size samples from a normal distribution with a true mean of mu.true and true standard deviation of sd.true is significantly different from 0. The function should return a vector of \(p\)-values.

Submission

That’s it – now it’s time to submit your assignment! –>

Save and submit your script or output file (including all code) to the appropriate folder on Ilias before midnight.