uni.kn.logo

Answers for WPA09 of Basic data and decision analysis in R, taught at the University of Konstanz in Winter 2017/2018.


To complete and submit these exercises, please remember and do the following:

  1. Use the .Rmd Format: Your WPAs should be written as scripts of commented code (as .Rmd files) and submitted as reproducible documents that combine text with code (in .html or .pdf formats).

    • A simple .Rmd template is provided here.

    • (Alternatively, open a plain R script and save it as LastnameFirstname_WPA##_yymmdd.R.)

  2. Commening your code: Indicate the current assignment (e.g., WPA09), your name, and the current date at the top of your document. Please always include appropriate comments with your code. For instance, your file LastFirst_WPA09_yymmdd.Rmd could look like this:

---
title: "Your Assignment Title (WPA09)"
author: "Your Name"
date: "Year Month Day"
output: html_document
---

This file contains my solutions: 

# Exercise 1

To show and run R code in your document, use a code chunk (without the '#' symbols):

# ```{r, exercise_1, echo = TRUE, eval = TRUE}
# 
# v <- c(1, 2, 3) # some vector
# sum(v)
#     
# ```

More text and code chunks... 

[Updated on `r Sys.Date()` by Your Name.]
<!-- End of document -->
  1. Complete as many exercises as you can by Wednesday (23:59).

  2. Submit your script or output file (including all code) to the appropriate folder on Ilias.


A. In Class

Here are some warm-up exercises that review important points from previous chapters and practice the basic concepts of the current topic:

Preparations

0. The following steps prepare the current session by opening an R project, creating a new .Rmd file, and compiling it into an .html output file:

0a. Open your R project from last week (called RCourse or something similar), which contains some files and at least two subfolders (data and R).

0b. Create a new R Markdown (.Rmd) script and save it as LastFirst_WPA07_yymmdd.Rmd (with an appropriate header) in your project directory.

0c. Insert a code chunk and load the rmarkdown, knitr and yarrr packages. (Hint: It’s always a good idea to name code chunks and load all required packages with library() at the beginning of your document. Using the chunk option include = FALSE evaluates the chunk, but does not show it or its outputs in the html output file.)

library(rmarkdown)
library(knitr)
library(yarrr)

# Store original par() settings:
opar <- par()
# par(opar) # restores original (default) par settings later

0d. Make sure that you can create an .html output-file by “knitting” your current document.

Functions

1. Write a function called feed.me() that takes a string food as an argument, and returns the sentence “I love to eat ___!“. Test your function by running feed.me("apples"), etc.

feed.me <- function(___) {
  
  output <- paste0("I love to eat ", ___, "!")
  
  print(___)
}

A possible solution:

feed.me <- function(food) {
  
  output <- paste0("I love to eat ", food, "!")
  
  print(output)
  }

feed.me("apples")
#> [1] "I love to eat apples!"
feed.me("cake")
#> [1] "I love to eat cake!"

2a. Write a function called z.trans() that takes a vector of numbers \(v\) as input and returns the z-transformed (or standardized) values as output. (Hint: Remember that z.v <- (v - mean(v)) / sd(v)), but beware that \(v\) could contain NA values.)

z.trans <- function(v) {

    if (!is.numeric(v)) {
      stop("v must be numeric.")
    }
  
    if (NA %in% v) {
      warning("v contains NA-values (ignored here).")
    }
  
    result <- (v - mean(v, na.rm = TRUE)) / sd(v, na.rm = TRUE)
    result
}

2b. Demonstrate that z.trans() works as expected by some example calls.

z.trans(1:9)
#> [1] -1.4605935 -1.0954451 -0.7302967 -0.3651484  0.0000000  0.3651484
#> [7]  0.7302967  1.0954451  1.4605935

x <- c(1, NA, 2)
z.trans(x)
#> [1] -0.7071068         NA  0.7071068

z.trans(rep(1, 10)) # sd(rep(1, 10)) ==> 0!
#>  [1] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

iq <- rnorm(10, mean = 100, sd = 1)
mean(z.trans(iq)) # should approach 0:
#> [1] -5.447032e-15
sd(z.trans(iq)) # should approach 1: 
#> [1] 1

Loops

3a. Create a loop that prints the squares of integers from 1 to 10.

for(i in ___) {
  
  square.i <- ___
  
  print(square.i)
  
}

A possible solution:

for(i in 1:10) {
  
  square.i <- i^2
  
  print(square.i)
  }
#> [1] 1
#> [1] 4
#> [1] 9
#> [1] 16
#> [1] 25
#> [1] 36
#> [1] 49
#> [1] 64
#> [1] 81
#> [1] 100

3b. Create a loop that counts down from an integer start in steps of size step until an integer lower.bound is reached. Test your loop for start <- 1000, lower.limit <- 900 and step <- 33 and step <- 50. (Hint: If start is the 1st element, the maximal number of steps is ceiling((start - lower.bound)/step)) + 1. Use an if statement to ensure that the current count is at least lower.bound. Using a while loop is easier than using a for loop for this task, but using seq() is even easier.)

# As functions:
# (a) using `for` loop:
countdown <- function(start, lower.bound, step) {
   count <- start
   for (i in 1:(ceiling((start - lower.bound)/step) + 1)) {
     if (count >= lower.bound) {
       print(count)
       count <- count - step
     }
   }
}

# (b) using `while` loop:
countdown.w <- function(start, lower.bound, step) {
  count <- start
  while (count >= lower.bound) {
    print(count)
    count <- count - step
  }
}

# (c) using seq():
countdown.s <- function(start, lower.bound, step) {
  counts <- seq(start, lower.bound, by = -step) # determine sequence of counts
  for (i in counts) {
    print(i)
  }
}

## Check functions:
# (a)
# countdown(1000, 900, 33)
# countdown(1000, 900, 50)
# (b)
# countdown.w(1000, 900, 33)
# countdown.w(1000, 900, 50)
# (c)
countdown.s(1000, 900, 33)
#> [1] 1000
#> [1] 967
#> [1] 934
#> [1] 901
countdown.s(1000, 900, 50)
#> [1] 1000
#> [1] 950
#> [1] 900

Functions with loops

4. The so-called Fibonacci sequence (see Wikipedia entry) starts with 0, 1, 1, 2, 3, 5, ... and is defined by the fact that every element (after the first two numbers) is the sum of the two preceding numbers (i.e., \(F_{n} = F_{n-2} + F_{n-1}\) for \(n > 2\)).

4a. Write a function Fibonacci(n) that uses a loop to print the first \(n\) Fibonacci numbers. (Hint: Initialize the series with c(0, 1) and use an if statement to directly return the result for \(0 < n \leq 2\). Use a loop to compute additional numbers if \(n \geq 3\).)

Fibonacci <- function(first.2 = c(0, 1), n = 2) {
  
  series <- first.2 # initialize series with its first.2 elements
  
  if (n < 1) { 
    warning("Warning: n must be positive.\n") 
    series <- NA
    }
  if (n > 0 & n < 3) { series <- series[1:n] }
  if (n > 2) {
  
    series <- c(series, rep(NA, n-2)) # initialize longer series (with n elements)
    
    for (i in 3:n) { # loop through elements 3 to n: 
      series[i] <- series[i - 2] + series[i - 1] # apply definition
    }
  }
  return(series)
}

# Check function:
Fibonacci() # using default arguments
#> [1] 0 1
Fibonacci(n = 5) # using 1st default argument
#> [1] 0 1 1 2 3
first.two <- c(0, 1)
Fibonacci(first.two, 0)
#> [1] NA
Fibonacci(first.two, 1)
#> [1] 0
Fibonacci(first.two, 15)
#>  [1]   0   1   1   2   3   5   8  13  21  34  55  89 144 233 377

4b. One major advantage of a general function is that it can be used for alternative tasks. Test your function by showing the results of Fibonacci(n) for the numbers 1 and 3 as different first and second elements of the series and a series length n of 15.

# Compute function for different first elements:
diff.first.two <- c(1, 3) # start with different elements
Fibonacci(diff.first.two, 15)
#>  [1]    1    3    4    7   11   18   29   47   76  123  199  322  521  843
#> [15] 1364

5. Functions on text (character objects): Let’s define some functions that count the occurrence of items in vectors or words in sentences.

5a. Create a function count() that takes a vector \(v\) and a scalar \(x\) as arguments and returns how many times \(x\) appears as an element of \(v\). (Hint: Use logical indexing to avoid using a loop, but note that \(x\) could be NA and \(v\) could contain NA values.)

count <- function(x, v) {
  result <- NA # initialize 
  if (is.na(x)) {
    result <- sum(is.na(v))
    } else {
      result <- sum(x == v, na.rm = TRUE)
      }
  print(result)
  }

5b. Demonstrate that count() works as expected by some example calls.

v <- c(NA, 1:2, NA, rep(1, 3), NA, 2:1)
v
#>  [1] NA  1  2 NA  1  1  1 NA  2  1
count(NA, v)
#> [1] 3
count(1, v)
#> [1] 5
count(2, v)
#> [1] 2
count(9, v)
#> [1] 0

5c. Create a function count.char() that takes a string of characters or word \(w\) and a string of characters or sentence \(s\) as arguments and returns how many times \(w\) appears in \(s\). (Hint: Use a loop over all possible positions of \(s\), and note that nchar() returns the length of a character string and substr(s, start, stop) returns the substring of \(s\) from position start to position stop.)

count.char <- function(w, s) {

  result <- 0 # initialize counter
  len.w <- 0
  len.s <- 0 
    
  if (!is.character(w) | !is.character(s)) {
    warning("Warning: w and s need to be character strings.\n")
  } else { # get lengths of w and s:
    len.w <- nchar(w)
    len.s <- nchar(s)
    }

  if (len.s < len.w) {
    result <- 0
    } else {
      # loop over all possible positions in s:
      for (i in 1:(len.s - len.w + 1)) {
        if (tolower(substr(s, i, i + len.w - 1)) == tolower(w)) {
          result <- result + 1 # increment counter
          }
      }
    }
      
  print(paste0("Found ", result, " instances of ", w, " in '", s, "'."))
  }

5d. Demonstrate that count.char() works as expected by some example calls.

s <- "An aardvark says: Aha!"

count.char(1, s)
#> [1] "Found 0 instances of 1 in 'An aardvark says: Aha!'."
count.char("!", s)
#> [1] "Found 1 instances of ! in 'An aardvark says: Aha!'."
count.char("a", s)
#> [1] "Found 7 instances of a in 'An aardvark says: Aha!'."
count.char("y", s)
#> [1] "Found 1 instances of y in 'An aardvark says: Aha!'."
count.char("z", s)
#> [1] "Found 0 instances of z in 'An aardvark says: Aha!'."
count.char("aha", s)
#> [1] "Found 1 instances of aha in 'An aardvark says: Aha!'."

6. Functions for graphs: Let’s plot and write a useful function for a simple financial data set.

6a. We will use a data frame \(df.st\) in this exercise. The corresponding data (to be loaded into \(df.st\)) contains the closing values of two prominent stock indices for all trading days of 2016 and is available at http://Rpository.com/down/data/WPA09_st.txt. (Hint: Note that this text file is tab-delimited and contains a header row.)

df.st <- read.table(file = "http://Rpository.com/down/data/WPA09_st.txt", # from url
                    sep = "\t", 
                    header = TRUE)

The first few lines of df.st should look as follows:

head(df.st)
#>           date      dax       dj
#> 1   04/01/2016 10283.44 17148.94
#> 2   05/01/2016 10310.10 17158.66
#> 3   06/01/2016 10214.02 16906.51
#> 4   07/01/2016  9979.85 16514.10
#> 5   08/01/2016  9849.34 16346.45
#> 6   11/01/2016  9825.07 16398.57

6b. Sliding average: Write a function slide.mean that calculates the sliding averages of the most recent \(w\) values for a vector \(v\) (and ignore any NA values, but return NA if there are fewer than \(w\) preceding values). For instance, for v <- c(1, 2, 3, 1, 2, 3, NA, 2, 3) the function call slide.mean(v, 3) should return a vector NA NA 2.0 2.0 2.0 2.0 2.5 2.5 2.5 and slide.mean(v, 2) should return a vector NA 1.5 2.5 2.0 1.5 2.5 3.0 2.0 2.5.


slide.mean <- function(v, w) {
  
  n <- length(v)
  means <- rep(NA, length(v)) # initialize
  
  if (w > n) {
    
    warning("Warning: Window size w exceeds length(v)!\n") 
    return(means) 
    
  } else {
    
    for (i in w:n) {
      means[i] <- mean(v[(i-w+1):i], na.rm = TRUE)
    } 
    
    return(means)  
  }
}

# Check function:
v <- c(1, 2, 3, 1, 2, 3, NA, 2, 3)
slide.mean(v, 3)
#> [1]  NA  NA 2.0 2.0 2.0 2.0 2.5 2.5 2.5
slide.mean(v, 2)
#> [1]  NA 1.5 2.5 2.0 1.5 2.5 3.0 2.0 2.5
slide.mean(v, 10)
#> [1] NA NA NA NA NA NA NA NA NA

# Note: 
slide.mean(c(NA, 1, NA, NA), 2)
#> [1]  NA   1   1 NaN

6c. Write a function that plots all stock-index series in \(df.st\), as well as their sliding averages over all variables in (or columns of) a data frame \(df.st\) for a given window size \(w\). (Hint: First create a plot for specific instances of \(df.st\) and \(w\), then generalise this plot into a function.)

A specific plot:

w <- 200
colors <- yarrr::piratepal("basel")

# (a) specific plot:
plot(x = 1:nrow(df.ts), type = "n", main = "Time series", ylim = range(df.ts), xlab = "Time", ylab = "Values")
grid(col = "grey60")

# (b) add lines:
lines(df.ts$dax, lty = 1, col = colors[1], lwd = 2)
sm.dax <- slide.mean(df.ts$dax, w)
lines(sm.dax, lty = 1, col = colors[1], lwd = 1)

lines(df.ts$dj, lty = 1, col = colors[2], lwd = 2)
sm.dj  <- slide.mean(df.ts$dj, w)
lines(sm.dj, lty = 1, col = colors[2], lwd = 1)

# (c) add legend:
legend("topleft", names(df.ts), lty = 1, lwd = 2, col = colors)

The following plot visualizes the data contained in df.st:

# (1) A specific plot:
w <- 100
colors <- yarrr::piratepal("basel")

# (a) determine ranges for axes:
all.values <- c(df.st$dax, df.st$dj) # to determine y-range

# df.st$date[1]
dates <- as.Date(df.st$date, "%d/%m/%Y") # determine dates
# dates[1]
# format(dates[1], format="%d %b %Y")

# (b) define plotting area:
plot(x = 1:nrow(df.st), type = "n",  
     ylim = range(all.values),  
     xaxt = "n", # suppress x axis  
     main = "Specific plot", 
     xlab = "Date", ylab = "Values")
axis.Date(side = 1, dates, format = "%d %b")
grid(col = "grey60")

# (c) add lines:
lines(df.st$dax, lty = 1, col = colors[1], lwd = 2)
sm.dax <- slide.mean(df.st$dax, w)
lines(sm.dax, lty = 1, col = colors[1], lwd = 1)

lines(df.st$dj, lty = 1, col = colors[2], lwd = 2)
sm.dj  <- slide.mean(df.st$dj, w)
lines(sm.dj, lty = 1, col = colors[2], lwd = 1)

# (d) add legend:
legend("topleft", names(df.st[,-1]), lty = 1, lwd = 2, col = colors)

Generalization to a function:

The following code generalizes code into a function that returns a plot:

6d. Write a function plot.stocks() that generalizes the above code by using the data set df and a specific value of w as input arguments. However, the function should still assume that the first column of df contains date values and all subsequent columns contain stock index values.

A possible solution:

# (2) As a function:
plot.stocks <- function(df, w) {
  
  m <- nrow(df)
  n <- ncol(df)
  stocks <- n - 1 
  colors <- yarrr::piratepal("basel")
  
  # (a) determine ranges for axes:
  all.values <- df[ , -1] # all columns except 1st (to determine y-range)
  dates <- as.Date(df[ , 1], "%d/%m/%Y") # determine dates by 1st column

  # (b) define plotting area:
  plot(x = 1:nrow(df), type = "n", 
       ylim = range(all.values), 
       xaxt = "n", # suppress x axis  
       main = "Generalized plot", 
       xlab = "Dates", ylab = "Values")
  
  axis.Date(side = 1, dates, format = "%d %b")
  
  grid(col = "grey60")
  
  # (c) add lines:  
  for (i in 1:stocks) { # for each stock column in df:
    
    cur.ts <- df[ , (i + 1)] # current series (starting in 2nd column):
    
    lines(cur.ts, lty = 1, col = colors[i], lwd = 2)
    sl.mn <- slide.mean(cur.ts, w)
    lines(sl.mn, lty = 1, col = colors[i], lwd = 1)
    
    }

  # (d) add legend:
  legend("topleft", names(df)[-1], lty = 1, lwd = 2, col = colors)

}

Calling the function with df.st as input data should yield the same plot again:

# Check function:
plot.stocks(df.st, w = 30)

Note that the plot.stocks() function allows not only for different values of w, but also for different data sets (e.g., containing a flexible number of stocks):


# Compute standardized values (using the z.trans from above):
z.dax <- z.trans(df.st$dax)
z.dj  <- z.trans(df.st$dj)

df.new <- cbind(df.st, z.dax, z.dj) # append new columns to df.st
df.new <- df.new[ , c(1, 4, 5)]     # drop old dax and dj columns

# Show that function still works for this new data set:
plot.stocks(df.new, w = 50) 

Checkpoint 1

At this point you completed all basic exercises. This is good, but additional practice will deepen your understanding, so please keep carrying on…

B. At Home

Facebook attraction

In this part, we will re-use data from the ficticious study on attraction that we also used in WPA07. In this study, 1,000 heterosexual University students viewed the Facebook profile of another student (the “target” person) of the opposite sex. Based on a target person’s profile, each participant made three judgments about the target: their perceived intelligence, attractiveness, and dateability. The primary judgment of interest was the dateability rating indicating as how dateable the target person was perceived (ranging from a minimum value of 0 to a maximum value of 100).

Data description

The data file contains 1,000 rows and 10 columns. Here are the columns:

  1. session: The experiment session in which the study was run. There were 50 sessions in total.

  2. sex: The sex of the target person (“m” vs. “f”).

  3. age: The age of the target person (in years).

  4. haircolor: The hair color of the target person.

  5. university: The university that the target person attended.

  6. education: The highest level of education obtained by the target person.

  7. shirtless: Did the target person have a shirtless profile picture? (1.No vs. 2.Yes).

  8. intelligence: As how intelligent do you rate this target? (1.Low, 2.Medium, 3.High).

  9. attractiveness: As how physically attractive do you rate this target? (1.Low, 2.Medium, 3.High).

  10. dateability: As how dateable do you rate this target person? (Scale from 0 to 100).

Data loading and exploration

7a. The data are still located in a tab-delimited text file at http://Rpository.com/down/data/WPA07_facebook.txt. Using read.table() load this data into R as a new object called facebook.

Here is how the first few rows of the data should look:

head(facebook)
#>   session sex age haircolor   university    education shirtless
#> 1       1   m  23     brown 3.Goettingen    3.Masters     2.Yes
#> 2       1   m  19    blonde   2.Freiburg 1.HighSchool      1.No
#> 3       1   f  22     brown   2.Freiburg  2.Bachelors     2.Yes
#> 4       1   f  22       red   2.Freiburg  2.Bachelors      1.No
#> 5       1   m  23     brown 3.Goettingen  2.Bachelors      1.No
#> 6       1   m  26    blonde   2.Freiburg    3.Masters     2.Yes
#>   intelligence attractiveness dateability
#> 1        1.low         3.high          15
#> 2     2.medium       2.medium          44
#> 3        1.low       2.medium         100
#> 4     2.medium         3.high         100
#> 5     2.medium       2.medium          63
#> 6       3.high         3.high          76

7b. Inspect the first few rows of the dataframe with the head() function to make sure it loaded correctly. Using the str() function, look at the structure of the dataframe to make sure everything looks ok.

#> 'data.frame':    1000 obs. of  10 variables:
#>  $ session       : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ sex           : Factor w/ 2 levels "f","m": 2 2 1 1 2 2 1 2 1 1 ...
#>  $ age           : int  23 19 22 22 23 26 19 25 22 19 ...
#>  $ haircolor     : Factor w/ 3 levels "blonde","brown",..: 2 1 2 3 2 1 2 3 2 1 ...
#>  $ university    : Factor w/ 3 levels "1.Konstanz","2.Freiburg",..: 3 2 2 2 3 2 3 2 2 3 ...
#>  $ education     : Factor w/ 4 levels "1.HighSchool",..: 3 1 2 2 2 3 1 3 2 1 ...
#>  $ shirtless     : Factor w/ 2 levels "1.No","2.Yes": 2 1 2 1 1 2 1 2 2 1 ...
#>  $ intelligence  : Factor w/ 3 levels "1.low","2.medium",..: 1 2 1 2 2 3 1 3 2 2 ...
#>  $ attractiveness: Factor w/ 3 levels "1.low","2.medium",..: 3 2 2 3 2 3 2 1 2 2 ...
#>  $ dateability   : int  15 44 100 100 63 76 61 26 76 40 ...

Custom functions

8a. Write a function called my.mean() that takes a vector x as an argument, and returns the mean of the vector x. Don’t use the mean() function! Use sum() and length()!

my.mean <- function(___) {
  
  result <- sum(___) / length(___)
  
  return(result)
  
}
my.mean <- function(x) {
  
  result <- sum(x) / length(x)
  
  return(result)
  
}

8b. Try your my.mean() function to calculate the mean dateability rating of participants in the facebook dataset and compare the result to the built-in mean() function to make sure you get the same result!

my.mean(facebook$dateability)
#> [1] 54.199

mean(facebook$dateability)
#> [1] 54.199

9a. Write a function called how.many.na() that takes a vector x as an argument, and returns the number of NA values found in the vector:

how.many.na <- function(x) {
  
  output <- sum(is.na(___))
  
  return(___)
}
how.many.na <- function(x) {
  
  output <- sum(is.na(x))
  
  return(output)
}

9b. Test your how.many.na() function on the age of participants in the facebook dataframe and then on the vector x = c(4, 7, 3, NA, NA, 1).

how.many.na(facebook$age)
#> [1] 0

how.many.na(c(4, 7, 3, NA, NA, 1))
#> [1] 2

10a. Create a function called my.plot() that takes arguments x and y and returns a customised scatterplot with gridlines and a regression line:

my.plot <- function(x, y) {
  
  plot(x = ___, 
       y = ___, 
       pch = ___, # check ?points to see the values of pch!
       col = ___
       )
  
  grid() # add gridlines
  
  # Add a regression line:
  abline(lm(___ ~ ___), 
         col = ___
         )
}

A possible solution:

my.plot <- function(x, y) {
  
  plot(x = x, 
       y = y, 
       pch = 16, # check ?points to see the values of pch!
       col = yarrr::transparent("steelblue", trans.val = .6), 
       main = "My fancy scatterplot with regression line"
       )
  
  grid() # add gridlines
  
  # Add a regression line:
  abline(lm(y ~ x), 
         col = "red3",
         lty = 2,
         lwd = 2
         )
  }

10b. Test your my.plot() function on the age and dateability of participants in the facebook dataset.

my.plot(facebook$age, 
        facebook$dateability
        )

Custom loops

11. Using the following template, create a loop that calculates (and prints) the mean dateability rating of students from each university in the facebook dataset.

for(university.i in c("1.Konstanz", "2.Freiburg", "3.Goettingen")) {
  
  data.i <- facebook$dateability[facebook$university == ___]
  
  output <- paste0("The mean dateability of targets at university ", ___, " is ", ___)
  
  print(___)
  
}

A possible solution:

for(university.i in c("1.Konstanz", "2.Freiburg", "3.Goettingen")) {
  
  data.i <- facebook$dateability[facebook$university == university.i]
  
  output <- paste0("The mean dateability of targets at university ", university.i, " is ", round(mean(data.i), 2), ".")
  
  print(output)
  
}
#> [1] "The mean dateability of targets at university 1.Konstanz is 60.75."
#> [1] "The mean dateability of targets at university 2.Freiburg is 50.48."
#> [1] "The mean dateability of targets at university 3.Goettingen is 52.11."

12. Create a histogram of dateability ratings for each level of intelligence using the following structure:

par(mfrow = c(1, 3)) # set up 1 x 3 plotting grid

for(intelligence.i in c("1.low", "2.medium", "3.high")) {
  
  hist(facebook$dateability[facebook$intelligence == ___],
       main = ___,
       xlab = ___,
       col = ___
       )
  }

par(mfrow = c(1, 1)) # reset plotting grid

A possible solution:

par(mfrow = c(1, 3)) # set up 1 x 3 plotting grid

for(intelligence.i in c("1.low", "2.medium", "3.high")) {
  
  hist(facebook$dateability[facebook$intelligence == intelligence.i],
       main = intelligence.i,
       xlab = "Dateability",
       col = yarrr::piratepal("appletv")[as.numeric(substr(intelligence.i, 1, 1))]
       )
  }


par(mfrow = c(1, 1)) # reset plotting grid

Checkpoint 2

If you got this far you’re doing great, but don’t give up just yet…

More functions

13a. Write a function called ttest.apa() that takes a vector x as an argument and returns an apa style conclusion from a one-sample test of x.

ttest.apa <- function(x, mu) {
  
   # Store the one-sample ttest in object a
  
  a <- t.test(x = ___, 
              mu = ___) 
  
  df <- a$parameter  # Get the degrees of freedom
  test.stat <- ___      # Get the test statistic
  p.value <- ___        # Get the p-value
  
  
  # If the test is significant...
  if(p.value <= ___) {
    
    output <- paste0("The test is significant! t(",
                     df, ") = ", test.stat, 
                   ", p = ", p.value, 
                   " (H0 = ", mu, ")")
  }
  
  # If the test is not significant...
    if(p.value > ___) {
    
    output <- ___
    
  }
  
  print(___)  
}

A possible solution:

ttest.apa <- function(x, mu) {
  
   # Store the one-sample ttest in object a
  
  a <- t.test(x = x, 
              mu = mu) 
  
  df <- a$parameter  # Get the degrees of freedom
  test.stat <- a$statistic      # Get the test statistic
  p.value <- a$p.value        # Get the p-value
  
  
  # If the test is significant...
  if(p.value <= .05) {
    
    output <- paste0("The test is significant! t(",
                     df, ") = ", test.stat, 
                   ", p = ", p.value, 
                   " (H0 = ", mu, ")")
  }
  
  # If the test is not significant...
    if(p.value > .05) {
    
    output <- paste0("The test is not significant! t(",
                     df, ") = ", test.stat, 
                   ", p = ", p.value, 
                   " (H0 = ", mu, ")")
    
  }
  
  print(output)  
}

13b. Test your ttest.apa() function on the the dateability of participants in the facebook study. Specifically, test if their mean dateability rating is different from 50.

ttest.apa(facebook$dateability, 50)
#> [1] "The test is significant! t(999) = 4.93261151453317, p = 9.49897937574552e-07 (H0 = 50)"

More loops

14. Let’s use a design matrix to compute some descriptive statistics for specific subsets of the facebook data.

14a. Create an R object design.matrix that contains rows for all combinations of 2 independent variables: gender (with values of “f” or “m”) and haircolor (with values of “blonde”, “brown”, or “red”). Include placeholders (with NA values) for 3 dependent variables: the number of cases nr, the mean age age.mn, and the mean dateability dateability.mn). (Hint: Use expand.grid() function to initialize the design.matrix, which should have \(2 \cdot\ 3 = 6\) rows and \(2 + 3 = 5\) columns.)

design.matrix <- expand.grid(gender = c("f", "m"), haircolor = c("blonde", "brown", "red"), 
                             nr = NA, age.mn = NA, dateability.mn = NA)

Here’s how your design.matrix should look after its creation:

design.matrix
#>   gender haircolor nr age.mn dateability.mn
#> 1      f    blonde NA     NA             NA
#> 2      m    blonde NA     NA             NA
#> 3      f     brown NA     NA             NA
#> 4      m     brown NA     NA             NA
#> 5      f       red NA     NA             NA
#> 6      m       red NA     NA             NA

14b. Use a loop through all rows of your design.matrix that

  1. gets the factor (or independent) values from the current row,
  2. finds the corresponding subset in the facebook data,
  3. computes the desired statistics (nr, age.mn, and dateability.mn), and
  4. adds them into the appropriate columns of the current row of design.matrix.
for(row.i in 1:nrow(design.matrix)) {# for every row in design.matrix: 

  # a) get factor values for current row:
  gender.i <- design.matrix$gender[row.i]
  color.i <- design.matrix$haircolor[row.i]

  # b) subset facebook data for current factor values:
  facebook.i <- subset(facebook, sex == gender.i & haircolor == color.i)

  # c) calculate desired statistics:
  nr.i <- nrow(facebook.i)
  age.mn.i <- mean(facebook.i$age)
  dateability.mn.i <- mean(facebook.i$dateability)
  
  # d) Add these statistics to row.i of design.matrix:
  design.matrix$nr[row.i] <- nr.i
  design.matrix$age.mn[row.i] <- age.mn.i
  design.matrix$dateability.mn[row.i] <- dateability.mn.i
  
}# next row.

Here’s how design.matrix looks after this loop:

design.matrix
#>   gender haircolor  nr   age.mn dateability.mn
#> 1      f    blonde 130 22.24615       58.14615
#> 2      m    blonde 120 22.22500       48.63333
#> 3      f     brown 230 22.33478       60.19130
#> 4      m     brown 270 22.40000       47.53333
#> 5      f       red 119 22.57983       63.67227
#> 6      m       red 131 22.13740       49.99237

14c. Use either dplyr or an ANOVA to check your results.

library(dplyr)

fb.agg <- facebook %>% 
  group_by(sex, haircolor) %>%
  summarise(nr = n(),
            age.mn = mean(age),
            db.mn = mean(dateability)
            )
fb.agg
#> # A tibble: 6 x 5
#> # Groups:   sex [?]
#>      sex haircolor    nr   age.mn    db.mn
#>   <fctr>    <fctr> <int>    <dbl>    <dbl>
#> 1      f    blonde   130 22.24615 58.14615
#> 2      f     brown   230 22.33478 60.19130
#> 3      f       red   119 22.57983 63.67227
#> 4      m    blonde   120 22.22500 48.63333
#> 5      m     brown   270 22.40000 47.53333
#> 6      m       red   131 22.13740 49.99237

Note: The results are identical (qed).

15. The following dataframe survey contains results from a survey of 5 participants. Each participant was asked 5 questions on a 1-10 Likert scale. As you can see, many of the responses are not valid integers from 1-10. Using a loop, create a new dataframe called survey.corrected that converts all invalid values to NA.

survey <- data.frame("p" = c(1, 2, 3, 4, 5),
                     "q1" = c(5, 3, 6, 3, 5),
                     "q2" = c(-1, 4, 3, 6, 11),
                     "q3" = c(6, 22, 4, 6, -5),
                     "q4" = c(6, 3, 4, -2, 4),
                     "q5" = c(1, 1, 900, 1, 2))

Here’s a template for the desired code and loop:

survey.corrected <- survey   # Copy survey

for(column.i in ___ ) {# Loop over columns
 
  x <- ___   # Copy original column as x
  x[x %in% ___ == FALSE] <- ___  # replace any value not in 1:10 by NA
  
  survey.corrected[ , ___] <- ___ # Assign x back to survey.correced
  
  }

A possible solution:

survey.corrected <- survey    # Copy survey

for(column.i in 2:6 ) { # Loop over columns
 
  x <- survey[ , column.i]   # Copy original column as x
  x[(x %in% 1:10) == FALSE] <- NA # replace any value not in 1:10 by NA
  
  survey.corrected[ , column.i] <- x # Assign x back to survey.correced
  
  }

survey.corrected
#>   p q1 q2 q3 q4 q5
#> 1 1  5 NA  6  6  1
#> 2 2  3  4 NA  3  1
#> 3 3  6  3  4  4 NA
#> 4 4  3  6  6 NA  1
#> 5 5  5 NA NA  4  2

Checkpoint 3

If you got this far you’re doing an amazing job — well done! Enjoy the following challenge…

C. Challenge: Simulation

16. Let’s conduct some simulations to address some deep problem (literally) of inferential statistics.

16a. What is the probability of getting a significant \(p\)-value if the null hypothesis is true? Test this by conducting the following simulation:

  • Create a vector called p.values with \(N = 100\) NA values.
  • Draw a sample size of 10 from a normal distribution with mean \(= 0\) and standard deviation \(= 1\).
  • Do a one-sample \(t\)-test testing if the mean of the distribution is different from 0. Save the \(p\)-value from this test in the 1st position of p.values.
  • Repeat these steps with a loop to fill all \(N\) p.values.
  • Create a histogram of p.values and calculate the proportion of p-values that are significant at the .05 level.
N <- 100

p.values <- rep(NA, ___)

for(i in ___) {

x <- rnorm(n = ___, mean = ___, sd = ___)

result <- t.test(___)$___

p.values[___] <- ___

}

A possible solution:

N <- 100

p.values <- rep(NA, N) # initialize vector for results

for(i in 1:N) {
  
  x <- rnorm(n = 10, mean = 0, sd = 1)
  
  result <- t.test(x)$p.value
  
  p.values[i] <- result
  
  }

# Examine resulting p.values:
mean(p.values < .05)
#> [1] 0.09
hist(p.values)

16b. Create a function p.simulation that generalizes the simulation from Exercise 16a. with 4 arguments: sim: the number of simulations, sample.size: the sample size used in each simulation, mu.true: the true mean, and sd.true: the true standard deviation. That is, p.simulation should calculate sim \(p\)-values testing whether sample.size samples from a normal distribution with a true mean of mu.true and true standard deviation of sd.true is significantly different from 0. The function should return a vector of \(p\)-values.

p.simulation <- function(sim, 
                         sample.size, 
                         mu.true, 
                         sd.true) {
  
p.values <- rep(NA, sim)

for(i in 1:sim) {
  
  x <- rnorm(n = sample.size, mean = mu.true, sd = sd.true)
  
  result <- t.test(x)$p.value
  
  p.values[i] <- result
  
  }

return(p.values)

}

# Testing the function: 
p.vals <- p.simulation(sim = 100, sample.size = 20, mu.true = 0, sd.true = 1)
sum(p.vals < .05)
#> [1] 3

That’s it – hope you enjoyed working on this assignment!


[WPA09_answers.Rmd updated on 2018-01-15 13:25:47 by hn.]