uni.kn.logo

Answers for WPA05 of Basic data and decision analysis in R, taught at the University of Konstanz in Winter 2017/2018.


To complete and submit these exercises, please remember and do the following:

  1. Your WPAs should be written as scripts of commented code (as .Rmd files) and submitted as reproducible documents that combine text with code (in .html or .pdf formats).

    • A simple .Rmd template is provided here.

    • (Alternatively, open a plain R script and save it as LastnameFirstname_WPA##_yymmdd.R.)

  2. Also enter the current assignment (e.g., WPA05), your name, and the current date at the top of your document. When working on a task, always indicate which task you are answering with appopriate comments.

Here is an example how your file JillJack_WPA05_171127.R could look:

# Assignment: WPA 05
# Name: Jill, Jack
# Date: 2017 Nov 27
# ~~~~~~~~~~~~~~~~~~~~~~~~~~
# A. In Class

# Exercise 1: Getting started
# ...
  1. Complete as many exercises as you can by Wednesday (23:59).

  2. Submit your script or output file (including all code) to the appropriate folder on Ilias.


A. In Class

Here are some warm-up exercises that review some important skills from previous and practice the basic concepts of the current chapter:

Preparations

0. The following steps prepare the current session by opening an R project, creating a new .Rmd file, and compiling it into an .html output file:

0a. Open your R project from last week (called RCourse or something similar), which contains some files and at least two subfolders (data and R).

0b. Create a new R Markdown (.Rmd) script and save it as LastFirst_WPA##_yymmdd.Rmd (with an appropriate header) in your project directory.

0c. You need the latest version of the yarrr package (v0.1.2) in this WPA. Install the package from CRAN with install.packages() if you haven’t already done so.

0d. Insert a code chunk and load the rmarkdown, knitr and yarrr packages. (Hint: It’s always a good idea to name code chunks and load all required packages with library() at the beginning of your document.)

library(rmarkdown)
library(knitr)
library(yarrr)

0e. Make sure that you can create an .html output-file by “knitting” your current document.

0f. When working with graphics, it is advisable to save the original par() settings (e.g., into an object opar), so that they can be restored later (by evaluating par(opar)):

opar <- par() # saves original (default) par settings
par(opar)  # restores original (default) par settings

Basic Plots

Aggregating and plotting data

1. This exercise practices previous skills (on loading and aggregating data) and combines them with a first plotting command:

1a. Reload the data set clothes.csv (described and used in WPA04) – which is available here – into a data frame called clothes.df. (Hint: Use the read.table() function for a comma-delimited file containing a header row.)

clothes.df <- read.table(file = "http://Rpository.com/down/data/WPA04_clothes.csv", # from url
                         #file = "data/WPA04_clothes.csv", # local file
                         header = TRUE, sep = ",")

1b. Calculate the average price for shoes and suits for men and women. (Hint: This can be achieved in 4 commands with appropriate subsets. Alternatively, use the deplyr package with appropriate filter and group settings for the type and group variables of clothes.df.)

# In 4 steps:
shoes.men <- with(clothes.df, mean(price[type == "shoes" & group == "men"]))
shoes.fem <- with(clothes.df, mean(price[type == "shoes" & group == "women"]))
suits.men <- with(clothes.df, mean(price[type == "suit" & group == "men"]))
suits.fem <- with(clothes.df, mean(price[type == "suit" & group == "women"]))

# In 1 step:
library("dplyr")
price.agg <- clothes.df %>%
  filter((type == "shoes" | type == "suit") & (group != "kids")) %>%
  group_by(type, group) %>%
  summarise(n = n(),
            price.mn = mean(price))
price.agg$price.mn
#> [1] 102.6150 112.7199 146.1280 126.4025

1c. Create a stacked barplot that shows the average price for shoes and suits for men and women. (Hint: You first need to store the 4 means computed in Exercise 1b. in a 2 x 2-matrix.)

# Represent 4 mean prices as a 2x2 matrix:
price.mtx <- cbind(price.agg[price.agg$type == "shoes", ]$price.mn, # mean prices for shoes
                   price.agg[price.agg$type == "suit", ]$price.mn)  # mean prices for suits
colnames(price.mtx) <- c("shoes", "suits")
rownames(price.mtx) <- c("men", "women")

# Draw barplot:
barplot(height = price.mtx, # use the matrix price.m as bar heights
        beside = TRUE,      # put bars next to each other 
        legend.text = TRUE, # add legend
        col = c(unikn.pal$seeblau4, unikn.pal$seeblau2), 
        main = "Prices of shoes and suits by gender", 
        ylab = "Mean price", 
        xlab = "Shoes and suits by gender",
        ylim = c(0, ceiling(max(price.mtx)) + 20)
        )

2. In Exercise 3c of WPA04 you compared the average recommended retail prices of different kinds (or type) of clothing items by the group for which they are designed. Calculate these means again (using the aggregate() function) and compare the result with a visualization of the raw data (using the yarrr package’s pirateplot() function).

aggregate(formula = price ~ group + type, FUN = mean, data = clothes.df)
#>    group  type     price
#> 1   kids dress  89.54882
#> 2  women dress 108.91830
#> 3   kids pants  67.90786
#> 4    men pants  87.67486
#> 5  women pants 101.74466
#> 6   kids shirt  69.69231
#> 7    men shirt  86.63246
#> 8  women shirt  93.33361
#> 9   kids shoes  91.36000
#> 10   men shoes 102.61500
#> 11 women shoes 112.71986
#> 12   men  suit 146.12800
#> 13 women  suit 126.40250

yarrr::pirateplot(formula = price ~ group + type,
           data = clothes.df,
           main = "Price of clothes by group and type",
           xlab = "Group and type",
           ylab = "Price",
           theme = 2,
           pal = "southpark",
           # pal = c(seeblau.col[2], seeblau.col[4], seeblau.col[1], seeblau.col[3]),
           back.col = unikn.pal[9],
           gl.col = unikn.pal[4],
           cex.lab = .9
           )

Curves

3. As we’ll be creating histograms and scatterplots below, let’s practice some function curves here. You want to plot some curvy functions, but an evil spirit messed up the following code to only show straight lines. (Hint: Modify the curve() functions to match the definitions in the legend. See the example code in YaRrr, Ch. 9, Figure 57.)

# Define plot area: 
plot(1, xlim = c(-5, 5), ylim = c(-5, 5), type = "n", 
     main = "Plotting function lines with CORRUPTED curves", ylab = "", xlab = "")
abline(h = 0) # x-axis
abline(v = 0) # y-axis

# Define colors:
col.vec <- c("blue3", "red3", "green3", "orange")

# Define 4 curves:
curve(expr = x*1, from = -5, to = 5, add = TRUE, lwd = 3, col = col.vec[1])
curve(expr = -1/3*x, from = 0, to = 5, add = TRUE, lwd = 3, col = col.vec[2])
curve(expr = 3*x/x, from = -5, to = 5, add = TRUE, lwd = 3, col = col.vec[3])

my.fun <- function(x) {return(dnorm(x, mean = 100, sd = 10))} 
curve(expr = my.fun, from = -5, to = 5, add = TRUE, lwd = 3, col = col.vec[4])

# Add legend:
legend("bottomright", 
       legend = c("x^3", "-x^(1/3)", "-sin(x)*3", "dnorm(x, 3, 1/3"), 
       col = col.vec[1:4], lwd = 3, lty = 1, cex = 1, bty = "n")

A possible solution

# Define plot area: 
plot(1, xlim = c(-5, 5), ylim = c(-5, 5), type = "n", 
     main = "Plotting function lines with CORRECTED curves", ylab = "", xlab = "")
abline(h = 0) # x-axis
abline(v = 0) # y-axis

# Define colors:
col.vec <- c("blue3", "red3", "green3", "orange")

# Define 4 curves:
curve(expr = x^3, from = -5, to = 5, add = TRUE, lwd = 3, col = col.vec[1])
curve(expr = -x^(1/3), from = 0, to = 5, add = TRUE, lwd = 3, col = col.vec[2])
curve(expr = -sin(x)*3, from = -5, to = 5, add = TRUE, lwd = 3, col = col.vec[3])

my.fun <- function(x) {return(dnorm(x, mean = 3, sd = 1/3))} 
curve(expr = my.fun, from = -5, to = 5, add = TRUE, lwd = 3, col = col.vec[4])

# Add legend:
legend("bottomright", 
       legend = c("x^3", "-x^(1/3)", "-sin(x)*3", "dnorm(x, 3, 1/3"), 
       col = col.vec[1:4], lwd = 3, lty = 1, cex = 1, bty = "n")

Plotting points and text

4. Assume you wanted to visualize that R can be used for Statistics, Graphics, and Simulations. The following plot provides a good start, but suggests that the three tasks are quite isolated:

par(mar = c(0, 0, 0, 0)) # remove margins
plot(1, xlim = c(0, 10), ylim = c(0, 10), 
     type = "n", xaxt = "n", yaxt = "n", xlab = "", ylab = "", bty = "n") # empty canvas

# Plot 3 points:
points(x = c(2, 5, 8), y = c(3, 3, 3), 
       pch = 21, cex = 40, 
       col = gray(.5, .5), 
       bg = c("yellow1", "red1", "green1")
       # bg = c(unikn.pal[1], unikn.pal[2], unikn.pal[3])
       )

# Add text labels:
text(5, 8, "R = ", cex = 2)
text(2, 3, "Statistics", cex = 1.5)
text(5, 3, "Graphics", cex = 1.5)
text(8, 3, "Simulations", cex = 1.5)


# par(opar) # restore default settings

Modify the code to convey a more integrated perspective on R. (Hint: Shift the positions of the points and labels so that the three domains overlap – like Venn diagrams – and use transparent colors.)

A possible solution

par(mar = c(0, 0, 0, 0)) # remove margins
plot(1, xlim = c(0, 10), ylim = c(0, 10), 
     type = "n", xaxt = "n", yaxt = "n", xlab = "", ylab = "", bty = "n") # empty canvas

x <- 5; y <- 6 # basic coordinate
x.dif <- 1; y.dif <- 2 # shift values

# Plot 3 points:
points(x = c(x, (x - x.dif), (x + x.dif)), y = c(y, (y - y.dif), (y - y.dif)),
       pch = 21, cex = 40, 
       col = gray(.5, .5), 
       bg = c(yarrr::transparent("yellow1", .8), 
              yarrr::transparent("red1", .8), 
              yarrr::transparent("green1", .8))
       # bg = c(unikn.pal.ty[1], 
       #        unikn.pal.ty[2], 
       #        unikn.pal.ty[3])
       )

# Text:
y2 <- y - 1.5 # lower y position of text labels

text(x, y2, "R = ", cex = 2)
text(x, y + 0.8 * y.dif, "Statistics", cex = 1.5)
text(x - 1.6 * x.dif, y2 - 0.8 * y.dif, "Graphics", cex = 1.5)
text(x + 1.6 * x.dif, y2 - 0.8 * y.dif, "Simulations", cex = 1.5)


# par(opar) # restore default settings

Colors

5. R is great for using colors to make graphic representations more interesting and more informative. Here are some exercises that practice some color settings. (Hint: The colors() command shows you the 657 colors that are predefined in R.)

5a. The famous Stroop task requires that participants name the colors in which various color names are printed, which is much more difficult than reading the series of color names that are printed in various colors (presumably as the reading of the color words is automatic and difficult to inhibit). The following code provides the basic ingredients for such a task, but the color names are too regular and all printed in white:1

par(mar = c(0, 0, 0, 0), bg = "gray20")
plot(1, xlim = c(0, 10), ylim = c(0, 10), type = "n", 
     xaxt = "n", yaxt = "n", xlab = "", ylab = "", bty = "n") # empty canvas

my.colors <- c("color1", "color2", "color3", "color4")

N <- 4*5
col.names <- rep(my.colors, 5)

text(2, 1,  col.names[1], col = "white", cex = 3)
text(4, 1,  col.names[2], col = "white", cex = 3)
text(6, 1,  col.names[3], col = "white", cex = 3)
text(8, 1,  col.names[4], col = "white", cex = 3)

text(2, 3,  col.names[5], col = "white", cex = 3)
text(4, 3,  col.names[6], col = "white", cex = 3)
text(6, 3,  col.names[7], col = "white", cex = 3)
text(8, 3,  col.names[8], col = "white", cex = 3)

text(2, 5,  col.names[9], col = "white", cex = 3)
text(4, 5, col.names[10], col = "white", cex = 3)
text(6, 5, col.names[11], col = "white", cex = 3)
text(8, 5, col.names[12], col = "white", cex = 3)

text(2, 7, col.names[13], col = "white", cex = 3)
text(4, 7, col.names[14], col = "white", cex = 3)
text(6, 7, col.names[15], col = "white", cex = 3)
text(8, 7, col.names[16], col = "white", cex = 3)

text(2, 9, col.names[17], col = "white", cex = 3)
text(4, 9, col.names[18], col = "white", cex = 3)
text(6, 9, col.names[19], col = "white", cex = 3)
text(8, 9, col.names[20], col = "white", cex = 3)

Modify this code into a Stroop task. (Hint: First change my.colors to use actual R colors, and then change the assignement of col.names so that it actually draws a sample of 20 color names from my.colors. Then draw a second random sample of 20 color values from my.colors and print the text labels in those colors.)

A possible solution

par(mar = c(0, 0, 0, 0), bg = "gray20")
plot(1, xlim = c(0, 10), ylim = c(0, 10), type = "n", 
     xaxt = "n", yaxt = "n", xlab = "", ylab = "", bty = "n") # empty canvas

my.colors <- c("red", "blue", "green", "yellow")
my.colors.de <- c("rot", "blau", "grün", "gelb") # for German version

N <- 4 * 5
set.seed(137) # reproducible randomness
col.names  <- sample(x = my.colors, size = N, replace = TRUE)
col.values <- sample(x = my.colors, size = N, replace = TRUE)

text(2, 1,  col.names[1], col = col.values[1], cex = 3)
text(4, 1,  col.names[2], col = col.values[2], cex = 3)
text(6, 1,  col.names[3], col = col.values[3], cex = 3)
text(8, 1,  col.names[4], col = col.values[4], cex = 3)

text(2, 3,  col.names[5], col = col.values[5], cex = 3)
text(4, 3,  col.names[6], col = col.values[6], cex = 3)
text(6, 3,  col.names[7], col = col.values[7], cex = 3)
text(8, 3,  col.names[8], col = col.values[8], cex = 3)

text(2, 5,  col.names[9],  col = col.values[9], cex = 3)
text(4, 5, col.names[10], col = col.values[10], cex = 3)
text(6, 5, col.names[11], col = col.values[11], cex = 3)
text(8, 5, col.names[12], col = col.values[12], cex = 3)

text(2, 7, col.names[13], col = col.values[13], cex = 3)
text(4, 7, col.names[14], col = col.values[14], cex = 3)
text(6, 7, col.names[15], col = col.values[15], cex = 3)
text(8, 7, col.names[16], col = col.values[16], cex = 3)

text(2, 9, col.names[17], col = col.values[17], cex = 3)
text(4, 9, col.names[18], col = col.values[18], cex = 3)
text(6, 9, col.names[19], col = col.values[19], cex = 3)
text(8, 9, col.names[20], col = col.values[20], cex = 3)


par(opar) # restore default settings

5b. One way to make some plots more informative is using different levels of a color and transparency. Define 10 different shades of gray and use them to randomly color the dots in the following scatterplot:

N <- 100
xs <- rnorm(N)      # N random x values
ys <- xs + rnorm(N) # N random y values

# (a) Black version:
plot(xs, ys, main = "(a) black in black", pch = 16, cex = 4)

(Hint: Use the gray() function to generate 10 different levels of gray and draw a sample from these shades to color each point of the plot.)

A possible solution

# Define and sample from 10 shades of gray():
shades <- 1:10/10 # 10 levels
shades.gray <- gray(level = shades, alpha = .8) # define 10 shades of slighly transparent gray()
gray.samples <- sample(x = shades.gray, N, replace = TRUE) # draw N samples

# (b) Shades of gray version:
plot(xs, ys, main = "(b) 10 shades of gray", pch = 16, cex = 4, col = gray.samples)

Checkpoint 1

At this point you completed all basic exercises. This is good, but additional practice will deepen your understanding, so please keep carrying on…

B. In Context: A JDM Study

It’s personal: The effect of personal value on utilitarian moral judgments

Abstract of Millars et al. (2016)

In this part, we will analyze data from:

  • Millar, C., Starmans, C., Fugelsang, J., & Friedman, O. (2016). It’s personal: The effect of personal value on utilitarian moral judgments. Judgment and Decision Making, 11(4), 326–331.

Here is the article’s abstract:

We investigated whether the personal importance of objects influences utilitarian decision-making in which damaging property is necessary to produce an overall positive outcome. In Experiment 1, participants judged saving five objects by destroying a sixth object to be less acceptable when the action required destroying the sixth object directly (rather than as a side-effect) and the objects were personally important (rather than unimportant). In Experiment 2, we demonstrated that utilitarian judgments were not influenced by the objects’ monetary worth. Together these findings suggest that personal importance underlies people’s sensitivity to damaging property as a means for utilitarian gains.

The full paper is available at the journal Judgment and Decision Making (in html and pdf format).

Data sets

The paper reports two studies:

  1. The data from Study 1 is available here and contains the following variables:
Variable Description
Acceptability.Score How acceptable is the action?
Important Were the objects important to the owner or not?
Direct Was the destruction of an object a means of saving others or a side-effect?
Cover.Story Was the object a poster or a clock?
Gender Participant gender
Age Participant Age
TopicCompQ Comprehension question 1
ExpensiveCompQ Comprehension question 2
ImportanceCompQ Comprehension question 3
Failed.controls. Did participant fail an attention check?
  1. The data from Study 2 is available here and contains the following variables:
Variable Description
Acceptability.Score How acceptable is the action?
Important Were the objects important to the owner or not?
Direct Was the destruction of an object a means of saving others or a side-effect?
Expensive Was the object expensive or not?
PreviousTrolley Did participants complete a trolley problem in the past?
Gender Participant gender
Age Participant Age
TopicCompQ Comprehension question 1
ExpensiveCompQ Comprehension question 2
ImportanceCompQ Comprehension question 3
Failed.controls. Did participant fail an attention check?

Data cleaning

6. A typical problem when working with real data is that the data is messy and requires some pre-processing before it can be analyzed. For our paper, the data of the two studies are stored in two .csv files:

  1. The data for Study 1 is at http://journal.sjdm.org/16/16428/expt1.csv
  2. The data for Study 2 is at http://journal.sjdm.org/16/16428/expt2.csv

6a. Load the data into R by using read.table() into new objects called study1 and study2. (Hint: You can either read the tables directly from the web, or first download the files to your data folder, and then load them. Either way, make sure to use the arguments sep = "," and header = TRUE.)

study1 <- read.table(file = "http://journal.sjdm.org/16/16428/expt1.csv", # from url
                     #file = "data/expt1.csv", # from local file
                     sep = ",", header = TRUE)

study2 <- read.table(file = "http://journal.sjdm.org/16/16428/expt2.csv", # from url
                     #file ="data/expt2.csv", # from local file
                     sep = ",", header = TRUE)

6b. Explore the structure of each data frame using str(). Do you notice something strange about the class of the Acceptability.Score column? Run a frequency table of the data with table() to see what’s going on.

table(study1$Acceptability.Score)
#> 
#>                   1 1-9 scale         2         3         4         5 
#>         2        26         1         4        48        21        67 
#>         6         7         8         9 
#>        53       107        27        38
# There is a value called "1-9 scale".

6c. The data frames contain some miscoded values. To see this, look at the last few rows of the files as follows:

tail(study1) # show last few rows
tail(study2)

6d. As you can see, there are some notes at the bottom of the files. We first have to fix this by manually going through each column and setting invalid values to NA. To do this, we’ll create a custom recoding function called recode.v that takes a vector as an argument, and returns a vector with old values recoded to new ones. Run the following code to create the recoding function recode.v:

# Just COPY, PASTE, and RUN this code:

# Create a new function called recode.v:
recode.v <- function(x,    # what vector do you want to recode?
                     old,  # what values do you want to change?
                     new,  # what should the new values be?
                     otherNA = TRUE,   # should other values be converted to NA?
                     numeric = TRUE) { # should result be numeric?
  
  x.new <- x  # copy vector to x.new
  
  if(class(x.new) == "factor") {x.new <- paste(x.new)} # remove factors
 
  for(i in 1:length(old)) { # loop through all old values:
    x.new[x == old[i]] <- new[i]
    }
 
 
  if(otherNA) { # convert unspecified values to NA:
    x.new[(x %in% old) == FALSE] <- NA
    }
 
  if(numeric) {x.new <- as.numeric(x.new)}  # convert vector to numeric values
 
  return(x.new)  # return new vector
}

Here’s how our new function recode.v() works. Run the following code and check the output:

# Just COPY, PASTE, and RUN this code:

# Recode a string vector of gender data to 0, 1:
recode.v(x = c("m", "male", "f", "female", "other", "mixed", NA),
         old = c("m", "male", "f", "female"),
         new = c(1, 1, 0, 0))
#> [1]  1  1  0  0 NA NA NA

6e. Now that we have defined recode.v, let’s apply it to all the data columns. Run the following code to clean the data:

# Just COPY, PASTE, and RUN this code:

## Study 1 cleaning: 
study1$Acceptability.Score <- recode.v(study1$Acceptability.Score, old = 1:9, new = 1:9)
study1$Important <- recode.v(study1$Important, old = c("1", "1-yes", "2", "2-no"), new = c(1, 1, 2, 2))
study1$Direct <- recode.v(study1$Direct, old = c("1", "1-yes", "2", "2-no"), new = c(1, 1, 2, 2))
study1$Cover.Story <- recode.v(study1$Cover.Story, old = c("1", "1-poster", "2", "2-clock"), new = c(1, 1, 2, 2))
study1$Gender <- recode.v(study1$Gender, old = c("1", "1-male", "2", "2-female"), new = c(1, 1, 2, 2))

## Study 2 cleaning:
names(study2)[1] <- "Acceptability.Score" # Variable name for study2 was not the same as study1!
study2$Acceptability.Score <- recode.v(study2$Acceptability.Score, old = 1:9, new = 1:9)
study2$Important <- recode.v(study2$Important, old = c("1", "1-yes", "2", "2-no"), new = c(1, 1, 2, 2))
study2$Direct <- recode.v(study2$Direct, old = c("1", "1-yes", "2", "2-no"), new = c(1, 1, 2, 2))
study2$Expensive <- recode.v(study2$Expensive, old = c("1", "1-yes", "2", "2-no"), new = c(1, 1, 2, 2))
study2$PreviousTrolley <- recode.v(study2$PreviousTrolley, 
                                   old = c("1", "1-yes", "2", "2-no"), 
                                   new = c(1, 1, 2, 2))
study2$Gender <- recode.v(study2$Gender, old = c("1", "1-male", "2", "2-female"), new = c(1, 1, 2, 2))
study2$TopicCompQ <- recode.v(study2$TopicCompQ, old = c("2", "4", "4-dolly/brakes"), new = c(2, 4, 4))
study2$ExpensiveCompQ <- recode.v(study2$ExpensiveCompQ, 
                                  old = c("1", "1-expensive", "2", "2-inexpensive"), 
                                  new = c(1, 1, 2, 2))
study2$ImportanceCompQ <- recode.v(study2$ExpensiveCompQ, c("1", "1-yes", "2", "2-no"), new = c(1, 1, 2, 2))

6f. Now examine the structure of the dataframes with str(), and the last few rows with tail(). Do they look ok now?

str(study1)
#> 'data.frame':    394 obs. of  10 variables:
#>  $ Acceptability.Score: num  3 3 4 7 4 1 1 1 1 1 ...
#>  $ Important          : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ Direct             : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ Cover.Story        : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ Gender             : num  1 2 1 1 1 1 1 1 2 2 ...
#>  $ Age                : int  40 27 41 32 28 30 22 48 55 26 ...
#>  $ StoryCompQ         : int  4 4 4 4 4 4 4 4 4 4 ...
#>  $ Item.typeCompQ     : int  1 1 1 1 3 1 1 1 1 1 ...
#>  $ OwnerCompQ         : int  1 1 1 1 2 2 2 2 2 2 ...
#>  $ Failed.controls.   : int  1 1 1 1 1 NA NA NA NA NA ...
str(study2)
#> 'data.frame':    791 obs. of  11 variables:
#>  $ Acceptability.Score: num  3 7 7 7 7 7 7 7 9 5 ...
#>  $ Important          : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ Direct             : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ Expensive          : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ PreviousTrolley    : num  2 2 2 2 2 2 2 2 1 1 ...
#>  $ Gender             : num  2 2 1 2 2 2 1 1 1 2 ...
#>  $ Age                : int  25 30 30 45 23 33 30 26 28 24 ...
#>  $ TopicCompQ         : num  4 4 4 4 4 4 4 4 4 4 ...
#>  $ ExpensiveCompQ     : num  2 1 1 1 1 2 1 1 2 2 ...
#>  $ ImportanceCompQ    : num  2 1 1 1 1 2 1 1 2 2 ...
#>  $ Failed.controls.   : int  1 NA NA NA NA 1 NA NA 1 1 ...

Histograms

7a. Create a histogram of the acceptability scores from Study%nbsp;1. Add appropriate labels and colors as you see fit!

hist(study1$Acceptability.Score,
     main = "Study 1 (Millar et al., 2016)",
     xlab = "Acceptability score",
     col = unikn.pal[1])

7b. Now do the same for Study%nbsp;2 and add a dashed line at the mean of the distribution. (Hint: Use abline() or segments() to add a vertical line.)

hist(study2$Acceptability.Score,
     main = "Study 2 (Millar et al., 2016)",
     xlab = "Acceptability score",
     # xlim = c(1, 9),
     col = unikn.pal[1])

# Add vertical line at mean of the distribution:
abline(v = mean(study2$Acceptability.Score, na.rm = TRUE), 
       lty = 2, lwd = 2, col = unikn.pal[4])

Scatterplots

8a. Create a scatterplot showing the relationship between age and acceptability score in Study 1 and add appropriate labels!

# Create scatterplot:
plot(x = study1$Age,
     y = study1$Acceptability.Score,
     xlab = "Age",
     ylab = "Acceptability score",
     main = "Study 1 (Millar et al. 2016)")

8b. Now pimp the plot a bit (i.e., make it look prettier, but also more informative). Try changing the point types (e.g., pch = 16), point sizes (e.g., cex = 2), and colors (e.g., col = gray(.0, .2), or col = transparent("blue", .5))

# Create scatterplot:
plot(x = study1$Age,
     y = study1$Acceptability.Score,
     xlab = "Age",
     ylab = "Acceptability score",
     main = "Study 1 (Millar et al. 2016)",
     pch = 16,
     cex = 1.8,
     col = unikn.pal.ty[2] # OR: gray(0, .2)
     )

8c. Now add grid lines to your plot. (Hint: Just evaluate grid() after your plot.)

# Create scatterplot:
plot(x = study1$Age,
     y = study1$Acceptability.Score,
     xlab = "Age",
     ylab = "Acceptability score",
     main = "Study 1 (Millar et al. 2016)",
     pch = 16,
     cex = 1.8,
     col = unikn.pal.ty[2] # OR: gray(0, .2)
     )

grid(col = unikn.pal.ty[6]) # OR: grid()

8d. Now let’s add a regression line to the plot. Adding a regression line is easy. First, create a linear model object created with lm(). Then add the model to the plot with abline():

# Create scatterplot:
plot(x = study1$Age,
     y = study1$Acceptability.Score,
     xlab = "Age",
     ylab = "Acceptability score",
     main = "Study 1 (Millar et al. 2016)",
     pch = 16,
     cex = 1.8,
     col = unikn.pal.ty[2] # OR: gray(0, .2)
     )

grid(col = unikn.pal.ty[6]) # OR: grid()

# Create a linear regression model:
model <- lm(Acceptability.Score ~ Age, data = study1)

# Add the model to the plot:
abline(model,lwd = 2, col = "red")

8e. Now create the same scatterplot for the Study 2 data. Be sure to include the gridlines and a linear regression line.

# Create scatterplot:
plot(x = study2$Age,
     y = study2$Acceptability.Score,
     xlab = "Age",
     ylab = "Acceptability score",
     main = "Study 2 (Millar et al. 2016)",
     pch = 16,
     cex = 1.8,
     col = unikn.pal.ty[2] # OR: gray(0, .2)
     )

grid(col = unikn.pal.ty[6]) # OR: grid()

# Create a linear regression model and add it to the plot:
abline(lm(Acceptability.Score ~ Age, data = study2), lwd = 2, col = "red")

Barplots

9. The authors present the following barplot as Figure 1 (on p. 328). Figure 1 of Millars et al. (2016, p. 328)

Let’s try to represent the same data with a barplot in R!

9a. In order to create a stacked barplot with the barplot() function, we first need to create a matrix of values. Run the following code to calculate a matrix of mean acceptability scores as a function of Important and Direct with aggregate() and cbind():

# Just COPY, PASTE, and RUN this code:

# Compute group means of study1:
s1.as.means <- aggregate(Acceptability.Score ~ Important + Direct,
                         FUN = mean, 
                         data = study1)

# Create a matrix of group means:
s1.as.means.mtx <- cbind(s1.as.means[1:2, 3], s1.as.means[3:4, 3])
colnames(s1.as.means.mtx) <- c("Means", "Side effect")
rownames(s1.as.means.mtx) <- c("Important", "Unimportant")

9b. Now create a barplot from the Study 1 data by entering the correct arguments in the following code:

# fix this code by replacing ZZZ with the correct arguments:
barplot(height = ZZZ,
        beside = ZZZ, 
        legend.text = ZZZ, 
        ylim = c(ZZZ, ZZZ),
        ylab = "ZZZ",
        main = "ZZZ")

A possible solution

# corrected code:
barplot(height = s1.as.means.mtx,
        beside = TRUE, 
        legend.text = TRUE, 
        ylim = c(0, 9),
        ylab = "Acceptability",
        main = "Study 1",
        col = c(unikn.pal[4], unikn.pal[8])
        )

9c. Now, repeat the process to get a barplot for Study 2:


# Get group means and store as a matrix:
s2.as.means <- aggregate(Acceptability.Score ~ Important + Direct,
                         FUN = mean, 
                         data = study2)
s2.as.means.mtx <- cbind(s2.as.means[1:2, 3], s2.as.means[3:4, 3])
colnames(s2.as.means.mtx) <- c("Means", "Side effect")
rownames(s2.as.means.mtx) <- c("Important", "Unimportant")

barplot(height = s2.as.means.mtx,
        beside = TRUE, 
        legend.text = TRUE, 
        ylim = c(0, 9),
        ylab = "Acceptability",
        main = "Study 2",
        col = c(unikn.pal[4], unikn.pal[8])
        )

Pirateplot

10a. Create a pirateplot of the same data from Study 1. Add appropriate labels and feel free to use your favorite theme (1, 2, or 3), and color palette from the piratepal() function. (Hint: Run ?pirateplot for the help menu and piratepal("all") to see all color palettes.)

pirateplot(formula = Acceptability.Score ~ Important + Direct, 
           data = study1,
           pal = "basel",
           theme = 1,
           main = "Study 1",
           gl.col = unikn.pal[4],
           back.col = unikn.pal[9]
           )

10b. Now do the same for Study 2. Again, add appropriate labels (and feel free to use your favorite palette or theme).

pirateplot(formula = Acceptability.Score ~ Important + Direct, 
           data = study2,
           pal = "espresso",
           theme = 2,
           main = "Study 2",
           gl.col = unikn.pal[6],
           back.col = unikn.pal[1]
           )

Checkpoint 2:

If you got this far you’re doing great, but don’t give up just yet…

Histogram with 2 variables

11. Create the following overlapping histogram showing the distribution of acceptability scores between the two studies:

hist(study1$Acceptability.Score, 
     col = unikn.pal.ty[3],
     border = "white", 
     xlab = "Acceptability Score", 
     main = "Acceptability Score by Study",
     probability = FALSE)

hist(study2$Acceptability.Score, 
     col = unikn.pal.ty[7],
     border = "white", add = TRUE, probability = FALSE)

legend(x = 1, y = 100, 
       legend = c("Study 1", "Study 2"), 
       col = c(unikn.pal.ty[3], unikn.pal.ty[7]), 
       pch = c(15, 15), bty = "n")

Scatterplot with reference lines

12a. Add a new column to study2$income that shows income as a function of Age:

# Just COPY, PASTE, and RUN this code:

# Add a new column to Study 1 called Income: 
study1$Income <- study1$Age + rnorm(n = nrow(study1), mean = 10, sd = 15)

# Add a new column to Study 2 called Income:
study2$Income <- 40 + rnorm(n = nrow(study2), mean = 10, sd = 15)

12b. With these new variables, create a scatterplot that shows Income as a function of Age for both studies (by using the following template and adding pch values, color and/or transparency):

# Create a blank plot with labels:
plot(x = 1, 
     y = 1,
     xlim = c(18, 80),
     ylim = c(0, 100),
     xlab = "Age", 
     ylab = "Income",
     main = "Income by age",
     type = "n")

# Study 1:
points(x = study1$Age, 
       y = study1$Income
       )

abline(lm(Income ~ Age, data = study1), 
       lty = 2, lwd = 2)

# Study 2:
points(x = study2$Age, 
       study2$Income
       )

abline(lm(Income ~ Age, data = study2), 
       lty = 2, lwd = 2)

# Add legend:
legend(x = 72,
       y = 72,
       legend = c("Study 1", "Study 2")
       )

# Add gridlines:
grid()

A possible solution

# Create a blank plot with labels:
plot(x = 1, 
     y = 1,
     xlim = c(18, 80),
     ylim = c(0, 100),
     xlab = "Age", 
     ylab = "Income",
     main = "Income by age",
     type = "n")

# Study 1:
points(x = study1$Age, 
       y = study1$Income, 
       pch = 16, 
       col = transparent("red2", .5))

abline(lm(Income ~ Age, data = study1), 
       col = "red2", lty = 2, lwd = 2)

# Study 2:
points(x = study2$Age, 
       study2$Income, 
       pch = 16,
       col = unikn.pal.ty[3])

abline(lm(Income ~ Age, data = study2), 
       col = unikn.pal.ty[4], lty = 2, lwd = 2)

# Add gridlines:
grid(col = unikn.pal.ty[6])

# Add legend (after gridlines):
legend(#"bottomright",
       x = 74,
       y = 67,
       legend = c("Study 1", "Study 2"), 
       pch = 16, 
       col = c(transparent("red2", .5), unikn.pal.ty[3]),
       bty = "o"
       )

Checkpoint 3:

If you got this far you’re doing an amazing job – well done! Enjoy the following challenges…

C. Challenges

Pick a plot

13. For this exercise, you decide which type of plot you think best represents the data. The movies dataframe in the yarrr package contains data about the top 5,000 grossing movies of all time. (Hint: You can learn more about the data using the help menu ?movies.)

13a. Create a plot that shows the relationship between a movie’s release year and its running time. Customise the plot to make it look nice. Then add the names of the two longest movies (as text() labels).

plot(x = movies$year,
     y = movies$time,
     xlim = c(1935, 2020),
     ylim = c(50, 250),
     pch = 21,
     cex = 1, 
     col = unikn.pal.ty[7], # gray(0, .1),
     bg = unikn.pal.ty[2],
     col.axis = unikn.pal.ty[6],
     col.lab = unikn.pal.ty[6],
     xlab = "Year",
     ylab = "Time (in minutes)",
     main = "Movie times by year",
     col.main = unikn.pal.ty[6],
     xaxt = "n" # Turn off x axis
     )

# Add x axis:
axis(side = 1, at = seq(1940, 2020, by = 5), col.axis = unikn.pal.ty[6])

## Add linear regression line:
# model <- lm(time ~ year, data = movies)
# abline(model, lty = 3, col = gray(0, .5), lwd = 2)

grid(col = unikn.pal.ty[7])

# Determine name of longest movie:
# ix <- which(movies$time == max(movies$time, na.rm = TRUE))
# lm.name <- movies[ix, ]$name
# text(x = movies$year[ix]+4.5, y = movies$time[ix], labels = lm.name)

# Add labels for the two longest movies:
movies.by.length <- movies[order(movies$time, decreasing = TRUE), ] # sort movies by decreasing length
text(x = movies.by.length$year[1]+5, y = movies.by.length$time[1], 
     labels = movies.by.length$name[1], col = unikn.pal.ty[5], cex = .9) # longest movie
text(x = movies.by.length$year[2]+10, y = movies.by.length$time[2], 
     labels = movies.by.length$name[2], col = unikn.pal.ty[5], cex = .9) # 2nd longest movie

13b. Create a plot that shows the relationship between a movie’s budget and its revenue. Customise it and make it look nice!

plot(x = movies$budget,
     y = movies$revenue.all,
     pch = 16,
     cex = 1,
     col = piratepal("basel"),
     xlab = "Budget (in millions)",
     ylab = "Revenue (in millions)",
     main = "Movie revenue by budget"
     )

# Add linear regression line:
model <- lm(revenue.all ~ budget, data = movies)
abline(model, lty = 3, col = unikn.pal.ty[4], lwd = 2)

grid(col = unikn.pal.ty[7])

13c. Create a plot that shows the relationship between genre and time. Customise it and make it look nice! (Hint: You may notice that many of the times are equal to 0, try creating the plot after excluding these values using subset.)

pirateplot(formula = time ~ genre,
           data = subset(movies, 
                         time > 0 &    # Many movies had a time of 0
                        (genre %in% c("Reality", "Multiple Genres", "Concert/Performance")) == FALSE),  # These genres had almost no data
           main = "Movie running times by genre",
           xlab = "Genre",
           ylab = "Running Time (in minutes)"
           )


rgb(19, 72, 206, maxColorValue = 255)
#> [1] "#1348CE"

Color palette power

14. Let’s define our own color palette. The corporate design regulations of the University of Konstanz include a color scheme. On page 19 of the Corporate Design Manual, the Seeblau colors are defined as follows:

Seeblau color definitions at uni.kn

Define your own Seeblau color palette in R and use it to modify the following code (adapted from YaRrr, Ch. 10, Figure 60):

google.colors <- c(rgb(19, 72, 206, maxColorValue = 255), 
                   rgb(206, 45, 35, maxColorValue = 255), 
                   rgb(253, 172, 10, maxColorValue = 255), 
                   rgb(18, 140, 70, maxColorValue = 255))
par(mar = rep(0, 4))
plot(1, xlim = c(0, 7), ylim = c(0, 1), 
     xlab = "", ylab = "", xaxt = "n", yaxt = "n", type = "n", bty = "n")

points(1:6, rep(.5, 6), 
       pch = c(15, 16, 16, 17, 18, 15),
       col = google.colors[c(1, 2, 3, 1, 4, 2)], cex = 8)

text(3.5, .7, "Do you recognize these colors?", cex = 2)

A possible solution

# Define color palette:
seeblau.colors <- c(rgb(204, 238, 249, maxColorValue = 255), # seeblau.1
                    rgb(166, 225, 244, maxColorValue = 255), # seeblau.2 
                    rgb(89, 199, 235, maxColorValue = 255),  # seeblau.3
                    rgb(0, 169, 224, maxColorValue = 255),   # seeblau.4 
                    rgb(0, 0, 0, maxColorValue = 255),       #  5. black
                    gray(level = 0, alpha = .6),             #  6. gray 60% transparent
                    gray(level = 0, alpha = .4),             #  7. gray 40% transparent
                    gray(level = 0, alpha = .2),             #  8. gray 20% transparent
                    gray(level = 0, alpha = .1),             #  9. gray 10% transparent
                    rgb(255, 255, 255, maxColorValue = 255)  # 10. white
                    )

# Plot colors:
par(mar = rep(0, 4))
plot(1, xlim = c(0, 11), ylim = c(0, 3),
     xlab = "", ylab = "", xaxt = "n", yaxt = "n", type = "n", bty = "n")

# Using colors defined here: 
points(1:10, rep(1.6, 10),
       pch = c(17, 16, 15, 18, 24, 22, 23, 21, 22, 25),
       col = seeblau.colors[c(1, 2, 3, 4, 5, 5, 5, 5, 5, 5)],
       bg = seeblau.colors[c(NA, NA, NA, NA, 5, 6, 7, 8, 9, 10)], 
       cex = 6
       )

# Using unikn.pal palette (with transparency):
points(1:10, rep(1.2, 10),
      pch = c(17, 16, 15, 18, 24, 22, 23, 21, 22, 25),
      col = unikn.pal.ty[c(1, 2, 3, 4, 5, 5, 5, 5, 5, 5)],
      bg =  unikn.pal.ty[c(NA, NA, NA, NA, 5, 6, 7, 8, 9, 10)],
      cex = 6
      )

x.tx <- 5.5 # x of (middle of) text labels 
text(x.tx, 2.3, "Seeblau", cex = 2.5, col = seeblau.colors[4])
text(x.tx, 2.0, "der Universität Konstanz", cex = 2)

# Add a caption (to illustrate paste() function):
caption <- paste("Figure X: The", length(seeblau.colors), "Seeblau colors", 
                 "\nof the University of Konstanz.", 
                 "\n(defined in two different ways).", sep = " ")
text(x.tx, .8, caption, cex = 1, col = seeblau.colors[5])
segments(x0 = 2, y0 = 1, x1 = 9, y1 = 1, lty = 2, lw = 1.5, col = seeblau.colors[6])

Plot your own data

15. (Re-)create or improve a plot. Go to one of the following websites (or a similar one) and select a graph or statistic:

Now create a new and/or better graphical representation of those numbers, including a suitable graph title, labels, and legend. (Hint: Choose a plot that allows you to obtain the actual numbers for multiple categories and make sure to cite and include a link to the source of your data.)

A possible solution

Let’s re-create the following graphic, depicting causes of “unnatural deaths” in Germany from 1980 to 2014 in a bar chart — and contrasting it with the number of deaths due to “falling from ladders”:

# Data: 
causes <- c("Stromunfälle im Haushalt", "Feuerwerk", "Reitunfälle", "Vergiftung durch Nahrung", 
            "Kontakt mit giftigen Tiere und Pflanzen", "Angriffe durch nichtgiftige Tiere", 
            "Wetter und Naturkatastrophen", "Stürze von Leitern")
deaths.1 <- c(1110, 19, 691, 181, 708, 955, 391, 0)
deaths.2 <- c(0, 0, 0, 0, 0, 0, 0, 4939)
un.deaths <- data.frame(causes, deaths.1, deaths.2)
#names(un.deaths) <- c("Cause of death", "Unnatural deaths", "Ladder accidents")

# Colors:
color.vec <- c(yarrr::piratepal("info2")[1:7], unikn.pal.ty[4])

# As bar plot: 
barplot(height = cbind(un.deaths$deaths.1, un.deaths$deaths.2), 
        ylim = c(0, 6000),
        xlab = "Unnatural deaths vs. ladder accidents",
        ylab = "Number of deaths (in total)", 
        main = "Unnatural deaths in Germany from 1980 to 2014",
        col = color.vec #,
        # legend = un.deaths$causes
        )

# Add legend:
legend("topleft", 
       legend = un.deaths$causes,
       pch = 15, 
       col = color.vec, 
       cex = 1.0, 
       bty = "o")

From bar to pie

Pie charts are rarely more informative than bar charts, but can easily be done in R as well. To emphasize that the number of ladder accidents exceeds those by all other causes of “unnatural” deaths, the following chart states the percentages of all causes (relative to those considered here):

all.causes <- causes
all.deaths <- c(1110, 19, 691, 181, 708, 955, 391, 4939)
pct <- round(all.deaths/sum(all.deaths) * 100) # compute percentages
pcts <- paste(pct, "%", sep = "") # as percentage labels (with "%")
lbls <- paste(all.causes, "\n", pcts, sep = " ") # combine causes with percentage labels 

pie(all.deaths,
    labels = lbls, 
    main = "Unnatural deaths in Germany from 1980 to 2014", 
    col = color.vec)

Data sources for this plot:

  • Article on Spiegel Online, Nov. 21, 2016
  • Destatis, Statistisches Bundesamt

That’s it – hope you enjoyed working on this assignment!


[WPA05_answers.Rmd updated on 2017-11-30 19:00:57 by hn.]


  1. This code could be written more efficiently, but since we have not yet covered loops in R we’re using 20 text() commands instead.