Answers for WPA00 of Basic data and decision analysis in R, taught at the University of Konstanz in Winter 2017/2018.

Instructions

This is only a demo document that acts as an example for future WPAs. If you want, you can use this as a practice for the following weeks (WPA01 to WPA11). To complete and submit these exercises, please do the following:

Your WPAs can be written and submitted either as scripts of commented code (as .R or .Rmd files) or as reproducible documents that combine text with code (in .html or .pdf formats).
- A simple .Rmd template is provided here.
- Alternatively, open a plain R script and save it as LastnameFirstname_WPA##_yymmdd.R.
Also enter the current assignment (e.g., WPA00), your name, and the current date at the top of your document. When working on a task, always indicate which task you are answering with appopriate comments.

Here is an example how your file JillsomeJack_WPA00_171023.Rmd could look:

# Assignment: WPA 00
# Name: Jane Jackson
# Date: 2017 October 23
# ~~~~~~~~~~~~~~~~~~~~~~~~~~
# Exercise 1: 

# Adding numbers: 
1 + 2

# ~~~~~~~~~~~~~~~~~~~~~~~~~~
# Exercise 2: 

# Draw 100 samples from a standard normal distribution: 
x <- rnorm(100)

# Conduct a t-test on the sample:
t.test(x)

# etc. ...

Complete as many exercises as you can.
Submit your script or output file (including all code) to the appropriate folder on Ilias before the deadline (Wednesday, 23:59).

Jump In

Copy and paste each of the following code chunks into your editor and save the file as a .R or .Rmd document. It’s probably best to evaluate this code line-by-line — and try to roughly understand what’s going on. (Please don’t worry if you don’t understand something. You’ll learn the details over the next couple of weeks…)

Creating and evaluating objects

1. Let’s see how we interact with R by creating some simple objects and applying basic functions to them:

1a. R can be used to create (e.g., numeric) objects and evaluate them, as with any regular calculator:

a <- 1 # assigns "1" to an object a
b <- 2 # assigns "2" to an object b
a + b  # applies "+" to a and b, and prints the result
#> [1] 3
sum(a, b) # applies the function sum() to a and b, and prints the result
#> [1] 3

a <- 100 # to change an object, it must be re-assigned
a + b
#> [1] 102
sum(a, b)
#> [1] 102

# Note that evaluating  
# A + B  
# would yield an Error, as R is case-sensitive!

1b. Creating some vectors:

x <- 1:10 # creates a sequence of numbers (integers from 1 to 10) and assigns it to a vector x
y <- 10:1 
x + y # applies "+" to each element of the vectors x and y, and prints the result
#>  [1] 11 11 11 11 11 11 11 11 11 11

z <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) # a generic way to create a vector
w <- c("a", "vector", "of", "characters")
x == z # applies "==" to each element of the vectors x and z
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

x[3] # returns the 3rd element of x
#> [1] 3
y[3] # ...
#> [1] 8
w[3:4] # ...
#> [1] "of"         "characters"

x > 5 # applies "+" to each element of the vectors x and y, and prints the result 
#>  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
x[x > 5] # ...
#> [1]  6  7  8  9 10

1c. Basic sampling:

sample(c(0, 1), size = 1) # randomly draws 1 sample from c(0, 1)
#> [1] 0

coin = c("heads", "tails") # defines the outcomes of a coin
# sample(coin, size = 10) # tries to randomly draws 10 samples from our coin (flip it 10 times), but ...
sample(coin, size = 10, replace = TRUE) # ... works!
#>  [1] "tails" "heads" "heads" "heads" "heads" "tails" "heads" "tails"
#>  [9] "heads" "heads"

# Randomly assigning students to groups:
n.students <- 16
n.groups <- 4
groups <- 1:n.groups
sample(groups, size = n.students, replace = TRUE)
#>  [1] 2 2 4 2 1 4 2 1 1 3 2 3 1 3 4 1

Installing and loading packages

R is not just a single program, but an entire universe of code from a community of developers (with all the benefits and costs of such diversity). Packages allow to import and use R code of other people.

2. Let’s install and load the yarrr package (by Nathaniel Phillips) that contains many datasets (like pirates) and functions (like pirateplot) which we’ll use throughout this course.

2a. Installing a package:

install.packages("yarrr") # installs a package

2b. Loading a (previously installed) package:

library("yarrr")          # loads a package

Exploring a dataset

3. The pirates dataset included in the yarrr package contains data from a survey of 1,000 pirates.

3a. Get basic information on the pirates dataset:

?pirates

3b. How many rows and columns are in this dataset?

nrow(pirates) # number of rows / cases
ncol(pirates) # number of columns / variables

3c. View the first few rows of the pirates dataset:

head(pirates)
#>   id    sex age height weight headband college tattoos tchests parrots
#> 1  1   male  28 173.11   70.5      yes   JSSFP       9       0       0
#> 2  2   male  31 209.25  105.6      yes   JSSFP       9      11       0
#> 3  3   male  26 169.95   77.1      yes    CCCC      10      10       1
#> 4  4 female  31 144.29   58.5       no   JSSFP       2       0       2
#>   favorite.pirate sword.type eyepatch sword.time beard.length
#> 1    Jack Sparrow    cutlass        1       0.58           16
#> 2    Jack Sparrow    cutlass        0       1.11           21
#> 3    Jack Sparrow    cutlass        1       1.44           19
#> 4    Jack Sparrow   scimitar        1      36.11            2
#>             fav.pixar grogg
#> 1      Monsters, Inc.    11
#> 2              WALL-E     9
#> 3          Inside Out     7
#> 4          Inside Out     9
#>  [ reached getOption("max.print") -- omitted 2 rows ]

3d. Show the structure of the pirates dataset:

str(pirates)
#> 'data.frame':    1000 obs. of  17 variables:
#>  $ id             : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ sex            : chr  "male" "male" "male" "female" ...
#>  $ age            : num  28 31 26 31 41 26 31 31 28 30 ...
#>  $ height         : num  173 209 170 144 158 ...
#>  $ weight         : num  70.5 105.6 77.1 58.5 58.4 ...
#>  $ headband       : chr  "yes" "yes" "yes" "no" ...
#>  $ college        : chr  "JSSFP" "JSSFP" "CCCC" "JSSFP" ...
#>  $ tattoos        : num  9 9 10 2 9 7 9 5 12 12 ...
#>  $ tchests        : num  0 11 10 0 6 19 1 13 37 69 ...
#>  $ parrots        : num  0 0 1 2 4 0 7 7 2 4 ...
#>  $ favorite.pirate: chr  "Jack Sparrow" "Jack Sparrow" "Jack Sparrow" "Jack Sparrow" ...
#>  $ sword.type     : chr  "cutlass" "cutlass" "cutlass" "scimitar" ...
#>  $ eyepatch       : num  1 0 1 1 1 1 0 1 0 1 ...
#>  $ sword.time     : num  0.58 1.11 1.44 36.11 0.11 ...
#>  $ beard.length   : num  16 21 19 2 0 17 1 1 1 25 ...
#>  $ fav.pixar      : chr  "Monsters, Inc." "WALL-E" "Inside Out" "Inside Out" ...
#>  $ grogg          : num  11 9 7 9 14 7 9 12 16 9 ...

3e. Show the entire dataset in a new window:

View(pirates)

Basic descriptives for numeric vectors

4. To obtain basic descriptives for numeric data, you can apply in-built R functions to numeric vectors.

4a. What is the mean age? Apply the function mean() to the vector pirates$age:

mean(pirates$age)
#> [1] 27.36

4b. What is the tallest pirate? Apply the function max() to the vector pirates$height:

max(pirates$height)
#> [1] 209.25

4c. What was the mean and median weight of the pirates?

Basic descriptives for non-numeric data

5. Non-numeric data are typically summarized in frequency tables:

5a. How many pirates are there of each sex?

table(pirates$sex)
#> 
#> female   male  other 
#>    464    490     46

5b. How many pirates are there of each age?

table(pirates$age)
#> 
#> 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 
#>  2  3  4  5 12  9 15 15 26 30 31 48 42 58 59 73 82 70 61 61 58 51 43 37 31 
#> 36 37 38 39 40 41 43 44 45 46 
#> 23 10 10 15  5  5  2  1  2  1

6. To collapse cases over 2 variables, you can use the aggregate() function:

6a. What is the mean age for each sex?

aggregate(formula = age ~ sex, 
          data = pirates,
          FUN = mean)
#>      sex      age
#> 1 female 29.92241
#> 2   male 24.96735
#> 3  other 27.00000

6b. What is the mean beard length (beard.length) for each sex?

aggregate(formula = beard.length ~ sex, 
          data = pirates,
          FUN = mean)
#>      sex beard.length
#> 1 female    0.3987069
#> 2   male   19.4163265
#> 3  other   14.8913043

6c. How many pirates are wearing a headband? What is the median age of pirates for each combination of sex and headband?

table(pirates$headband)
#> 
#>  no yes 
#> 113 887
aggregate(formula = age ~ sex + headband, 
          data = pirates,
          FUN = median)
#>      sex headband age
#> 1 female       no  31
#> 2   male       no  25
#> 3  other       no  25
#> 4 female      yes  30
#> 5   male      yes  25
#> 6  other      yes  27

Basic Plots

Let’s explore some basic plotting commands!

Histograms

7. Basic histograms show some variable’s distribution of values:

7a. What is the distribution of pirate ages?

hist(x = pirates$age)

7b. What is the distribution of (the number of) pirate tattoos?

7c. To get more fancy histograms you can set and customize many parameters:

ymax <- 200

hist(x = pirates$age,
     breaks = 15, 
     main = "Distribution of pirate ages",
     col = "skyblue",
     border = "white",
     xlab = "Age",
     ylim = c(0, ymax))

# Add the mean as a text label: 
text(x = mean(pirates$age), y = (ymax - 10), 
     labels = paste("Mean = ", round(mean(pirates$age), 1)))

# Add a vertical dashed line at the mean: 
segments(x0 = mean(pirates$age), y0 = 0, 
         x1 = mean(pirates$age), y1 = (ymax - 20), 
         col = gray(.2, .5),
         lwd = 3, 
         lty = 2)

7d. Combining multiple histograms:

## 2 overlapping histograms of pirate ages for females and males:

# (a) Start with the female data:
hist(x = pirates$age[pirates$sex == "female"],
     main = "Distribution of pirate ages by sex",
     col = transparent("orange3", .2),
     border = "white",
     xlab = "Age", 
     breaks = seq(0, 50, 2),
     probability = TRUE,
     ylab = "", 
     yaxt = "n")

# (b) add male data:
hist(x = pirates$age[pirates$sex == "male"],
     add = TRUE, 
     probability = TRUE, 
     border = "white",
     breaks = seq(0, 50, 2),
     col = transparent("steelblue3", .5))

# (c) add a legend: 
legend(x = 40, 
       y = .05,
       col = c("orange3", "steelblue3"),
       legend = c("female", "male"),
       pch = 16,
       bty = "n")

Scatterplots

8. Scatterplots show relations between 2 numeric variables:

8a. Basic scatterplot of height and weight of pirates:

## 6A: A simple scatterplot of pirate height and weight
plot(x = pirates$height,
     y = pirates$weight,
     xlab = "Height (cm)",
     ylab = "Weight (kg)")

8b. A fancier scatterplot of the same data with some additional arguments:

# Create main plot: 
plot(x = pirates$height, 
     y = pirates$weight,
     main = 'My first scatterplot of pirate data!',
     xlab = 'Height (in cm)',
     ylab = 'Weight (in kg)',
     pch = 16,    # filled circles
     col = gray(0, .1)) # transparent gray
     
# Add gridlines:
grid()

# Create a linear regression model:
model <- lm(formula = weight ~ height, 
            data = pirates)

# Add regression line to the plot:
abline(model,
       col = 'blue', lty = 2)

Color palettes

9. To obtain prettier colors, the yarrr package offers some pre-designed color palettes:

9a. Look at all the available palettes from piratepal():

piratepal()

9b. Look at some specific palette in more detail:

piratepal(palette = "google", plot.result = TRUE)

9c. Look at some other palettes in more detail:

piratepal(palette = "appletv", plot.result = TRUE)
piratepal(palette = "espresso", plot.result = TRUE)

9d. Using the pony palette in a fancy scatterplot of pirate height and weight:

my.cols <- piratepal(palette = "pony", 
                     trans = .2, 
                     length.out = nrow(pirates))

# Create the plot:
plot(x = pirates$height, y = pirates$weight,
     main = "Random scatterplot with My Little Pony Colors",
     xlab = "Pony height",
     ylab = "Pony weight",
     pch = 21,  # Round symbols with borders
     cex = 2,  # magnifying factor of plot text and symbols
     col = "white",  # white border
     bg = my.cols,   # random colors
     bty = "n"       # no plot border
     )

# Add gridlines:
grid()

Barplots

Barplots allow comparisons between categories of a variable.

10a. Calculate the mean height of people for each favorite.pirate:

# Calculate mean height for each favorite.pirate
pirate.heights <- aggregate(height ~ favorite.pirate,
                     data = pirates,
                     FUN = mean)

barplot(pirate.heights$height, 
        main = "Barplot of mean height by favorite pirate",
        names.arg = pirate.heights$favorite.pirate)

10b. The same barplot, but with additional customizations:

barplot(pirate.heights$height, 
        ylim = c(0, 200),
        ylab = "Pirate Height (in cm)",
        main = "Barplot of mean height by favorite pirate",
        names.arg = pirate.heights$favorite.pirate, 
        col = piratepal("basel", trans = .2))

abline(h = seq(0, 200, 25), lty = 3, lwd = c(1, .5))

Pirateplots

11.a A so-called pirateplot shows the raw values, means and distributions of a numeric variable (like height) by the levels of some categorical variables (like favorite.pirate):

pirateplot(formula = height ~ favorite.pirate,
           data = pirates,
           main = "Pirateplot of height by favorite pirate")

11b. Pirateplot of height by sex and eyepatch:

pirateplot(formula = height ~ sex + eyepatch,
           data = pirates,
           main = "Pirateplot of height by favorite pirate")

Statistical tests

As R started out as a statistical programming language, it is not surprising that it can do statistics as well…

Two sample hypothesis tests

12. A t-test compares two means.

For instance, do pirates with eyepatches have shorter or longer beards (beard.length) than those without eyepatches?

t.test(formula = beard.length ~ eyepatch, 
       data = pirates,
       alternative = 'two.sided')
#> 
#>  Welch Two Sample t-test
#> 
#> data:  beard.length by eyepatch
#> t = 1.4404, df = 674.27, p-value = 0.1502
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -0.3625525  2.3593174
#> sample estimates:
#> mean in group 0 mean in group 1 
#>        11.04094        10.04255

13 A correlation test evaluates the relation between 2 numeric variables.

13a. For instance, is there a correlation between a pirate’s age and the number of parrots (s)he has?

cor.test(formula = ~ age + parrots,
         data = pirates)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  age and parrots
#> t = 6.1311, df = 998, p-value = 1.255e-09
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.1300655 0.2495678
#> sample estimates:
#>       cor 
#> 0.1905224

13b. Is there a correlation between a pirate’s weight and tattoos?

cor.test(formula = ~ weight + tattoos,
         data = pirates)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  weight and tattoos
#> t = -0.60608, df = 998, p-value = 0.5446
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.08107829  0.04286244
#> sample estimates:
#>         cor 
#> -0.01918162

ANOVA

An ANOVA compares the means of variables with 2 or more levels.

14a. ANOVA on beard.length as a function of sex and college education:

# a) run the ANOVA:
beard.aov <- aov(formula = beard.length ~ sex + college, 
                   data = pirates)

# b) print summary results:
summary(beard.aov)
#>              Df Sum Sq Mean Sq  F value Pr(>F)    
#> sex           2  87174   43587 2276.901 <2e-16 ***
#> college       1     12      12    0.641  0.424    
#> Residuals   996  19067      19                    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

14b. Post-hoc tests on the previous ANOVA:

TukeyHSD(beard.aov)
#>   Tukey multiple comparisons of means
#>     95% family-wise confidence level
#> 
#> Fit: aov(formula = beard.length ~ sex + college, data = pirates)
#> 
#> $sex
#>                   diff       lwr       upr p adj
#> male-female  19.017620 18.352383 19.682856     0
#> other-female 14.492597 12.905125 16.080070     0
#> other-male   -4.525022 -6.108691 -2.941353     0
#> 
#> $college
#>                 diff       lwr      upr     p adj
#> JSSFP-CCCC 0.2204085 -0.351934 0.792751 0.4500087

Regression

A regression analysis predicts some numeric variable as a function of other variables.

15. Regression analysis showing if age, weight, and the number of tattoos predict how many treasure chests (tchests) a pirate has found:

# a) run the regression:
chests.lm <- lm(formula = tchests ~ age + weight + tattoos, 
                data = pirates)

# b) print summary results:
summary(chests.lm)
#> 
#> Call:
#> lm(formula = tchests ~ age + weight + tattoos, data = pirates)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -33.302 -15.832  -6.860   8.407 119.966 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  5.19084    7.18437   0.723     0.47    
#> age          0.78177    0.13438   5.818 8.03e-09 ***
#> weight      -0.09013    0.07183  -1.255     0.21    
#> tattoos      0.25398    0.22550   1.126     0.26    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 23.99 on 996 degrees of freedom
#> Multiple R-squared:  0.04056,    Adjusted R-squared:  0.03767 
#> F-statistic: 14.04 on 3 and 996 DF,  p-value: 5.751e-09

That’s it – hope you enjoyed your first glimpse into the R universe and are now eager to learn more about the details!

[WPA00_answers.Rmd updated on 2017-10-27 16:06:53 by hn.]

WPA00: Seeing what R can do (Answers)

Hansjörg Neth, SPDS, uni.kn

2017 October 26