uni.kn.logo

WPA00 of Basic data and decision analysis in R, taught at the University of Konstanz in Winter 2017/2018.


Instructions

This is only a demo document that acts as an example for future WPAs. If you want, you can use this as a practice for the following weeks (WPA01 to WPA11). To complete and submit these exercises, please do the following:

  1. Your WPAs can be written and submitted either as scripts of commented code (as .R or .Rmd files) or as reproducible documents that combine text with code (in .html or .pdf formats).

    • A simple .Rmd template is provided here.

    • Alternatively, open a plain R script and save it as LastnameFirstname_WPA##_yymmdd.R.

  2. Also enter the current assignment (e.g., WPA00), your name, and the current date at the top of your document. When working on a task, always indicate which task you are answering with appopriate comments.

Here is an example how your file JillsomeJack_WPA00_171023.Rmd could look:

# Assignment: WPA 00
# Name: Jane Jackson
# Date: 2017 October 23
# ~~~~~~~~~~~~~~~~~~~~~~~~~~
# Exercise 1: 

# Adding numbers: 
1 + 2

# ~~~~~~~~~~~~~~~~~~~~~~~~~~
# Exercise 2: 

# Draw 100 samples from a standard normal distribution: 
x <- rnorm(100)

# Conduct a t-test on the sample:
t.test(x)

# etc. ...
  1. Complete as many exercises as you can.

  2. Submit your script or output file (including all code) to the appropriate folder on Ilias before the deadline (Wednesday, 23:59).


Jump In

Copy and paste each of the following code chunks into your editor (and save the file as a .R or .Rmd document):

Creating and evaluating objects

1. Let’s see how we interact with R by creating some simple objects and applying basic functions to them:

1a. R can be used to create (e.g., numeric) objects and evaluate them, as with any regular calculator:

a <- 1 # assigns "1" to an object a
b <- 2 # assigns "2" to an object b
a + b  # applies "+" to a and b, and prints the result
sum(a, b) # applies the function sum() to a and b, and prints the result

a <- 100 # to change an object, it must be re-assigned
a + b
sum(a, b)

# Note that evaluating  
# A + B  
# would yield an Error, as R is case-sensitive!

1b. Creating some vectors:

x <- 1:10 # creates a sequence of numbers (integers from 1 to 10) and assigns it to a vector x
y <- 10:1 
x + y # applies "+" to each element of the vectors x and y, and prints the result

z <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) # a generic way to create a vector
w <- c("a", "vector", "of", "characters")
x == z # applies "==" to each element of the vectors x and z

x[3] # returns the 3rd element of x
y[3] # ...
w[3:4] # ...

x > 5 # applies "+" to each element of the vectors x and y, and prints the result 
x[x > 5] # ...

1c. Basic sampling:

sample(c(0, 1), size = 1) # randomly draws 1 sample from c(0, 1)

coin = c("heads", "tails") # defines the outcomes of a coin
# sample(coin, size = 10) # tries to randomly draws 10 samples from our coin (flip it 10 times), but ...
sample(coin, size = 10, replace = TRUE) # ... works!

# Randomly assigning students to groups:
n.students <- 16
n.groups <- 4
groups <- 1:n.groups
sample(groups, size = n.students, replace = TRUE)

Installing and loading packages

R is not just a single program, but an entire universe of code from a community of developers (with all the benefits and costs of such diversity). Packages allow to import and use R code of other people.

2. Let’s install and load the yarrr package (by Nathaniel Phillips) that contains many datasets (like pirates) and functions (like pirateplot) which we’ll use throughout this course.

2a. Installing a package:

install.packages("yarrr") # installs a package

2b. Loading a (previously installed) package:

library("yarrr")          # loads a package

Exploring a dataset

3. The pirates dataset included in the yarrr package contains data from a survey of 1,000 pirates.

3a. Get basic information on the pirates dataset:

?pirates

3b. How many rows and columns are in this dataset?

nrow(pirates) # number of rows / cases
ncol(pirates) # number of columns / variables

3c. View the first few rows of the pirates dataset:

head(pirates)

3d. Show the structure of the pirates dataset:

str(pirates)

3e. Show the entire dataset in a new window:

View(pirates)

Basic descriptives for numeric vectors

4. To obtain basic descriptives for numeric data, you can apply in-built R functions to numeric vectors.

4a. What is the mean age? Apply the function mean() to the vector pirates$age:

mean(pirates$age)

4b. What is the tallest pirate? Apply the function max() to the vector pirates$height

4c. What was the mean and median weight of the pirates?

Basic descriptives for non-numeric data

5. Non-numeric data are typically summarized in frequency tables:

5a. How many pirates are there of each sex?

table(pirates$sex)

5b. How many pirates are there of each age?

6. To collapse cases over 2 variables, you can use the aggregate() function:

6a. What is the mean age for each sex?

aggregate(formula = age ~ sex, 
          data = pirates,
          FUN = mean)

6b. What is the mean beard length (beard.length) for each sex?

6c. How many pirates are wearing a headband? What is the median age of pirates for each combination of sex and headband?

table(pirates$headband)
aggregate(formula = age ~ sex + headband, 
          data = pirates,
          FUN = median)

Basic Plots

Let’s explore some basic plotting commands!

Histograms

7. Basic histograms show some variable’s distribution of values:

7a. What is the distribution of pirate ages?

hist(x = pirates$age)

7b. What is the distribution of (the number of) pirate tattoos?

hist(x = pirates$tattoos)

7c. To get more fancy histograms you can set and customize many parameters:

ymax <- 200

hist(x = pirates$age,
     breaks = 15, 
     main = "Distribution of pirate ages",
     col = "skyblue",
     border = "white",
     xlab = "Age",
     ylim = c(0, ymax))

# Add the mean as a text label: 
text(x = mean(pirates$age), y = (ymax - 10), 
     labels = paste("Mean = ", round(mean(pirates$age), 1)))

# Add a vertical dashed line at the mean: 
segments(x0 = mean(pirates$age), y0 = 0, 
         x1 = mean(pirates$age), y1 = (ymax - 20), 
         col = gray(.2, .5),
         lwd = 3, 
         lty = 2)

7d. Combining multiple histograms:

## 2 overlapping histograms of pirate ages for females and males:

# (a) Start with the female data:
hist(x = pirates$age[pirates$sex == "female"],
     main = "Distribution of pirate ages by sex",
     col = transparent("orange3", .2),
     border = "white",
     xlab = "Age", 
     breaks = seq(0, 50, 2),
     probability = TRUE,
     ylab = "", 
     yaxt = "n")

# (b) add male data:
hist(x = pirates$age[pirates$sex == "male"],
     add = TRUE, 
     probability = TRUE, 
     border = "white",
     breaks = seq(0, 50, 2),
     col = transparent("steelblue3", .5))

# (c) add a legend: 
legend(x = 40, 
       y = .05,
       col = c("orange3", "steelblue3"),
       legend = c("female", "male"),
       pch = 16,
       bty = "n")

Scatterplots

8. Scatterplots show relations between 2 numeric variables:

8a. Basic scatterplot of height and weight of pirates:

## 6A: A simple scatterplot of pirate height and weight
plot(x = pirates$height,
     y = pirates$weight,
     xlab = "Height (cm)",
     ylab = "Weight (kg)")

8b. A fancier scatterplot of the same data with some additional arguments:

# Create main plot: 
plot(x = pirates$height, 
     y = pirates$weight,
     main = 'My first scatterplot of pirate data!',
     xlab = 'Height (in cm)',
     ylab = 'Weight (in kg)',
     pch = 16,    # filled circles
     col = gray(0, .1)) # transparent gray
     
# Add gridlines:
grid()

# Create a linear regression model:
model <- lm(formula = weight ~ height, 
            data = pirates)

# Add regression line to the plot:
abline(model,
       col = 'blue', lty = 2)

Color palettes

9. To obtain prettier colors, the yarrr package offers some pre-designed color palettes:

9a. Look at all the available palettes from piratepal():

piratepal()

9b. Look at some specific palette in more detail:

piratepal(palette = "google", plot.result = TRUE)

9c. Look at some other palettes in more detail…

9d. Using the pony palette in a fancy scatterplot of pirate height and weight:

my.cols <- piratepal(palette = "pony", 
                     trans = .2, 
                     length.out = nrow(pirates))

# Create the plot:
plot(x = pirates$height, y = pirates$weight,
     main = "Random scatterplot with My Little Pony Colors",
     xlab = "Pony height",
     ylab = "Pony weight",
     pch = 21,  # Round symbols with borders
     cex = 2,  # magnifying factor of plot text and symbols
     col = "white",  # white border
     bg = my.cols,   # random colors
     bty = "n"       # no plot border
     )

# Add gridlines:
grid()

Barplots

Barplots allow comparisons between categories of a variable:

10.a Calculate mean height for each favorite.pirate:

pirate.heights <- aggregate(height ~ favorite.pirate,
                     data = pirates,
                     FUN = mean)

barplot(pirate.heights$height, 
        main = "Barplot of mean height by favorite pirate",
        names.arg = pirate.heights$favorite.pirate)

10b. The same barplot, but with additional customizations:

barplot(pirate.heights$height, 
        ylim = c(0, 200),
        ylab = "Pirate Height (in cm)",
        main = "Barplot of mean height by favorite pirate",
        names.arg = pirate.heights$favorite.pirate, 
        col = piratepal("basel", trans = .2))

abline(h = seq(0, 200, 25), lty = 3, lwd = c(1, .5))

Pirateplots

11.a A so-called pirateplot shows the raw values, means and distributions of a numeric variable (like height) by the levels of some categorical variables (like favorite.pirate):

pirateplot(formula = height ~ favorite.pirate,
           data = pirates,
           main = "Pirateplot of height by favorite pirate")

11b. Create a pirateplot of height by sex and eyepatch.

Statistical tests

As R started out as a statistical programming language, it is not surprising that it can do statistics as well…

Two sample hypothesis tests

12. A t-test compares two means.

For instance, do pirates with eyepatches have shorter or longer beards (beard.length) than those without eyepatches?

t.test(formula = beard.length ~ eyepatch, 
       data = pirates,
       alternative = 'two.sided')

13 A correlation test evaluates the relation between 2 numeric variables.

13a. For instance, is there a correlation between a pirate’s age and the number of parrots (s)he has?

cor.test(formula = ~ age + parrots,
         data = pirates)

13b. Is there a correlation between a pirate’s weight and tattoos?

ANOVA

An ANOVA compares the means of variables with 2 or more levels.

14a. ANOVA on beard.length as a function of sex and college education:

# a) run the ANOVA:
beard.aov <- aov(formula = beard.length ~ sex + college, 
                   data = pirates)

# b) print summary results:
summary(beard.aov)

14b. Post-hoc tests on the previous ANOVA:

TukeyHSD(beard.aov)

Regression

A regression analysis predicts some numeric variable as a function of other variables.

15. Regression analysis showing if age, weight, and the number of tattoos predict how many treasure chests (tchests) a pirate has found:

# a) run the regression:
chests.lm <- lm(formula = tchests ~ age + weight + tattoos, 
                data = pirates)

# b) print summary results:
summary(chests.lm)

That’s it for today – hope you enjoyed your first glimpse into the R universe and are now eager to learn more about the details!


[WPA00_answers.Rmd updated on 2017-10-23 11:11:50 by hn.]