uni.kn.logo

WPA06 of Basic data and decision analysis in R, taught at the University of Konstanz in Winter 2017/2018.


To complete and submit these exercises, please remember and do the following:

  1. Your WPAs should be written as scripts of commented code (as .Rmd files) and submitted as reproducible documents that combine text with code (in .html or .pdf formats).

    • A simple .Rmd template is provided here.

    • (Alternatively, open a plain R script and save it as LastnameFirstname_WPA##_yymmdd.R.)

  2. Also enter the current assignment (e.g., WPA06), your name, and the current date at the top of your document. When working on a task, always indicate which task you are answering with appopriate comments.

  3. Complete as many exercises as you can by Wednesday (23:59).

  4. Submit your script or output file (including all code) to the appropriate folder on Ilias.


A. In Class

Here are some warm-up exercises that review some important skills from previous and practice the basic concepts of the current chapter:

Preparations

0. The following steps prepare the current session by opening an R project, creating a new .Rmd file, and compiling it into an .html output file:

0a. Open your R project from last week (called RCourse or something similar), which contains some files and at least two subfolders (data and R).

0b. Create a new R Markdown (.Rmd) script and save it as LastFirst_WPA06_yymmdd.Rmd (with an appropriate header) in your project directory.

0c. You need the latest version of the yarrr package (v0.1.2) in this WPA. Install the package from CRAN with install.packages() if you haven’t already done so.

0d. Insert a code chunk and load the rmarkdown, knitr and yarrr packages. (Hint: It’s always a good idea to name code chunks and load all required packages with library() at the beginning of your document.)

library(rmarkdown)
library(knitr)
library(yarrr)

0e. Save the original graphic settings into an object opar. (Hint: This allows you to restore them later by evaluating par(opar) Use the options message = FALSE and warning = FALSE for this code chunk.)

opar <- par() # saves original (default) par settings
par(opar)  # restores original (default) par settings

0f. Make sure that you can create an .html output-file by “knitting” your current document.

Testing statistical hypotheses

In the following exercises, you will test simple statistical hypotheses about differences or relationships between one or two samples. The type of test depends on the number of samples, as well as the measurement characteristics and distributions of the data.

Reporting test results

Please report your answers to all hypothesis test questions in proper American Pirate Association (APA) style.

  • Reporting test parameters:
    • chi-square test: \(X\)(df) = XXX, \(p\) = YYY
    • t-test: \(t\)(df) = XXX, \(p\) = YYY
    • correlation test: \(r\) = XXX, \(t\)(df) = YYY, \(p\) = ZZZZ
  • If a \(p\)-value is less than \(\alpha = .01\), write only \(p < .01^{*}\).

Example

Here is a question, a statistical test, and an answer in the appropriate (APA) format:

Question: Do pirates with headbands have different numbers of tattoos than those who not wearing headbands?

We can find out whether this is the case by using a 1-sample \(t\)-test on the pirates data set:

library(yarrr)  # for the pirates data set
t.test(tattoos ~ headband, data = pirates)
#> 
#>  Welch Two Sample t-test
#> 
#> data:  tattoos by headband
#> t = -19.313, df = 146.73, p-value < 2.2e-16
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -5.878101 -4.786803
#> sample estimates:
#>  mean in group no mean in group yes 
#>          4.699115         10.031567

Answer: Pirates with headbands have significantly more tattoos on average than those who do not wear headbands (10.0 vs. 4.7, respectively): \(t(146.73) = -19.31\), \(p < .01^{*}\).

T-tests

1. The Nile data set available in R lists 100 measurements of the annual flow of the river Nile (in \(10^{8} m^{3}\)), at Aswan, from 1871 to 1970. Inspect this data set first.

str(Nile)
head(Nile)
length(Nile)
# ?Nile

1a. Does the annual flow differ from 1000 (\(10^{8} m^{3}\)) when considering the entire measurement period (1871–1970)? (Hint: Comparing the mean of an interval-scaled variable to a reference value calls for a 1-sample \(t\)-test.)

#> 
#>  One Sample t-test
#> 
#> data:  Nile
#> t = -4.7658, df = 99, p-value = 6.461e-06
#> alternative hypothesis: true mean is not equal to 1000
#> 95 percent confidence interval:
#>  885.7716 952.9284
#> sample estimates:
#> mean of x 
#>    919.35

Answer:

1b. Someone claims that (a) the flow before the year 1898 is higher than 1000 (\(10^{8} m^{3}\)) and (b) the flow from the year 1898 onwards is lower than 1000 (\(10^{8} m^{3}\)). Test both of these claims (with two 1-sample \(t\)-tests on parts of the data). (Hint: Determine the appropriate ranges of the Nile vector to conduct 1-sample \(t\)-tests on them.)

Answer:

1c. Directly test whether the average annual flow before 1893 and from 1898 onwards differed systematically. (Hint: This requires a 2-sample \(t\)-test on the Nile ranges from above.)

Answer:

1d. Plot the annual Nile flow data, drawing a vertical line at the year 1898, and horizontal lines for the averages before 1898 and from the year 1898 onwards.

Correlation tests

2. The mtcars data set available in R records fuel consumption (in mpg) and 10 further variables of 32 cars (from the 1974 issue of Motor Trend magazine). We’ll use this data to test some hypotheses about the correlation \(r\) between two numeric variables.

str(mtcars)
# head(mtcars)
# ?mtcars

2a. Is there a systematic relationship between a car’s fuel consumption (mpg) and its gross horsepower (hp)? (Hint: Use a correlation test with the variable notiation on columns of of mtcars to test this relationship.)

Answer:

2b. Is there a systematic relationship between a car’s weight (wt) and its displacement (disp)? (Hint: Use a correlation test with the formula notation to test this relationship.)

Answer:

2c. Does the systematic relationship between a car’s fuel consumption (mpg) and its gross horsepower (hp) still hold when only considering cars with automatic transmission (am == 0) and more than four cylinders (cyl > 4)?

Answer:

2d. Plot the relationship of Exercise 2a. and add a linear regression line and a label stating the value of the correlation.

Chi-square tests

3. The mtcars data set can also be used to test some hypotheses about the equal distribution of (the frequency of) instances in categories (with \(\chi^{2}\)-tests).

3a. What are the frequencies of cars with automatic vs. manual transmission (am)? What are the frequencies of cars with different numbers of cylinders (cyl)? Are both of these categories distributed equally or unequally? (Hints: Obtain two table()s and conduct two corresponding 1-sample \(\chi^{2}\)-tests to find out.)

Answer:

Answer:

3b. Do the frequencies of cars with automatic vs. manual transmission (am) vary as a function of the number of cylinders (cyl)? (Hints: Obtain a 2x2 table() and 2-sample \(\chi^{2}\)-test to find out.)

Answer:

Checkpoint 1

At this point you completed all basic exercises. This is good, but practice will deepen your understanding, so please keep carrying on…


B. At Home

A Student Survey

In this part, we will analyze data from a fictional survey of 100 students.

The data are located in a tab-delimited text file at http://Rpository.com/down/data/WPA06.txt.

Data description

The data file has 100 rows and 12 columns. The columns contain variables that are defined as follows:

student survey

  1. sex (string): A string indicting the sex of the participant (“m” = male, “f” = female).
  2. age (integer): An integer indicating the age of the participant.
  3. major (string): A string indicating the participant’s major.
  4. haircolor (string): The participant’s hair color.
  5. iq (integer): The participant’s score on an IQ test.
  6. country (string): The participant’s country of origin.
  7. logic (numeric): Amount of time it took for a participant to complete a logic problem (smaller is better).
  8. siblings (integer): The participant’s number of siblings.
  9. multitasking (integer): The participant’s score on a multitasking task (higher is better).
  10. partners (integer): The participant’s number of sexual partners (so far).
  11. marijuana (binary): Has the participant ever tried marijuana? (“0” = no, “1” = yes).
  12. risk (binary). Would the participant play a gamble with a 50% chance of losing EUR 20 and a 50% chance of earning EUR 20? (“0” means the participant would not play the gamble, “1” means he or she would agree to play).

Data loading and preparation

4a. The data file is available online at http://Rpository.com/down/data/WPA06.txt. Load this data into R as a new object called wpa6.df (Hint: Use read.table() and note that the file contains a header row and is tab-delimited.)

# Read data from online source into a data frame wpa6.df:
wpa6.df <- read.table(file = "http://Rpository.com/down/data/WPA06.txt",
                      header = TRUE,        # there is a header row
                      sep = "\t")           # data are tab-delimited

4b. Save the data as a comma-delimited text file (entitled wpa6.txt) into your local data folder of your project’s working directory. (Hint: Use write.table() with appropriate arguments.)

4c. Use the head(), str(), and View() functions to inspect the dataset and make sure that it was loaded correctly. If the data don’t look correct (i.e., if there isn’t a header row and 100 rows and 12 columns) you probably didn’t load it correctly.

T-test(s)

5. Let’s first test some simple hypotheses about the means of groups or populations.

5a. The average IQ in the general population is 100. Do the participants of our survey have an IQ different from the general population? (Hint: Answer this question with a one-sample t-test.)

Answer:

5b. A friend of your mother’s hairdresser’s cousin claims that students have 2.5 siblings on average. Test this claim on the data set with a one-sample t-test.

Answer:

5c. Do students that have smoked marijuana have different IQ levels than those who have never smoked marijuana? (Hint: Test this hypothesis with a two-sample t-test. You can either use the vector or the formula notation for a t-test.)

Answer:

Correlation test(s)

6. Does some numeric variable vary as a function of another one?

6a. Do students with higher multitasking skills tend to have more romantic partners than those with lower multitasking skills? (Hint: Test this with a correlation test.)

Answer:

6b. Do people with higher IQs perform better (i.e., faster) on the logic test? (Hint: Answer this question with a correlation test.)

Answer:

Chi-square test(s)

7. Are the instances of categorical variables equally or unequally distributed?

7a. Are some majors more popular than others? (Hint: Answer this question with a one-sample chi-square test, but also check the frequencies.)

Answer:

7b. In general, were the students in this survey more likely to take a risk than not taking it? (Hint: Answer this question with a one-sample chi-square test, but also check the frequencies.)

Answer:

7c. Is there a relationship between hair color and students’ academic major? (Hint: Answer this with a two-sample chi-square test, but also check the frequencies.)

Answer:

You pick the test!

8. In the following exercises it’s up to you to select an appropriate test.

8a. Is there a relationship between whether a student has ever smoked marijuana and his/her decision to accept or reject the risky gamble?

Answer:

8b. Do males and females have different numbers of sexual partners on average?

Answer:

8c. Do males and females differ in how likely they are to have smoked marijuana?

Answer:

8d. Do people who have smoked marijuana have different logic scores on average than those who never have smoked marijuana?

Answer:

8e. Do people with higher IQ scores tend to perform better on the logic test than those with lower IQ scores?

Answer:

Checkpoint 2

If you got this far you’re doing great, but don’t give up just yet…

More complicated tests

9. The following exercises typically require you to first select an appropriate subset of the data.

9a. Are Swiss students more likely than not to have tried marijuana? (Hint: Use a one-sample chi-square test with a subset argument, but also check the frequencies.)

Answer:

9a. Does the relationship between IQ scores and multitasking performance differ between males and females? (Hint: Test this by conducting two separate tests — one for males and one for females. Do your conclusions differ?)

Answer:

9b. Do the IQ scores of people with brown hair differ from those with blonde hair? (Hint: This is a two-sample t-test that requires you to use the subset() argument to tell R which two groups you want to compare.)

Answer:

9c. Only for men from Germany, is there a relationship between age and IQ?

Answer:

9d. Considering only people who chose the risky gamble, do people that have smoked marijuana have more sexual partners than those who have never smoked marijuana?

Answer:

9e. Considering only people who chose the risky gamble and have never tried marijuana, is there a relationship between IQ scores and performance on the logic test?

Answer:

Checkpoint 3

If you got this far you’re doing an amazing job — well done! Enjoy the following challenge…


C. Challenges

Anscombe’s quartet

10. In the next few exercises, we’ll explore Anscombe’s famous data quartet. This famous dataset will illustrate the dangers of interpreting statistical tests (like a correlation test), without first plotting the data!

10a. Run the following code to create the anscombe.df dataframe. This dataframe contains 4 datasets (x1 and y1, x2 and y2, x3 and y3 and x4 and y4):

# Just COPY, PASTE, and RUN this code:
anscombe.df <- data.frame(x1 = c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
                          y1 = c(8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 4.68),
                          x2 = c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
                          y2 = c(9.14, 8.14, 8.74, 8.77, 9.26, 8.1, 6.13, 3.1, 9.13, 7.26, 4.74),
                          x3 = c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
                          y3 = c(7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73),
                          x4 = c(8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8),
                          y4 = c(6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.5, 5.56, 7.91, 6.89))

10b. Calculate the four correlations between \(x1\) and \(y1\), \(x2\) and \(y2\), \(x3\) and \(y3\), and between \(x4\) and \(y4\) (separately, i.e., what is the correlation between \(x1\) and \(y1\)? Next, what is the correlation between \(x2\) and \(y2?\), etc.). What can you report about the correlation values for each test?

Answer:

10c. Now run the following code to generate a scatterplot of each data pair. What do you find?

# Just COPY, PASTE, and RUN this code:

# Plot the famous Anscombe quartet:

par(mfrow = c(2, 2)) # create a 2 x 2 plotting grid

for (i in 1:4) {   # Loop over the 4 datasets:
 
  # Assign x and y for current value of i
  if (i == 1) {x <- anscombe.df$x1
               y <- anscombe.df$y1} 
  
  if (i == 2) {x <- anscombe.df$x2
               y <- anscombe.df$y2} 
  
  if (i == 3) {x <- anscombe.df$x3
               y <- anscombe.df$y3} 
  
  if (i == 4) {x <- anscombe.df$x4
               y <- anscombe.df$y4} 

  # Create corresponding plot:
  plot(x = x, y = y, pch = 21, 
       main = paste("Anscombe", i, sep = " "), 
       bg = "orange", col = "red4", 
       xlim = c(0, 20), ylim = c(0, 15)
       )

 # Add regression line:
 abline(lm(y ~ x, 
          data = data.frame(y, x)),
          col = "steelblue", lty = 3, lwd = 2)

  # Add correlation test text: 
  text(x = 3, y = 12,
       labels = paste0("cor = ", round(cor(x, y), 2)),
       col = "steelblue")
  
  }

par(mfrow = c(1, 1)) # reset plotting grid

Answer: Despite their similarities (in test results), the four data sets are very different.

Conclusion: What you can see in the four scatterplots is the famous Anscombe’s quartet, a dataset designed to show you how important is to always plot your data before running a statistical test on it. You can learn more about this at the corresponding Wikipedia page.

Also, see this or this blog post on The Datasaurus Dozen for similar examples that illustrate the importance of visualizing data.

Submission

That’s it – now it’s time to submit your assignment!

Save and submit your script or output file (including all code) to the appropriate folder on Ilias before midnight.


[WPA06.Rmd updated on 2017-12-04 16:17:48 by hn.]