Answers for WPA06 of Basic data and decision analysis in R, taught at the University of Konstanz in Winter 2017/2018.
To complete and submit these exercises, please remember and do the following:
Your WPAs should be written as scripts of commented code (as
.Rmd
files) and submitted as reproducible documents that combine text with code (in.html
or.pdf
formats).A simple
.Rmd
template is provided here.(Alternatively, open a plain R script and save it as
LastnameFirstname_WPA##_yymmdd.R
.)
Also enter the current assignment (e.g., WPA06), your name, and the current date at the top of your document. When working on a task, always indicate which task you are answering with appopriate comments.
Complete as many exercises as you can by Wednesday (23:59).
Submit your script or output file (including all code) to the appropriate folder on Ilias.
A. In Class
Here are some warm-up exercises that review some important skills from previous and practice the basic concepts of the current chapter:
Preparations
0. The following steps prepare the current session by opening an R project, creating a new .Rmd
file, and compiling it into an .html
output file:
0a. Open your R project from last week (called RCourse
or something similar), which contains some files and at least two subfolders (data
and R
).
0b. Create a new R Markdown (.Rmd
) script and save it as LastFirst_WPA06_yymmdd.Rmd
(with an appropriate header) in your project directory.
0c. You need the latest version of the yarrr
package (v0.1.2) in this WPA. Install the package from CRAN with install.packages()
if you haven’t already done so.
0d. Insert a code chunk and load the rmarkdown
, knitr
and yarrr
packages. (Hint: It’s always a good idea to name code chunks and load all required packages with library()
at the beginning of your document.)
library(rmarkdown)
library(knitr)
library(yarrr)
0e. Save the original graphic settings into an object opar
. (Hint: This allows you to restore them later by evaluating par(opar)
Use the options message = FALSE
and warning = FALSE
for this code chunk.)
opar <- par() # saves original (default) par settings
par(opar) # restores original (default) par settings
0f. Make sure that you can create an .html
output-file by “knitting” your current document.
Testing statistical hypotheses
In the following exercises, you will test simple statistical hypotheses about differences or relationships between one or two samples. The type of test depends on the number of samples, as well as the measurement characteristics and distributions of the data.
Reporting test results
Please report your answers to all hypothesis test questions in proper American Pirate Association (APA) style.
- Reporting test parameters:
- chi-square test: \(X\)(df) = XXX, \(p\) = YYY
- t-test: \(t\)(df) = XXX, \(p\) = YYY
- correlation test: \(r\) = XXX, \(t\)(df) = YYY, \(p\) = ZZZZ
- If a \(p\)-value is less than \(\alpha = .01\), write only \(p < .01^{*}\).
Example
Here is a question, a statistical test, and an answer in the appropriate (APA) format:
Question: Do pirates with headbands have different numbers of tattoos than those who not wearing headbands?
We can find out whether this is the case by using a 1-sample \(t\)-test on the pirates
data set:
library(yarrr) # for the pirates data set
t.test(tattoos ~ headband, data = pirates)
#>
#> Welch Two Sample t-test
#>
#> data: tattoos by headband
#> t = -19.313, df = 146.73, p-value < 2.2e-16
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -5.878101 -4.786803
#> sample estimates:
#> mean in group no mean in group yes
#> 4.699115 10.031567
Answer: Pirates with headbands have significantly more tattoos on average than those who do not wear headbands (10.0 vs. 4.7, respectively): \(t(146.73) = -19.31\), \(p < .01^{*}\).
T-tests
1. The Nile
data set available in R lists 100 measurements of the annual flow of the river Nile (in \(10^{8} m^{3}\)), at Aswan, from 1871 to 1970. Inspect this data set first.
str(Nile)
#> Time-Series [1:100] from 1871 to 1970: 1120 1160 963 1210 1160 1160 813 1230 1370 1140 ...
# head(Nile)
# ?Nile
1a. Does the annual flow differ from 1000 (\(10^{8} m^{3}\)) when considering the entire measurement period (1871–1970)? (Hint: Comparing the mean of an interval-scaled variable to a reference value calls for a 1-sample \(t\)-test.)
# t-test 1a:
t.test(x = Nile, # all data (1871 - 1970)
mu = 1000) # Null hypothesis: Mean is 1000
#>
#> One Sample t-test
#>
#> data: Nile
#> t = -4.7658, df = 99, p-value = 6.461e-06
#> alternative hypothesis: true mean is not equal to 1000
#> 95 percent confidence interval:
#> 885.7716 952.9284
#> sample estimates:
#> mean of x
#> 919.35
Answer: Yes, the annual average flow of 919.4 is significantly lower than 1000 (\(10^{8} m^{3}\)): \(t(99) = -4.77\), \(p < .01^{*}\)
1b. Someone claims that (a) the flow before the year 1898 is higher than 1000 (\(10^{8} m^{3}\)) and (b) the flow from the year 1898 onwards is lower than 1000 (\(10^{8} m^{3}\)). Test both of these claims (with two 1-sample \(t\)-tests on parts of the data). (Hint: Determine the appropriate ranges of the Nile
vector to conduct 1-sample \(t\)-tests on them.)
# At which position is the year 1898 in this vector?
years <- 1871:1970 # MAKE a vector of the range of years!
# (a) using numerical indexing:
my.year <- 1898
ix.yr <- which(years == my.year) # At which position in years is 1898?
data.pre.change <- Nile[1:(ix.yr - 1)]
data.post.change <- Nile[ix.yr:length(Nile)]
# (b) using logical indexing:
data.pre.change <- Nile[years < 1898 ] # Use Booleans to indicate desired range.
data.post.change <- Nile[years >= 1898 ]
# t-test 1b: Using data before 1898:
test.1 <- t.test(x = data.pre.change, # Use only data before 1898
mu = 1000) # Null hypothesis: Mean is 1000
test.1
#>
#> One Sample t-test
#>
#> data: data.pre.change
#> t = 3.689, df = 26, p-value = 0.001046
#> alternative hypothesis: true mean is not equal to 1000
#> 95 percent confidence interval:
#> 1043.247 1152.086
#> sample estimates:
#> mean of x
#> 1097.667
# t-test 1b: Using data from 1898 onwards:
test.2 <- t.test(x = data.post.change, # Use only data from 1898 onwards
mu = 1000) # Null hypothesis: Mean is 1000
test.2
#>
#> One Sample t-test
#>
#> data: data.post.change
#> t = -9.8383, df = 72, p-value = 5.854e-15
#> alternative hypothesis: true mean is not equal to 1000
#> 95 percent confidence interval:
#> 823.6923 883.1022
#> sample estimates:
#> mean of x
#> 853.3973
# To report results in APA format:
yarrr::apa(test.1)
#> [1] "mean = 1097.67, t(26) = 3.69, p < 0.01 (2-tailed)"
yarrr::apa(test.2)
#> [1] "mean = 853.4, t(72) = -9.84, p < 0.01 (2-tailed)"
Answer: Both claims are true:
Before the year 1898, the average annual flow of 1097.7 was significantly higher than 1000: \(t(26) = 3.69\), \(p < .01^{*}\).
After the year 1898, the average annual flow of 853.4 was significantly lower than 1000: \(t(72) = -9.84\), \(p < .001^{***}\).
1c. Directly test whether the average annual flow before 1893 and from 1898 onwards differed systematically. (Hint: This requires a 2-sample \(t\)-test on the Nile
ranges from above.)
# t-test 1c:
test.3 <- t.test(x = data.pre.change, # only data before 1898
y = data.post.change) # only data from 1898 onwards
test.3 # print the test results
#>
#> Welch Two Sample t-test
#>
#> data: data.pre.change and data.post.change
#> t = 8.0404, df = 43.506, p-value = 3.824e-10
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> 183.0224 305.5164
#> sample estimates:
#> mean of x mean of y
#> 1097.6667 853.3973
apa(test.3) # print results in APA format
#> [1] "mean difference = -244.27, t(43.51) = 8.04, p < 0.01 (2-tailed)"
Answer: Yes, the flow was significantly reduced (with the same means as reported above): \(t(43.5) = 8.04\), \(p < .001^{***}\).
1d. Plot the annual Nile flow data, drawing a vertical line at the year 1898, and horizontal lines for the averages before 1898 and from the year 1898 onwards.
plot(Nile, lwd = 2, col = "gray50")
mn.pre <- mean(data.pre.change)
mn.post <- mean(data.post.change)
abline(v = my.year, lty = 2, lwd = 2, col = "steelblue3")
abline(h = mn.pre, lty = 3, lwd = 2, col = "green3")
abline(h = mn.post, lty = 3, lwd = 2, col = "red3")
# add text labels:
text(x = my.year + 4,
y = max(Nile),
labels = paste(my.year),
col = unikn.pal.ty[4]
)
text(x = 1886,
y = max(Nile),
labels = paste("before ", my.year, ": ",
round(mn.pre, 0), sep = ""),
col = "green3"
)
text(x = 1940,
y = min(Nile) + 100,
labels = paste("after ", my.year, ": ",
round(mn.post, 0), sep = ""),
col = "red3"
)
Correlation tests
2. The mtcars
data set available in R records fuel consumption (in mpg) and 10 further variables of 32 cars (from the 1974 issue of Motor Trend magazine). We’ll use this data to test some hypotheses about the correlation \(r\) between two numeric variables.
str(mtcars)
#> 'data.frame': 32 obs. of 11 variables:
#> $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
#> $ disp: num 160 160 108 258 360 ...
#> $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
#> $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#> $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
#> $ qsec: num 16.5 17 18.6 19.4 17 ...
#> $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
#> $ am : num 1 1 1 0 0 0 0 0 0 0 ...
#> $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
#> $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
# head(mtcars)
# ?mtcars
2a. Is there a systematic relationship between a car’s fuel consumption (mpg
) and its gross horsepower (hp
)? (Hint: Use a correlation test with the variable notiation on columns of of mtcars
to test this relationship.)
my.test <- cor.test(x = mtcars$hp,
y = mtcars$mpg)
my.test
#>
#> Pearson's product-moment correlation
#>
#> data: mtcars$hp and mtcars$mpg
#> t = -6.7424, df = 30, p-value = 1.788e-07
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> -0.8852686 -0.5860994
#> sample estimates:
#> cor
#> -0.7761684
# Note that you can print the results in APA format with the apa() function of the `yarrr` package:
# library(yarrr)
yarrr::apa(my.test)
#> [1] "r = -0.78, t(30) = -6.74, p < 0.01 (2-tailed)"
Answer: Yes, there is a significant negative correlation between fuel consumption and horsepower: r = -0.78, t(30) = -6.74, p < 0.01 (2-tailed).
2b. Is there a systematic relationship between a car’s weight (wt
) and its displacement (disp
)? (Hint: Use a correlation test with the formula notation to test this relationship.)
cor.test(formula = ~ wt + disp,
data = mtcars)
#>
#> Pearson's product-moment correlation
#>
#> data: wt and disp
#> t = 10.576, df = 30, p-value = 1.222e-11
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.7811586 0.9442902
#> sample estimates:
#> cor
#> 0.8879799
Answer: Yes, there is a significant positive correlation between a car’s weight and displacement: r = 0.89, t(30) = 10.58, p < 0.01 (2-tailed).
2c. Does the systematic relationship between a car’s fuel consumption (mpg
) and its gross horsepower (hp
) still hold when only considering cars with automatic transmission (am == 0
) and more than four cylinders (cyl > 4
)?
cor.test(formula = ~ hp + mpg,
data = mtcars,
subset = (am == 0 & cyl > 4)
)
#>
#> Pearson's product-moment correlation
#>
#> data: hp and mpg
#> t = -3.7496, df = 14, p-value = 0.002155
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> -0.8909390 -0.3268242
#> sample estimates:
#> cor
#> -0.7078515
Answer: Yes, there is still a significant negative correlation between fuel consumption and horsepower for the subset of cars with automatic transmission and more than four cylinders: r = -0.71, t(14) = -3.75, p < 0.01 (2-tailed).
2d. Plot the relationship of Exercise 2a. and add a linear regression line and a label stating the value of the correlation.
plot(x = mtcars$hp,
y = mtcars$mpg,
pch = 21,
bg = yarrr::transparent("orange", .50),
col = yarrr::transparent("red4", .25),
ylim = c(0, max(mtcars$mpg)),
main = "MPG by horse power",
xlab = "Horse power",
ylab = "Fuel consumption (mpg)"
)
# Add regression line:
abline(lm(mtcars$mpg ~ mtcars$hp),
col = unikn.pal.ty[4], lty = 3, lwd = 2)
# Add correlation test label:
text(x = 290, y = 8,
labels = paste0("cor = ", round(cor(mtcars$mpg, mtcars$hp), 2)),
col = unikn.pal.ty[4]
)
Chi-square tests
3. The mtcars
data set can also be used to test some hypotheses about the equal distribution of (the frequency of) instances in categories (with \(\chi^{2}\)-tests).
3a. What are the frequencies of cars with automatic vs. manual transmission (am
)? What are the frequencies of cars with different numbers of cylinders (cyl
)? Are both of these categories distributed equally or unequally? (Hints: Obtain two table()
s and conduct two corresponding 1-sample \(\chi^{2}\)-tests to find out.)
# 1 sample: transmission
t1a <- table(mtcars$am)
t1a
#>
#> 0 1
#> 19 13
# One-sample chi-square test:
chisq.test(x = t1a)
#>
#> Chi-squared test for given probabilities
#>
#> data: t1a
#> X-squared = 1.125, df = 1, p-value = 0.2888
Answer: There are 19 cars with automatic transmission and 13 cars with manual transmission, but we cannot reject the hypothesis that both types of transmission occur with equal frequency: \(\chi^{2} = 1.13\), \(p = .29\).
# 1 sample: cylinders
t1b <- table(mtcars$cyl)
t1b
#>
#> 4 6 8
#> 11 7 14
# One-sample chi-square test:
chisq.test(x = t1b)
#>
#> Chi-squared test for given probabilities
#>
#> data: t1b
#> X-squared = 2.3125, df = 2, p-value = 0.3147
Answer: There are 11 (7, 14) cars with 4 (6, 8) cylinders, respectively, but we cannot reject the hypothesis that cars with these numbers of cylinders occur with equal frequency: \(\chi^{2} = 2.31\), \(p = .31\).
3b. Do the frequencies of cars with automatic vs. manual transmission (am
) vary as a function of the number of cylinders (cyl
)? (Hints: Obtain a 2x2 table()
and 2-sample \(\chi^{2}\)-test to find out.)
# 2 samples:
t2 <- table(mtcars$am, mtcars$cyl)
t2
#>
#> 4 6 8
#> 0 3 4 12
#> 1 8 3 2
# Two-sample chi-square test:
chisq.test(x = t2)
#>
#> Pearson's Chi-squared test
#>
#> data: t2
#> X-squared = 8.7407, df = 2, p-value = 0.01265
Answer: The following table shows the frequency of cars by type of transmission and their number of cylinders:
Transmission: | 4 cyl: | 6 cyl: | 8 cyl: |
---|---|---|---|
automatic (0): | 3 | 4 | 12 |
manual (1): | 8 | 3 | 2 |
We can reject the hypothesis that this distribution is equal: \(\chi^{2} = 8.7\), \(p < .05^{*}\) (or \(p = .013\)). It looks like cars with automatic transmission tend to have more cylinders, whereas manual transmission tend to have fewer cylinders. (However, due to the low counts in some cells this interpretation has to be treated with caution.)
Checkpoint 1
At this point you completed all basic exercises. This is good, but additional practice will deepen your understanding, so please keep carrying on…
B. At Home:
A Student Survey
In this part, we will analyze data from a fictional survey of 100 students.
The data are located in a tab-delimited text file at http://Rpository.com/down/data/WPA06.txt.
Data description
The data file has 100 rows and 12 columns. The columns contain variables that are defined as follows:
sex
(string): A string indicting the sex of the participant (“m” = male, “f” = female).age
(integer): An integer indicating the age of the participant.major
(string): A string indicating the participant’s major.haircolor
(string): The participant’s hair color.iq
(integer): The participant’s score on an IQ test.country
(string): The participant’s country of origin.logic
(numeric): Amount of time it took for a participant to complete a logic problem (smaller is better).siblings
(integer): The participant’s number of siblings.multitasking
(integer): The participant’s score on a multitasking task (higher is better).partners
(integer): The participant’s number of sexual partners (so far).marijuana
(binary): Has the participant ever tried marijuana? (“0” = no, “1” = yes).risk
(binary). Would the participant play a gamble with a 50% chance of losing EUR 20 and a 50% chance of earning EUR 20? (“0” means the participant would not play the gamble, “1” means he or she would agree to play).
Data loading and preparation
4a. The data file is available online at http://Rpository.com/down/data/WPA06.txt. Load this data into R as a new object called wpa6.df
(Hint: Use read.table()
and note that the file contains a header row and is tab-delimited.)
# Read data into a data frame wpa6.df:
wpa6.df <- read.table(#file = "http://nathanieldphillips.com/wp-content/uploads/2016/11/wpa6.txt",
file = "http://Rpository.com/down/data/WPA06.txt",
header = TRUE, # there is a header row
sep = "\t") # data are tab-delimited
4b. Save the data as a comma-delimited text file (entitled wpa6.txt
) into your local data folder of your project’s working directory. (Hint: Use write.table()
with appropriate arguments.)
# Write wpa6.df to a tab-delimited text file in your local data folder:
write.table(wpa6.df, # object to be written
file = "data/wpa6.txt", # save wpa6.txt into the data folder
sep = ",") # save in comma-delimited format
4c. Use the head()
, str()
, and View()
functions to inspect the dataset and make sure that it was loaded correctly. If the data don’t look correct (i.e., if there isn’t a header row and 100 rows and 12 columns) you probably didn’t load it correctly.
T-test(s)
5. Let’s first test some simple hypotheses about the means of groups or populations.
5a. The average IQ in the general population is 100. Do the participants of our survey have an IQ different from the general population? (Hint: Answer this question with a one-sample t-test.)
t.test(x = wpa6.df$iq,
mu = 100)
#>
#> One Sample t-test
#>
#> data: wpa6.df$iq
#> t = 20.847, df = 99, p-value < 2.2e-16
#> alternative hypothesis: true mean is not equal to 100
#> 95 percent confidence interval:
#> 109.0935 111.0065
#> sample estimates:
#> mean of x
#> 110.05
Answer: Yes, participants’ IQ values differ from the general population. More specifically, they have significantly higher IQ values than the general population: \(t(99) = 20.847\), \(p < .01^{*}\).
5b. The friend of your mother’s hairdresser’s cousin claims that students have 2.5 siblings on average. Test this claim with a one-sample t-test.
t.test(x = wpa6.df$siblings,
mu = 2.5)
#>
#> One Sample t-test
#>
#> data: wpa6.df$siblings
#> t = -0.9083, df = 99, p-value = 0.3659
#> alternative hypothesis: true mean is not equal to 2.5
#> 95 percent confidence interval:
#> 2.181545 2.618455
#> sample estimates:
#> mean of x
#> 2.4
Answer: The data do not reject the person’s claim — that is, the mean number of siblings of the students in our survey is not significantly different from 2.5: \(t(99) = -0.91\), \(p = 0.37\). (Note that this does not mean that the claim is confirmed.)
5c. Do students that have smoked marijuana have different IQ levels than those who have never smoked marijuana? (Hint: Test this hypothesis with a two-sample t-test. You can either use the vector or the formula notation for a t-test.)
t.test(formula = iq ~ marijuana,
data = wpa6.df)
#>
#> Welch Two Sample t-test
#>
#> data: iq by marijuana
#> t = -0.72499, df = 96.078, p-value = 0.4702
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -2.610103 1.213545
#> sample estimates:
#> mean in group 0 mean in group 1
#> 109.6939 110.3922
Answer: We do not have evidence that IQ values vary as a function of marijuana use: \(t(96) = -0.72\), \(p = 0.47\).
Correlation test(s)
6. Does some numeric variable vary as a function of another one?
6a. Do students with higher multitasking skills tend to have more romantic partners than those with lower multitasking skills? (Hint: Test this with a correlation test.)
cor.test(formula = ~ multitasking + partners,
data = wpa6.df,
alternative = "greater")
#>
#> Pearson's product-moment correlation
#>
#> data: multitasking and partners
#> t = -0.79444, df = 98, p-value = 0.7856
#> alternative hypothesis: true correlation is greater than 0
#> 95 percent confidence interval:
#> -0.2422602 1.0000000
#> sample estimates:
#> cor
#> -0.07999296
Answer: We do not have evidence that people with higher multitasking skills tend to have more romantic partners than those with lower skills: \(t(98) = -0.79\), \(p = 0.79\).
6b. Do people with higher IQs perform better (i.e., faster) on the logic test? (Hint: Answer this question with a correlation test.)
cor.test(formula = ~ iq + logic,
data = wpa6.df,
alternative = "less")
#>
#> Pearson's product-moment correlation
#>
#> data: iq and logic
#> t = 1.2065, df = 98, p-value = 0.8847
#> alternative hypothesis: true correlation is less than 0
#> 95 percent confidence interval:
#> -1.0000000 0.2808334
#> sample estimates:
#> cor
#> 0.1209815
Answer: We do not have evidence that people with higher IQs perform faster on the logic test: \(t(98) = 1.21\), \(p = 0.88\).
Chi-square test(s)
7. Are the instances of categorical variables equally or unequally distributed?
7a. Are some majors more popular than others? (Hint: Answer this question with a one-sample chi-square test, but also check the frequencies.)
chisq.test(table(wpa6.df$major))
#>
#> Chi-squared test for given probabilities
#>
#> data: table(wpa6.df$major)
#> X-squared = 130.96, df = 3, p-value < 2.2e-16
table(wpa6.df$major)
#>
#> biology economics psychology sociology
#> 3 15 74 8
Answer: We do find significant evidence that some majors are more popular than others: \(\chi^{2}(3) = 130.96\), \(p < .01\). (Psychology seems the most popular major in our sample.)
7b. In general, were the students in this survey more likely to take a risk than not taking it? (Hint: Answer this question with a one-sample chi-square test, but also check the frequencies.)
chisq.test(table(wpa6.df$risk))
#>
#> Chi-squared test for given probabilities
#>
#> data: table(wpa6.df$risk)
#> X-squared = 17.64, df = 1, p-value = 2.669e-05
# But look at the direction!!!
table(wpa6.df$risk)
#>
#> 0 1
#> 71 29
Answer: No! We do not find significant evidence that people are more likely to take a risk than not. In fact, we find evidence that they are significantly less likely to take a risk than not taking one: \(\chi^{2}(1) = 17.64\), \(p < .01^{*}\).
7c. Is there a relationship between hair color and students’ academic major? (Hint: Answer this with a two-sample chi-square test, but also check the frequencies.)
table(wpa6.df$major, wpa6.df$haircolor)
#>
#> black blonde brown red
#> biology 0 0 3 0
#> economics 2 5 7 1
#> psychology 15 20 31 8
#> sociology 1 3 4 0
chisq.test(table(wpa6.df$major, wpa6.df$haircolor))
#>
#> Pearson's Chi-squared test
#>
#> data: table(wpa6.df$major, wpa6.df$haircolor)
#> X-squared = 5.9227, df = 9, p-value = 0.7476
Answer: We do not find significant evidence for a relationship between hair color and major: \(\chi^{2}(9) = 5.92\), \(p = 0.74\).
You pick the test!
8. In the following exercises it’s up to you to select an appropriate test.
8a. Is there a relationship between whether a student has ever smoked marijuana and his/her decision to accept or reject the risky gamble?
chisq.test(table(wpa6.df$marijuana, wpa6.df$risk))
#>
#> Pearson's Chi-squared test with Yates' continuity correction
#>
#> data: table(wpa6.df$marijuana, wpa6.df$risk)
#> X-squared = 0.56829, df = 1, p-value = 0.4509
Answer: We do not find significant evidence that marijuana is related to selecting the risky gamble: \(\chi^{2}(1) = 0.57\), \(p = 0.45\).
8b. Do males and females have different numbers of sexual partners on average?
t.test(partners ~ sex,
data = wpa6.df)
#>
#> Welch Two Sample t-test
#>
#> data: partners by sex
#> t = 0.9219, df = 93.765, p-value = 0.3589
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -0.7615073 2.0815073
#> sample estimates:
#> mean in group f mean in group m
#> 7.54 6.88
Answer: We do not find signifciant evidence that males and females have a different number of sexual partners on average: \(t(93.77) = 0.92\), \(p = 0.36\).
8c. Do males and females differ in how likely they are to have smoked marijuana?
chisq.test(table(wpa6.df$marijuana, wpa6.df$sex))
#>
#> Pearson's Chi-squared test with Yates' continuity correction
#>
#> data: table(wpa6.df$marijuana, wpa6.df$sex)
#> X-squared = 0.16006, df = 1, p-value = 0.6891
Answer: We do not find signifcant evidence that males and females differ in how likely they are to have smoked marijuana: \(\chi^{2}(1) = 0.16\), \(p = 0.69\).
8d. Do people who have smoked marijuana have different logic scores on average than those who never have smoked marijuana?
t.test(logic ~ marijuana, data = wpa6.df)
#>
#> Welch Two Sample t-test
#>
#> data: logic by marijuana
#> t = 0.95997, df = 90.88, p-value = 0.3396
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -0.4373672 1.2554465
#> sample estimates:
#> mean in group 0 mean in group 1
#> 9.945510 9.536471
Answer: We do not find significant evidence that those who have smoked marijuana have different logic scores on average than those who have not smoked marijuana: \(t(90.88) = 0.96\), \(p = 0.34\).
8e. Do people with higher IQ scores tend to perform better on the logic test than those with lower IQ scores?
cor.test(~ iq + logic,
data = wpa6.df,
alternative = "less")
#>
#> Pearson's product-moment correlation
#>
#> data: iq and logic
#> t = 1.2065, df = 98, p-value = 0.8847
#> alternative hypothesis: true correlation is less than 0
#> 95 percent confidence interval:
#> -1.0000000 0.2808334
#> sample estimates:
#> cor
#> 0.1209815
Answer: We do not find significant evidence that people with higher IQ scores tend to perform better on the logic test than those with lower IQ scores: \(t(98) = 1.21\), \(p = 0.88\).
Checkpoint 2
If you got this far you’re doing great, but don’t give up just yet…
More complicated tests
9. The following exercises typically require you to first select an appropriate subset of the data.
9a. Are Swiss students more likely than not to have tried marijuana? (Hint: Use a one-sample chi-square test with a subset
argument, but also check the frequencies.)
mar.swi <- wpa6.df$marijuana[wpa6.df$country == "switzerland"]
table(mar.swi) # to see frequencies
#> mar.swi
#> 0 1
#> 8 12
chisq.test(mar.swi)
#>
#> Chi-squared test for given probabilities
#>
#> data: mar.swi
#> X-squared = 8, df = 19, p-value = 0.9867
Answer: We do not find evidence that Swiss students were more likely than not to have tried marijuana: \(\chi^{2}(19) = 8\), \(p = 0.99\).
9a. Does the relationship between IQ scores and multitasking performance differ between males and females? (Hint: Test this by conducting two separate tests — one for males and one for females. Do your conclusions differ?)
# For males:
cor.test(formula = ~ iq + multitasking,
data = subset(wpa6.df, sex == "m"))
#>
#> Pearson's product-moment correlation
#>
#> data: iq and multitasking
#> t = 0.40605, df = 48, p-value = 0.6865
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> -0.2234787 0.3314582
#> sample estimates:
#> cor
#> 0.05850849
# For females:
cor.test(formula = ~ iq + multitasking,
data = subset(wpa6.df, sex == "f"))
#>
#> Pearson's product-moment correlation
#>
#> data: iq and multitasking
#> t = -1.45, df = 48, p-value = 0.1536
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> -0.45713553 0.07793781
#> sample estimates:
#> cor
#> -0.204854
Answer: While the raw correlations for males and females are different (\(r=-.21\) vs. \(r=.06\), respectively), we do not find a significant correlation between IQ scores and multitasking performance for either males or females: \(t(48) = .41\), \(p = 0.69\) and \(t(48) = -1.45\), \(p = 0.15\), respectively.
9b. Do the IQ scores of people with brown hair differ from those with blonde hair? (Hint: This is a two-sample t-test that requires you to use the subset()
argument to tell R which two groups you want to compare.)
t.test(iq ~ haircolor,
data = subset(wpa6.df, haircolor %in% c("brown", "blonde")))
#>
#> Welch Two Sample t-test
#>
#> data: iq by haircolor
#> t = 0.50574, df = 59.053, p-value = 0.6149
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -1.757470 2.946358
#> sample estimates:
#> mean in group blonde mean in group brown
#> 109.7500 109.1556
Answer: We do not find a significant difference in IQ scores between blondes and brown-haired people: \(t(59.05) = 0.51\), \(p = 0.61\).
9c. Only for men from Germany, is there a relationship between age and IQ?
cor.test(formula = ~ age + iq,
data = subset(wpa6.df, country == "germany"))
#>
#> Pearson's product-moment correlation
#>
#> data: age and iq
#> t = 1.1545, df = 58, p-value = 0.253
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> -0.1081605 0.3890006
#> sample estimates:
#> cor
#> 0.1498806
Answer: We do not find sufficient evidence to show that men from Germany show a significant relationship between their age and IQ scores: \(t(58) = 1.15\), \(p = 0.25\).
9d. Considering only people who chose the risky gamble, do people that have smoked marijuana have more sexual partners than those who have never smoked marijuana?
t.test(formula = partners ~ marijuana,
data = subset(wpa6.df, risk == 1))
#>
#> Welch Two Sample t-test
#>
#> data: partners by marijuana
#> t = 0.87669, df = 25.529, p-value = 0.3888
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -1.307151 3.248327
#> sample estimates:
#> mean in group 0 mean in group 1
#> 7.500000 6.529412
Answer: No, for the people who chose the risky gamble, we do not find evidence for a relationship between marijuana use and their average number of sexual partners: \(t(25.53) = 0.88\), \(p = 0.39\).
9e. Considering only people who chose the risky gamble and have never tried marijuana, is there a relationship between IQ scores and performance on the logic test?
cor.test(formula = ~ iq + logic,
data = subset(wpa6.df, risk == 1 & marijuana == 0))
#>
#> Pearson's product-moment correlation
#>
#> data: iq and logic
#> t = 1.4249, df = 10, p-value = 0.1846
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> -0.2134024 0.7968450
#> sample estimates:
#> cor
#> 0.4108122
Answer: We do not find evidence for a relationship between IQ scores and logic test performance for those who chose the risky gamble and have not tried marijuana: \(t(10) = 1.42\), \(p = 0.18\).
Checkpoint 3
If you got this far you’re doing an amazing job — well done! Enjoy the following challenge…
C. Challenges
Anscombe’s quartet
10. In the next few exercises, we’ll explore Anscombe’s famous data quartet. This famous dataset will illustrate the dangers of interpreting statistical tests (like a correlation test), without first plotting the data!
10a. Run the following code to create the anscombe.df
dataframe. This dataframe contains 4 datasets (x1 and y1, x2 and y2, x3 and y3 and x4 and y4):
# Just COPY, PASTE, and RUN this code:
anscombe.df <- data.frame(x1 = c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
y1 = c(8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 4.68),
x2 = c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
y2 = c(9.14, 8.14, 8.74, 8.77, 9.26, 8.1, 6.13, 3.1, 9.13, 7.26, 4.74),
x3 = c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
y3 = c(7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73),
x4 = c(8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8),
y4 = c(6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.5, 5.56, 7.91, 6.89))
10b. Calculate the four correlations between \(x1\) and \(y1\), \(x2\) and \(y2\), \(x3\) and \(y3\), and between \(x4\) and \(y4\) (separately, i.e., what is the correlation between \(x1\) and \(y1\)? Next, what is the correlation between \(x2\) and \(y2?\), etc.). What can you report about the correlation values for each test?
cor.test(~x1 + y1, data = anscombe.df) # 1
#>
#> Pearson's product-moment correlation
#>
#> data: x1 and y1
#> t = 4.4844, df = 9, p-value = 0.001523
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.4612713 0.9549196
#> sample estimates:
#> cor
#> 0.8311601
cor.test(~x2 + y2, data = anscombe.df) # 2
#>
#> Pearson's product-moment correlation
#>
#> data: x2 and y2
#> t = 4.2386, df = 9, p-value = 0.002179
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.4239389 0.9506402
#> sample estimates:
#> cor
#> 0.8162365
cor.test(~x3 + y3, data = anscombe.df) # 3
#>
#> Pearson's product-moment correlation
#>
#> data: x3 and y3
#> t = 4.2394, df = 9, p-value = 0.002176
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.4240623 0.9506547
#> sample estimates:
#> cor
#> 0.8162867
cor.test(~x4 + y4, data = anscombe.df) # 4
#>
#> Pearson's product-moment correlation
#>
#> data: x4 and y4
#> t = 4.243, df = 9, p-value = 0.002165
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.4246394 0.9507224
#> sample estimates:
#> cor
#> 0.8165214
Answer: The correlations are all statistically significant (\(p < .01^{*}\)) and are all almost of the same magnitude (\(r\ \approx .82-.83\)).
10c. Now run the following code to generate a scatterplot of each data pair. What do you find?
# Just COPY, PASTE, and RUN this code:
# Plot the famous Anscombe quartet:
par(mfrow = c(2, 2)) # create a 2 x 2 plotting grid
for (i in 1:4) { # Loop over the 4 datasets:
# Assign x and y for current value of i
if (i == 1) {x <- anscombe.df$x1
y <- anscombe.df$y1}
if (i == 2) {x <- anscombe.df$x2
y <- anscombe.df$y2}
if (i == 3) {x <- anscombe.df$x3
y <- anscombe.df$y3}
if (i == 4) {x <- anscombe.df$x4
y <- anscombe.df$y4}
# Create corresponding plot:
plot(x = x, y = y, pch = 21,
main = paste("Anscombe", i, sep = " "),
bg = "orange", col = "red4",
xlim = c(0, 20), ylim = c(0, 15)
)
# Add regression line:
abline(lm(y ~ x,
data = data.frame(y, x)),
col = "steelblue", lty = 3, lwd = 2)
# Add correlation test text:
text(x = 3, y = 12,
labels = paste0("cor = ", round(cor(x, y), 2)),
col = "steelblue")
}
par(mfrow = c(1, 1)) # reset plotting grid
Answer: Despite their statistical similarities (in test results), the four data sets are very different.
Conclusion: What you can see in the four scatterplots is the famous Anscombe’s quartet, a dataset designed to show you how important is to always plot your data before running a statistical test on it. You can learn more about this at the corresponding Wikipedia page.
Also, see this or this blog post on The Datasaurus Dozen for similar examples that illustrate the importance of visualizing data.
That’s it – hope you enjoyed working on this assignment!
[WPA06_answers.Rmd
updated on 2017-12-07 21:34:54 by hn.]