Answers for WPA08 of Basic data and decision analysis in R, taught at the University of Konstanz in Winter 2017/2018.
To complete and submit these exercises, please remember and do the following:
Use the .Rmd Format: Your WPAs should be written as scripts of commented code (as
.Rmd
files) and submitted as reproducible documents that combine text with code (in.html
or.pdf
formats).A simple
.Rmd
template is provided here.(Alternatively, open a plain R script and save it as
LastnameFirstname_WPA##_yymmdd.R
.)
Commening your code: Indicate the current assignment (e.g., WPA08), your name, and the current date at the top of your document. Please always include appropriate comments with your code. For instance, your file
LastFirst_WPA08_yymmdd.Rmd
could look like this:
---
title: "Your Assignment Title (WPA08)"
author: "Your Name"
date: "Year Month Day"
output: html_document
---
This file contains my solutions to WPA08.
# Exercise 1
To show and run R code in your document, use a code chunk (without the '#' symbols):
# ```{r, exercise_1, echo = TRUE, eval = TRUE}
#
# v <- c(1, 2, 3) # some vector
# sum(v)
#
# ```
More text and code chunks...
[Updated on `r Sys.Date()` by Your Name.]
<!-- End of document -->
Complete as many exercises as you can by Wednesday (23:59).
Submit your script or output file (including all code) to the appropriate folder on Ilias.
A. In Class
Here are some warm-up exercises that review important points from previous chapters and practice the basic concepts of the current topic:
Preparations
0. The following steps prepare the current session by creating a new .Rmd
file, and compiling it into an .html
output file:
0a. Open your R project from last week (called RCourse
or something similar), which contains some files and at least two subfolders (data
and R
).
0b. Create a new R Markdown (.Rmd
) script and save it as LastFirst_WPA08_yymmdd.Rmd
(with an appropriate header) in your project directory.
0c. Insert a code chunk and load the rmarkdown
, knitr
and yarrr
packages. (Hint: It’s always a good idea to name code chunks and load all required packages with library()
at the beginning of your document. Using the chunk option include = FALSE
evaluates the chunk, but does not show it or its outputs in the html output file.)
library(rmarkdown)
library(knitr)
library(yarrr)
# Store original par() settings:
opar <- par()
# par(opar) # restores original (default) par settings later
0d. Make sure that you can create an .html
output-file by “knitting” your current document.
# Ok!
1. In this exercise, we will use some ficticious census data that lists some basic demographic variables – the gender
, age
, iq
score, education (in both text and numeric formats), annual income
, and self-reported happiness (happy
, rated on a scale from 1 to 30) — for 200 quasi-representative respondents.
The data are stored in a comma-separated text file in http://Rpository.com/down/data/WPA08_census.txt.
Exploring data and comparing means
1a. Load the text file containing the data into R and assign them to new objects called census.dat
. (Hint: Use the read.table()
function with appropriate arguments. Note that the file is comma-delimited and contains a header with variable names.)
Here is how the first few rows of the census data should look like:
head(census.dat)
#> gender age iq edu edu.num income happy
#> 1 m 34 122 2.Bachelors 3 55499 6
#> 2 f 86 113 2.Bachelors 3 26631 14
#> 3 m 60 97 1.HighSchool 2 73623 21
#> 4 f 54 105 1.HighSchool 2 61140 18
#> 5 f 46 96 2.Bachelors 3 41420 10
#> 6 m 80 95 4.PhD 5 48190 24
1b. Explore the data to make sure everything loaded and looks ok. (Hint: Use the head()
, str()
, summary()
functions.)
str(census.dat)
#> 'data.frame': 200 obs. of 7 variables:
#> $ gender : Factor w/ 2 levels "f","m": 2 1 2 1 1 2 1 1 1 2 ...
#> $ age : int 34 86 60 54 46 80 43 38 30 30 ...
#> $ iq : int 122 113 97 105 96 95 92 94 117 99 ...
#> $ edu : Factor w/ 5 levels "0.None","1.HighSchool",..: 3 3 2 2 3 5 3 3 4 4 ...
#> $ edu.num: int 3 3 2 2 3 5 3 3 4 4 ...
#> $ income : int 55499 26631 73623 61140 41420 48190 45500 35620 53420 57891 ...
#> $ happy : int 6 14 21 18 10 24 16 8 7 12 ...
summary(census.dat)
#> gender age iq edu edu.num
#> f: 97 Min. :18.00 Min. : 75.0 0.None :28 Min. :1.000
#> m:103 1st Qu.:34.00 1st Qu.: 94.0 1.HighSchool:49 1st Qu.:2.000
#> Median :51.00 Median :100.0 2.Bachelors :47 Median :3.000
#> Mean :52.21 Mean :100.4 3.Masters :58 Mean :2.945
#> 3rd Qu.:70.25 3rd Qu.:107.0 4.PhD :18 3rd Qu.:4.000
#> Max. :88.00 Max. :126.0 Max. :5.000
#> income happy
#> Min. :22277 Min. : 1.00
#> 1st Qu.:44085 1st Qu.:12.00
#> Median :51198 Median :17.00
#> Mean :51281 Mean :15.98
#> 3rd Qu.:59413 3rd Qu.:21.00
#> Max. :83030 Max. :29.00
1c. Do men and women earn the same or a different level of income
on average? (Hint: Answer this question by conducting a t-test on appropriate subsets of the income
variable.)
t.test(x = census.dat$income[census.dat$gender == "m"],
y = census.dat$income[census.dat$gender == "f"])
#>
#> Welch Two Sample t-test
#>
#> data: census.dat$income[census.dat$gender == "m"] and census.dat$income[census.dat$gender == "f"]
#> t = 4.3966, df = 194.44, p-value = 1.808e-05
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> 3758.426 9873.495
#> sample estimates:
#> mean of x mean of y
#> 54586.63 47770.67
Answer: Men earn significantly more than women: mean difference = -6815.96, t(194.44) = 4.4, p < 0.01 (2-tailed).
1d. Do people with different levels of education show different levels of happiness on average? (Hint: Answer this question by conducting an ANOVA with a dependent variable happy
by an independent variable that quantifies education (edu.num
).)
# ANOVA:
happy.aov <- aov(formula = happy ~ edu.num,
data = census.dat)
summary(happy.aov)
#> Df Sum Sq Mean Sq F value Pr(>F)
#> edu.num 1 35 34.55 1 0.319
#> Residuals 198 6842 34.56
# Plot:
yarrr::pirateplot(formula = happy ~ edu,
data = census.dat,
main = "Effect education on happiness",
xlab = "education",
ylab = "happiness",
gl.col = "gray",
pal = "espresso"
)
Answer: A person’s level of education shows no systematic relationship with happiness: \(F(1, 198) = 1\), \(p = .32\).
Linear regression with 1 IV
2a. Test the hypothesis that more money makes people happier. (Hint: Conduct a linear regression of happy
by income
.)
c1.lm <- lm(formula = happy ~ income, data = census.dat)
summary(c1.lm)
#>
#> Call:
#> lm(formula = happy ~ income, data = census.dat)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -14.988 -3.985 1.015 5.012 13.014
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.599e+01 1.907e+00 8.387 9.27e-15 ***
#> income -1.883e-07 3.629e-05 -0.005 0.996
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 5.893 on 198 degrees of freedom
#> Multiple R-squared: 1.36e-07, Adjusted R-squared: -0.00505
#> F-statistic: 2.693e-05 on 1 and 198 DF, p-value: 0.9959
Answer: We find no systematic relationship between income and happiness: \(p = 1.00\).
2b. Create a scatterplot that illustrates the relationship between income and happines and add a regression line to the scatterplot. (Hint: Use the abline()
function on your linear model object from Exercise 2a.)
plot(x = census.dat$income,
y = census.dat$happy,
main = "Happiness by income",
xlab = "Income",
ylab = "Happiness"
)
abline(c1.lm, col = "steelblue3", lty = 1, lwd = 2)
3a. Does a person’s income
depend on his or her age
?
c2.lm <- lm(formula = income ~ age, data = census.dat)
summary(c2.lm)
#>
#> Call:
#> lm(formula = income ~ age, data = census.dat)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -21359.1 -7474.4 -45.2 7117.8 27855.4
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 63093.17 2017.99 31.265 < 2e-16 ***
#> age -226.25 35.92 -6.298 1.91e-09 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 10530 on 198 degrees of freedom
#> Multiple R-squared: 0.1669, Adjusted R-squared: 0.1627
#> F-statistic: 39.67 on 1 and 198 DF, p-value: 1.907e-09
3b. Create a scatterplot that illustrates the relationship between income and age and add a regression line to the scatterplot. (Hint: Use the abline()
function on your linear model object from Exercise 3c.)
# Scatterplot:
plot(x = census.dat$age,
y = census.dat$income,
ylim = c(0, 90000),
main = "Income by age",
xlab = "Age",
ylab = "Income"
)
abline(c2.lm, col = "steelblue3", lty = 1, lwd = 2) # regression line
3c. Create 2 separate linear regressions to re-examine the influence of age
on income
for people up to 60 years vs. people older than 60 years. What do you conclude?
# (a) lm for people up to 60:
c3a.lm <- lm(formula = income ~ age,
data = subset(census.dat[census.dat$age < 61, ])
)
summary(c3a.lm)
#>
#> Call:
#> lm(formula = income ~ age, data = subset(census.dat[census.dat$age <
#> 61, ]))
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -20211.3 -6812.5 751.3 6644.3 28529.1
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 47561.72 2965.95 16.036 < 2e-16 ***
#> age 198.26 74.56 2.659 0.00889 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 9602 on 121 degrees of freedom
#> Multiple R-squared: 0.05522, Adjusted R-squared: 0.04741
#> F-statistic: 7.072 on 1 and 121 DF, p-value: 0.008892
# Alternative version:
c3a2.lm <- lm(formula = income ~ age,
data = census.dat,
subset = age < 61)
all.equal(c3a.lm$coefficients, c3a2.lm$coefficients) # test equality
#> [1] TRUE
# (b) lm for people above 60:
c3b.lm <- lm(formula = income ~ age,
data = subset(census.dat[census.dat$age > 60, ])
)
summary(c3b.lm)
#>
#> Call:
#> lm(formula = income ~ age, data = subset(census.dat[census.dat$age >
#> 60, ]))
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -14588.0 -5019.3 -473.2 3908.8 20449.7
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 121450.0 7842.8 15.485 < 2e-16 ***
#> age -1019.3 104.2 -9.786 4.83e-15 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 7611 on 75 degrees of freedom
#> Multiple R-squared: 0.5608, Adjusted R-squared: 0.5549
#> F-statistic: 95.76 on 1 and 75 DF, p-value: 4.833e-15
# Alternative version:
c3b2.lm <- lm(formula = income ~ age,
data = census.dat,
subset = age > 60)
all.equal(c3b.lm$coefficients, c3b2.lm$coefficients) # test equality
#> [1] TRUE
Answer: Both relationships are significant, but in different directions: For people up to 60 years old, there is a positive relationship between age and income (\(b = 198.3\), \(p < .01\)). By contrast, for people older than 60 the relationship between age and income is stronger and negative (\(b = -1019.3\), \(p < .01\)). Overall, the relationship between age and income seems curvilinear, as illustrated by the following plot:
# Scatterplot:
plot(x = census.dat$age,
y = census.dat$income,
ylim = c(0, 90000),
main = "Income by age",
xlab = "Age",
ylab = "Income"
)
# Draw some lines:
abline(c2.lm, col = "steelblue3", lty = 1, lwd = 2) # regression line (full model from above)
abline(v = 61, col = "orange3", lty = 3) # vertical line
abline(c3a.lm, col = "green3", lty = 2, lwd = 2) # model for subset: age < 61
abline(c3b.lm, col = "red3", lty = 2, lwd = 2) # model for subset: age > 60
Linear regression with multiple IVs
4a. Does someone’s happiness — besides on income
— also depend on a person’s age, gender, IQ score, and education level? (Hint: Conduct a linear regression of happy
by gender
, age
, iq
, edu.num
, and income
to find out.)
c3.lm <- lm(formula = happy ~ gender + age + iq + edu.num + income,
data = census.dat)
summary(c3.lm)
#>
#> Call:
#> lm(formula = happy ~ gender + age + iq + edu.num + income, data = census.dat)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -5.4380 -1.3659 0.0196 1.2650 5.6033
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.433e+01 1.876e+00 7.642 9.58e-13 ***
#> genderm -9.767e-01 3.276e-01 -2.981 0.00324 **
#> age 2.676e-01 8.277e-03 32.334 < 2e-16 ***
#> iq -2.334e-01 1.544e-02 -15.114 < 2e-16 ***
#> edu.num 1.146e-01 1.278e-01 0.897 0.37093
#> income 2.197e-04 1.558e-05 14.098 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.16 on 194 degrees of freedom
#> Multiple R-squared: 0.8683, Adjusted R-squared: 0.8649
#> F-statistic: 255.9 on 5 and 194 DF, p-value: < 2.2e-16
Answer: Happiness now seems to be significantly predicted by gender
(women are happier), age
(older persons are happier), iq
(people with lower IQ values are happier), and income
(people with higher income are happier).
4b. Which of the other predictor variables does someone’s income
depend on?
c4.lm <- lm(formula = income ~ gender + age + iq + edu.num + happy,
data = census.dat)
summary(c4.lm)
#>
#> Call:
#> lm(formula = income ~ gender + age + iq + edu.num + happy, data = census.dat)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -17837 -4398 -230 4729 17272
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -6968.94 6909.87 -1.009 0.314
#> genderm 5872.76 999.67 5.875 1.81e-08 ***
#> age -732.54 42.70 -17.157 < 2e-16 ***
#> iq 577.57 61.02 9.466 < 2e-16 ***
#> edu.num -449.43 413.46 -1.087 0.278
#> happy 2303.50 163.39 14.098 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 6996 on 194 degrees of freedom
#> Multiple R-squared: 0.64, Adjusted R-squared: 0.6307
#> F-statistic: 68.98 on 5 and 194 DF, p-value: < 2.2e-16
Answer: Significant predictors for income
are gender
, (males have higher incomes), age
(older people have lower incomes), iq
(higher IQ values imply higher incomes), and happiness (happier people have higher incomes).
4c. Is the effect of income
on happiness still significant when already accounting for the effects of the other systematic predictors? (Hint: Conduct 2 linear regression models on happy
: One with all significant predictors excluding income
and one with all significant predictors including income. Then use an ANOVA – via the anova()
function – to contrast both models.)
# Model of significant predictors only:
c5a.lm <- lm(formula = happy ~ gender + age + iq + income,
data = census.dat)
summary(c5a.lm)
#>
#> Call:
#> lm(formula = happy ~ gender + age + iq + income, data = census.dat)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -5.5455 -1.4779 -0.0165 1.3040 5.5916
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.478e+01 1.807e+00 8.180 3.59e-14 ***
#> genderm -9.699e-01 3.274e-01 -2.963 0.00343 **
#> age 2.667e-01 8.209e-03 32.488 < 2e-16 ***
#> iq -2.337e-01 1.543e-02 -15.147 < 2e-16 ***
#> income 2.191e-04 1.556e-05 14.079 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.159 on 195 degrees of freedom
#> Multiple R-squared: 0.8678, Adjusted R-squared: 0.8651
#> F-statistic: 320 on 4 and 195 DF, p-value: < 2.2e-16
# Model of significant predictors of happiness EXCLUDING income:
c5b.lm <- lm(formula = happy ~ gender + age + iq,
data = census.dat)
summary(c5b.lm)
#>
#> Call:
#> lm(formula = happy ~ gender + age + iq, data = census.dat)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -7.1301 -2.2784 0.1363 2.1347 6.9248
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 26.03827 2.29567 11.342 <2e-16 ***
#> genderm 0.63534 0.43469 1.462 0.145
#> age 0.21577 0.01044 20.672 <2e-16 ***
#> iq -0.21565 0.02178 -9.903 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3.059 on 196 degrees of freedom
#> Multiple R-squared: 0.7334, Adjusted R-squared: 0.7293
#> F-statistic: 179.7 on 3 and 196 DF, p-value: < 2.2e-16
# Note that gender is no longer significant!
# Compare the 2 models:
anova(c5a.lm, c5b.lm)
#> Analysis of Variance Table
#>
#> Model 1: happy ~ gender + age + iq + income
#> Model 2: happy ~ gender + age + iq
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 195 909.26
#> 2 196 1833.53 -1 -924.27 198.22 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Answer: Yes, income
remains a significant predictor of happiness when already accounting for the effects of gender
, age
, and iq
.
Checkpoint 1
At this point you completed all basic exercises. This is good, but additional practice will deepen your understanding, so please keep carrying on…
B. At Home
Studying student performance
In this part, we will analyze a real data set from a study on student performance in two schools and two classes (Math and Portuguese). The two data files come from the UCI Machine Learning database at http://archive.ics.uci.edu/ml/datasets/Student+Performance#
Here is the data description (taken directly from the original website):
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
The data are located in two tab-delimited text files at:
http://Rpository.com/down/data/WPA08_studentmath.txt (the Math data), and
http://Rpository.com/down/data/WPA08_studentpor.txt (the Portuguese data).
Data description
Both data files contain 33 columns. Here are their definitions:
school
- student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)sex
- student’s sex (binary: ‘F’ - female or ‘M’ - male)age
- student’s age (numeric: from 15 to 22)address
- student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)famsize
- family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)Pstatus
- parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)Medu
- mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education, or 4 - higher education)Fedu
- father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)Mjob
- mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)Fjob
- father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)reason
- reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)guardian
- student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)traveltime
- home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)studytime
- weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)failures
- number of past class failures (numeric: n if 1<=n<3, else 4)schoolsup
- extra educational support (binary: yes or no)famsup
- family educational support (binary: yes or no)paid
- extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)activities
- extra-curricular activities (binary: yes or no)nursery
- attended nursery school (binary: yes or no)higher
- wants to take higher education (binary: yes or no)internet
- Internet access at home (binary: yes or no)romantic
- in a romantic relationship (binary: yes or no)famrel
- quality of family relationships (numeric: from 1 - very bad to 5 - excellent)freetime
- free time after school (numeric: from 1 - very low to 5 - very high)goout
- going out with friends (numeric: from 1 - very low to 5 - very high)Dalc
- workday alcohol consumption (numeric: from 1 - very low to 5 - very high)Walc
- weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)health
- current health status (numeric: from 1 - very bad to 5 - very good)absences
- number of school absences (numeric: from 0 to 93)G1
- first period grade (numeric: from 0 to 20)G2
- second period grade (numeric: from 0 to 20)G3
- final grade (numeric: from 0 to 20, output target)
Loading and exploring data
5a. Load the two text files containing the data into R and assign them to new objects called student.math
and student.por
respectively. (Hint: Use the read.table()
function with appropriate arguments, given that both files are tab-delimited and contain a header row with variable names.)
student.math <- read.table("http://Rpository.com/down/data/WPA08_studentmath.txt",
sep = "\t",
header = TRUE
)
student.por <- read.table("http://Rpository.com/down/data/WPA08_studentpor.txt",
sep = "\t",
header = TRUE
)
Here is how the first few rows of the math data should look:
head(student.math)
#> school sex age address famsize Pstatus Medu Fedu Mjob Fjob
#> 1 GP F 18 U GT3 A 4 4 at_home teacher
#> 2 GP F 17 U GT3 T 1 1 at_home other
#> reason guardian traveltime studytime failures schoolsup famsup paid
#> 1 course mother 2 2 0 yes no no
#> 2 course father 1 2 0 no yes no
#> activities nursery higher internet romantic famrel freetime goout Dalc
#> 1 no yes yes no no 4 3 4 1
#> 2 no no yes yes no 5 3 3 1
#> Walc health absences G1 G2 G3
#> 1 1 3 6 5 6 6
#> 2 1 3 4 5 5 6
#> [ reached getOption("max.print") -- omitted 4 rows ]
5b. Look at the first few rows of both data frames with the head()
function to make sure they were imported correctly.
# head(student.math) # (shown above)
head(student.por)
#> school sex age address famsize Pstatus Medu Fedu Mjob Fjob
#> 1 GP F 18 U GT3 A 4 4 at_home teacher
#> 2 GP F 17 U GT3 T 1 1 at_home other
#> reason guardian traveltime studytime failures schoolsup famsup paid
#> 1 course mother 2 2 0 yes no no
#> 2 course father 1 2 0 no yes no
#> activities nursery higher internet romantic famrel freetime goout Dalc
#> 1 no yes yes no no 4 3 4 1
#> 2 no no yes yes no 5 3 3 1
#> Walc health absences G1 G2 G3
#> 1 1 3 4 0 11 11
#> 2 1 3 2 9 11 11
#> [ reached getOption("max.print") -- omitted 4 rows ]
5c. Using the str()
function, look at summary statistics for each column in the dataframe. There should be 33 columns in each dataset. Make sure everything looks ok.
# str(student.math)
# str(student.por)
# Looks ok.
Standard Regression with lm()
One IV
6a. For the math data, create a regression object called lm.G1.age
predicting first period grade (G1
) based on age
.
lm.G1.age <- lm(G1 ~ age, data = student.math)
6b. How do you interpret the relationship between age
and the first period grade G1
?
summary(lm.G1.age)
#>
#> Call:
#> lm(formula = G1 ~ age, data = student.math)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -7.6915 -2.7749 -0.1916 2.3085 8.3085
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 13.6919 2.1926 6.245 1.1e-09 ***
#> age -0.1667 0.1309 -1.273 0.204
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3.317 on 393 degrees of freedom
#> Multiple R-squared: 0.004106, Adjusted R-squared: 0.001572
#> F-statistic: 1.62 on 1 and 393 DF, p-value: 0.2038
Answer: There is a slight negative relationship between age and first period grade (\(b = -0.17\)), but the relationship is not significant (\(p = .20\)).
7a. For the math data, create a regression object called lm.G1.abs
predicting first period grade (G1
) based on absences
.
lm.G1.abs <- lm(G1 ~ absences, data = student.math)
7b. How do you interpret the relationship between absences
and the first period grade G1
?
summary(lm.G1.abs)
#>
#> Call:
#> lm(formula = G1 ~ absences, data = student.math)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -7.8794 -2.9115 0.0177 2.2363 8.0692
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 10.98227 0.20539 53.470 <2e-16 ***
#> absences -0.01286 0.02091 -0.615 0.539
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3.322 on 393 degrees of freedom
#> Multiple R-squared: 0.0009612, Adjusted R-squared: -0.001581
#> F-statistic: 0.3781 on 1 and 393 DF, p-value: 0.539
Answer: There is a very small negative relationship between age and first period grade (\(b = -0.01\)), but the relationship is not significant (\(p = .54\)).
8a. For the math data, create a regression object called lm.G3.G1
predicting each student’s period 3 grade (G3
) based on their period 1 grade (G1
).
lm.G3.G1 <- lm(G3 ~ G1, data = student.math)
8b. How do you interpret the relationship between the first and final period grades (G1
and G3
)?
summary(lm.G3.G1)
#>
#> Call:
#> lm(formula = G3 ~ G1, data = student.math)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -11.6223 -0.8348 0.3777 1.6965 5.0153
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.65280 0.47475 -3.481 0.000555 ***
#> G1 1.10626 0.04164 26.568 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.743 on 393 degrees of freedom
#> Multiple R-squared: 0.6424, Adjusted R-squared: 0.6414
#> F-statistic: 705.8 on 1 and 393 DF, p-value: < 2.2e-16
Answer: There is a strong positive relationship between first period grade and third period grade (\(b = 1.11\), \(p < .01\)).
Adding a regression line to a scatterplot
9a. Create a scatterplot showing the relationship between the G1
and G3
grades for the math data.
9b. Add a regression line to the scatterplot from your regression object lm.G3.G1
. (Hint: Use the abline()
function.)
plot(x = student.math$G1,
y = student.math$G3,
xlab = "first grade G1",
ylab = "final grade G3",
main = "Predicting final grade (G3) by first grade (G1)"
)
abline(lm.G3.G1, col = unikn.col[4], lty = 2, lwd = 2)
Checkpoint 2
If you got this far you’re doing very well, but don’t give up just yet…
Multiple IVs
10a. For the math data, create a regression object called lm.G3.mult
predicting third period grade (G3
) based on several other variables: sex
, age
, internet
, and failures
.
table(student.math$sex)
#>
#> F M
#> 208 187
lm.G3.mult <- lm(G3 ~ sex + age + internet + failures, data = student.math)
10b. How do you interpret the regression output? Which of the variables are significantly related to third period grade G3
?
summary(lm.G3.mult)
#>
#> Call:
#> lm(formula = G3 ~ sex + age + internet + failures, data = student.math)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -12.2156 -1.9523 0.0965 3.0252 9.4370
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 13.9962 2.9808 4.695 3.69e-06 ***
#> sexM 1.0451 0.4282 2.441 0.0151 *
#> age -0.2407 0.1735 -1.388 0.1660
#> internetyes 0.7855 0.5761 1.364 0.1735
#> failures -2.1260 0.2966 -7.167 3.86e-12 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4.237 on 390 degrees of freedom
#> Multiple R-squared: 0.1533, Adjusted R-squared: 0.1446
#> F-statistic: 17.65 on 4 and 390 DF, p-value: 2.488e-13
Answer: A student’s gender and number of previous failures significantly predict third period grade: Male students perform better than female students, and the more past failures a student has the lower his or her grade.
10c. Create a new regression object called lm.G3.mult2
using the same variables as on Exercise 10a. (where you predicted third period grade G3
based on sex
, age
, internet
, and failures
), but now use the Portuguese dataset to fit the model.
lm.G3.mult2 <- lm(G3 ~ sex + age + internet + failures, data = student.por)
10d. What are the key differences between the beta values for the Portuguese dataset (lm.G3.mult2
from Exercise 10c.) and the math dataset (lm.G3.mult
from Exercise 10a.)?
summary(lm.G3.mult2)
#>
#> Call:
#> lm(formula = G3 ~ sex + age + internet + failures, data = student.por)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -12.8941 -1.8345 0.0522 1.8807 7.8041
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 11.61020 1.68101 6.907 1.19e-11 ***
#> sexM -0.71515 0.23625 -3.027 0.002568 **
#> age 0.01986 0.10031 0.198 0.843134
#> internetyes 0.92639 0.27508 3.368 0.000803 ***
#> failures -2.04819 0.20738 -9.877 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.936 on 644 degrees of freedom
#> Multiple R-squared: 0.1794, Adjusted R-squared: 0.1743
#> F-statistic: 35.19 on 4 and 644 DF, p-value: < 2.2e-16
Answer: In the Portuguese data set, male students do worse than female students, and having internet access actually helps performance. As before, more past failures coincide with lower grades.
Predicting values
11a. For the math dataset, create a regression object called lm.G1.all
predicting a student’s first period grade (G1
) based on all variables in the dataset. (Hint: Use the notation formula = y ~ .
to include all variables.)
# lm.G1.all <- lm(absences ~ ., data = student.math)
lm.G1.all <- lm(G1 ~ ., data = student.math)
summary(lm.G1.all)
#>
#> Call:
#> lm(formula = G1 ~ ., data = student.math)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.8473 -0.8875 0.0117 0.9788 6.9754
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.858e+00 1.781e+00 1.043 0.29771
#> schoolMS -2.294e-01 3.093e-01 -0.742 0.45880
#> sexM 1.875e-01 1.967e-01 0.953 0.34111
#> age 9.053e-02 8.512e-02 1.064 0.28826
#> addressU -1.679e-01 2.280e-01 -0.737 0.46188
#> famsizeLE3 -7.347e-03 1.910e-01 -0.038 0.96933
#> PstatusT 3.268e-01 2.823e-01 1.158 0.24776
#> Medu -1.216e-01 1.263e-01 -0.963 0.33629
#> Fedu 1.565e-01 1.083e-01 1.445 0.14944
#> Mjobhealth 2.252e-01 4.367e-01 0.516 0.60637
#> Mjobother -5.673e-01 2.781e-01 -2.040 0.04213 *
#> Mjobservices 7.223e-02 3.113e-01 0.232 0.81668
#> Mjobteacher -1.475e-01 4.057e-01 -0.364 0.71636
#> Fjobhealth -6.709e-01 5.606e-01 -1.197 0.23225
#> Fjobother -8.636e-01 3.991e-01 -2.164 0.03113 *
#> [ reached getOption("max.print") -- omitted 27 rows ]
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.601 on 353 degrees of freedom
#> Multiple R-squared: 0.7914, Adjusted R-squared: 0.7672
#> F-statistic: 32.67 on 41 and 353 DF, p-value: < 2.2e-16
11b. Save the fitted values values from the lm.G1.all
object as a vector called lm.G1.all.fitted
. (Hint: A model’s fitted values are contained in a vector called model$fitted.values
.)
lm.G1.all.fitted <- lm.G1.all$fitted.values
11c. For the math dataset, create a scatterplot showing the relationship between a student’s first period grade (G1
) and the fitted values from the model. Does the model appear to correctly fit a student’s first period grade?
plot(x = student.math$G1,
y = lm.G1.all.fitted,
xlim = c(0, max(student.math$G1)),
ylim = c(0, max(lm.G1.all.fitted)),
xlab = "first grades (G1)",
ylab = "fitted first grades",
main = "Evaluating model fit"
)
abline(a = 0, b = 1, col = unikn.col[4], lty = 2, lwd = 2)
Answer: No, the model doesn’t seem to fit very well, as the fitted values substantially deviate from the diagonal (representing the actual values).
Checkpoint 3
If you got this far you’re doing a terrific job — well done! Enjoy the following challenge…
C. Challenge
Showing model parsimony
12a. Someone claims that a student’s final grade G3
in Portuguese are predicted by three independent variables: The number of past class failures
, the desire to pursue higher education (higher
), and extra educational support (schoolsup
). Verify this hypothesis by testing an appropriate linear model.
lm.G3.3 <- lm(G3 ~ failures + higher + schoolsup, data = student.por)
summary(lm.G3.3)
#>
#> Call:
#> lm(formula = G3 ~ failures + higher + schoolsup, data = student.por)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -12.6559 -1.6559 0.2634 1.6040 6.3441
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 10.1322 0.3777 26.823 < 2e-16 ***
#> failures -1.7363 0.2003 -8.667 < 2e-16 ***
#> higheryes 2.5236 0.3867 6.526 1.36e-10 ***
#> schoolsupyes -0.9192 0.3701 -2.484 0.0132 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.876 on 645 degrees of freedom
#> Multiple R-squared: 0.2112, Adjusted R-squared: 0.2076
#> F-statistic: 57.58 on 3 and 645 DF, p-value: < 2.2e-16
Answer: The model supports the claim, as all three variables are significant predictors of the final grade G3
. (However, the schoolsup
variable is only weakly predictive: \(b = -.92\), \(p = .013\).)
12b. Create 3 separate linear models (lm.1
, lm.2
, and lm.3
) that predict a student’s final grade G3
in Portuguese by 1, 2, or 3 of the predictors identified in Exercise 12a. (starting with the most significant predictor). Then use 3 pairwise ANOVAs to show that none of the predictors is unnecessary (i.e., adding each predictor yields a significant improvement in the fit of the more complex model).
# Model 1: 1 IV
lm.1 <- lm(G3 ~ failures, data = student.por)
summary(lm.1)
#>
#> Call:
#> lm(formula = G3 ~ failures, data = student.por)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -12.3813 -1.3813 -0.2393 1.6187 6.9026
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 12.3813 0.1246 99.38 <2e-16 ***
#> failures -2.1419 0.1968 -10.88 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.973 on 647 degrees of freedom
#> Multiple R-squared: 0.1547, Adjusted R-squared: 0.1534
#> F-statistic: 118.4 on 1 and 647 DF, p-value: < 2.2e-16
# Model 2: 2 IVs
lm.2 <- lm(G3 ~ failures + higher, data = student.por)
summary(lm.2)
#>
#> Call:
#> lm(formula = G3 ~ failures + higher, data = student.por)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -12.5534 -1.5534 0.1963 1.4466 6.4466
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 10.1157 0.3792 26.677 < 2e-16 ***
#> failures -1.7497 0.2011 -8.702 < 2e-16 ***
#> higheryes 2.4377 0.3867 6.304 5.37e-10 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.887 on 646 degrees of freedom
#> Multiple R-squared: 0.2037, Adjusted R-squared: 0.2012
#> F-statistic: 82.62 on 2 and 646 DF, p-value: < 2.2e-16
# Model 3: 3 IVs
lm.3 <- lm(G3 ~ failures + higher + schoolsup, data = student.por)
summary(lm.3)
#>
#> Call:
#> lm(formula = G3 ~ failures + higher + schoolsup, data = student.por)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -12.6559 -1.6559 0.2634 1.6040 6.3441
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 10.1322 0.3777 26.823 < 2e-16 ***
#> failures -1.7363 0.2003 -8.667 < 2e-16 ***
#> higheryes 2.5236 0.3867 6.526 1.36e-10 ***
#> schoolsupyes -0.9192 0.3701 -2.484 0.0132 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.876 on 645 degrees of freedom
#> Multiple R-squared: 0.2112, Adjusted R-squared: 0.2076
#> F-statistic: 57.58 on 3 and 645 DF, p-value: < 2.2e-16
# Compare Model 1 to Model 2:
anova(lm.1, lm.2)
#> Analysis of Variance Table
#>
#> Model 1: G3 ~ failures
#> Model 2: G3 ~ failures + higher
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 647 5717.0
#> 2 646 5385.7 1 331.34 39.744 5.365e-10 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Compare Model 2 to Model 3:
anova(lm.2, lm.3)
#> Analysis of Variance Table
#>
#> Model 1: G3 ~ failures + higher
#> Model 2: G3 ~ failures + higher + schoolsup
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 646 5385.7
#> 2 645 5334.6 1 51.027 6.1696 0.01325 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Compare Model 1 to Model 3:
anova(lm.1, lm.3)
#> Analysis of Variance Table
#>
#> Model 1: G3 ~ failures
#> Model 2: G3 ~ failures + higher + schoolsup
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 647 5717.0
#> 2 645 5334.6 2 382.37 23.116 2.016e-10 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Answer: The ANOVAS show that each additional predictor yields a significant improvement in model fit. (Note the similarity of the \(p\)-values to those of the coefficients in Exercise 12a.)
12c. Are the three independent variables from Exercise 12a. significant predictors of a student’s final grade G3
when also considering the first and second period grades (G1
and G2
) as predictors?
lm.G3.5 <- lm(G3 ~ G1 + G2 + failures + higher + schoolsup, data = student.por)
summary(lm.G3.5)
#>
#> Call:
#> lm(formula = G3 ~ G1 + G2 + failures + higher + schoolsup, data = student.por)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -9.2097 -0.4459 -0.0911 0.6257 5.7478
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.01735 0.26137 0.066 0.947097
#> G1 0.13397 0.03640 3.681 0.000252 ***
#> G2 0.88681 0.03407 26.028 < 2e-16 ***
#> failures -0.19468 0.09269 -2.100 0.036092 *
#> higheryes 0.17927 0.17594 1.019 0.308625
#> schoolsupyes -0.15214 0.16286 -0.934 0.350554
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.258 on 643 degrees of freedom
#> Multiple R-squared: 0.8494, Adjusted R-squared: 0.8483
#> F-statistic: 725.5 on 5 and 643 DF, p-value: < 2.2e-16
Answer: Only the number of past class failures remains a significant (negative) predictor (\(b = -.19\), \(p < .05\)).
That’s it – hope you enjoyed working on this assignment!
[WPA08_answers.Rmd
updated on 2017-12-21 15:13:46 by hn.]