Answers for WPA08 of Basic data and decision analysis in R, taught at the University of Konstanz in Winter 2017/2018.

To complete and submit these exercises, please remember and do the following:

Use the .Rmd Format: Your WPAs should be written as scripts of commented code (as .Rmd files) and submitted as reproducible documents that combine text with code (in .html or .pdf formats).
- A simple .Rmd template is provided here.
- (Alternatively, open a plain R script and save it as LastnameFirstname_WPA##_yymmdd.R.)
Commening your code: Indicate the current assignment (e.g., WPA08), your name, and the current date at the top of your document. Please always include appropriate comments with your code. For instance, your file LastFirst_WPA08_yymmdd.Rmd could look like this:

---
title: "Your Assignment Title (WPA08)"
author: "Your Name"
date: "Year Month Day"
output: html_document
---

This file contains my solutions to WPA08.

# Exercise 1

To show and run R code in your document, use a code chunk (without the '#' symbols):

# ```{r, exercise_1, echo = TRUE, eval = TRUE}
# 
# v <- c(1, 2, 3) # some vector
# sum(v)
#     
# ```

More text and code chunks... 

[Updated on `r Sys.Date()` by Your Name.]
<!-- End of document -->

Complete as many exercises as you can by Wednesday (23:59).
Submit your script or output file (including all code) to the appropriate folder on Ilias.

A. In Class

Here are some warm-up exercises that review important points from previous chapters and practice the basic concepts of the current topic:

Preparations

0. The following steps prepare the current session by creating a new .Rmd file, and compiling it into an .html output file:

0a. Open your R project from last week (called RCourse or something similar), which contains some files and at least two subfolders (data and R).

0b. Create a new R Markdown (.Rmd) script and save it as LastFirst_WPA08_yymmdd.Rmd (with an appropriate header) in your project directory.

0c. Insert a code chunk and load the rmarkdown, knitr and yarrr packages. (Hint: It’s always a good idea to name code chunks and load all required packages with library() at the beginning of your document. Using the chunk option include = FALSE evaluates the chunk, but does not show it or its outputs in the html output file.)

library(rmarkdown)
library(knitr)
library(yarrr)

# Store original par() settings:
opar <- par()
# par(opar) # restores original (default) par settings later

0d. Make sure that you can create an .html output-file by “knitting” your current document.

# Ok!

1. In this exercise, we will use some ficticious census data that lists some basic demographic variables – the gender, age, iq score, education (in both text and numeric formats), annual income, and self-reported happiness (happy, rated on a scale from 1 to 30) — for 200 quasi-representative respondents.

The data are stored in a comma-separated text file in http://Rpository.com/down/data/WPA08_census.txt.

Exploring data and comparing means

1a. Load the text file containing the data into R and assign them to new objects called census.dat. (Hint: Use the read.table() function with appropriate arguments. Note that the file is comma-delimited and contains a header with variable names.)

Here is how the first few rows of the census data should look like:

head(census.dat)
#>   gender age  iq          edu edu.num income happy
#> 1      m  34 122  2.Bachelors       3  55499     6
#> 2      f  86 113  2.Bachelors       3  26631    14
#> 3      m  60  97 1.HighSchool       2  73623    21
#> 4      f  54 105 1.HighSchool       2  61140    18
#> 5      f  46  96  2.Bachelors       3  41420    10
#> 6      m  80  95        4.PhD       5  48190    24

1b. Explore the data to make sure everything loaded and looks ok. (Hint: Use the head(), str(), summary() functions.)

str(census.dat)
#> 'data.frame':    200 obs. of  7 variables:
#>  $ gender : Factor w/ 2 levels "f","m": 2 1 2 1 1 2 1 1 1 2 ...
#>  $ age    : int  34 86 60 54 46 80 43 38 30 30 ...
#>  $ iq     : int  122 113 97 105 96 95 92 94 117 99 ...
#>  $ edu    : Factor w/ 5 levels "0.None","1.HighSchool",..: 3 3 2 2 3 5 3 3 4 4 ...
#>  $ edu.num: int  3 3 2 2 3 5 3 3 4 4 ...
#>  $ income : int  55499 26631 73623 61140 41420 48190 45500 35620 53420 57891 ...
#>  $ happy  : int  6 14 21 18 10 24 16 8 7 12 ...
summary(census.dat)
#>  gender       age              iq                  edu        edu.num     
#>  f: 97   Min.   :18.00   Min.   : 75.0   0.None      :28   Min.   :1.000  
#>  m:103   1st Qu.:34.00   1st Qu.: 94.0   1.HighSchool:49   1st Qu.:2.000  
#>          Median :51.00   Median :100.0   2.Bachelors :47   Median :3.000  
#>          Mean   :52.21   Mean   :100.4   3.Masters   :58   Mean   :2.945  
#>          3rd Qu.:70.25   3rd Qu.:107.0   4.PhD       :18   3rd Qu.:4.000  
#>          Max.   :88.00   Max.   :126.0                     Max.   :5.000  
#>      income          happy      
#>  Min.   :22277   Min.   : 1.00  
#>  1st Qu.:44085   1st Qu.:12.00  
#>  Median :51198   Median :17.00  
#>  Mean   :51281   Mean   :15.98  
#>  3rd Qu.:59413   3rd Qu.:21.00  
#>  Max.   :83030   Max.   :29.00

1c. Do men and women earn the same or a different level of income on average? (Hint: Answer this question by conducting a t-test on appropriate subsets of the income variable.)

t.test(x = census.dat$income[census.dat$gender == "m"],
       y = census.dat$income[census.dat$gender == "f"])
#> 
#>  Welch Two Sample t-test
#> 
#> data:  census.dat$income[census.dat$gender == "m"] and census.dat$income[census.dat$gender == "f"]
#> t = 4.3966, df = 194.44, p-value = 1.808e-05
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  3758.426 9873.495
#> sample estimates:
#> mean of x mean of y 
#>  54586.63  47770.67

Answer: Men earn significantly more than women: mean difference = -6815.96, t(194.44) = 4.4, p < 0.01 (2-tailed).

1d. Do people with different levels of education show different levels of happiness on average? (Hint: Answer this question by conducting an ANOVA with a dependent variable happy by an independent variable that quantifies education (edu.num).)

# ANOVA:
happy.aov <- aov(formula = happy ~ edu.num, 
                 data = census.dat)
summary(happy.aov)
#>              Df Sum Sq Mean Sq F value Pr(>F)
#> edu.num       1     35   34.55       1  0.319
#> Residuals   198   6842   34.56

# Plot:
yarrr::pirateplot(formula = happy ~ edu,
                  data = census.dat,
                  main = "Effect education on happiness",
                  xlab = "education",
                  ylab = "happiness",
                  gl.col = "gray",
                  pal = "espresso"
                  )

Answer: A person’s level of education shows no systematic relationship with happiness: $F(1, 198) = 1$, $p = .32$.

Linear regression with 1 IV

2a. Test the hypothesis that more money makes people happier. (Hint: Conduct a linear regression of happy by income.)

c1.lm <- lm(formula =  happy ~ income, data = census.dat)
summary(c1.lm)
#> 
#> Call:
#> lm(formula = happy ~ income, data = census.dat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -14.988  -3.985   1.015   5.012  13.014 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  1.599e+01  1.907e+00   8.387 9.27e-15 ***
#> income      -1.883e-07  3.629e-05  -0.005    0.996    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 5.893 on 198 degrees of freedom
#> Multiple R-squared:  1.36e-07,   Adjusted R-squared:  -0.00505 
#> F-statistic: 2.693e-05 on 1 and 198 DF,  p-value: 0.9959

Answer: We find no systematic relationship between income and happiness: $p = 1.00$.

2b. Create a scatterplot that illustrates the relationship between income and happines and add a regression line to the scatterplot. (Hint: Use the abline() function on your linear model object from Exercise 2a.)

plot(x = census.dat$income,
     y = census.dat$happy,
     main = "Happiness by income",
     xlab = "Income",
     ylab = "Happiness"
     )

abline(c1.lm, col = "steelblue3", lty = 1, lwd = 2)

3a. Does a person’s income depend on his or her age?

c2.lm <- lm(formula =  income ~ age, data = census.dat)
summary(c2.lm)
#> 
#> Call:
#> lm(formula = income ~ age, data = census.dat)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -21359.1  -7474.4    -45.2   7117.8  27855.4 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 63093.17    2017.99  31.265  < 2e-16 ***
#> age          -226.25      35.92  -6.298 1.91e-09 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 10530 on 198 degrees of freedom
#> Multiple R-squared:  0.1669, Adjusted R-squared:  0.1627 
#> F-statistic: 39.67 on 1 and 198 DF,  p-value: 1.907e-09

3b. Create a scatterplot that illustrates the relationship between income and age and add a regression line to the scatterplot. (Hint: Use the abline() function on your linear model object from Exercise 3c.)

# Scatterplot:
plot(x = census.dat$age,
     y = census.dat$income,
     ylim = c(0, 90000),
     main = "Income by age",
     xlab = "Age",
     ylab = "Income"
     )
abline(c2.lm, col = "steelblue3", lty = 1, lwd = 2) # regression line

3c. Create 2 separate linear regressions to re-examine the influence of age on income for people up to 60 years vs. people older than 60 years. What do you conclude?

# (a) lm for people up to 60:
c3a.lm <- lm(formula =  income ~ age,
            data = subset(census.dat[census.dat$age < 61, ])
            )
summary(c3a.lm)
#> 
#> Call:
#> lm(formula = income ~ age, data = subset(census.dat[census.dat$age < 
#>     61, ]))
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -20211.3  -6812.5    751.3   6644.3  28529.1 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 47561.72    2965.95  16.036  < 2e-16 ***
#> age           198.26      74.56   2.659  0.00889 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 9602 on 121 degrees of freedom
#> Multiple R-squared:  0.05522,    Adjusted R-squared:  0.04741 
#> F-statistic: 7.072 on 1 and 121 DF,  p-value: 0.008892

# Alternative version:
c3a2.lm <- lm(formula =  income ~ age,
              data = census.dat,
              subset = age < 61)

all.equal(c3a.lm$coefficients, c3a2.lm$coefficients) # test equality
#> [1] TRUE

# (b) lm for people above 60:
c3b.lm <- lm(formula =  income ~ age, 
             data = subset(census.dat[census.dat$age > 60, ])
             )
summary(c3b.lm)
#> 
#> Call:
#> lm(formula = income ~ age, data = subset(census.dat[census.dat$age > 
#>     60, ]))
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -14588.0  -5019.3   -473.2   3908.8  20449.7 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 121450.0     7842.8  15.485  < 2e-16 ***
#> age          -1019.3      104.2  -9.786 4.83e-15 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 7611 on 75 degrees of freedom
#> Multiple R-squared:  0.5608, Adjusted R-squared:  0.5549 
#> F-statistic: 95.76 on 1 and 75 DF,  p-value: 4.833e-15

# Alternative version:
c3b2.lm <- lm(formula =  income ~ age, 
              data = census.dat,
              subset = age > 60)

all.equal(c3b.lm$coefficients, c3b2.lm$coefficients) # test equality
#> [1] TRUE

Answer: Both relationships are significant, but in different directions: For people up to 60 years old, there is a positive relationship between age and income ($b = 198.3$, $p < .01$). By contrast, for people older than 60 the relationship between age and income is stronger and negative ($b = -1019.3$, $p < .01$). Overall, the relationship between age and income seems curvilinear, as illustrated by the following plot:

# Scatterplot:
plot(x = census.dat$age,
     y = census.dat$income,
     ylim = c(0, 90000),
     main = "Income by age",
     xlab = "Age",
     ylab = "Income"
     )

# Draw some lines:
abline(c2.lm, col = "steelblue3", lty = 1, lwd = 2) # regression line (full model from above)
abline(v = 61, col = "orange3", lty = 3) # vertical line
abline(c3a.lm, col = "green3", lty = 2, lwd = 2) # model for subset: age < 61
abline(c3b.lm, col = "red3", lty = 2, lwd = 2)   # model for subset: age > 60

Linear regression with multiple IVs

4a. Does someone’s happiness — besides on income — also depend on a person’s age, gender, IQ score, and education level? (Hint: Conduct a linear regression of happy by gender, age, iq, edu.num, and income to find out.)

c3.lm <- lm(formula =  happy ~ gender + age + iq + edu.num + income, 
            data = census.dat)
summary(c3.lm)
#> 
#> Call:
#> lm(formula = happy ~ gender + age + iq + edu.num + income, data = census.dat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -5.4380 -1.3659  0.0196  1.2650  5.6033 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  1.433e+01  1.876e+00   7.642 9.58e-13 ***
#> genderm     -9.767e-01  3.276e-01  -2.981  0.00324 ** 
#> age          2.676e-01  8.277e-03  32.334  < 2e-16 ***
#> iq          -2.334e-01  1.544e-02 -15.114  < 2e-16 ***
#> edu.num      1.146e-01  1.278e-01   0.897  0.37093    
#> income       2.197e-04  1.558e-05  14.098  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.16 on 194 degrees of freedom
#> Multiple R-squared:  0.8683, Adjusted R-squared:  0.8649 
#> F-statistic: 255.9 on 5 and 194 DF,  p-value: < 2.2e-16

Answer: Happiness now seems to be significantly predicted by gender (women are happier), age (older persons are happier), iq (people with lower IQ values are happier), and income (people with higher income are happier).

4b. Which of the other predictor variables does someone’s income depend on?

c4.lm <- lm(formula =  income ~ gender + age + iq + edu.num + happy, 
            data = census.dat)
summary(c4.lm)
#> 
#> Call:
#> lm(formula = income ~ gender + age + iq + edu.num + happy, data = census.dat)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -17837  -4398   -230   4729  17272 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -6968.94    6909.87  -1.009    0.314    
#> genderm      5872.76     999.67   5.875 1.81e-08 ***
#> age          -732.54      42.70 -17.157  < 2e-16 ***
#> iq            577.57      61.02   9.466  < 2e-16 ***
#> edu.num      -449.43     413.46  -1.087    0.278    
#> happy        2303.50     163.39  14.098  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 6996 on 194 degrees of freedom
#> Multiple R-squared:   0.64,  Adjusted R-squared:  0.6307 
#> F-statistic: 68.98 on 5 and 194 DF,  p-value: < 2.2e-16

Answer: Significant predictors for income are gender, (males have higher incomes), age (older people have lower incomes), iq (higher IQ values imply higher incomes), and happiness (happier people have higher incomes).

4c. Is the effect of income on happiness still significant when already accounting for the effects of the other systematic predictors? (Hint: Conduct 2 linear regression models on happy: One with all significant predictors excluding income and one with all significant predictors including income. Then use an ANOVA – via the anova() function – to contrast both models.)

# Model of significant predictors only:
c5a.lm <- lm(formula = happy ~ gender + age + iq + income, 
            data = census.dat)
summary(c5a.lm)
#> 
#> Call:
#> lm(formula = happy ~ gender + age + iq + income, data = census.dat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -5.5455 -1.4779 -0.0165  1.3040  5.5916 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  1.478e+01  1.807e+00   8.180 3.59e-14 ***
#> genderm     -9.699e-01  3.274e-01  -2.963  0.00343 ** 
#> age          2.667e-01  8.209e-03  32.488  < 2e-16 ***
#> iq          -2.337e-01  1.543e-02 -15.147  < 2e-16 ***
#> income       2.191e-04  1.556e-05  14.079  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.159 on 195 degrees of freedom
#> Multiple R-squared:  0.8678, Adjusted R-squared:  0.8651 
#> F-statistic:   320 on 4 and 195 DF,  p-value: < 2.2e-16

# Model of significant predictors of happiness EXCLUDING income:
c5b.lm <- lm(formula = happy ~ gender + age + iq, 
            data = census.dat)
summary(c5b.lm)
#> 
#> Call:
#> lm(formula = happy ~ gender + age + iq, data = census.dat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -7.1301 -2.2784  0.1363  2.1347  6.9248 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 26.03827    2.29567  11.342   <2e-16 ***
#> genderm      0.63534    0.43469   1.462    0.145    
#> age          0.21577    0.01044  20.672   <2e-16 ***
#> iq          -0.21565    0.02178  -9.903   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.059 on 196 degrees of freedom
#> Multiple R-squared:  0.7334, Adjusted R-squared:  0.7293 
#> F-statistic: 179.7 on 3 and 196 DF,  p-value: < 2.2e-16

# Note that gender is no longer significant!

# Compare the 2 models:
anova(c5a.lm, c5b.lm)
#> Analysis of Variance Table
#> 
#> Model 1: happy ~ gender + age + iq + income
#> Model 2: happy ~ gender + age + iq
#>   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
#> 1    195  909.26                                  
#> 2    196 1833.53 -1   -924.27 198.22 < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Answer: Yes, income remains a significant predictor of happiness when already accounting for the effects of gender, age, and iq.

Checkpoint 1

At this point you completed all basic exercises. This is good, but additional practice will deepen your understanding, so please keep carrying on…

B. At Home

Studying student performance

In this part, we will analyze a real data set from a study on student performance in two schools and two classes (Math and Portuguese). The two data files come from the UCI Machine Learning database at http://archive.ics.uci.edu/ml/datasets/Student+Performance#

Here is the data description (taken directly from the original website):

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

The data are located in two tab-delimited text files at:

http://Rpository.com/down/data/WPA08_studentmath.txt (the Math data), and
http://Rpository.com/down/data/WPA08_studentpor.txt (the Portuguese data).

Data description

Both data files contain 33 columns. Here are their definitions:

school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
age - student’s age (numeric: from 15 to 22)
address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education, or 4 - higher education)
Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
failures - number of past class failures (numeric: n if 1<=n<3, else 4)
schoolsup - extra educational support (binary: yes or no)
famsup - family educational support (binary: yes or no)
paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
activities - extra-curricular activities (binary: yes or no)
nursery - attended nursery school (binary: yes or no)
higher - wants to take higher education (binary: yes or no)
internet - Internet access at home (binary: yes or no)
romantic - in a romantic relationship (binary: yes or no)
famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
freetime - free time after school (numeric: from 1 - very low to 5 - very high)
goout - going out with friends (numeric: from 1 - very low to 5 - very high)
Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
health - current health status (numeric: from 1 - very bad to 5 - very good)
absences - number of school absences (numeric: from 0 to 93)
G1 - first period grade (numeric: from 0 to 20)
G2 - second period grade (numeric: from 0 to 20)
G3 - final grade (numeric: from 0 to 20, output target)

Loading and exploring data

5a. Load the two text files containing the data into R and assign them to new objects called student.math and student.por respectively. (Hint: Use the read.table() function with appropriate arguments, given that both files are tab-delimited and contain a header row with variable names.)

student.math <- read.table("http://Rpository.com/down/data/WPA08_studentmath.txt",
                      sep = "\t",
                      header = TRUE
                      )

student.por <- read.table("http://Rpository.com/down/data/WPA08_studentpor.txt",
                      sep = "\t",
                      header = TRUE
                      )

Here is how the first few rows of the math data should look:

head(student.math)
#>   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
#> 1     GP   F  18       U     GT3       A    4    4  at_home  teacher
#> 2     GP   F  17       U     GT3       T    1    1  at_home    other
#>       reason guardian traveltime studytime failures schoolsup famsup paid
#> 1     course   mother          2         2        0       yes     no   no
#> 2     course   father          1         2        0        no    yes   no
#>   activities nursery higher internet romantic famrel freetime goout Dalc
#> 1         no     yes    yes       no       no      4        3     4    1
#> 2         no      no    yes      yes       no      5        3     3    1
#>   Walc health absences G1 G2 G3
#> 1    1      3        6  5  6  6
#> 2    1      3        4  5  5  6
#>  [ reached getOption("max.print") -- omitted 4 rows ]

5b. Look at the first few rows of both data frames with the head() function to make sure they were imported correctly.

# head(student.math) # (shown above)
head(student.por)
#>   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
#> 1     GP   F  18       U     GT3       A    4    4  at_home  teacher
#> 2     GP   F  17       U     GT3       T    1    1  at_home    other
#>       reason guardian traveltime studytime failures schoolsup famsup paid
#> 1     course   mother          2         2        0       yes     no   no
#> 2     course   father          1         2        0        no    yes   no
#>   activities nursery higher internet romantic famrel freetime goout Dalc
#> 1         no     yes    yes       no       no      4        3     4    1
#> 2         no      no    yes      yes       no      5        3     3    1
#>   Walc health absences G1 G2 G3
#> 1    1      3        4  0 11 11
#> 2    1      3        2  9 11 11
#>  [ reached getOption("max.print") -- omitted 4 rows ]

5c. Using the str() function, look at summary statistics for each column in the dataframe. There should be 33 columns in each dataset. Make sure everything looks ok.

# str(student.math)
# str(student.por)

# Looks ok.

Standard Regression with lm()

One IV

6a. For the math data, create a regression object called lm.G1.age predicting first period grade (G1) based on age.

lm.G1.age <- lm(G1 ~ age, data = student.math)

6b. How do you interpret the relationship between age and the first period grade G1?

summary(lm.G1.age)
#> 
#> Call:
#> lm(formula = G1 ~ age, data = student.math)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -7.6915 -2.7749 -0.1916  2.3085  8.3085 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  13.6919     2.1926   6.245  1.1e-09 ***
#> age          -0.1667     0.1309  -1.273    0.204    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.317 on 393 degrees of freedom
#> Multiple R-squared:  0.004106,   Adjusted R-squared:  0.001572 
#> F-statistic:  1.62 on 1 and 393 DF,  p-value: 0.2038

Answer: There is a slight negative relationship between age and first period grade ($b = -0.17$), but the relationship is not significant ($p = .20$).

7a. For the math data, create a regression object called lm.G1.abs predicting first period grade (G1) based on absences.

lm.G1.abs <- lm(G1 ~ absences, data = student.math)

7b. How do you interpret the relationship between absences and the first period grade G1?

summary(lm.G1.abs)
#> 
#> Call:
#> lm(formula = G1 ~ absences, data = student.math)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -7.8794 -2.9115  0.0177  2.2363  8.0692 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 10.98227    0.20539  53.470   <2e-16 ***
#> absences    -0.01286    0.02091  -0.615    0.539    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.322 on 393 degrees of freedom
#> Multiple R-squared:  0.0009612,  Adjusted R-squared:  -0.001581 
#> F-statistic: 0.3781 on 1 and 393 DF,  p-value: 0.539

Answer: There is a very small negative relationship between age and first period grade ($b = -0.01$), but the relationship is not significant ($p = .54$).

8a. For the math data, create a regression object called lm.G3.G1 predicting each student’s period 3 grade (G3) based on their period 1 grade (G1).

lm.G3.G1 <- lm(G3 ~ G1, data = student.math)

8b. How do you interpret the relationship between the first and final period grades (G1 and G3)?

summary(lm.G3.G1)
#> 
#> Call:
#> lm(formula = G3 ~ G1, data = student.math)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -11.6223  -0.8348   0.3777   1.6965   5.0153 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -1.65280    0.47475  -3.481 0.000555 ***
#> G1           1.10626    0.04164  26.568  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.743 on 393 degrees of freedom
#> Multiple R-squared:  0.6424, Adjusted R-squared:  0.6414 
#> F-statistic: 705.8 on 1 and 393 DF,  p-value: < 2.2e-16

Answer: There is a strong positive relationship between first period grade and third period grade ($b = 1.11$, $p < .01$).

Adding a regression line to a scatterplot

9a. Create a scatterplot showing the relationship between the G1 and G3 grades for the math data.

9b. Add a regression line to the scatterplot from your regression object lm.G3.G1. (Hint: Use the abline() function.)

plot(x = student.math$G1,
     y = student.math$G3,
     xlab = "first grade G1",
     ylab = "final grade G3",
     main = "Predicting final grade (G3) by first grade (G1)"
     )

abline(lm.G3.G1, col = unikn.col[4], lty = 2, lwd = 2)

Checkpoint 2

If you got this far you’re doing very well, but don’t give up just yet…

Multiple IVs

10a. For the math data, create a regression object called lm.G3.mult predicting third period grade (G3) based on several other variables: sex, age, internet, and failures.

table(student.math$sex)
#> 
#>   F   M 
#> 208 187


lm.G3.mult <- lm(G3 ~ sex + age + internet + failures, data = student.math)

10b. How do you interpret the regression output? Which of the variables are significantly related to third period grade G3?

summary(lm.G3.mult)
#> 
#> Call:
#> lm(formula = G3 ~ sex + age + internet + failures, data = student.math)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -12.2156  -1.9523   0.0965   3.0252   9.4370 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  13.9962     2.9808   4.695 3.69e-06 ***
#> sexM          1.0451     0.4282   2.441   0.0151 *  
#> age          -0.2407     0.1735  -1.388   0.1660    
#> internetyes   0.7855     0.5761   1.364   0.1735    
#> failures     -2.1260     0.2966  -7.167 3.86e-12 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.237 on 390 degrees of freedom
#> Multiple R-squared:  0.1533, Adjusted R-squared:  0.1446 
#> F-statistic: 17.65 on 4 and 390 DF,  p-value: 2.488e-13

Answer: A student’s gender and number of previous failures significantly predict third period grade: Male students perform better than female students, and the more past failures a student has the lower his or her grade.

10c. Create a new regression object called lm.G3.mult2 using the same variables as on Exercise 10a. (where you predicted third period grade G3 based on sex, age, internet, and failures), but now use the Portuguese dataset to fit the model.

lm.G3.mult2 <- lm(G3 ~ sex + age + internet + failures, data = student.por)

10d. What are the key differences between the beta values for the Portuguese dataset (lm.G3.mult2 from Exercise 10c.) and the math dataset (lm.G3.mult from Exercise 10a.)?

summary(lm.G3.mult2)
#> 
#> Call:
#> lm(formula = G3 ~ sex + age + internet + failures, data = student.por)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -12.8941  -1.8345   0.0522   1.8807   7.8041 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 11.61020    1.68101   6.907 1.19e-11 ***
#> sexM        -0.71515    0.23625  -3.027 0.002568 ** 
#> age          0.01986    0.10031   0.198 0.843134    
#> internetyes  0.92639    0.27508   3.368 0.000803 ***
#> failures    -2.04819    0.20738  -9.877  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.936 on 644 degrees of freedom
#> Multiple R-squared:  0.1794, Adjusted R-squared:  0.1743 
#> F-statistic: 35.19 on 4 and 644 DF,  p-value: < 2.2e-16

Answer: In the Portuguese data set, male students do worse than female students, and having internet access actually helps performance. As before, more past failures coincide with lower grades.

Predicting values

11a. For the math dataset, create a regression object called lm.G1.all predicting a student’s first period grade (G1) based on all variables in the dataset. (Hint: Use the notation formula = y ~ . to include all variables.)


# lm.G1.all <- lm(absences ~ ., data = student.math)

lm.G1.all <- lm(G1 ~ ., data = student.math)
summary(lm.G1.all)
#> 
#> Call:
#> lm(formula = G1 ~ ., data = student.math)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -3.8473 -0.8875  0.0117  0.9788  6.9754 
#> 
#> Coefficients:
#>                    Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)       1.858e+00  1.781e+00   1.043  0.29771    
#> schoolMS         -2.294e-01  3.093e-01  -0.742  0.45880    
#> sexM              1.875e-01  1.967e-01   0.953  0.34111    
#> age               9.053e-02  8.512e-02   1.064  0.28826    
#> addressU         -1.679e-01  2.280e-01  -0.737  0.46188    
#> famsizeLE3       -7.347e-03  1.910e-01  -0.038  0.96933    
#> PstatusT          3.268e-01  2.823e-01   1.158  0.24776    
#> Medu             -1.216e-01  1.263e-01  -0.963  0.33629    
#> Fedu              1.565e-01  1.083e-01   1.445  0.14944    
#> Mjobhealth        2.252e-01  4.367e-01   0.516  0.60637    
#> Mjobother        -5.673e-01  2.781e-01  -2.040  0.04213 *  
#> Mjobservices      7.223e-02  3.113e-01   0.232  0.81668    
#> Mjobteacher      -1.475e-01  4.057e-01  -0.364  0.71636    
#> Fjobhealth       -6.709e-01  5.606e-01  -1.197  0.23225    
#> Fjobother        -8.636e-01  3.991e-01  -2.164  0.03113 *  
#>  [ reached getOption("max.print") -- omitted 27 rows ]
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.601 on 353 degrees of freedom
#> Multiple R-squared:  0.7914, Adjusted R-squared:  0.7672 
#> F-statistic: 32.67 on 41 and 353 DF,  p-value: < 2.2e-16

11b. Save the fitted values values from the lm.G1.all object as a vector called lm.G1.all.fitted. (Hint: A model’s fitted values are contained in a vector called model$fitted.values.)

lm.G1.all.fitted <- lm.G1.all$fitted.values

11c. For the math dataset, create a scatterplot showing the relationship between a student’s first period grade (G1) and the fitted values from the model. Does the model appear to correctly fit a student’s first period grade?

plot(x = student.math$G1,
     y = lm.G1.all.fitted,
     xlim = c(0, max(student.math$G1)), 
     ylim = c(0, max(lm.G1.all.fitted)), 
     xlab = "first grades (G1)",
     ylab = "fitted first grades",
     main = "Evaluating model fit"
     )

abline(a = 0, b = 1, col = unikn.col[4], lty = 2, lwd = 2)

Answer: No, the model doesn’t seem to fit very well, as the fitted values substantially deviate from the diagonal (representing the actual values).

Checkpoint 3

If you got this far you’re doing a terrific job — well done! Enjoy the following challenge…

C. Challenge

Showing model parsimony

12a. Someone claims that a student’s final grade G3 in Portuguese are predicted by three independent variables: The number of past class failures, the desire to pursue higher education (higher), and extra educational support (schoolsup). Verify this hypothesis by testing an appropriate linear model.

lm.G3.3 <- lm(G3 ~ failures + higher + schoolsup, data = student.por)
summary(lm.G3.3)
#> 
#> Call:
#> lm(formula = G3 ~ failures + higher + schoolsup, data = student.por)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -12.6559  -1.6559   0.2634   1.6040   6.3441 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   10.1322     0.3777  26.823  < 2e-16 ***
#> failures      -1.7363     0.2003  -8.667  < 2e-16 ***
#> higheryes      2.5236     0.3867   6.526 1.36e-10 ***
#> schoolsupyes  -0.9192     0.3701  -2.484   0.0132 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.876 on 645 degrees of freedom
#> Multiple R-squared:  0.2112, Adjusted R-squared:  0.2076 
#> F-statistic: 57.58 on 3 and 645 DF,  p-value: < 2.2e-16

Answer: The model supports the claim, as all three variables are significant predictors of the final grade G3. (However, the schoolsup variable is only weakly predictive: $b = -.92$, $p = .013$.)

12b. Create 3 separate linear models (lm.1, lm.2, and lm.3) that predict a student’s final grade G3 in Portuguese by 1, 2, or 3 of the predictors identified in Exercise 12a. (starting with the most significant predictor). Then use 3 pairwise ANOVAs to show that none of the predictors is unnecessary (i.e., adding each predictor yields a significant improvement in the fit of the more complex model).

# Model 1: 1 IV
lm.1 <- lm(G3 ~ failures, data = student.por)
summary(lm.1)
#> 
#> Call:
#> lm(formula = G3 ~ failures, data = student.por)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -12.3813  -1.3813  -0.2393   1.6187   6.9026 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  12.3813     0.1246   99.38   <2e-16 ***
#> failures     -2.1419     0.1968  -10.88   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.973 on 647 degrees of freedom
#> Multiple R-squared:  0.1547, Adjusted R-squared:  0.1534 
#> F-statistic: 118.4 on 1 and 647 DF,  p-value: < 2.2e-16

# Model 2: 2 IVs
lm.2 <- lm(G3 ~ failures + higher, data = student.por)
summary(lm.2)
#> 
#> Call:
#> lm(formula = G3 ~ failures + higher, data = student.por)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -12.5534  -1.5534   0.1963   1.4466   6.4466 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  10.1157     0.3792  26.677  < 2e-16 ***
#> failures     -1.7497     0.2011  -8.702  < 2e-16 ***
#> higheryes     2.4377     0.3867   6.304 5.37e-10 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.887 on 646 degrees of freedom
#> Multiple R-squared:  0.2037, Adjusted R-squared:  0.2012 
#> F-statistic: 82.62 on 2 and 646 DF,  p-value: < 2.2e-16

# Model 3: 3 IVs
lm.3 <- lm(G3 ~ failures + higher + schoolsup, data = student.por)
summary(lm.3)
#> 
#> Call:
#> lm(formula = G3 ~ failures + higher + schoolsup, data = student.por)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -12.6559  -1.6559   0.2634   1.6040   6.3441 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   10.1322     0.3777  26.823  < 2e-16 ***
#> failures      -1.7363     0.2003  -8.667  < 2e-16 ***
#> higheryes      2.5236     0.3867   6.526 1.36e-10 ***
#> schoolsupyes  -0.9192     0.3701  -2.484   0.0132 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.876 on 645 degrees of freedom
#> Multiple R-squared:  0.2112, Adjusted R-squared:  0.2076 
#> F-statistic: 57.58 on 3 and 645 DF,  p-value: < 2.2e-16


# Compare Model 1 to Model 2:
anova(lm.1, lm.2)
#> Analysis of Variance Table
#> 
#> Model 1: G3 ~ failures
#> Model 2: G3 ~ failures + higher
#>   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
#> 1    647 5717.0                                  
#> 2    646 5385.7  1    331.34 39.744 5.365e-10 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Compare Model 2 to Model 3:
anova(lm.2, lm.3)
#> Analysis of Variance Table
#> 
#> Model 1: G3 ~ failures + higher
#> Model 2: G3 ~ failures + higher + schoolsup
#>   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
#> 1    646 5385.7                              
#> 2    645 5334.6  1    51.027 6.1696 0.01325 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Compare Model 1 to Model 3:
anova(lm.1, lm.3)
#> Analysis of Variance Table
#> 
#> Model 1: G3 ~ failures
#> Model 2: G3 ~ failures + higher + schoolsup
#>   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
#> 1    647 5717.0                                  
#> 2    645 5334.6  2    382.37 23.116 2.016e-10 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Answer: The ANOVAS show that each additional predictor yields a significant improvement in model fit. (Note the similarity of the $p$-values to those of the coefficients in Exercise 12a.)

12c. Are the three independent variables from Exercise 12a. significant predictors of a student’s final grade G3 when also considering the first and second period grades (G1 and G2) as predictors?

lm.G3.5 <- lm(G3 ~ G1 + G2 + failures + higher + schoolsup, data = student.por)
summary(lm.G3.5)
#> 
#> Call:
#> lm(formula = G3 ~ G1 + G2 + failures + higher + schoolsup, data = student.por)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -9.2097 -0.4459 -0.0911  0.6257  5.7478 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   0.01735    0.26137   0.066 0.947097    
#> G1            0.13397    0.03640   3.681 0.000252 ***
#> G2            0.88681    0.03407  26.028  < 2e-16 ***
#> failures     -0.19468    0.09269  -2.100 0.036092 *  
#> higheryes     0.17927    0.17594   1.019 0.308625    
#> schoolsupyes -0.15214    0.16286  -0.934 0.350554    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.258 on 643 degrees of freedom
#> Multiple R-squared:  0.8494, Adjusted R-squared:  0.8483 
#> F-statistic: 725.5 on 5 and 643 DF,  p-value: < 2.2e-16

Answer: Only the number of past class failures remains a significant (negative) predictor ($b = -.19$, $p < .05$).

That’s it – hope you enjoyed working on this assignment!

[WPA08_answers.Rmd updated on 2017-12-21 15:13:46 by hn.]

WPA08: Statistics: Linear regression (Answers)

Hansjörg Neth, SPDS, uni.kn

2017 Dec 21