WPA08 of Basic data and decision analysis in R, taught at the University of Konstanz in Winter 2017/2018.
To complete and submit these exercises, please remember and do the following:
Use the .Rmd Format: Your WPAs should be written as scripts of commented code (as
.Rmd
files) and submitted as reproducible documents that combine text with code (in.html
or.pdf
formats).A simple
.Rmd
template is provided here.(Alternatively, open a plain R script and save it as
LastnameFirstname_WPA##_yymmdd.R
.)
Commening your code: Indicate the current assignment (e.g., WPA08), your name, and the current date at the top of your document. Please always include appropriate comments with your code. For instance, your file
LastFirst_WPA08_yymmdd.Rmd
could look like this:
---
title: "Your Assignment Title (WPA08)"
author: "Your Name"
date: "Year Month Day"
output: html_document
---
This file contains my solutions to WPA08.
# Exercise 1
To show and run R code in your document, use a code chunk (without the '#' symbols):
# ```{r, exercise_1, echo = TRUE, eval = TRUE}
#
# v <- c(1, 2, 3) # some vector
# sum(v)
#
# ```
More text and code chunks...
[Updated on `r Sys.Date()` by Your Name.]
<!-- End of document -->
Complete as many exercises as you can by Wednesday (23:59).
Submit your script or output file (including all code) to the appropriate folder on Ilias.
A. In Class
Here are some warm-up exercises that review important points from previous chapters and practice the basic concepts of the current topic:
Preparations
0. The following steps prepare the current session by creating a new .Rmd
file, and compiling it into an .html
output file:
0a. Open your R project from last week (called RCourse
or something similar), which contains some files and at least two subfolders (data
and R
).
0b. Create a new R Markdown (.Rmd
) script and save it as LastFirst_WPA08_yymmdd.Rmd
(with an appropriate header) in your project directory.
0c. Insert a code chunk and load the rmarkdown
, knitr
and yarrr
packages. (Hint: It’s always a good idea to name code chunks and load all required packages with library()
at the beginning of your document. Using the chunk option include = FALSE
evaluates the chunk, but does not show it or its outputs in the html output file.)
library(rmarkdown)
library(knitr)
library(yarrr)
# Store original par() settings:
opar <- par()
# par(opar) # restores original (default) par settings later
0d. Make sure that you can create an .html
output-file by “knitting” your current document.
# Ok!
1. In this exercise, we will use some ficticious census data that lists some basic demographic variables – the gender
, age
, iq
score, education (in both text and numeric formats), annual income
, and self-reported happiness (happy
, rated on a scale from 1 to 30) — for 200 quasi-representative respondents.
The data are stored in a comma-separated text file in http://Rpository.com/down/data/WPA08_census.txt.
Exploring data and comparing means
1a. Load the text file containing the data into R and assign them to new objects called census.dat
. (Hint: Use the read.table()
function with appropriate arguments. Note that the file is comma-delimited and contains a header with variable names.)
Here is how the first few rows of the census data should look like:
head(census.dat)
#> gender age iq edu edu.num income happy
#> 1 m 34 122 2.Bachelors 3 55499 6
#> 2 f 86 113 2.Bachelors 3 26631 14
#> 3 m 60 97 1.HighSchool 2 73623 21
#> 4 f 54 105 1.HighSchool 2 61140 18
#> 5 f 46 96 2.Bachelors 3 41420 10
#> 6 m 80 95 4.PhD 5 48190 24
1b. Explore the data to make sure everything loaded and looks ok. (Hint: Use the head()
, str()
, summary()
functions.)
1c. Do men and women earn the same or a different level of income
on average? (Hint: Answer this question by conducting a t-test on appropriate subsets of the income
variable.)
Answer: (…)
1d. Do people with different levels of education show different levels of happiness on average? (Hint: Answer this question by conducting an ANOVA with a dependent variable happy
by an independent variable that quantifies education (edu.num
).)
Answer: (…)
Linear regression with 1 IV
2a. Test the hypothesis that more money makes people happier. (Hint: Conduct a linear regression of happy
by income
.)
Answer: (…)
2b. Create a scatterplot that illustrates the relationship between income and happines and add a regression line to the scatterplot. (Hint: Use the abline()
function on your linear model object from Exercise 2a.)
3a. Does a person’s income
depend on his or her age
?
3b. Create a scatterplot that illustrates the relationship between income and age and add a regression line to the scatterplot. (Hint: Use the abline()
function on your linear model object from Exercise 3c.)
3c. Create 2 separate linear regressions to re-examine the influence of age
on income
for people up to 60 years vs. people older than 60 years. What do you conclude?
Answer: (…)
Linear regression with multiple IVs
4a. Does someone’s happiness — besides on income
— also depend on a person’s age, gender, IQ score, and education level? (Hint: Conduct a linear regression of happy
by gender
, age
, iq
, edu.num
, and income
to find out.)
Answer: (…)
4b. Which of the other predictor variables does someone’s income
depend on?
Answer: (…)
4c. Is the effect of income
on happiness still significant when already accounting for the effects of the other systematic predictors? (Hint: Conduct 2 linear regression models on happy
: One with all significant predictors excluding income
and one with all significant predictors including income. Then use an ANOVA to contrast both models.)
Answer: (…)
Checkpoint 1
At this point you completed all basic exercises. This is good, but additional practice will deepen your understanding, so please keep carrying on…
B. At Home
Studying student performance
In this part, we will analyze a real data set from a study on student performance in two schools and two classes (Math and Portuguese). The two data files come from the UCI Machine Learning database at http://archive.ics.uci.edu/ml/datasets/Student+Performance#
Here is the data description (taken directly from the original website):
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
The data are located in two tab-delimited text files at:
http://Rpository.com/down/data/WPA08_studentmath.txt (the Math data), and
http://Rpository.com/down/data/WPA08_studentpor.txt (the Portuguese data).
Data description
Both data files contain 33 columns. Here are their definitions:
school
- student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)sex
- student’s sex (binary: ‘F’ - female or ‘M’ - male)age
- student’s age (numeric: from 15 to 22)address
- student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)famsize
- family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)Pstatus
- parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)Medu
- mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education, or 4 - higher education)Fedu
- father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)Mjob
- mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)Fjob
- father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)reason
- reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)guardian
- student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)traveltime
- home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)studytime
- weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)failures
- number of past class failures (numeric: n if 1<=n<3, else 4)schoolsup
- extra educational support (binary: yes or no)famsup
- family educational support (binary: yes or no)paid
- extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)activities
- extra-curricular activities (binary: yes or no)nursery
- attended nursery school (binary: yes or no)higher
- wants to take higher education (binary: yes or no)internet
- Internet access at home (binary: yes or no)romantic
- in a romantic relationship (binary: yes or no)famrel
- quality of family relationships (numeric: from 1 - very bad to 5 - excellent)freetime
- free time after school (numeric: from 1 - very low to 5 - very high)goout
- going out with friends (numeric: from 1 - very low to 5 - very high)Dalc
- workday alcohol consumption (numeric: from 1 - very low to 5 - very high)Walc
- weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)health
- current health status (numeric: from 1 - very bad to 5 - very good)absences
- number of school absences (numeric: from 0 to 93)G1
- first period grade (numeric: from 0 to 20)G2
- second period grade (numeric: from 0 to 20)G3
- final grade (numeric: from 0 to 20, output target)
Loading and exploring data
5a. Load the two text files containing the data into R and assign them to new objects called student.math
and student.por
respectively. (Hint: Use the read.table()
function with appropriate arguments, given that both files are tab-delimited and contain a header row with variable names.)
Here is how the first few rows of the math data should look:
head(student.math)
#> school sex age address famsize Pstatus Medu Fedu Mjob Fjob
#> 1 GP F 18 U GT3 A 4 4 at_home teacher
#> 2 GP F 17 U GT3 T 1 1 at_home other
#> reason guardian traveltime studytime failures schoolsup famsup paid
#> 1 course mother 2 2 0 yes no no
#> 2 course father 1 2 0 no yes no
#> activities nursery higher internet romantic famrel freetime goout Dalc
#> 1 no yes yes no no 4 3 4 1
#> 2 no no yes yes no 5 3 3 1
#> Walc health absences G1 G2 G3
#> 1 1 3 6 5 6 6
#> 2 1 3 4 5 5 6
#> [ reached getOption("max.print") -- omitted 4 rows ]
5b. Look at the first few rows of both data frames with the head()
function to make sure they were imported correctly.
# head(student.math) # (shown above)
head(student.por)
#> school sex age address famsize Pstatus Medu Fedu Mjob Fjob
#> 1 GP F 18 U GT3 A 4 4 at_home teacher
#> 2 GP F 17 U GT3 T 1 1 at_home other
#> reason guardian traveltime studytime failures schoolsup famsup paid
#> 1 course mother 2 2 0 yes no no
#> 2 course father 1 2 0 no yes no
#> activities nursery higher internet romantic famrel freetime goout Dalc
#> 1 no yes yes no no 4 3 4 1
#> 2 no no yes yes no 5 3 3 1
#> Walc health absences G1 G2 G3
#> 1 1 3 4 0 11 11
#> 2 1 3 2 9 11 11
#> [ reached getOption("max.print") -- omitted 4 rows ]
5c. Using the str()
function, look at summary statistics for each column in the dataframe. There should be 33 columns in each dataset. Make sure everything looks ok.
Standard Regression with lm()
One IV
6a. For the math data, create a regression object called lm.G1.age
predicting first period grade (G1
) based on age
.
6b. How do you interpret the relationship between age
and the first period grade G1
?
Answer: (…)
7a. For the math data, create a regression object called lm.G1.abs
predicting first period grade (G1
) based on absences
.
7b. How do you interpret the relationship between absences
and the first period grade G1
?
Answer: (…)
8a. For the math data, create a regression object called lm.G3.G1
predicting each student’s period 3 grade (G3
) based on their period 1 grade (G1
).
8b. How do you interpret the relationship between the first and final period grades (G1
and G3
)?
Answer: (…)
Adding a regression line to a scatterplot
9a. Create a scatterplot showing the relationship between the G1
and G3
grades for the math data.
9b. Add a regression line to the scatterplot from your regression object lm.G3.G1
. (Hint: Use the abline()
function.)
Checkpoint 2
If you got this far you’re doing very well, but don’t give up just yet…
Multiple IVs
10a. For the math data, create a regression object called lm.G3.mult
predicting third period grade (G3
) based on several other variables: sex
, age
, internet
, and failures
.
10b. How do you interpret the regression output? Which of the variables are significantly related to third period grade G3
?
Answer: (…)
10c. Create a new regression object called lm.G3.mult2
using the same variables as on Exercise 10a. (where you predicted third period grade G3
based on sex
, age
, internet
, and failures
), but now use the Portuguese dataset to fit the model.
10d. What are the key differences between the beta values for the Portuguese dataset (lm.G3.mult2
from Exercise 10c.) and the math dataset (lm.G3.mult
from Exercise 10a.)?
Answer: (…)
Predicting values
11a. For the math dataset, create a regression object called lm.G1.all
predicting a student’s first period grade (G1
) based on all variables in the dataset. (Hint: Use the notation formula = y ~ .
to include all variables.)
11b. Save the fitted values values from the lm.G1.all
object as a vector called lm.G1.all.fitted
. (Hint: A model’s fitted values are contained in a vector called model$fitted.values
.)
11c. For the math dataset, create a scatterplot showing the relationship between a student’s first period grade (G1
) and the fitted values from the model. Does the model appear to correctly fit a student’s first period grade?
Answer: (…)
Checkpoint 3
If you got this far you’re doing a terrific job — well done! Enjoy the following challenge…
C. Challenge
Showing model parsimony
12a. Someone claims that a student’s final grade G3
in Portuguese are predicted by three independent variables: The number of past class failures
, the desire to pursue higher education (higher
), and extra educational support (schoolsup
). Verify this hypothesis by testing an appropriate linear model.
Answer: (…)
12b. Create 3 separate linear models (lm.1
, lm.2
, and lm.3
) that predict a student’s final grade G3
in Portuguese by 1, 2, or 3 of the predictors identified in Exercise 12a. (starting with the most significant predictor). Then use 3 pairwise ANOVAs to show that none of the predictors is unnecessary (i.e., adding each predictor yields a significant improvement in the fit of the more complex model).
Answer: (…)
12c. Are the three independent variables from Exercise 12a. significant predictors of a student’s final grade G3
when also considering the first and second period grades (G1
and G2
) as predictors?
Answer: (…)