WPA08 of Basic data and decision analysis in R, taught at the University of Konstanz in Winter 2017/2018.

To complete and submit these exercises, please remember and do the following:

Use the .Rmd Format: Your WPAs should be written as scripts of commented code (as .Rmd files) and submitted as reproducible documents that combine text with code (in .html or .pdf formats).
- A simple .Rmd template is provided here.
- (Alternatively, open a plain R script and save it as LastnameFirstname_WPA##_yymmdd.R.)
Commening your code: Indicate the current assignment (e.g., WPA08), your name, and the current date at the top of your document. Please always include appropriate comments with your code. For instance, your file LastFirst_WPA08_yymmdd.Rmd could look like this:

---
title: "Your Assignment Title (WPA08)"
author: "Your Name"
date: "Year Month Day"
output: html_document
---

This file contains my solutions to WPA08.

# Exercise 1

To show and run R code in your document, use a code chunk (without the '#' symbols):

# ```{r, exercise_1, echo = TRUE, eval = TRUE}
# 
# v <- c(1, 2, 3) # some vector
# sum(v)
#     
# ```

More text and code chunks... 

[Updated on `r Sys.Date()` by Your Name.]
<!-- End of document -->

Complete as many exercises as you can by Wednesday (23:59).
Submit your script or output file (including all code) to the appropriate folder on Ilias.

A. In Class

Here are some warm-up exercises that review important points from previous chapters and practice the basic concepts of the current topic:

Preparations

0. The following steps prepare the current session by creating a new .Rmd file, and compiling it into an .html output file:

0a. Open your R project from last week (called RCourse or something similar), which contains some files and at least two subfolders (data and R).

0b. Create a new R Markdown (.Rmd) script and save it as LastFirst_WPA08_yymmdd.Rmd (with an appropriate header) in your project directory.

0c. Insert a code chunk and load the rmarkdown, knitr and yarrr packages. (Hint: It’s always a good idea to name code chunks and load all required packages with library() at the beginning of your document. Using the chunk option include = FALSE evaluates the chunk, but does not show it or its outputs in the html output file.)

library(rmarkdown)
library(knitr)
library(yarrr)

# Store original par() settings:
opar <- par()
# par(opar) # restores original (default) par settings later

0d. Make sure that you can create an .html output-file by “knitting” your current document.

# Ok!

1. In this exercise, we will use some ficticious census data that lists some basic demographic variables – the gender, age, iq score, education (in both text and numeric formats), annual income, and self-reported happiness (happy, rated on a scale from 1 to 30) — for 200 quasi-representative respondents.

The data are stored in a comma-separated text file in http://Rpository.com/down/data/WPA08_census.txt.

Exploring data and comparing means

1a. Load the text file containing the data into R and assign them to new objects called census.dat. (Hint: Use the read.table() function with appropriate arguments. Note that the file is comma-delimited and contains a header with variable names.)

Here is how the first few rows of the census data should look like:

head(census.dat)
#>   gender age  iq          edu edu.num income happy
#> 1      m  34 122  2.Bachelors       3  55499     6
#> 2      f  86 113  2.Bachelors       3  26631    14
#> 3      m  60  97 1.HighSchool       2  73623    21
#> 4      f  54 105 1.HighSchool       2  61140    18
#> 5      f  46  96  2.Bachelors       3  41420    10
#> 6      m  80  95        4.PhD       5  48190    24

1b. Explore the data to make sure everything loaded and looks ok. (Hint: Use the head(), str(), summary() functions.)

1c. Do men and women earn the same or a different level of income on average? (Hint: Answer this question by conducting a t-test on appropriate subsets of the income variable.)

Answer: (…)

1d. Do people with different levels of education show different levels of happiness on average? (Hint: Answer this question by conducting an ANOVA with a dependent variable happy by an independent variable that quantifies education (edu.num).)

Answer: (…)

Linear regression with 1 IV

2a. Test the hypothesis that more money makes people happier. (Hint: Conduct a linear regression of happy by income.)

Answer: (…)

2b. Create a scatterplot that illustrates the relationship between income and happines and add a regression line to the scatterplot. (Hint: Use the abline() function on your linear model object from Exercise 2a.)

3a. Does a person’s income depend on his or her age?

3b. Create a scatterplot that illustrates the relationship between income and age and add a regression line to the scatterplot. (Hint: Use the abline() function on your linear model object from Exercise 3c.)

3c. Create 2 separate linear regressions to re-examine the influence of age on income for people up to 60 years vs. people older than 60 years. What do you conclude?

Answer: (…)

Linear regression with multiple IVs

4a. Does someone’s happiness — besides on income — also depend on a person’s age, gender, IQ score, and education level? (Hint: Conduct a linear regression of happy by gender, age, iq, edu.num, and income to find out.)

Answer: (…)

4b. Which of the other predictor variables does someone’s income depend on?

Answer: (…)

4c. Is the effect of income on happiness still significant when already accounting for the effects of the other systematic predictors? (Hint: Conduct 2 linear regression models on happy: One with all significant predictors excluding income and one with all significant predictors including income. Then use an ANOVA to contrast both models.)

Answer: (…)

Checkpoint 1

At this point you completed all basic exercises. This is good, but additional practice will deepen your understanding, so please keep carrying on…

B. At Home

Studying student performance

In this part, we will analyze a real data set from a study on student performance in two schools and two classes (Math and Portuguese). The two data files come from the UCI Machine Learning database at http://archive.ics.uci.edu/ml/datasets/Student+Performance#

Here is the data description (taken directly from the original website):

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

The data are located in two tab-delimited text files at:

http://Rpository.com/down/data/WPA08_studentmath.txt (the Math data), and
http://Rpository.com/down/data/WPA08_studentpor.txt (the Portuguese data).

Data description

Both data files contain 33 columns. Here are their definitions:

school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
age - student’s age (numeric: from 15 to 22)
address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education, or 4 - higher education)
Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
failures - number of past class failures (numeric: n if 1<=n<3, else 4)
schoolsup - extra educational support (binary: yes or no)
famsup - family educational support (binary: yes or no)
paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
activities - extra-curricular activities (binary: yes or no)
nursery - attended nursery school (binary: yes or no)
higher - wants to take higher education (binary: yes or no)
internet - Internet access at home (binary: yes or no)
romantic - in a romantic relationship (binary: yes or no)
famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
freetime - free time after school (numeric: from 1 - very low to 5 - very high)
goout - going out with friends (numeric: from 1 - very low to 5 - very high)
Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
health - current health status (numeric: from 1 - very bad to 5 - very good)
absences - number of school absences (numeric: from 0 to 93)
G1 - first period grade (numeric: from 0 to 20)
G2 - second period grade (numeric: from 0 to 20)
G3 - final grade (numeric: from 0 to 20, output target)

Loading and exploring data

5a. Load the two text files containing the data into R and assign them to new objects called student.math and student.por respectively. (Hint: Use the read.table() function with appropriate arguments, given that both files are tab-delimited and contain a header row with variable names.)

Here is how the first few rows of the math data should look:

head(student.math)
#>   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
#> 1     GP   F  18       U     GT3       A    4    4  at_home  teacher
#> 2     GP   F  17       U     GT3       T    1    1  at_home    other
#>       reason guardian traveltime studytime failures schoolsup famsup paid
#> 1     course   mother          2         2        0       yes     no   no
#> 2     course   father          1         2        0        no    yes   no
#>   activities nursery higher internet romantic famrel freetime goout Dalc
#> 1         no     yes    yes       no       no      4        3     4    1
#> 2         no      no    yes      yes       no      5        3     3    1
#>   Walc health absences G1 G2 G3
#> 1    1      3        6  5  6  6
#> 2    1      3        4  5  5  6
#>  [ reached getOption("max.print") -- omitted 4 rows ]

5b. Look at the first few rows of both data frames with the head() function to make sure they were imported correctly.

# head(student.math) # (shown above)
head(student.por)
#>   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
#> 1     GP   F  18       U     GT3       A    4    4  at_home  teacher
#> 2     GP   F  17       U     GT3       T    1    1  at_home    other
#>       reason guardian traveltime studytime failures schoolsup famsup paid
#> 1     course   mother          2         2        0       yes     no   no
#> 2     course   father          1         2        0        no    yes   no
#>   activities nursery higher internet romantic famrel freetime goout Dalc
#> 1         no     yes    yes       no       no      4        3     4    1
#> 2         no      no    yes      yes       no      5        3     3    1
#>   Walc health absences G1 G2 G3
#> 1    1      3        4  0 11 11
#> 2    1      3        2  9 11 11
#>  [ reached getOption("max.print") -- omitted 4 rows ]

5c. Using the str() function, look at summary statistics for each column in the dataframe. There should be 33 columns in each dataset. Make sure everything looks ok.

Standard Regression with lm()

One IV

6a. For the math data, create a regression object called lm.G1.age predicting first period grade (G1) based on age.

6b. How do you interpret the relationship between age and the first period grade G1?

Answer: (…)

7a. For the math data, create a regression object called lm.G1.abs predicting first period grade (G1) based on absences.

7b. How do you interpret the relationship between absences and the first period grade G1?

Answer: (…)

8a. For the math data, create a regression object called lm.G3.G1 predicting each student’s period 3 grade (G3) based on their period 1 grade (G1).

8b. How do you interpret the relationship between the first and final period grades (G1 and G3)?

Answer: (…)

Adding a regression line to a scatterplot

9a. Create a scatterplot showing the relationship between the G1 and G3 grades for the math data.

9b. Add a regression line to the scatterplot from your regression object lm.G3.G1. (Hint: Use the abline() function.)

Checkpoint 2

If you got this far you’re doing very well, but don’t give up just yet…

Multiple IVs

10a. For the math data, create a regression object called lm.G3.mult predicting third period grade (G3) based on several other variables: sex, age, internet, and failures.

10b. How do you interpret the regression output? Which of the variables are significantly related to third period grade G3?

Answer: (…)

10c. Create a new regression object called lm.G3.mult2 using the same variables as on Exercise 10a. (where you predicted third period grade G3 based on sex, age, internet, and failures), but now use the Portuguese dataset to fit the model.

10d. What are the key differences between the beta values for the Portuguese dataset (lm.G3.mult2 from Exercise 10c.) and the math dataset (lm.G3.mult from Exercise 10a.)?

Answer: (…)

Predicting values

11a. For the math dataset, create a regression object called lm.G1.all predicting a student’s first period grade (G1) based on all variables in the dataset. (Hint: Use the notation formula = y ~ . to include all variables.)

11b. Save the fitted values values from the lm.G1.all object as a vector called lm.G1.all.fitted. (Hint: A model’s fitted values are contained in a vector called model$fitted.values.)

11c. For the math dataset, create a scatterplot showing the relationship between a student’s first period grade (G1) and the fitted values from the model. Does the model appear to correctly fit a student’s first period grade?

Answer: (…)

Checkpoint 3

If you got this far you’re doing a terrific job — well done! Enjoy the following challenge…

C. Challenge

Showing model parsimony

12a. Someone claims that a student’s final grade G3 in Portuguese are predicted by three independent variables: The number of past class failures, the desire to pursue higher education (higher), and extra educational support (schoolsup). Verify this hypothesis by testing an appropriate linear model.

Answer: (…)

12b. Create 3 separate linear models (lm.1, lm.2, and lm.3) that predict a student’s final grade G3 in Portuguese by 1, 2, or 3 of the predictors identified in Exercise 12a. (starting with the most significant predictor). Then use 3 pairwise ANOVAs to show that none of the predictors is unnecessary (i.e., adding each predictor yields a significant improvement in the fit of the more complex model).

Answer: (…)

12c. Are the three independent variables from Exercise 12a. significant predictors of a student’s final grade G3 when also considering the first and second period grades (G1 and G2) as predictors?

Answer: (…)

Submission

That’s it – now it’s time to submit your assignment!

Save and submit your script or output file (including all code) to the appropriate folder on Ilias before midnight.

[WPA08.Rmd updated on 2017-12-18 13:15:39 by hn.]

WPA08: Statistics: Linear regression

Hansjörg Neth, SPDS, uni.kn

2017 Dec 18

A. In Class

Preparations

Exploring data and comparing means

Linear regression with 1 IV

Linear regression with multiple IVs

Checkpoint 1

B. At Home

Studying student performance

Data description

Loading and exploring data

Standard Regression with lm()

One IV

Adding a regression line to a scatterplot

Checkpoint 2

Multiple IVs

Predicting values

Checkpoint 3

C. Challenge

Showing model parsimony

Submission