Chapter 3 of James, Witten, Hastie, & Tibshirani (2021). See https://www.statlearning.com/ for data and resources.
3 Linear Regression
Linear regression (LR) is a useful tool for predicting a quantitative response and a simple type of supervised learning. Understanding LR remains important, as newer methods of statistical learning can be seen as generalizations or extensions of LR.
Example
Reconsidering the Advertising
data from Chapter 2:
library(ISLR2)
# Data:
# https://www.statlearning.com/resources-second-edition
"https://www.statlearning.com/s/Advertising.csv"
data_url <-
if (wd == "/Users/hneth/Desktop/stuff/Dropbox/GitHub/predict/R"){ # in subdir R:
"./../data/_JamesEtAl2021_ISL2/data/Advertising.csv" # up 1 level
data_file <-else {
} "./data/_JamesEtAl2021_ISL2/data/Advertising.csv"
data_file <-
}
tibble::as_tibble(read.csv(data_file))
ad <-dim(ad)
## [1] 200 5
Figure 2.1 (p. 16) displayed sales
(in thousands of units) for a particular product as a function of advertising budgets (in thousands of dollars) for TV
, radio
, and newspaper
media:
Recreating Figure 2.1:
par(mfrow = c(1, 3))
plot(ad$TV, ad$sales, main = "(a) sales by TV",
pch = 20, cex = 2, col = adjustcolor("black", .25))
abline(lm(sales ~ TV, data = ad), lwd = 2, col = "deepskyblue")
plot(ad$radio, ad$sales, main = "(b) sales by radio",
pch = 20, cex = 2, col = adjustcolor("black", .25))
abline(lm(sales ~ radio, data = ad), lwd = 2, col = "deeppink")
plot(ad$newspaper, ad$sales, main = "(c) sales by newspaper",
pch = 20, cex = 2, col = adjustcolor("black", .25))
abline(lm(sales ~ newspaper, data = ad), lwd = 2, col = "gold")
par(opar) # restore original par
For an output variable \(Y\) and a set of \(p\) predictor variables \(X = (X_1,\ X_2,\ \ldots,\ X_p)\) we assume:
\[Y = f(X) + \epsilon\]
Here, \(f\) represents systematic information, whereas \(\epsilon\) represents random error information (aka. noise, independent of \(X\) and with a mean of zero).
In simple LR, we assume that the function \(f\) can be approximated by a line (with two parameters, i.e., \(\beta_0 + \beta_1 X\)). Thus,
\[Y = \beta_0 + \beta_1 X + \epsilon\]
The error or noise term \(\epsilon\) is assumed to be independent of the predictor \(X\) and contain everything that we miss in the linear model \(f(X)\). For instance,
- the true relationship between \(X\) and \(Y\) may be non-linear,
- other variables beyond \(X\) may contribute to the variation of \(Y\), and
- our measurements may contain errors.
Questions
The LR paradigm allows addressing the following (types of) questions:
Is there a relationship between predictors (advertising budgets) and outcome (
sales
)?How strong is the relationship?
Which predictors matter? What are their individual contributions on the overall outcome?
How strong is the relationship for each predictor (with or without considering the others)?
Prediction: How accurately can we predict future outomes?
Inference: What is the form of the relationship (e.g., linear or non-linear)?
Interactions: Are there synergy effects (interactions)?
These questions will be addressed for the Advertising
data in below (see Section 3.4).
3.1 Simple linear regression
Simple LR assumes a single predictor variable \(X\) and an approximately linear relationship between \(X\) and an outcome variable \(Y\):
\[Y \approx \beta_0 + \beta_1 X\]
When plotting outcome values \(Y\) as a function of predictor values \(X\), simple LR estimates two parameters (or coefficients) of a straight line: its intercept \(\beta_0\) (for \(X = 0\)) and its slope \(\beta_1\).
Estimating the coefficients
Varying the parameters for intercept and slope (i.e., \(\beta_0\) and \(\beta_1\)) yields different lines. Each line provides a specific model that predicts a particular outcome value \(\hat{Y}\) for any predictor value \(\hat{X}\). (The hat-symbols signal that we are usually estimating unknown population values from observed samples.) Determining how closely our predictions fit to the data requires that we quantify the deviation between observed and predicted data. The criterion for measuring closeness or quantifying the degree of fit used here is the least squared deviation: We select \(\beta_0\) and \(\beta_1\) so that the residual sum of squares (RSS) — commonly known as mean squared error (MSE) — of our predictions is minimized.
See Equation 3.4 (James et al., 2021, p. 62) for the least squares coefficient estimates for simple LR.
Figure 3.1 (p. 62) shows the least squares fit for the regression of sales
onto TV
for the Advertising
data. The fit is found by minimizing the residual sum of squares.
Recreating a version of Figure 3.1:
# Scatterplot: ----
plot(x = ad$TV, y = ad$sales,
main = "Sales by TV advertising budget", xlab = "Budget (in units of $1,000)", ylab = "Sales (in units of 1,000)",
ylim = c(0, 30),
pch = 20, cex = 1.5, col = adjustcolor("black", .50))
# Simple linear regression model: ----
1 <- lm(sales ~ TV, data = ad)
slm_abline(slm_1, lwd = 2, col = "deepskyblue")
# Summary: ----
summary(slm_1)
##
## Call:
## lm(formula = sales ~ TV, data = ad)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3860 -1.9545 -0.1913 2.0671 7.2124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.032594 0.457843 15.36 <2e-16 ***
## TV 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
# Predictions: ----
predict(slm_1)
y_hat <-segments(x0 = ad$TV, y0 = ad$sales, y1 = y_hat,
lwd = .8, col = "deeppink")
Assessing the accuracy of the coefficient estimates
In statistics, we usually are interested in population values (e.g., \(\mu\)), but have only access to samples (e.g., \(\hat{\mu}\)). Thus, we need to estimate the population parameters from samples.
In the context of simple LR, we are interested in the coefficients \(beta_0\) and \(beta_1\) of the population regression line. However, we need to estimate these unknown coefficients from our sample(s) using \(\hat{\beta_0}\) and \(\hat{\beta_1}\).
True relation vs. least-square regression line
Figure 3.3 (p. 64) compares the true relationship between \(X\) and \(Y\) with the estimated coefficients of simple LR for a simulated data set:
Left: The red line represents the true relationship, \(Y = f(X) = 2 + 3X\), which is known as the population regression line. The blue line is the least squares line; it is the least squares estimate for \(f(X)\) based on the observed data, shown in black.
Right: The population regression line is again shown in red, and the least squares line in dark blue. In light blue, ten least squares lines are shown, each computed on the basis of a separate random set of observations. Each least squares line is different, but on average, the least squares lines are quite close to the population regression line.
Just like many sample means \(\hat{\mu}\) provide unbiased estimates of the true population mean \(\mu\)), the estimated coefficients \(\hat{\beta_0}\) and \(\hat{\beta_1}\) provide unbiased estimates of the true regression line in the population. When drawing many samples, the average of the estimates approximate the true population values.
Measures of deviation
To quantify the deviation between population values and sample-based estimates, we can compute standard errors (SE), residual standard errors (RSE), and confidence intervals (CI).
For simple LR, the 95%-confidence interval for the population (or true) coefficients (\(beta_0\) and \(beta_1\)) can be computed from the estimated (or sample-based) coefficients (\(\hat{\beta_0}\) and \(\hat{\beta_1}\)) and their corresponding standard errors as:
\[ \hat{\beta_0} \pm 2 \cdot SE(\hat{\beta_0}) \] \[ \hat{\beta_1} \pm 2 \cdot SE(\hat{\beta_1}) \]
Hypothesis testing
Endowed with estimated coefficients and standard errors, we can perform hypothesis tests on the coefficients:
- \(H_a\): There is a relationship between \(X\) and \(Y\): \(\beta_1 \neq 0\)
- \(H_0\): There is no relationship between \(X\) and \(Y\): \(\beta_1 = 0\)
Using a \(t\)-distribution (to assess the likelihood of an estimated coefficient’s deviation from 0) yields a \(p\)-value: The probability that the observed effect was caused exclusively by chance (i.e., \(P(data | H_0)\)) — which is not the probability that \(H_a\) is true!
For our model slm_1
(from above):
# Simple linear regression model: ----
1 <- lm(sales ~ TV, data = ad)
slm_
# Summary of the slm_1 model (above):
summary(slm_1)
##
## Call:
## lm(formula = sales ~ TV, data = ad)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3860 -1.9545 -0.1913 2.0671 7.2124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.032594 0.457843 15.36 <2e-16 ***
## TV 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
As the SE-values for both coefficients are very small, the estimated (or sample-based) coefficients (\(\hat{\beta_0}\) and \(\hat{\beta_1}\)) deviate from 0.
This is tested by the \(t\)-statistic (and corresponding \(p\)-values).
Rejecting \(H_0\) for both \(\beta_0\) and \(\beta_1\) implies:
- \(\beta_0 \neq 0\): There are
sales
in the absence ofTV
advertising. - \(\beta_1 \neq 0\): There is a relationship between
TV
advertisements andsales
.
Assessing the accuracy of the model
We want to quantify the extent to which a model fits the data. The quality of fit for a LR model is typically assessed using two related quantities: the residual standard error (RSE) and the \(R^2\) statistic.
Residual standard error (RSE)
The residual standard error (RSE) estimates the standard deviation of the irreducible error \(\epsilon\) to quantify the extent to which our predictions would miss on average if we knew the true LR line (i.e., the population values \(\beta_0\) and \(\beta_1\)). The residual standard error (RSE) is based on the residual sums of squares (RSS, see the formulas on p. 69):
\[\text{RSS} = \sum_{i=1}^{n}{(y_i - \hat{y_i})^2}\]
and computed as:
\[\text{RSE} = \sqrt{\frac{1}{n-2} \cdot \text{RSS}}\]
As RSE quantifies the lack of fit of a model in units of \(Y\), lower values are better (ideally, \(RSE \rightarrow 0\)). In our above example model slm_1
, \(RSE = 3.26\) and \(Y\) represents sales
in units of 1,000. Thus, even if our model was perfectly correct, our predictions would deviate from the true values by approximately 3,260 units, on average.
Proportion of variance explained (\(R^2\))
Whereas RSE provides an absolute measure of model fit (in units of the criterion \(Y\)), the \(R^2\) statistic provides an alternative measure of fit as the proportion of non-erroreous squared deviations, or the proportion of variance explained:
\[R^2 = \frac{\text{TSS} - \text{RSS}}{\text{TSS}}\] with
\[\text{TSS} = \sum_{i=1}^{n}{(y_i - \bar{y_i})^2}\]
(Note that TSS is a special case of RSS, with only a single predicted value \(\bar{y_i}\).)
In contrast to error measures, \(R^2\) ranges from \(0\) to \(1\), with higher values indicating better model fits (ideally, \(R^2 \rightarrow 1\)).
Given a model, \(R^2\) measures the proportion of variability in \(Y\) that can be explained by \(X\). An \(R^2\) value close to 1 indicates that a large proportion of the variability in the response variable is explained by the regression model.
Note:
TSS is a special case of RSS (with only )
Although \(R^2\) is always between 0 and 1, it can still be challenging to determine what constitutes a “good” \(R^2\) value. For domains in which the true model is highly linear and \(\epsilon\) is small, we can obtain \(R^2 \approx 1\). In social sciences, relationships can be non-linear and subject to many other influences, even small \(R^2 > 0\) can be meaningful.
The \(R^2\) is related to the correlation \(r = \text{Cor}(X, Y)\). In simple LR (with only 1 predictor), \(R^2 = r^2\).
Accuracy measures in R
We can easily define R functions for the total sum of squares (TSS), residual sum of squares (RSS), the residual standard error (RSE), and the \(R^2\)-statistic, when providing both the true and predicted values as arguments:
# Measures of model accuracy:
# (see @JamesEtAl2021, p. 69ff.):
# Total sum of squares (TSS): ----
function(true_vals, pred_vals = NA){
TSS <-
# Special case:
if (is.na(pred_vals)){
mean(true_vals)
pred_vals <-
}
sum((true_vals - pred_vals)^2)
}
# Residual sum of squares (RSS): ----
function(true_vals, pred_vals){
RSS <-
sum((true_vals - pred_vals)^2)
}
# Residual standard error (RSE): ----
function(true_vals, pred_vals){
RSE <-
length(true_vals)
n <-
RSS(true_vals, pred_vals) # defined above
rss <-
sqrt(1/(n - 2) * rss)
}
# R^2 statistic: ----
function(true_vals, pred_vals){
Rsq <-
TSS(true_vals) # defined above
tss <-
RSS(true_vals, pred_vals) # defined above
rss <-
- rss)/tss
(tss }
3.2 Multiple linear regression
Simple LR assumes only one predictor variable, but we often have more than one predictor.
Multiple single LR
We could consider running separate models for each predictor.
Re-creating Table 3.3 (p. 72): Multiple simple/single LR models for the Advertising
data:
# Simple linear regression models: ----
1 <- lm(sales ~ TV, data = ad)
slm_2 <- lm(sales ~ radio, data = ad)
slm_3 <- lm(sales ~ newspaper, data = ad)
slm_
# Summary of each model:
summary(slm_1)
##
## Call:
## lm(formula = sales ~ TV, data = ad)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3860 -1.9545 -0.1913 2.0671 7.2124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.032594 0.457843 15.36 <2e-16 ***
## TV 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
summary(slm_2)
##
## Call:
## lm(formula = sales ~ radio, data = ad)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.7305 -2.1324 0.7707 2.7775 8.1810
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.31164 0.56290 16.542 <2e-16 ***
## radio 0.20250 0.02041 9.921 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.275 on 198 degrees of freedom
## Multiple R-squared: 0.332, Adjusted R-squared: 0.3287
## F-statistic: 98.42 on 1 and 198 DF, p-value: < 2.2e-16
summary(slm_3)
##
## Call:
## lm(formula = sales ~ newspaper, data = ad)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.2272 -3.3873 -0.8392 3.5059 12.7751
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.35141 0.62142 19.88 < 2e-16 ***
## newspaper 0.05469 0.01658 3.30 0.00115 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.092 on 198 degrees of freedom
## Multiple R-squared: 0.05212, Adjusted R-squared: 0.04733
## F-statistic: 10.89 on 1 and 198 DF, p-value: 0.001148
Three distinct simple/single LR models are unsatisfactory:
It remains unclear how we can predict the outcome variable based on a combination of predictor values.
Also, each simple/single LR model ignores (or rather tacitly incorporates) the effects of the other predictor variables.
Multiple LR
A better approach is to extend the simple LR model by incorporating multiple predictors in a single model. The following multiple linear regression (multiple LR) model uses a separate slope coefficient for each predictor \(p\):
\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 +\ \ldots\ + \beta_p X_p + \epsilon \]
where \(X_j\) represents the \(j\)th predictor and the coefficient \(\beta_j\) quantifies the association between that predictor and the response.
We interpret \(beta_j\) as the average effect on \(Y\) of a one unit increase in \(X_j\), when holding all other predictors fixed.
Estimating the regression coefficients
# Multiple linear regression model (3 predictors): ----
1 <- lm(sales ~ TV + radio + newspaper, data = ad)
mlm_
# Summary of multiple LR model:
summary(mlm_1)
##
## Call:
## lm(formula = sales ~ TV + radio + newspaper, data = ad)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8277 -0.8908 0.2418 1.1893 2.8292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.938889 0.311908 9.422 <2e-16 ***
## TV 0.045765 0.001395 32.809 <2e-16 ***
## radio 0.188530 0.008611 21.893 <2e-16 ***
## newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
Note that the simple and multiple regression coefficients can be quite different. Here, the coefficient for the newspaper
predictor is no longer significant, when accounting for the effects of TV
and radio
.
How can the effect of newspaper
be significant in a simple/single LR, but non-significant in a multiple LR? The correlation between the radio
and newspaper
predictors is 0.35:
Re-creating Table 3.5 (p. 75): Correlation matrix for TV
, radio
, newspaper
, and sales
, for the Advertising
data:
# Correlation matrix:
ad[ , -1]
df <-round(cor(x = df), 2)
## TV radio newspaper sales
## TV 1.00 0.05 0.06 0.78
## radio 0.05 1.00 0.35 0.58
## newspaper 0.06 0.35 1.00 0.23
## sales 0.78 0.58 0.23 1.00
Thus, the simple/single LR of sales
by newspaper
captures some of the variance due to the radio
predictor. In other words, newspaper
advertising is a surrogate for radio
advertising, or the newspaper
predictor gets some credit for the association between radio
and sales
. When incorporating both predictors in a single model, the effect of newspaper
disappears:
# Multiple linear regression model (2 predictors): ----
2 <- lm(sales ~ radio + newspaper, data = ad)
mlm_
# Summary mLR:
summary(mlm_2)
##
## Call:
## lm(formula = sales ~ radio + newspaper, data = ad)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5289 -2.1449 0.7315 2.7657 7.9751
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.188920 0.627672 14.640 <2e-16 ***
## radio 0.199045 0.021870 9.101 <2e-16 ***
## newspaper 0.006644 0.014909 0.446 0.656
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.284 on 197 degrees of freedom
## Multiple R-squared: 0.3327, Adjusted R-squared: 0.3259
## F-statistic: 49.11 on 2 and 197 DF, p-value: < 2.2e-16
Example involving common sense phenomenon (or a common cause variable): Assume an association between the predictor ice cream sales
on the outcome shark attacks
on beaches. Once we incorporate the (common cause) predictor temperature
, the association disappears.
Some important questions
When we perform multiple linear regression, we usually are interested in answering the following questions:
Is at least one of the predictors \(X_1,\ X_2,\ \ldots,\ X_p\) useful in predicting the response \(Y\)?
Do all the predictors help to explain \(Y\) , or is only a subset of the predictors useful?
How well does the model fit the data?
Given a set of predictor values, what response value should we predict, and how accurate is our prediction?
ad 1. Overall effect vs. contributions of individual predictors
Is at least one of the predictors \(X_1,\ X_2,\ \ldots,\ X_p\) useful in predicting the response \(Y\)?
A large \(F\)-statistic (\(F > 1\), taking into account the \(df\) implied by \(n\) and \(p\)) implies that at least one of the predictors is successful in predicing the outcome/response variable.
In contrast to this overall effect, the \(t\)-statistic and the corresponding \(p\)-value of each coefficient informs about whether each predictor is related to the outcome/response when adjusting for the other predictors.
???: Note that \({t_{j}}^2 = F_j\) for the partial effect of each predictor \(X_j\) (see James et al., 2021, p. 77, Footnote 7).
Good question: Given individual \(p\)-values for each coefficient, why do we need the overall \(F\)-statistic?
Consider an example with many predictors (or multiple tests): For instance, assume a multiple LR model with \(p = 100\) predictors, but
\(H_0: \beta_1 = \beta_2 =\ \ldots\ = \beta_p = 0\) is true, so no variable is truly associated with the response. In this situation, about \(5\)% of the \(p\)-values associated with each coefficient will be below \(0.05\) by chance (and — simply by definition — “statistically significant”). In other words, we expect to see approximately five significant \(p\)-values in the absence of any true association between the predictors and the response. By contrast, the \(F\)-statistic does not suffer from the problem of multiple tests because it adjusts for the number of predictors \(p\).
ad 2. Deciding on important variables
Do all the predictors help to explain \(Y\) , or is only a subset of the predictors useful?
The task of determining which predictors are associated with the response, in order to fit a single model involving only those predictors, is referred to as variable selection (see Chapter 6). With \(p\) predictors, we can create \(2^p\) different models containing subsets of predictor variables (not yet considering interactions). Various statistics allow judging the quality of a model (including AIC, BIC).
Three classical approaches for solving the variable selection task are:
Forward selection: Begin with the null model and add the predictor that results in a model with the lowest RSS. Proceed until some stopping criterion is reached.
Backward selection: Begin with the full model and remove the predictor that has the highest \(p\) value. Proceed until some stopping criterion is reached.
Mixed selection: Begin with forward selection, but also remove predictors when their \(p\)-values increases above some criterion. Continue until all variables remaining in the model have a sufficiently low \(p\)-value.
Note:
Backward selection cannot be used when \(p > n\).
Forward selection can always be used (even when \(p > n\)), but is a greedy approach (i.e., might include variables early that later become redundant). Mixed selection can remedy this.
ad 3. Model fit
How well does the model fit the data?
Two of the most common numerical measures of model fit are the residual standard error (RSE) and the fraction of variance explained (\(R^2\)).
Note:
Relation between \(R^2\) and correlation: In simple LR (with only 1 predictor), we had \(R^2 = r^2\). In multiple LR, we have \(R^2 = \text{Cor}(Y, \hat{Y})^2\) (i.e., multiple LR maximizes the correlation between response \(Y\) and the predictions of the fitted linear model \(\hat{Y}\)).
Number of predictors \(p\), \(R^2\), vs. model quality: \(R^2\) always increases when adding variables to a model, even if those variables are only weakly associated with the response. This is due to the fact that adding another variable always results in a decrease in the RSS on the training data, but not necessarily the testing data.
Graphical summaries can reveal problems with a model that are not visible from numerical statistics (e.g., non-linearity).
Example: Figure 3.5 (p. 81) shows a non-linear relationship in data for a multiple LR with 2 predictors. Such a non-linear pattern often suggests a synergy or interaction effect.
ad 4. Predictions
Given a set of predictor values, what response value should we predict, and how accurate is our prediction?
For a specific multiple LR model and a set of values for the predictors \(X_1, X_2,\ \ldots, X_p\), it is straightforward to predict the response \(\hat{Y}\). There are three sorts of uncertainty associated with this prediction:
Reducible error: The coefficients \(\hat{\beta_0}, \hat{\beta_1}, \ldots, \hat{\beta_p}\) are only estimations for the true population values \(\beta_0, \beta_1,\ \ldots, \beta_p\).
Model bias: Even the best linear model \(f(X)\) must not be the true relationship (which may be non-linear).
Irreducible error: Even the true population model would be off by random error/noise \(\epsilon\).
Distinction between confidence interval (average \(Y\) over many observations) vs. prediction interval (uncertainty surrounding \(Y\) for a particular observation).
Confidence intervals express sampling uncertainty in quantities estimated from many data points. The more data, the less sampling uncertainty, and hence the thinner the interval.
Prediction intervals, on top of the sampling uncertainty, also express uncertainty around a single value, which makes them wider than the confidence intervals.
See blog post at Confidence intervals vs prediction intervals (2021-11-08).
3.3 Other considerations in the regression model
The Credit
dataset of ISLR2 contains both quantitative and qualitative predictors:
library(ISLR2)
library(tidyverse)
ISLR2::Credit
cred <-dim(cred) # 400 x 11
## [1] 400 11
summary(cred)
## Income Limit Rating Cards
## Min. : 10.35 Min. : 855 Min. : 93.0 Min. :1.000
## 1st Qu.: 21.01 1st Qu.: 3088 1st Qu.:247.2 1st Qu.:2.000
## Median : 33.12 Median : 4622 Median :344.0 Median :3.000
## Mean : 45.22 Mean : 4736 Mean :354.9 Mean :2.958
## 3rd Qu.: 57.47 3rd Qu.: 5873 3rd Qu.:437.2 3rd Qu.:4.000
## Max. :186.63 Max. :13913 Max. :982.0 Max. :9.000
## Age Education Own Student Married Region
## Min. :23.00 Min. : 5.00 No :193 No :360 No :155 East : 99
## 1st Qu.:41.75 1st Qu.:11.00 Yes:207 Yes: 40 Yes:245 South:199
## Median :56.00 Median :14.00 West :102
## Mean :55.67 Mean :13.45
## 3rd Qu.:70.00 3rd Qu.:16.00
## Max. :98.00 Max. :20.00
## Balance
## Min. : 0.00
## 1st Qu.: 68.75
## Median : 459.50
## Mean : 520.01
## 3rd Qu.: 863.00
## Max. :1999.00
cred %>%
sub_cred <- select(Balance, Age, Cards, Education, Income, Limit, Rating, Own)
plot(sub_cred, pch = 20, cex = .8, col = adjustcolor("steelblue", .20))
Qualitative predictors
Given a quantitative outcome/response variable, binary predictor variable can be dummy coded as 0 vs. 1:
Example: Predicting Credit
card balance based on demographic variables (quantitative and qualitative/categorical).
Re-creating Table 3.7 (p. 85): Least squares coefficient estimates associated with the regression of credit card Balance
onto home ownership (binary predictor Own
) in the Credit
data set.
# Descriptive:
# table(cred$Own)
cred %>%
tb_own <- group_by(Own) %>%
summarise(n = n(),
mn_balance = mean(Balance))
::kable(tb_own, caption = "N and mean balance by home ownership.") knitr
Own | n | mn_balance |
---|---|---|
No | 193 | 509.8031 |
Yes | 207 | 529.5362 |
# library(kableExtra)
# kbl(tb_own, booktabs = TRUE, caption = "N and mean balance by home ownership.") %>%
# kable_styling(latex_options = "striped", full_width = FALSE)
# Box plot:
plot(x = cred$Own, y = cred$Balance,
main = "Balance by home ownership",
col = adjustcolor(c("deepskyblue", "deeppink"), .33))
# Linear model (with binary dummy variable):
2 <- lm(Balance ~ Own, data = cred)
mlm_summary(mlm_2)
##
## Call:
## lm(formula = Balance ~ Own, data = cred)
##
## Residuals:
## Min 1Q Median 3Q Max
## -529.54 -455.35 -60.17 334.71 1489.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 509.80 33.13 15.389 <2e-16 ***
## OwnYes 19.73 46.05 0.429 0.669
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 460.2 on 398 degrees of freedom
## Multiple R-squared: 0.0004611, Adjusted R-squared: -0.00205
## F-statistic: 0.1836 on 1 and 398 DF, p-value: 0.6685
Interpretation:
intercept: Average credit card balance without home (i.e.,
Own == No
baseline): $509.80coefficient: Increase by $19.73 in case of home ownership (but effect is small and non-significant).
Note that the interpretation of the coefficients depends on the way the variables (here: the factor Own
) is coded. By contrast, the interpreted predictions remain the same, regardless of coding:
table(cred$Own) # is a factor with 2 levels (1: No, 2: Yes)
##
## No Yes
## 193 207
# Recode binary predictor:
(as.numeric(cred$Own) * -1) + 2 # reverse 0/1 values of variable
not_Own <-table(not_Own)
## not_Own
## 0 1
## 207 193
# Linear model (with binary dummy variable):
3 <- lm(Balance ~ not_Own, data = cred)
mlm_summary(mlm_3)
##
## Call:
## lm(formula = Balance ~ not_Own, data = cred)
##
## Residuals:
## Min 1Q Median 3Q Max
## -529.54 -455.35 -60.17 334.71 1489.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 529.54 31.99 16.554 <2e-16 ***
## not_Own -19.73 46.05 -0.429 0.669
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 460.2 on 398 degrees of freedom
## Multiple R-squared: 0.0004611, Adjusted R-squared: -0.00205
## F-statistic: 0.1836 on 1 and 398 DF, p-value: 0.6685
Interpretation:
The mlm_3
contains the same results as mlm_2
, but from the opposite perspective:
intercept: Average credit card balance WITH home (i.e.,
not_Own == 0
baseline): $529.54coefficient: DEcrease by $19.73 in ABSENCE of home ownership (but effect is small and non-significant).
Qualitative predictors with more than two levels
When a qualitative predictor has more than two levels, a single dummy variable cannot represent all possible values. In this situation, we can create additional dummy variables.
Example: For predicting Credit
card balance by Region
(i.e., a factor variable with 3 levels), we create and use two (binary) dummy variables:
table(cred$Region) # is a factor with 3 levels (1: East, 2: South, 3: West)
##
## East South West
## 99 199 102
# Create 2 dummy variables:
$r_S <- rep(NA, nrow(cred))
cred$r_S <- ifelse(cred$Region == "South", 1, 0)
credtable(cred$r_S)
##
## 0 1
## 201 199
$r_W <- rep(NA, nrow(cred))
cred$r_W <- ifelse(cred$Region == "West", 1, 0)
credtable(cred$r_W)
##
## 0 1
## 298 102
Model mlm_4
(using 2 dummy variables as binary predictors) re-creates the results of Table 3.8 (p. 86): Least squares coefficient estimates associated with the regression of Balance
onto region in the Credit
data set. The categorical Region
variable (with 3 levels) is encoded via two dummy variables.
# Linear model (with 2 binary dummy variables):
4 <- lm(Balance ~ r_S + r_W, data = cred)
mlm_summary(mlm_4)
##
## Call:
## lm(formula = Balance ~ r_S + r_W, data = cred)
##
## Residuals:
## Min 1Q Median 3Q Max
## -531.00 -457.08 -63.25 339.25 1480.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 531.00 46.32 11.464 <2e-16 ***
## r_S -12.50 56.68 -0.221 0.826
## r_W -18.69 65.02 -0.287 0.774
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 460.9 on 397 degrees of freedom
## Multiple R-squared: 0.0002188, Adjusted R-squared: -0.004818
## F-statistic: 0.04344 on 2 and 397 DF, p-value: 0.9575
Interpretation:
The level with no dummy variable (
East
in this example) is the baseline. Which level we choose as the baseline is arbitrary, but matters for interpreting the coefficients.The coefficient estimates show that the differences between regions are small and non-significant.
Overall, the low \(F\)-value (and its \(p\)-value of 0.96) indicate that we cannot reject the null hypothesis \(H_0\): \(\beta_1 = \beta_2 = 0\) (i.e.,
Region
has no effect onBalance
) by this model.
Note that we can achieve the same result with the original categorical variable Region
(i.e., a factor with 3 levels):
# Linear model with a factor variable (3-levels):
5 <- lm(Balance ~ Region, data = cred)
mlm_summary(mlm_5)
##
## Call:
## lm(formula = Balance ~ Region, data = cred)
##
## Residuals:
## Min 1Q Median 3Q Max
## -531.00 -457.08 -63.25 339.25 1480.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 531.00 46.32 11.464 <2e-16 ***
## RegionSouth -12.50 56.68 -0.221 0.826
## RegionWest -18.69 65.02 -0.287 0.774
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 460.9 on 397 degrees of freedom
## Multiple R-squared: 0.0002188, Adjusted R-squared: -0.004818
## F-statistic: 0.04344 on 2 and 397 DF, p-value: 0.9575
Overall, using the dummy variable approach presents no difficulties when incorporating both quantitative and qualitative predictors.
Besides the dummy variable approach, there are alternative ways of coding qualitative variables. All of these approaches lead to equivalent model fits, but the coefficients have different values and interpretations, and are designed to measure particular contrasts.
Extensions of the linear model
The standard LR model provides interpretable results and works quite well on many real-world problems. However, its restrictive assumptions are often violated in practice. Two of the most important assumptions state that the effects of the predictors are additive and the relationship between the predictors and response is linear. Here are some classical approaches for extending the linear model:
1. Non-additive relationships
Removing the additive assumption of standard LR models: Including interaction effects.
Example: Two quantitative predictors
Figure 3.5 (p. 81) suggested a non-linear relationship in the Advertising
data for a multiple LR with 2 predictors (TV
and radio
). Such a non-linear pattern often suggests a synergy or interaction effect between both predictors:
# Multiple linear regression model (2 predictors + interaction): ----
6 <- lm(sales ~ TV + radio + TV:radio, data = ad)
mlm_6 <- lm(sales ~ TV * radio, data = ad) # same model (main effects + interaction)
mlm_
# Summary mLR:
summary(mlm_6)
##
## Call:
## lm(formula = sales ~ TV * radio, data = ad)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3366 -0.4028 0.1831 0.5948 1.5246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.750e+00 2.479e-01 27.233 <2e-16 ***
## TV 1.910e-02 1.504e-03 12.699 <2e-16 ***
## radio 2.886e-02 8.905e-03 3.241 0.0014 **
## TV:radio 1.086e-03 5.242e-05 20.727 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared: 0.9678, Adjusted R-squared: 0.9673
## F-statistic: 1963 on 3 and 196 DF, p-value: < 2.2e-16
# Without interaction: ----
lm(sales ~ TV + radio, data = ad)
mlm_x <-# summary(mlm_x)
Notes:
Testing for interactions relaxes the additivity assumption insofar as multiplicative relations are included (see explanation and examples on p. 88).
The new \(R^2 = 96.7\%\), whereas it was only \(R^2 = 89.6\%\) without the interaction.
Thus,68.3% of the remaining variance insales
was explained by including the interaction term.The hierarchical principle: If we include an interaction in a model, we should also include the main effects, even if the \(p\)-values associated with their coefficients are not significant. (In
mlm_6
, both the two main effects and their interaction are significant.)
Example: One qualitative and one quantitative predictor
An interaction between a qualitative variable and a quantitative variable has a particularly nice interpretation (p. 89 ). Consider the Credit
data set from above, and suppose that we wish to predict balance
using the income
(quantitative) and student
(qualitative) variables:
# head(cred)
# (a) Linear model with a quantitative and qualitative predictor (NO interaction):
lm(Balance ~ Income + Student, data = cred)
mlm_7a <-summary(mlm_7a)
##
## Call:
## lm(formula = Balance ~ Income + Student, data = cred)
##
## Residuals:
## Min 1Q Median 3Q Max
## -762.37 -331.38 -45.04 323.60 818.28
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 211.1430 32.4572 6.505 2.34e-10 ***
## Income 5.9843 0.5566 10.751 < 2e-16 ***
## StudentYes 382.6705 65.3108 5.859 9.78e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 391.8 on 397 degrees of freedom
## Multiple R-squared: 0.2775, Adjusted R-squared: 0.2738
## F-statistic: 76.22 on 2 and 397 DF, p-value: < 2.2e-16
# (b) Linear model with a quantitative and qualitative predictor (WITH interaction):
lm(Balance ~ Income + Student + Income:Student, data = cred)
mlm_7b <-summary(mlm_7b)
##
## Call:
## lm(formula = Balance ~ Income + Student + Income:Student, data = cred)
##
## Residuals:
## Min 1Q Median 3Q Max
## -773.39 -325.70 -41.13 321.65 814.04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 200.6232 33.6984 5.953 5.79e-09 ***
## Income 6.2182 0.5921 10.502 < 2e-16 ***
## StudentYes 476.6758 104.3512 4.568 6.59e-06 ***
## Income:StudentYes -1.9992 1.7313 -1.155 0.249
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 391.6 on 396 degrees of freedom
## Multiple R-squared: 0.2799, Adjusted R-squared: 0.2744
## F-statistic: 51.3 on 3 and 396 DF, p-value: < 2.2e-16
When expressed graphically, including the interaction results in a more flexible model that allows not only for different intercepts, but also different slopes for both qualitative groups:
Notes:
Whereas including main effects allow only for parallel lines with different intercepts (for each level of the qualitative predictor), including interaction term additionally allows for different slopes (i.e., different impacts of the quantitative predictor on different qualitative groups).
See derivation of line equations (i.e., values of intercept and slope) from coefficients (James et al., 2021, p. 90)
# Scatterplot:
plot(x = cred$Income, y = cred$Balance,
main = "Balance by income (including interaction with student/no student)",
pch = 20, col = ifelse(cred$Student == "Yes", adjustcolor("deepskyblue", .33), adjustcolor("deeppink", .33)),
xlab = "Income", ylab = "Balance", xlim = c(0, 200), ylim = c(0, 1600))
# From model:
coefficients(mlm_7b)
cf <-
# Line equations (with interaction of categorical predictor):
function(x){ cf[1] + cf[2] * x }
l_inc_no_stud <- function(x){ (cf[1] + cf[3]) + (cf[2] + cf[4]) * x }
l_inc_student <-
curve(expr = l_inc_student, lwd = 2, col = "deepskyblue", add = TRUE)
curve(expr = l_inc_no_stud, lwd = 2, col = "deeppink", add = TRUE)
2. Non-linear relationships
Removing the linear assumption of standard LR models: Using polynomial regression.
The LR model assumes a linear relationship between the predictors and response. When the true relationship between the response and the predictors is non-linear, a simple way to allow for incorporating incorporating non-linear associations into a linear model is to include transformed versions of the predictors.
Extending the linear model to accommodate non-linear relationships is known as polynomial regression, as we include polynomial functions of the predictors in a LR model.
Example of polynomial regression: The points in Figure 3.8 (p. 91) seem to have a quadratic shape, suggesting that a model of the form
\(\text{mpg} = \beta_0 + (\beta_1 \cdot \text{hp}) + (\beta_2 \cdot \text{hp}^2) + \epsilon\)
may provide a better fit.
# Data:
ISLR2::Auto
Auto <-
# Variables:
Auto$horsepower # predictor
hp <- Auto$mpg # outcome
mpg <-
# Scatterplot:
plot(x = hp, y = mpg,
main = "Auto data: mpg by horsepower", xlab = "horsepower", ylab = "mpg",
pch = 20, cex = 1, col = adjustcolor("black", .25))
# (a) Linear model with one predictor (hp):
lm(mpg ~ hp)
mlm_8a <-summary(mlm_8a)
##
## Call:
## lm(formula = mpg ~ hp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## hp -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
abline(mlm_8a, lwd = 2, col = "deepskyblue")
# (b) Polynomial LR model with one predictor and its square (as 2nd predictor):
hp^2 # squared predictor
hp_sq <- lm(mpg ~ hp + hp_sq)
mlm_8b <-
lm(mpg ~ hp + I(hp^2)) # Note: I() inhibits the interpretation of ^
mlm_8b <-# (which has special meaning in an R formula)
summary(mlm_8b)
##
## Call:
## lm(formula = mpg ~ hp + I(hp^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7135 -2.5943 -0.0859 2.2868 15.8961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.9000997 1.8004268 31.60 <2e-16 ***
## hp -0.4661896 0.0311246 -14.98 <2e-16 ***
## I(hp^2) 0.0012305 0.0001221 10.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
## F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
curve(expr = coef(mlm_8b)[1] + coef(mlm_8b)[2] * x + coef(mlm_8b)[3] * x^2,
lwd = 2, col = "deeppink", add = TRUE)
Notes:
The fit of the polynomial model (
mlm_8b
) is superior: \(R^2 = 0.686\), compared to \(0.605\) for the linear fit (mlm_08a
). Also, the \(p\)-value for the coefficient of the quadratic term is highly significant.In this equation, the value of
mpg
is a non-linear function ofhp
. However, the model is still a linear model — a multiple LR model with two predictors \(X_1 = \text{hp}\) and \(X_2 = \text{hp}^2\).We could extend the polynomial LR approach by including terms for \(\text{hp}^3\), \(\text{hp}^4\), etc. Although the resulting curves would provide closer fits, they are also prone to overfitting the model training data.
See Chapter 7 for further non-linear extensions of the linear model (e.g., basis functions, splines, local regression, generalized additive models).
Potential problems for LR
When we fit a linear regression model to a particular data set, the following problems can occur:
- Non-linearity of the response-predictor relationships.
- Correlation of error terms.
- Non-constant variance of error terms.
- Outliers.
- High-leverage points.
- Collinearity.
Detecting and solving these problems in practice is as much an art as a science. A brief summary of some key points:
ad 1. Non-linearity of the response-predictor relationships
A LR model assumes a linear (i.e., straight-line) relationship between predictors and response. If the true relationship is not linear, all conclusions drawn from a linear model are suspect.
Diagnosis:
A useful graphical tool for identifying non-linearity are residual plots:
Re-creating Figure 3.9 (p. 93):
- Simple LR model: Plotting the residual errors \(e_i = y_i − \hat{y_i}\) as a function of the the predictor \(x_i\):
# Data:
ISLR2::Auto
Auto <-
# Variables:
Auto$horsepower # predictor
hp <- Auto$mpg # outcome
mpg <-
# (a) Linear model with one predictor (hp):
lm(mpg ~ hp)
mlm_8a <-# summary(mlm_8a)
# Scatterplot of residuals:
mpg - predict(mlm_8a)
mpg_residuals <-plot(x = hp, y = mpg_residuals,
main = "Residuals: mpg by horsepower", xlab = "horsepower", ylab = "mpg",
pch = 20, cex = 1, col = adjustcolor("black", .25))
abline(0, 0, lty = 2, lwd = .5)
# identify(x = hp, y = mpg_residuals, plot = TRUE)
# Note: Default residual plot in
# plot(mlm_8a)
- Polynomial LR model: Plotting residual errors \(e_i = y_i − \hat{y_i}\) vs. the predictor \(x_i\):
# (b) Polynomial LR model with one predictor and its square (as 2nd predictor):
hp^2 # squared predictor
hp_sq <- lm(mpg ~ hp + hp_sq)
mlm_8b <-# summary(mlm_8b)
# Scatterplot of residuals:
2 <- mpg - predict(mlm_8b)
mpg_residuals_plot(x = hp, y = mpg_residuals_2,
main = "Residuals: mpg by horsepower + hp^2", xlab = "hp", ylab = "mpg",
pch = 20, cex = 1, col = adjustcolor("black", .25))
abline(0, 0, lty = 2, lwd = .5)
# identify(x = hp, y = mpg_residuals_2, plot = TRUE)
# Note: Default residual plot in
# plot(mlm_8b)
- Multiple LR model: Plotting the residual errors \(e_i = y_i − \hat{y_i}\) vs. the predicted (or fitted) values .
Interpretation:
The presence of a pattern (e.g., U-shape) may indicate a problem with some aspect of the linear model.
If the linearity assumption is violated, a simple approach for improving a model can be to use a non-linear transformation of the predictor(s) (e.g., log(\(X\)), \(\sqrt{X}\), or \(X^2\)).
ad 2. Correlation of error terms
A LR model assumes that the error terms \(e_i = y_i − \hat{y_i}\) are uncorrelated. Violating this assumption results in unjustified confidence (e.g., larger confidence intervals).
Examples:
If we accidentally used each of \(n\) data points twice (i.e., created a model that considered \(2n\) data points), our confidence intervals would overestimate the truly warranted confidence by a factor of \(\sqrt{2}\).
Correlated error terms frequently occur in the context of time series data, which often exhibit positively correlated errors of adjacent data points.
Diagnosis:
Plot the residual errors from a LR model as a function of time (see Figure 3.10, p. 95):
Interpretation:
- When viewing residual errors as a function of time, we should not see any time-related trends or patterns. By contrast, tracking in the residuals (i.e., adjacent residuals having similar signs or values) suggest correlated error terms.
Examples beyond time series:
- Correlation between errors can occur when other relevant variables influence the relationship between predictor(s) and response variables. For instance, a study may aim to predict individuals’ height from their weight. If clusters in the data (e.g., families, communities, countries) consider individuals on similar diets, their error terms may be correlated. Avoiding such correlations is critical factor of (ensuring the internal validity of an) experimental design.
ad 3. Non-constant variance of error terms
A LR model assumes that error terms have a constant variance \(\text{Var}(\epsilon_i) = \sigma^2\) (aka. heteroscedasticity).
However, variances of the error terms may increase (or decrease) with the value of the response.
Diagnosis:
A funnel shape in the residual plot indicates non-constant variances in the errors (heteroscedasticity).
A concave transformation of the outcome/response variable \(Y\) (e.g., using \(\text{log}(Y)\) or \(\sqrt{Y}\)) can restore homoscedasticity (see Figure 3.11, p. 96):
- When the variance of response values \(Y_i\) is known, we can fit a LR model with weighted least squares, with weights proportional to the inverse variances. LR software usually allows for observation weights.
ad 4. Outliers
Definition: An outlier is a data point for which \(y_i\) differs substantially from the value \(\hat{y_i}\) predicted by the model (i.e., the value’s residual error is large). This can — but does not have to — indicate errors (e.g., in measuring or coding the original data) and should be investigated carefully.
Even when the model coefficients (and thus the best-fitting regression line) may change very little, removing outliers may dramatically improve the model fit (by reducing \(RSE\) and increasing \(R^2\)).
Example of Figure 3.12 (p. 97):
Interpretation:
Left: The least squares regression line is shown in red, and the regression line after removing the outlier is shown in blue.
Center: The residual plot clearly identifies the outlier.
Right: The outlier has a studentized residual of \(+6\); typically we expect values between \(−3\) and \(+3\).
Diagnosis:
- Residual plots can be used to identify outliers. To standardize, we can plot studentized residuals that divide each residual error \(e_i\) by its estimated standard error. Observations whose studentized residuals are greater than \(|3|\) are possible outliers.
ad 5. High-leverage points
Definition: Whereas outliers are observations for which the response \(y_i\) is unusual given the predictor \(x_i\). By contrast, observations with high leverage have an unusual predictor value for \(x_i\).
Example: Observation 41 in the left-hand panel of Figure 3.13 (p. 98) has high leverage, in that the predictor value for this observation is large relative to the other observations:
Interpretation:
Left: Observation 41 is a high leverage point, while 20 is (perhaps) an outlier. The red line is the fit to all the data, and the blue line is the fit with observation 41 removed.
Center: The red observation is not unusual in terms of its \(X_1\) value or its \(X_2\) value, but still falls outside the bulk of the data, and hence has high leverage.
Right: Observation 41 has a high leverage and a high residual.
Overall, a few high leverage observations can have a sizable impact on the estimated regression line.
Diagnosis:
In order to quantify an observation’s leverage, we compute the leverage statistic: If a given observation has a leverage statistic that greatly exceeds \((p + 1)/n\), then we may suspect that the corresponding point has high leverage.
In Figure 3.13 (p. 98), observation 41 stands out as having a very high leverage statistic as well as a high studentized residual. In other words, it is an outlier as well as a high leverage observation. By contrast, observation 20 has low leverage (i.e., not much influence on the model/line of best fit).
ad 6. Collinearity
Definition: Collinearity refers to the situation in which two or more predictor variables are closely related to one another.
When two or more predictors are collinear (i.e., related to each other), it is difficult or impossible to separate their individual effects on the response. In other words, when predictors tend to increase or decrease together, their \(\beta\)-weights can fluctuate widely, as we cannot separately measure their relation to the response.
Example: Two scatterplot for predictors in the Credit
dataset (without and with collinearity between predictors):
Interpretation:
Left: A plot of
age
versuslimit
. These two variables are not collinear.Right: A plot of
rating
versuslimit
. There is high collinearity.
Example: Two contour plots of RSS (without and with collinearity between predictors):
Interpretation:
Left: A contour plot of RSS for the regression of
balance
ontoage
andlimit
. In the absence of collinearity between predictors, the minimum value is well defined.Right: A contour plot of RSS for the regression of
balance
ontorating
andlimit
. In the presence of collinearity between predictors, there are many pairs (\(\beta_\text{Limit}\), \(\beta_\text{Rating}\)) with a similar RSS value.
Consequences of collinearity:
Collinearity yields a great deal of uncertainty in the coefficient estimates.
The standard error of coefficient estimates \(\hat{\beta_j}\) grows, resulting in a declining \(t\)-statistic, and lower power (for rejecting \(H_0\)).
Example: See Table 3.11 (p. 102) for two models without vs. with collinearity.
Diagnosis:
To detect collinearity between pairs of predictors, inspect the correlation matrix of the predictors.
To detect multicollinearity (i.e., relationships between three or more predictors), compute the variance inflation factor (VIF) for each predictor. VIF values are \(\geq 1\): VIF\((\hat{\beta_j}) \geq 5\) suggest problematic amounts of collinearity.
Remedy:
Drop one or more problematic predictors from the model.
Combine collinear variables into a single predictor (e.g., the average of standardized versions of the individual predictors).
3.4 The marketing plan
This chapter started by introducing the Advertising
data. After our overview of LR, we can not address the questions (asked on p. 59f.):
- Is there a relationship between predictors (advertising budgets) and outcome (
sales
)?
Fit a multiple LR model (without interactions):
# Data:
dim(ad)
## [1] 200 5
head(ad)
## # A tibble: 6 × 5
## X TV radio newspaper sales
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 230. 37.8 69.2 22.1
## 2 2 44.5 39.3 45.1 10.4
## 3 3 17.2 45.9 69.3 9.3
## 4 4 152. 41.3 58.5 18.5
## 5 5 181. 10.8 58.4 12.9
## 6 6 8.7 48.9 75 7.2
1 <- lm(sales ~ TV + radio + newspaper, data = ad)
m_summary(m_1)
##
## Call:
## lm(formula = sales ~ TV + radio + newspaper, data = ad)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8277 -0.8908 0.2418 1.1893 2.8292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.938889 0.311908 9.422 <2e-16 ***
## TV 0.045765 0.001395 32.809 <2e-16 ***
## radio 0.188530 0.008611 21.893 <2e-16 ***
## newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
Interpretation: Table 3.6 (p. 76): There clearly is a relation between advertising budget (mostly that of TV
, radio
) and sales
.
- How strong is the relationship?
Summary of m_1
and Table 3.6 (p. 76): Quite strong (e.g., \(RSE = 1.69\), \(R^2 = 89.6\)%, \(F = 570\)).
- Which predictors matter? What are their individual contributions on the overall outcome?
Summary of m_1
and Table 3.4 (p. 74): The \(p\)-values suggest that This suggest only TV
and radio
are related to sales
(see Chapter 6 for details).
- How strong is the relationship for each predictor (with or without considering the others)?
Summary of m_1
and Table 3.4 (p. 74) can be used to compute confidence intervals for the predictors: The coefficients for predictors TV
and radio
exclude zero, whereas that of newspaper
is not different from zero.
Is there collinearity between predictors?
- Re-creating the correlation matrix of Table 3.5 (p. 75):
# Correlation matrix (for predictors):
ad[ , c(-1, -5)]
pr <-round(cor(x = pr), 2)
## TV radio newspaper
## TV 1.00 0.05 0.06
## radio 0.05 1.00 0.35
## newspaper 0.06 0.35 1.00
Only low to moderate correlation between predictors.
- Check the variance inflation factor (VIF) for each predictor:
# install.packages(car) # if car has not been installed yet
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
vif(m_1)
## TV radio newspaper
## 1.004611 1.144952 1.145187
No evidence for collinearity (as VIF values are close to 1, not exceeding 5).
- Prediction: How accurately can we predict future outomes?
We can use the model m_1
to predict (or estimate) response values. The accuracy of such estimates depends on whether we wish to predict an individual response \(Y = f(X) + \epsilon\), or the average response \(f(X)\) (see Question 4 of Section 3.2.2, p. 81f.). If the former, we use a prediction interval, and if the latter, we use a confidence interval. Prediction intervals are always be wider than confidence intervals because they account for the uncertainty associated with the irreducible error \(\epsilon\).
- Inference: What is the form of the relationship (e.g., linear or non-linear)?
We can use some diagnostic plots:
par(mfrow = c(2, 2))
plot(m_1, pch = 20, col = adjustcolor("black", .25))
par(opar) # restore original par
Alternatively, we can compute the residuals of m_1
by plotting the predicted values vs. the residuals()
function:
plot(predict(m_1), residuals(m_1),
main = "Residual plot", pch = 20, col = adjustcolor("black", .25))
The function rstudent()
will return the studentized residuals, and we can use this function to plot the residuals against the fitted values:
plot(predict(m_1), rstudent(m_1),
main = "Studentized residual plot", pch = 20, col = adjustcolor("black", .25))
Interpretation: On the basis of the residual plots, there is some evidence of non-linearity. This could be addressed by transforming the predictors in the LR model in order to accommodate non-linear relationships (see Section 3.3.2).
Leverage statistics can be computed for any number of predictors using the hatvalues()
function:
# Leverage statistics:
plot(hatvalues(m_1),
main = "Leverage", pch = 20, col = adjustcolor("black", .25))
# Point with max leverage:
which.max(hatvalues(m_1)))
(max_lev <-## 17
## 17
# view observation (row) in data
ad[max_lev, ] ## # A tibble: 1 × 5
## X TV radio newspaper sales
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 17 67.8 36.6 114 12.5
- Interactions: Are there synergy effects (interactions)?
The standard LR model assumes an additive relation between predictors and response. An additive model is easy to interpret because the association between each predictor and the response is unrelated to the values of the other predictors. However, the additive assumption may be unrealistic for many datasets.
We just concluded that the Advertising
data may non-additive (see also Figure 3.5). In Section 3.3.2, we saw that including an interaction term in the regression model can accommodate non-additive relationships:
# Data:
dim(ad)
## [1] 200 5
head(ad)
## # A tibble: 6 × 5
## X TV radio newspaper sales
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 230. 37.8 69.2 22.1
## 2 2 44.5 39.3 45.1 10.4
## 3 3 17.2 45.9 69.3 9.3
## 4 4 152. 41.3 58.5 18.5
## 5 5 181. 10.8 58.4 12.9
## 6 6 8.7 48.9 75 7.2
# Alternative model: 2 predictors and their interaction:
2 <- lm(sales ~ TV + radio + TV:radio, data = ad)
m_2 <- lm(sales ~ TV*radio, data = ad) # shorter version of same model
m_summary(m_2)
##
## Call:
## lm(formula = sales ~ TV * radio, data = ad)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3366 -0.4028 0.1831 0.5948 1.5246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.750e+00 2.479e-01 27.233 <2e-16 ***
## TV 1.910e-02 1.504e-03 12.699 <2e-16 ***
## radio 2.886e-02 8.905e-03 3.241 0.0014 **
## TV:radio 1.086e-03 5.242e-05 20.727 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared: 0.9678, Adjusted R-squared: 0.9673
## F-statistic: 1963 on 3 and 196 DF, p-value: < 2.2e-16
Interpretation:
The small \(p\)-value associated with the interaction term indicates the presence of a non-additive relationship.
Including an interaction term in the model results in a substantial increase in \(R^2\), from around \(90\)% to \(96.7\)%.
3.5 Comparison of linear regression with K-nearest neighbors
LR is an example of a parametric approach that assumes a linear functional form for \(f(X)\).
Advantages:
- easy to fit, few parameters
- easy to test for statistical significance
- easy to interpret (in LR)
Disadvantage:
- strong assumptions about the linear form of \(f(X)\)
By contrast, non-parametric methods do not explicitly assume a parametric form for \(f(X)\), and thereby provide a more flexible alternative to LR.
K-nearest neighbors regression (KNN regression):
Given a value for \(K\) and a prediction point \(x_0\), KNN regression first identifies the \(K\) training observations that are closest to \(x_0\), represented by \(N_0\). It then estimates \(f(x0)\) using the average of all the training responses in \(N_0\).
Interpretation:
Values for \(K\) in KNN regression determine bias-variance tradeoff (see Chapter 2):
Left: \(K = 1\) results in a rough step function fit (perfect training fit, low bias, high variance)
Right: \(K = 9\) produces a much smoother fit (less perfect training fit, higher bias, lower variance)
Choosing LR vs. KNN regression
Which approach is better mostly depends on the true functional form of \(f(X)\). The closer it is to linear, the better or easier is fitting a LR model. However, even for non-linear data, KNN may inferior, when it is prone to fit noise.
Figures 3.17 to 3.19 (p. 108f.) compare LR vs. KNN regression for different functional forms \(f(X)\). Overall, LR is superior for linear and near-linear data, whereas KNN regression has clear advantages for highly non-linear data.
- Curse of dimensionality: When the number of observations relative to that of predictors is small, parametric methods will tend to outperform non-parametric approaches.
Even when the number of predictors is small, we might prefer LR to KNN from an interpretability standpoint. If the test MSE of KNN is only slightly lower than that of LR, we are willing to forego a little bit of prediction accuracy for the sake of a simple model that can be described in terms of just a few coefficients, and for which \(p\)-values are available.
Interpretation:
Test \(MSE\) for LR (black dashed lines) and KNN (green curves) as the number of variables \(p\) increases. The true function is non–linear in the first variable (as in the lower panel in Figure 3.19), and does not depend on the additional variables.
LR performance deteriorates slowly in the presence of additional noise variables (i.e., is quite robust).
KNN regression performance degrades much more quickly as the value of \(p\) increases (i.e., is more sensitive to noise).
3.6 Lab: Linear regression
R packages / libraries
In R, the library()
function is used to load libraries (aka. packages, i.e., groups of functions and data sets) that are not included in the base R distribution. Basic functions that perform least squares linear regression and other simple analyses come standard with any R distribution, but more exotic functions require additional libraries.
Loading packages
Here we load the MASS package, which is a very large collection of data sets and functions, and the ISLR2 package, which includes the data sets associated with this book:
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
##
## Boston
## The following object is masked from 'package:dplyr':
##
## select
library(ISLR2)
If you receive an error message when loading any of these libraries, it likely indicates that the corresponding library has not yet been installed on your system. Some libraries, such as MASS, come with R and do not need to be separately installed on your computer.
Installing packages
However, other packages, such as ISLR2, must be downloaded the first time they are used. This can be done directly from within R (using the buttons/menus of an IDE or install.packages()
function). For example, on a Windows system, select the Install package
option under the Packages
tab. After you select any mirror site, a list of available packages will appear. Simply select the package you wish to install and R will automatically download the package. Alternatively, this can be done at the R command line via install.packages("ISLR2")
.
An installation only needs to be done the first time you use a package. However, the library()
function must be called within each R session (to load the package).
Simple linear regression
The ISLR2 library contains the Boston
data set, which records medv
(median house value) for \(506\) census tracts in Boston. We will seek to predict medv
using \(12\) predictors such as
rmvar
(average number of rooms per house),age
(average age of houses), andlstat
(percent of households with low socioeconomic status).
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
To find out more about the data set, we can type ?Boston
.
We will start by using the lm()
function to fit a simple linear regression (simple LR) model, with medv
as the response and lstat
as the predictor. The basic syntax is {}, where y
is the response, x
is the predictor, and data
is the data set in which these two variables are kept:
lm(medv ~ lstat)
lm.fit <-## Error in eval(predvars, data, env): object 'medv' not found
The command causes an error because R does not know where to find the variables medv
and lstat
. The next line tells R that the variables refer to the Boston
data. If we attach Boston
, the first line works fine because R now recognizes the variables:
# 1. tell lm() to look for variables in data:
lm(medv ~ lstat, data = Boston)
lm.fit <-
# 2. attach data to environment to make variables accessible:
attach(Boston)
lm(medv ~ lstat) lm.fit <-
If we type lm.fit
, some basic information about the model is printed. For more detailed information, we use summary(lm.fit)
. This gives us \(p\)-values and standard errors for the coefficients, as well as the \(R^2\) statistic and \(F\)-statistic for the model:
lm.fit##
## Call:
## lm(formula = medv ~ lstat)
##
## Coefficients:
## (Intercept) lstat
## 34.55 -0.95
summary(lm.fit)
##
## Call:
## lm(formula = medv ~ lstat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
We can use the names()
function in order to find out what other pieces of information are stored in lm.fit
. Although we can extract these quantities by name — e.g., lm.fit$coefficients
— it is safer to use the extractor functions like coef()
to access them:
names(lm.fit) # show parts (of list object)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
coef(lm.fit) # extract parts
## (Intercept) lstat
## 34.5538409 -0.9500494
In order to obtain a confidence interval for the coefficient estimates, we can use the confint()
command. (Type confint(lm.fit)
at the command line to obtain the confidence intervals.)
confint(lm.fit)
## 2.5 % 97.5 %
## (Intercept) 33.448457 35.6592247
## lstat -1.026148 -0.8739505
Predictions with confidence vs. prediction intervals
The predict()
function can be used to produce confidence intervals and prediction intervals for the prediction of medv
for a given value of lstat
:
# values of interest:
c(5, 10, 15)
my_vals <-
# predictions with confidence intervals:
predict(lm.fit, data.frame(lstat = my_vals),
interval = "confidence")
## fit lwr upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461
# predictions with prediction intervals:
predict(lm.fit, data.frame(lstat = my_vals),
interval = "prediction")
## fit lwr upr
## 1 29.80359 17.565675 42.04151
## 2 25.05335 12.827626 37.27907
## 3 20.30310 8.077742 32.52846
For instance, the 95,% confidence interval associated with a lstat
value of 10 is \((24.47, 25.63)\), and the 95,% prediction interval is \((12.828, 37.28)\). As expected, the confidence and prediction intervals are centered around the same value (a predicted value of \(25.05\) for medv
when lstat
equals 10), but the prediction intervals are substantially wider.
Visualizations
We will now plot medv
and lstat
along with the least squares regression line using the plot()
and abline()
functions:
plot(x = lstat, y = medv,
main = "Scatterplot of response by predictor",
pch = 20, col = adjustcolor("black", .33))
abline(lm.fit, lwd = 2, col = "deepskyblue")
There is some evidence for non-linearity in the relationship between lstat
and medv
(see below).
Plotting lines
The abline()
function can be used to draw any line, not just the least squares regression line. To draw a line with intercept a
and slope b
, we type abline(a, b)
. Below we experiment with some additional settings for plotting lines and points. The lwd = 3
argument causes the width of the regression line to be increased by a factor of 3; this also works for the plot()
and lines()
functions. We can also use the pch
option to create different plotting symbols:
# Basic plot:
plot(lstat, medv)
# Lines:
# abline(lm.fit, lwd = 3)
abline(lm.fit, lwd = 3, col = "deeppink")
# Point symbols and colors:
plot(lstat, medv,
pch = 20, col = adjustcolor("darkblue", alpha.f = .33))
# plot(lstat, medv, pch = 20)
plot(lstat, medv, pch = "+", col = adjustcolor("deeppink", alpha.f = .67))
# Note: Different pch values:
# plot(1:20, 1:20, pch = 1:20)
Diagnostic plots
Next we examine some diagnostic plots, several of which were discussed in Section 3.3.3. Four diagnostic plots are automatically produced by applying the plot()
function directly to the output object obtained by lm()
. This command will normally produce one plot at a time, and hitting Enter will generate the next plot. To view all four plots together, we can use the par()
and mfrow()
functions, which instruct R to split the display screen into separate panels, so that multiple plots can be viewed simultaneously. For example,par(mfrow = c(2, 2))
divides the plotting region into a \(2 \times 2\) grid of panels:
par(mfrow = c(2, 2))
# 4 standard diagnostic plots:
plot(lm.fit,
pch = 20, col = adjustcolor("black", .33))
par(opar) # restore original par
Residual plots
Alternatively, we can compute the residuals from a linear regression fit using the residuals()
function. The function rstudent()
will return the studentized residuals, and we can use this function to plot the residuals against the fitted values.
# Residual plot:
plot(x = predict(lm.fit), y = residuals(lm.fit),
main = "Residual plot",
pch = 20, col = adjustcolor("black", .33))
# Studentized residual plot:
plot(x = predict(lm.fit), y = rstudent(lm.fit),
main = "Studentized residuals",
pch = 20, col = adjustcolor("black", .33))
On the basis of the residual plots, there is some evidence of non-linearity.
Leverage statistics
Leverage statistics can be computed for any number of predictors using the hatvalues()
function:
# Compute leverage statistics:
hatvalues(lm.fit)[1:6] # first 6 values
## 1 2 3 4 5 6
## 0.004262518 0.002455527 0.004863680 0.005639779 0.004058706 0.004127513
# Plotting leverage for all values:
plot(hatvalues(lm.fit),
main = "Leverage diagnostics",
pch = 20, col = adjustcolor("black", .33))
# Identify max value:
which.max(hatvalues(lm.fit))
## 375
## 375
The which.max()
function identifies the index of the largest element of a vector. In this case, it tells us which observation has the maximum leverage statistic.
Multiple linear regression
In order to fit a multiple linear regression (multiple LR) model using least squares, we can use the same lm()
function. The syntax lm(y ~ x1 + x2 + x3)
is used to fit a model with three predictors, x1
, x2
, and x3
(as main effects). The summary()
function now outputs the regression coefficients for all the predictors:
# Multiple LR (with 2 predictors):
lm(medv ~ lstat + age, data = Boston)
lm.fit <-summary(lm.fit)
##
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.981 -3.978 -1.283 1.968 23.158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.22276 0.73085 45.458 < 2e-16 ***
## lstat -1.03207 0.04819 -21.416 < 2e-16 ***
## age 0.03454 0.01223 2.826 0.00491 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495
## F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16
The Boston
data set contains 12 variables, and so it would be cumbersome to have to type all of these in order to perform a regression using all of the predictors. Instead, we can use the dot (.
) in following short-hand version:
# Include all main effects:
lm(medv ~ ., data = Boston)
lm.fit <-summary(lm.fit)
##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
We can access the individual components of a summary object by name (type ?summary.lm
to see what is available). Hence summary(lm.fit)$r.sq
gives us the \(R^2\), and summary(lm.fit)$sigma
gives us the \(RSE\).
# Access individual components:
# ?summary.lm # Help on lm summary methods
summary(lm.fit)$r.sq # R^2
## [1] 0.7406427
summary(lm.fit)$adj.r.sq # adj. R^2
## [1] 0.7337897
summary(lm.fit)$sigma # RSE
## [1] 4.745298
Checking for collinearity
To check for potential collinearity between predictors, the vif()
function of the car package can be used to compute variance inflation factors.1 Most VIF’s are low to moderate for this data:
# install.packages(car) # if car has not been installed yet
library(car)
vif(lm.fit)
## crim zn indus chas nox rm age dis
## 1.792192 2.298758 3.991596 1.073995 4.393720 1.933744 3.100826 3.955945
## rad tax ptratio black lstat
## 7.484496 9.008554 1.799084 1.348521 2.941491
Excluding variables and updating models
What if we would like to perform a regression using all of the variables but one? For example, in the above regression output, age
has a high \(p\)-value. So we may wish to run a regression excluding this predictor. The following syntax results in a regression using all predictors except age
:
2 <- lm(medv ~ . - age, data = Boston)
lm.fit_summary(lm.fit_2)
##
## Call:
## lm(formula = medv ~ . - age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.6054 -2.7313 -0.5188 1.7601 26.2243
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.436927 5.080119 7.172 2.72e-12 ***
## crim -0.108006 0.032832 -3.290 0.001075 **
## zn 0.046334 0.013613 3.404 0.000719 ***
## indus 0.020562 0.061433 0.335 0.737989
## chas 2.689026 0.859598 3.128 0.001863 **
## nox -17.713540 3.679308 -4.814 1.97e-06 ***
## rm 3.814394 0.408480 9.338 < 2e-16 ***
## dis -1.478612 0.190611 -7.757 5.03e-14 ***
## rad 0.305786 0.066089 4.627 4.75e-06 ***
## tax -0.012329 0.003755 -3.283 0.001099 **
## ptratio -0.952211 0.130294 -7.308 1.10e-12 ***
## black 0.009321 0.002678 3.481 0.000544 ***
## lstat -0.523852 0.047625 -10.999 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.74 on 493 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7343
## F-statistic: 117.3 on 12 and 493 DF, p-value: < 2.2e-16
Alternatively, the update()
function can be used on an existing model fit:
3 <- update(lm.fit, ~ . - age)
lm.fit_summary(lm.fit_3)
##
## Call:
## lm(formula = medv ~ crim + zn + indus + chas + nox + rm + dis +
## rad + tax + ptratio + black + lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.6054 -2.7313 -0.5188 1.7601 26.2243
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.436927 5.080119 7.172 2.72e-12 ***
## crim -0.108006 0.032832 -3.290 0.001075 **
## zn 0.046334 0.013613 3.404 0.000719 ***
## indus 0.020562 0.061433 0.335 0.737989
## chas 2.689026 0.859598 3.128 0.001863 **
## nox -17.713540 3.679308 -4.814 1.97e-06 ***
## rm 3.814394 0.408480 9.338 < 2e-16 ***
## dis -1.478612 0.190611 -7.757 5.03e-14 ***
## rad 0.305786 0.066089 4.627 4.75e-06 ***
## tax -0.012329 0.003755 -3.283 0.001099 **
## ptratio -0.952211 0.130294 -7.308 1.10e-12 ***
## black 0.009321 0.002678 3.481 0.000544 ***
## lstat -0.523852 0.047625 -10.999 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.74 on 493 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7343
## F-statistic: 117.3 on 12 and 493 DF, p-value: < 2.2e-16
Interaction terms
It is easy to include interaction terms in a linear model using the lm()
function:
The syntax
lstat:black
tells R to include an interaction term betweenlstat
andblack
.The syntax
lstat * age
simultaneously includes terms for the main effectslstat
andage
, and the interaction termlstat x age
as predictors; it is a shorthand forlstat + age + lstat:age
.
# including only the interaction term:
summary(lm(medv ~ lstat:age, data = Boston))
##
## Call:
## lm(formula = medv ~ lstat:age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.347 -4.372 -1.534 1.914 27.193
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.1588631 0.4828240 62.46 <2e-16 ***
## lstat:age -0.0077146 0.0003799 -20.31 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.827 on 504 degrees of freedom
## Multiple R-squared: 0.4501, Adjusted R-squared: 0.449
## F-statistic: 412.4 on 1 and 504 DF, p-value: < 2.2e-16
# including main effects and interaction:
summary(lm(medv ~ lstat * age, data = Boston))
##
## Call:
## lm(formula = medv ~ lstat * age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.806 -4.045 -1.333 2.085 27.552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.0885359 1.4698355 24.553 < 2e-16 ***
## lstat -1.3921168 0.1674555 -8.313 8.78e-16 ***
## age -0.0007209 0.0198792 -0.036 0.9711
## lstat:age 0.0041560 0.0018518 2.244 0.0252 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared: 0.5557, Adjusted R-squared: 0.5531
## F-statistic: 209.3 on 3 and 502 DF, p-value: < 2.2e-16
Interpreting the significant interaction — here between two quantitative variables — can be challenging. (See Gelman & Hill, 2007, Sections 3.3 and 4.2.)
We can also pass in transformed versions of the predictors…
Non-linear transformations of the predictors
The lm()
function can also accommodate non-linear transformations of the predictors. For instance, given a predictor \(X\), we can create a 2nd predictor \(X^2\) using I(X^2)
.
Note: The function I()
inhibits the interpretation of the arithmetic expression (here: lstat^2
). is needed since the ^
has a special meaning in an R formula object; wrapping as we do allows the standard usage in R, which is to raise X
to the power 2
.
We now perform a regression of medv
onto lstat
and lstat^2
:
lm(medv ~ lstat + I(lstat^2))
lm.fit2 <-summary(lm.fit2)
##
## Call:
## lm(formula = medv ~ lstat + I(lstat^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.862007 0.872084 49.15 <2e-16 ***
## lstat -2.332821 0.123803 -18.84 <2e-16 ***
## I(lstat^2) 0.043547 0.003745 11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
Interpretation:
The near-zero \(p\)-value associated with the quadratic term suggests that it leads to an improved model.
Comparing models
We can use the anova()
function to quantify the extent to which the quadratic fit is superior to the linear fit:
lm(medv ~ lstat) # simple LR model
lm.fit1 <-anova(lm.fit1, lm.fit2) # comparing 2 models
## Analysis of Variance Table
##
## Model 1: medv ~ lstat
## Model 2: medv ~ lstat + I(lstat^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 504 19472
## 2 503 15347 1 4125.1 135.2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here Model 1 represents the linear submodel containing only one predictor, lstat
, while Model 2 corresponds to the larger quadratic model that has two predictors, lstat
and lstat^2
. The anova()
function performs a hypothesis test comparing the two models. The null hypothesis is that the two models fit the data equally well, and the alternative hypothesis is that the full model is superior. Here the \(F\)-statistic is \(135\) and the associated \(p\)-value is virtually zero. This provides very clear evidence that the model containing the predictors lstat
and lstat^2
is superior to the model that only contains the predictor lstat
. This is not surprising, since earlier we saw evidence for non-linearity in the relationship between medv
and lstat
.
Diagnostic plots
To show diagnostic plots for a model lm.fit2
, we type:
par(no.readonly = TRUE) # store original par
opar <-par(mfrow = c(2, 2)) # 2 rows and 2 cols
plot(lm.fit2,
pch = 20, col = adjustcolor("black", .25))
par(opar) # restore original par
Interpretation:
- When the
lstat^2
term is included in the model, there is little discernible pattern in the residuals.
Higher-order polynomial models
In order to create a cubic fit, we could include a predictor of the form I(X^3)
. However, this approach can start to get cumbersome for higher-order polynomials. A better approach involves using the poly()
function to create the polynomial within lm()
. For example, the following command produces a fifth-order polynomial fit:
lm(medv ~ poly(lstat, 5))
lm.fit5 <-summary(lm.fit5)
##
## Call:
## lm(formula = medv ~ poly(lstat, 5))
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5433 -3.1039 -0.7052 2.0844 27.1153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.5328 0.2318 97.197 < 2e-16 ***
## poly(lstat, 5)1 -152.4595 5.2148 -29.236 < 2e-16 ***
## poly(lstat, 5)2 64.2272 5.2148 12.316 < 2e-16 ***
## poly(lstat, 5)3 -27.0511 5.2148 -5.187 3.10e-07 ***
## poly(lstat, 5)4 25.4517 5.2148 4.881 1.42e-06 ***
## poly(lstat, 5)5 -19.2524 5.2148 -3.692 0.000247 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.215 on 500 degrees of freedom
## Multiple R-squared: 0.6817, Adjusted R-squared: 0.6785
## F-statistic: 214.2 on 5 and 500 DF, p-value: < 2.2e-16
This suggests that including additional polynomial terms, up to fifth order, leads to an improvement in the model fit! However, further investigation of the data reveals that no polynomial terms beyond fifth order have significant \(p\)-values in a regression fit.
By default, the poly()
function orthogonalizes the predictors: this means that the features output by this function are not simply a sequence of powers of the argument. However, a linear model applied to the output of the poly()
function will have the same fitted values as a linear model applied to the raw polynomials (although the coefficient estimates, standard errors, and \(p\)-values will differ). In order to obtain the raw polynomials from the poly()
function, the argument raw = TRUE
must be used.
Other transformations
Of course, we are in no way restricted to using polynomial transformations of the predictors. Here we try a log transformation:
summary(lm(medv ~ log(rm), data = Boston))
##
## Call:
## lm(formula = medv ~ log(rm), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.487 -2.875 -0.104 2.837 39.816
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -76.488 5.028 -15.21 <2e-16 ***
## log(rm) 54.055 2.739 19.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.915 on 504 degrees of freedom
## Multiple R-squared: 0.4358, Adjusted R-squared: 0.4347
## F-statistic: 389.3 on 1 and 504 DF, p-value: < 2.2e-16
(For details on linear and non-linear transformations, see Gelman & Hill, 2007, Chapter 4.)
Qualitative predictors
We will now examine the Carseats
data, which is part of the ISLR2 package. We will attempt to predict Sales
(child car seat sales) in \(400\) locations based on a number of predictors.
head(Carseats)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1 9.50 138 73 11 276 120 Bad 42 17
## 2 11.22 111 48 16 260 83 Good 65 10
## 3 10.06 113 35 10 269 80 Medium 59 12
## 4 7.40 117 100 4 466 97 Medium 55 14
## 5 4.15 141 64 3 340 128 Bad 38 13
## 6 10.81 124 113 13 501 72 Bad 78 16
## Urban US
## 1 Yes Yes
## 2 Yes Yes
## 3 Yes Yes
## 4 Yes Yes
## 5 Yes No
## 6 No Yes
is.factor(Carseats$ShelveLoc)
## [1] TRUE
table(Carseats$ShelveLoc)
##
## Bad Good Medium
## 96 85 219
The Carseats
data includes qualitative predictors such as shelveloc
, an indicator of the quality of the shelving location — that is, the space within a store in which the car seat is displayed — at each location. The predictor shelveloc
takes on three possible values: Bad, Medium, and Good.
Given a qualitative variable such as shelveloc
(stored as a factor variable in R), R automatically generates dummy variables.
Below we fit a multiple regression model that includes some interaction terms.
# a qualitative variable (factor):
is.factor(Carseats$ShelveLoc)
## [1] TRUE
table(Carseats$ShelveLoc)
##
## Bad Good Medium
## 96 85 219
# Simple LR (using only the ShelveLoc factor as single predictor):
1 <- lm(Sales ~ ShelveLoc, data = Carseats)
lm_summary(lm_1)
##
## Call:
## lm(formula = Sales ~ ShelveLoc, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3066 -1.6282 -0.0416 1.5666 6.1471
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.5229 0.2388 23.131 < 2e-16 ***
## ShelveLocGood 4.6911 0.3484 13.464 < 2e-16 ***
## ShelveLocMedium 1.7837 0.2864 6.229 1.2e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.339 on 397 degrees of freedom
## Multiple R-squared: 0.3172, Adjusted R-squared: 0.3138
## F-statistic: 92.23 on 2 and 397 DF, p-value: < 2.2e-16
# Multiple LR (including 2 interactions):
2 <- lm(Sales ~ . + Income:Advertising + Price:Age, data = Carseats)
lm_summary(lm_2)
##
## Call:
## lm(formula = Sales ~ . + Income:Advertising + Price:Age, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9208 -0.7503 0.0177 0.6754 3.3413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5755654 1.0087470 6.519 2.22e-10 ***
## CompPrice 0.0929371 0.0041183 22.567 < 2e-16 ***
## Income 0.0108940 0.0026044 4.183 3.57e-05 ***
## Advertising 0.0702462 0.0226091 3.107 0.002030 **
## Population 0.0001592 0.0003679 0.433 0.665330
## Price -0.1008064 0.0074399 -13.549 < 2e-16 ***
## ShelveLocGood 4.8486762 0.1528378 31.724 < 2e-16 ***
## ShelveLocMedium 1.9532620 0.1257682 15.531 < 2e-16 ***
## Age -0.0579466 0.0159506 -3.633 0.000318 ***
## Education -0.0208525 0.0196131 -1.063 0.288361
## UrbanYes 0.1401597 0.1124019 1.247 0.213171
## USYes -0.1575571 0.1489234 -1.058 0.290729
## Income:Advertising 0.0007510 0.0002784 2.698 0.007290 **
## Price:Age 0.0001068 0.0001333 0.801 0.423812
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.011 on 386 degrees of freedom
## Multiple R-squared: 0.8761, Adjusted R-squared: 0.8719
## F-statistic: 210 on 13 and 386 DF, p-value: < 2.2e-16
Defining contrasts
The contrasts()
function returns the coding that R uses for the dummy variables:
# attach(Carseats)
contrasts(Carseats$ShelveLoc)
## Good Medium
## Bad 0 0
## Good 1 0
## Medium 0 1
Use ?contrasts
to learn about other contrasts, and how to set them.
R has created a ShelveLocGood
dummy variable that takes on a value of 1 if the shelving location is good, and 0 otherwise. It has also created a ShelveLocMedium
dummy variable that equals 1 if the shelving location is medium, and 0 otherwise. A bad shelving location corresponds to a zero for each of the two dummy variables.
Interpretation: The fact that the coefficient for ShelveLocGood
in the regression output is positive indicates that a good shelving location is associated with high sales (relative to a bad location). And ShelveLocMedium
has a smaller positive coefficient, indicating that a medium shelving location is associated with higher sales than a bad shelving location, but lower sales than a good shelving location.
Writing R functions
As we have seen, R comes with many useful functions, and still more functions are available by way of R libraries. However, we will often be interested in performing an operation for which no function is available. In this setting, we may want to write our own function. For instance, below we provide a simple function that reads in the ISLR2 and MASS libraries, called LoadLibraries()
.
Before we have created the function, R returns an error if we try to call it:
LoadLibraries## Error in eval(expr, envir, enclos): object 'LoadLibraries' not found
LoadLibraries()
## Error in LoadLibraries(): could not find function "LoadLibraries"
We now create a new function. The body of a function is enclosed in curly brackets {}
. The {
symbol informs R that multiple commands are about to be entered. Hitting Enter after typing {
will cause R to print the +
symbol. We can then input as many commands as we wish, hitting Enter after each one. Finally the }
symbol informs R that no further commands will be entered.
function() {
LoadLibraries <-library(ISLR2)
library(MASS)
print("The libraries have been loaded.")
}
Now if we type in LoadLibraries
(without parentheses), R will tell us what is in the function:
LoadLibraries## function() {
## library(ISLR2)
## library(MASS)
## print("The libraries have been loaded.")
## }
If we call the function, the libraries are loaded in and the print statement is output:
LoadLibraries()
## [1] "The libraries have been loaded."
Note that R functions can also take and use arguments (both mandatory and optional). The names of such arguments would have to be supplied in the parentheses ()
after the function name (separated by commas).
3.7 Exercises
Conceptual
+++ here now +++
Applied
Links
Website on James et al. (2021): https://www.statlearning.com/
R packages ISLR at https://CRAN.R-project.org/package=ISLR and ISLR2 at https://CRAN.R-project.org/package=ISLR2
Data. figures, slides, and RMarkdown labs at https://www.statlearning.com/
[Last update of ISL_ch03.Rmd on 2022-06-27 by hn.]
References
Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Retrieved from http://www.stat.columbia.edu/~gelman/arm/
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning with applications in R (2nd edition). Retrieved from https://www.statlearning.com/
The car package is not part of the base R installation so it must be downloaded the first time you use it via the
install.packages()
function in R.↩︎