Chapter 3 of James, Witten, Hastie, & Tibshirani (2021). See https://www.statlearning.com/ for data and resources.

3 Linear Regression

Linear regression (LR) is a useful tool for predicting a quantitative response and a simple type of supervised learning. Understanding LR remains important, as newer methods of statistical learning can be seen as generalizations or extensions of LR.

Example

Reconsidering the Advertising data from Chapter 2:

library(ISLR2)

# Data: 
# https://www.statlearning.com/resources-second-edition
data_url  <- "https://www.statlearning.com/s/Advertising.csv"

if (wd == "/Users/hneth/Desktop/stuff/Dropbox/GitHub/predict/R"){ # in subdir R: 
  data_file <- "./../data/_JamesEtAl2021_ISL2/data/Advertising.csv"  # up 1 level
} else {
  data_file <- "./data/_JamesEtAl2021_ISL2/data/Advertising.csv" 
}

ad <- tibble::as_tibble(read.csv(data_file))
dim(ad)
## [1] 200   5

Figure 2.1 (p. 16) displayed sales (in thousands of units) for a particular product as a function of advertising budgets (in thousands of dollars) for TV, radio, and newspaper media:

Figure 2.1: Advertising data.

Recreating Figure 2.1:

par(mfrow = c(1, 3))

plot(ad$TV, ad$sales, main = "(a) sales by TV",
     pch = 20, cex = 2, col = adjustcolor("black", .25))
abline(lm(sales ~ TV, data = ad), lwd = 2, col = "deepskyblue")

plot(ad$radio, ad$sales, main = "(b) sales by radio",
     pch = 20, cex = 2, col = adjustcolor("black", .25))
abline(lm(sales ~ radio, data = ad), lwd = 2, col = "deeppink")

plot(ad$newspaper, ad$sales, main = "(c) sales by newspaper", 
     pch = 20, cex = 2, col = adjustcolor("black", .25))
abline(lm(sales ~ newspaper, data = ad), lwd = 2, col = "gold")

par(opar)  # restore original par

Figure 2.1

For an output variable $Y$ and a set of $p$ predictor variables $X = (X_1,\ X_2,\ \ldots,\ X_p)$ we assume:

\[Y = f(X) + \epsilon\]

Here, $f$ represents systematic information, whereas $\epsilon$ represents random error information (aka. noise, independent of $X$ and with a mean of zero).

In simple LR, we assume that the function $f$ can be approximated by a line (with two parameters, i.e., $\beta_0 + \beta_1 X$). Thus,

\[Y = \beta_0 + \beta_1 X + \epsilon\]

The error or noise term $\epsilon$ is assumed to be independent of the predictor $X$ and contain everything that we miss in the linear model $f(X)$. For instance,

the true relationship between $X$ and $Y$ may be non-linear,
other variables beyond $X$ may contribute to the variation of $Y$, and
our measurements may contain errors.

Questions

The LR paradigm allows addressing the following (types of) questions:

Is there a relationship between predictors (advertising budgets) and outcome (sales)?
How strong is the relationship?
Which predictors matter? What are their individual contributions on the overall outcome?
How strong is the relationship for each predictor (with or without considering the others)?
Prediction: How accurately can we predict future outomes?
Inference: What is the form of the relationship (e.g., linear or non-linear)?
Interactions: Are there synergy effects (interactions)?

These questions will be addressed for the Advertising data in below (see Section 3.4).

3.1 Simple linear regression

Simple LR assumes a single predictor variable $X$ and an approximately linear relationship between $X$ and an outcome variable $Y$:

\[Y \approx \beta_0 + \beta_1 X\]

When plotting outcome values $Y$ as a function of predictor values $X$, simple LR estimates two parameters (or coefficients) of a straight line: its intercept $\beta_0$ (for $X = 0$) and its slope $\beta_1$.

Estimating the coefficients

Varying the parameters for intercept and slope (i.e., $\beta_0$ and $\beta_1$) yields different lines. Each line provides a specific model that predicts a particular outcome value $\hat{Y}$ for any predictor value $\hat{X}$. (The hat-symbols signal that we are usually estimating unknown population values from observed samples.) Determining how closely our predictions fit to the data requires that we quantify the deviation between observed and predicted data. The criterion for measuring closeness or quantifying the degree of fit used here is the least squared deviation: We select $\beta_0$ and $\beta_1$ so that the residual sum of squares (RSS) — commonly known as mean squared error (MSE) — of our predictions is minimized.

See Equation 3.4 (James et al., 2021, p. 62) for the least squares coefficient estimates for simple LR.

Figure 3.1 (p. 62) shows the least squares fit for the regression of sales onto TV for the Advertising data. The fit is found by minimizing the residual sum of squares.

Figure 3.1: Least square fit of a regression model.

Recreating a version of Figure 3.1:

# Scatterplot: ---- 
plot(x = ad$TV, y = ad$sales, 
     main = "Sales by TV advertising budget", xlab = "Budget (in units of $1,000)", ylab = "Sales (in units of 1,000)", 
     ylim = c(0, 30), 
     pch = 20, cex = 1.5, col = adjustcolor("black", .50))

# Simple linear regression model: ---- 
slm_1 <- lm(sales ~ TV, data = ad) 
abline(slm_1, lwd = 2, col = "deepskyblue")

# Summary: ---- 
summary(slm_1)
## 
## Call:
## lm(formula = sales ~ TV, data = ad)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3860 -1.9545 -0.1913  2.0671  7.2124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.032594   0.457843   15.36   <2e-16 ***
## TV          0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

# Predictions: ---- 
y_hat <- predict(slm_1)
segments(x0 = ad$TV, y0 = ad$sales, y1 = y_hat,
         lwd = .8, col = "deeppink")

Figure 3.1: Sales by TV advertising budget (with linear regression model and residual deviations).

Assessing the accuracy of the coefficient estimates

In statistics, we usually are interested in population values (e.g., $\mu$), but have only access to samples (e.g., $\hat{\mu}$). Thus, we need to estimate the population parameters from samples.

In the context of simple LR, we are interested in the coefficients $beta_0$ and $beta_1$ of the population regression line. However, we need to estimate these unknown coefficients from our sample(s) using $\hat{\beta_0}$ and $\hat{\beta_1}$.

True relation vs. least-square regression line

Figure 3.3 (p. 64) compares the true relationship between $X$ and $Y$ with the estimated coefficients of simple LR for a simulated data set:

Figure 3.3: Estimated vs. true relationship between predictor and response.

Left: The red line represents the true relationship, $Y = f(X) = 2 + 3X$, which is known as the population regression line. The blue line is the least squares line; it is the least squares estimate for $f(X)$ based on the observed data, shown in black.
Right: The population regression line is again shown in red, and the least squares line in dark blue. In light blue, ten least squares lines are shown, each computed on the basis of a separate random set of observations. Each least squares line is different, but on average, the least squares lines are quite close to the population regression line.

Just like many sample means $\hat{\mu}$ provide unbiased estimates of the true population mean $\mu$), the estimated coefficients $\hat{\beta_0}$ and $\hat{\beta_1}$ provide unbiased estimates of the true regression line in the population. When drawing many samples, the average of the estimates approximate the true population values.

Measures of deviation

To quantify the deviation between population values and sample-based estimates, we can compute standard errors (SE), residual standard errors (RSE), and confidence intervals (CI).

For simple LR, the 95%-confidence interval for the population (or true) coefficients ($beta_0$ and $beta_1$) can be computed from the estimated (or sample-based) coefficients ($\hat{\beta_0}$ and $\hat{\beta_1}$) and their corresponding standard errors as:

\[ \hat{\beta_0} \pm 2 \cdot SE(\hat{\beta_0}) \] \[ \hat{\beta_1} \pm 2 \cdot SE(\hat{\beta_1}) \]

Hypothesis testing

Endowed with estimated coefficients and standard errors, we can perform hypothesis tests on the coefficients:

$H_a$: There is a relationship between $X$ and $Y$: $\beta_1 \neq 0$
$H_0$: There is no relationship between $X$ and $Y$: $\beta_1 = 0$

Using a $t$-distribution (to assess the likelihood of an estimated coefficient’s deviation from 0) yields a $p$-value: The probability that the observed effect was caused exclusively by chance (i.e., $P(data | H_0)$) — which is not the probability that $H_a$ is true!

For our model slm_1 (from above):

# Simple linear regression model: ---- 
slm_1 <- lm(sales ~ TV, data = ad) 

# Summary of the slm_1 model (above):
summary(slm_1)
## 
## Call:
## lm(formula = sales ~ TV, data = ad)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3860 -1.9545 -0.1913  2.0671  7.2124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.032594   0.457843   15.36   <2e-16 ***
## TV          0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

As the SE-values for both coefficients are very small, the estimated (or sample-based) coefficients ($\hat{\beta_0}$ and $\hat{\beta_1}$) deviate from 0.

This is tested by the $t$-statistic (and corresponding $p$-values).

Rejecting $H_0$ for both $\beta_0$ and $\beta_1$ implies:

$\beta_0 \neq 0$: There are sales in the absence of TV advertising.
$\beta_1 \neq 0$: There is a relationship between TV advertisements and sales.

Assessing the accuracy of the model

We want to quantify the extent to which a model fits the data. The quality of fit for a LR model is typically assessed using two related quantities: the residual standard error (RSE) and the $R^2$ statistic.

Residual standard error (RSE)

The residual standard error (RSE) estimates the standard deviation of the irreducible error $\epsilon$ to quantify the extent to which our predictions would miss on average if we knew the true LR line (i.e., the population values $\beta_0$ and $\beta_1$). The residual standard error (RSE) is based on the residual sums of squares (RSS, see the formulas on p. 69):

\[\text{RSS} = \sum_{i=1}^{n}{(y_i - \hat{y_i})^2}\]

and computed as:

\[\text{RSE} = \sqrt{\frac{1}{n-2} \cdot \text{RSS}}\]

As RSE quantifies the lack of fit of a model in units of $Y$, lower values are better (ideally, $RSE \rightarrow 0$). In our above example model slm_1, $RSE = 3.26$ and $Y$ represents sales in units of 1,000. Thus, even if our model was perfectly correct, our predictions would deviate from the true values by approximately 3,260 units, on average.

Proportion of variance explained ($R^2$)

Whereas RSE provides an absolute measure of model fit (in units of the criterion $Y$), the $R^2$ statistic provides an alternative measure of fit as the proportion of non-erroreous squared deviations, or the proportion of variance explained:

\[R^2 = \frac{\text{TSS} - \text{RSS}}{\text{TSS}}\] with

\[\text{TSS} = \sum_{i=1}^{n}{(y_i - \bar{y_i})^2}\]

(Note that TSS is a special case of RSS, with only a single predicted value $\bar{y_i}$.)

In contrast to error measures, $R^2$ ranges from $0$ to $1$, with higher values indicating better model fits (ideally, $R^2 \rightarrow 1$).

Given a model, $R^2$ measures the proportion of variability in $Y$ that can be explained by $X$. An $R^2$ value close to 1 indicates that a large proportion of the variability in the response variable is explained by the regression model.

Note:

TSS is a special case of RSS (with only )
Although $R^2$ is always between 0 and 1, it can still be challenging to determine what constitutes a “good” $R^2$ value. For domains in which the true model is highly linear and $\epsilon$ is small, we can obtain $R^2 \approx 1$. In social sciences, relationships can be non-linear and subject to many other influences, even small $R^2 > 0$ can be meaningful.
The $R^2$ is related to the correlation $r = \text{Cor}(X, Y)$. In simple LR (with only 1 predictor), $R^2 = r^2$.

Accuracy measures in R

We can easily define R functions for the total sum of squares (TSS), residual sum of squares (RSS), the residual standard error (RSE), and the $R^2$-statistic, when providing both the true and predicted values as arguments:

# Measures of model accuracy: 
# (see @JamesEtAl2021, p. 69ff.): 


# Total sum of squares (TSS): ---- 
TSS <- function(true_vals, pred_vals = NA){
  
  # Special case: 
  if (is.na(pred_vals)){
    pred_vals <- mean(true_vals)
  }
  
  sum((true_vals - pred_vals)^2)
}


# Residual sum of squares (RSS): ---- 
RSS <- function(true_vals, pred_vals){
  
  sum((true_vals - pred_vals)^2)
}


# Residual standard error (RSE): ----
RSE <- function(true_vals, pred_vals){
  
  n <- length(true_vals)  
  
  rss <- RSS(true_vals, pred_vals)  # defined above
  
  sqrt(1/(n - 2) * rss)
}


# R^2 statistic: ----
Rsq <- function(true_vals, pred_vals){
  
  tss <- TSS(true_vals)  # defined above
  
  rss <- RSS(true_vals, pred_vals)  # defined above
  
  (tss - rss)/tss
}

3.2 Multiple linear regression

Simple LR assumes only one predictor variable, but we often have more than one predictor.

Multiple single LR

We could consider running separate models for each predictor.

Re-creating Table 3.3 (p. 72): Multiple simple/single LR models for the Advertising data:

# Simple linear regression models: ---- 
slm_1 <- lm(sales ~ TV, data = ad) 
slm_2 <- lm(sales ~ radio, data = ad) 
slm_3 <- lm(sales ~ newspaper, data = ad) 

# Summary of each model:
summary(slm_1)
## 
## Call:
## lm(formula = sales ~ TV, data = ad)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3860 -1.9545 -0.1913  2.0671  7.2124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.032594   0.457843   15.36   <2e-16 ***
## TV          0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16
summary(slm_2)
## 
## Call:
## lm(formula = sales ~ radio, data = ad)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.7305  -2.1324   0.7707   2.7775   8.1810 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.31164    0.56290  16.542   <2e-16 ***
## radio        0.20250    0.02041   9.921   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.275 on 198 degrees of freedom
## Multiple R-squared:  0.332,  Adjusted R-squared:  0.3287 
## F-statistic: 98.42 on 1 and 198 DF,  p-value: < 2.2e-16
summary(slm_3)
## 
## Call:
## lm(formula = sales ~ newspaper, data = ad)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2272  -3.3873  -0.8392   3.5059  12.7751 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.35141    0.62142   19.88  < 2e-16 ***
## newspaper    0.05469    0.01658    3.30  0.00115 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.092 on 198 degrees of freedom
## Multiple R-squared:  0.05212,    Adjusted R-squared:  0.04733 
## F-statistic: 10.89 on 1 and 198 DF,  p-value: 0.001148

Three distinct simple/single LR models are unsatisfactory:

It remains unclear how we can predict the outcome variable based on a combination of predictor values.
Also, each simple/single LR model ignores (or rather tacitly incorporates) the effects of the other predictor variables.

Multiple LR

A better approach is to extend the simple LR model by incorporating multiple predictors in a single model. The following multiple linear regression (multiple LR) model uses a separate slope coefficient for each predictor $p$:

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 +\ \ldots\ + \beta_p X_p + \epsilon \]

where $X_j$ represents the $j$th predictor and the coefficient $\beta_j$ quantifies the association between that predictor and the response.

We interpret $beta_j$ as the average effect on $Y$ of a one unit increase in $X_j$, when holding all other predictors fixed.

Estimating the regression coefficients

# Multiple linear regression model (3 predictors): ---- 
mlm_1 <- lm(sales ~ TV + radio + newspaper, data = ad)

# Summary of multiple LR model:
summary(mlm_1)
## 
## Call:
## lm(formula = sales ~ TV + radio + newspaper, data = ad)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## TV           0.045765   0.001395  32.809   <2e-16 ***
## radio        0.188530   0.008611  21.893   <2e-16 ***
## newspaper   -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

Note that the simple and multiple regression coefficients can be quite different. Here, the coefficient for the newspaper predictor is no longer significant, when accounting for the effects of TV and radio.

How can the effect of newspaper be significant in a simple/single LR, but non-significant in a multiple LR? The correlation between the radio and newspaper predictors is 0.35:

Re-creating Table 3.5 (p. 75): Correlation matrix for TV, radio, newspaper, and sales, for the Advertising data:

# Correlation matrix:
df <- ad[ , -1]
round(cor(x = df), 2)
##             TV radio newspaper sales
## TV        1.00  0.05      0.06  0.78
## radio     0.05  1.00      0.35  0.58
## newspaper 0.06  0.35      1.00  0.23
## sales     0.78  0.58      0.23  1.00

Thus, the simple/single LR of sales by newspaper captures some of the variance due to the radio predictor. In other words, newspaper advertising is a surrogate for radio advertising, or the newspaper predictor gets some credit for the association between radio and sales. When incorporating both predictors in a single model, the effect of newspaper disappears:

# Multiple linear regression model (2 predictors): ---- 
mlm_2 <- lm(sales ~ radio + newspaper, data = ad)

# Summary mLR:
summary(mlm_2)
## 
## Call:
## lm(formula = sales ~ radio + newspaper, data = ad)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5289  -2.1449   0.7315   2.7657   7.9751 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.188920   0.627672  14.640   <2e-16 ***
## radio       0.199045   0.021870   9.101   <2e-16 ***
## newspaper   0.006644   0.014909   0.446    0.656    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.284 on 197 degrees of freedom
## Multiple R-squared:  0.3327, Adjusted R-squared:  0.3259 
## F-statistic: 49.11 on 2 and 197 DF,  p-value: < 2.2e-16

Example involving common sense phenomenon (or a common cause variable): Assume an association between the predictor ice cream sales on the outcome shark attacks on beaches. Once we incorporate the (common cause) predictor temperature, the association disappears.

Some important questions

When we perform multiple linear regression, we usually are interested in answering the following questions:

Is at least one of the predictors $X_1,\ X_2,\ \ldots,\ X_p$ useful in predicting the response $Y$?
Do all the predictors help to explain $Y$ , or is only a subset of the predictors useful?
How well does the model fit the data?
Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

ad 1. Overall effect vs. contributions of individual predictors

Is at least one of the predictors $X_1,\ X_2,\ \ldots,\ X_p$ useful in predicting the response $Y$?

A large $F$-statistic ($F > 1$, taking into account the $df$ implied by $n$ and $p$) implies that at least one of the predictors is successful in predicing the outcome/response variable.

In contrast to this overall effect, the $t$-statistic and the corresponding $p$-value of each coefficient informs about whether each predictor is related to the outcome/response when adjusting for the other predictors.

???: Note that ${t_{j}}^2 = F_j$ for the partial effect of each predictor $X_j$ (see James et al., 2021, p. 77, Footnote 7).

Good question: Given individual $p$-values for each coefficient, why do we need the overall $F$-statistic?
Consider an example with many predictors (or multiple tests): For instance, assume a multiple LR model with $p = 100$ predictors, but
$H_0: \beta_1 = \beta_2 =\ \ldots\ = \beta_p = 0$ is true, so no variable is truly associated with the response. In this situation, about $5$% of the $p$-values associated with each coefficient will be below $0.05$ by chance (and — simply by definition — “statistically significant”). In other words, we expect to see approximately five significant $p$-values in the absence of any true association between the predictors and the response. By contrast, the $F$-statistic does not suffer from the problem of multiple tests because it adjusts for the number of predictors $p$.

ad 2. Deciding on important variables

Do all the predictors help to explain $Y$ , or is only a subset of the predictors useful?

The task of determining which predictors are associated with the response, in order to fit a single model involving only those predictors, is referred to as variable selection (see Chapter 6). With $p$ predictors, we can create $2^p$ different models containing subsets of predictor variables (not yet considering interactions). Various statistics allow judging the quality of a model (including AIC, BIC).

Three classical approaches for solving the variable selection task are:

Forward selection: Begin with the null model and add the predictor that results in a model with the lowest RSS. Proceed until some stopping criterion is reached.
Backward selection: Begin with the full model and remove the predictor that has the highest $p$ value. Proceed until some stopping criterion is reached.
Mixed selection: Begin with forward selection, but also remove predictors when their $p$-values increases above some criterion. Continue until all variables remaining in the model have a sufficiently low $p$-value.

Note:

Backward selection cannot be used when $p > n$.
Forward selection can always be used (even when $p > n$), but is a greedy approach (i.e., might include variables early that later become redundant). Mixed selection can remedy this.

ad 3. Model fit

How well does the model fit the data?

Two of the most common numerical measures of model fit are the residual standard error (RSE) and the fraction of variance explained ($R^2$).

Note:

Relation between $R^2$ and correlation: In simple LR (with only 1 predictor), we had $R^2 = r^2$. In multiple LR, we have $R^2 = \text{Cor}(Y, \hat{Y})^2$ (i.e., multiple LR maximizes the correlation between response $Y$ and the predictions of the fitted linear model $\hat{Y}$).
Number of predictors $p$, $R^2$, vs. model quality: $R^2$ always increases when adding variables to a model, even if those variables are only weakly associated with the response. This is due to the fact that adding another variable always results in a decrease in the RSS on the training data, but not necessarily the testing data.
Graphical summaries can reveal problems with a model that are not visible from numerical statistics (e.g., non-linearity).

Example: Figure 3.5 (p. 81) shows a non-linear relationship in data for a multiple LR with 2 predictors. Such a non-linear pattern often suggests a synergy or interaction effect.

Figure 3.5: Non-linear relationship for multiple LR.

ad 4. Predictions

Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

For a specific multiple LR model and a set of values for the predictors $X_1, X_2,\ \ldots, X_p$, it is straightforward to predict the response $\hat{Y}$. There are three sorts of uncertainty associated with this prediction:

Reducible error: The coefficients $\hat{\beta_0}, \hat{\beta_1}, \ldots, \hat{\beta_p}$ are only estimations for the true population values $\beta_0, \beta_1,\ \ldots, \beta_p$.
Model bias: Even the best linear model $f(X)$ must not be the true relationship (which may be non-linear).
Irreducible error: Even the true population model would be off by random error/noise $\epsilon$.

Distinction between confidence interval (average $Y$ over many observations) vs. prediction interval (uncertainty surrounding $Y$ for a particular observation).

Confidence intervals express sampling uncertainty in quantities estimated from many data points. The more data, the less sampling uncertainty, and hence the thinner the interval.
Prediction intervals, on top of the sampling uncertainty, also express uncertainty around a single value, which makes them wider than the confidence intervals.

See blog post at Confidence intervals vs prediction intervals (2021-11-08).

3.3 Other considerations in the regression model

The Credit dataset of ISLR2 contains both quantitative and qualitative predictors:

library(ISLR2)
library(tidyverse)

cred <- ISLR2::Credit
dim(cred)  # 400 x 11
## [1] 400  11
summary(cred)
##      Income           Limit           Rating          Cards      
##  Min.   : 10.35   Min.   :  855   Min.   : 93.0   Min.   :1.000  
##  1st Qu.: 21.01   1st Qu.: 3088   1st Qu.:247.2   1st Qu.:2.000  
##  Median : 33.12   Median : 4622   Median :344.0   Median :3.000  
##  Mean   : 45.22   Mean   : 4736   Mean   :354.9   Mean   :2.958  
##  3rd Qu.: 57.47   3rd Qu.: 5873   3rd Qu.:437.2   3rd Qu.:4.000  
##  Max.   :186.63   Max.   :13913   Max.   :982.0   Max.   :9.000  
##       Age          Education      Own      Student   Married     Region   
##  Min.   :23.00   Min.   : 5.00   No :193   No :360   No :155   East : 99  
##  1st Qu.:41.75   1st Qu.:11.00   Yes:207   Yes: 40   Yes:245   South:199  
##  Median :56.00   Median :14.00                                 West :102  
##  Mean   :55.67   Mean   :13.45                                            
##  3rd Qu.:70.00   3rd Qu.:16.00                                            
##  Max.   :98.00   Max.   :20.00                                            
##     Balance       
##  Min.   :   0.00  
##  1st Qu.:  68.75  
##  Median : 459.50  
##  Mean   : 520.01  
##  3rd Qu.: 863.00  
##  Max.   :1999.00

sub_cred <- cred %>%
  select(Balance, Age, Cards, Education, Income, Limit, Rating, Own)

plot(sub_cred, pch = 20, cex = .8, col = adjustcolor("steelblue", .20))

Qualitative predictors

Given a quantitative outcome/response variable, binary predictor variable can be dummy coded as 0 vs. 1:

Example: Predicting Credit card balance based on demographic variables (quantitative and qualitative/categorical).

Re-creating Table 3.7 (p. 85): Least squares coefficient estimates associated with the regression of credit card Balance onto home ownership (binary predictor Own) in the Credit data set.

# Descriptive: 
# table(cred$Own)
tb_own <- cred %>%
  group_by(Own) %>%
  summarise(n = n(),
            mn_balance = mean(Balance))

knitr::kable(tb_own, caption = "N and mean balance by home ownership.")

N and mean balance by home ownership.
Own	n	mn_balance
No	193	509.8031
Yes	207	529.5362

# library(kableExtra)
# kbl(tb_own, booktabs = TRUE, caption = "N and mean balance by home ownership.") %>% 
#  kable_styling(latex_options = "striped", full_width = FALSE)

# Box plot: 
plot(x = cred$Own, y = cred$Balance, 
     main = "Balance by home ownership", 
     col = adjustcolor(c("deepskyblue", "deeppink"), .33))


# Linear model (with binary dummy variable):
mlm_2 <- lm(Balance ~ Own, data = cred)
summary(mlm_2)
## 
## Call:
## lm(formula = Balance ~ Own, data = cred)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -529.54 -455.35  -60.17  334.71 1489.20 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   509.80      33.13  15.389   <2e-16 ***
## OwnYes         19.73      46.05   0.429    0.669    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 460.2 on 398 degrees of freedom
## Multiple R-squared:  0.0004611,  Adjusted R-squared:  -0.00205 
## F-statistic: 0.1836 on 1 and 398 DF,  p-value: 0.6685

Interpretation:

intercept: Average credit card balance without home (i.e., Own == No baseline): $509.80
coefficient: Increase by $19.73 in case of home ownership (but effect is small and non-significant).

Note that the interpretation of the coefficients depends on the way the variables (here: the factor Own) is coded. By contrast, the interpreted predictions remain the same, regardless of coding:

table(cred$Own)  # is a factor with 2 levels (1: No, 2: Yes)
## 
##  No Yes 
## 193 207

# Recode binary predictor:
not_Own <- (as.numeric(cred$Own) * -1) + 2  # reverse 0/1 values of variable
table(not_Own)
## not_Own
##   0   1 
## 207 193

# Linear model (with binary dummy variable):
mlm_3 <- lm(Balance ~ not_Own, data = cred)
summary(mlm_3)
## 
## Call:
## lm(formula = Balance ~ not_Own, data = cred)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -529.54 -455.35  -60.17  334.71 1489.20 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   529.54      31.99  16.554   <2e-16 ***
## not_Own       -19.73      46.05  -0.429    0.669    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 460.2 on 398 degrees of freedom
## Multiple R-squared:  0.0004611,  Adjusted R-squared:  -0.00205 
## F-statistic: 0.1836 on 1 and 398 DF,  p-value: 0.6685

Interpretation:

The mlm_3 contains the same results as mlm_2, but from the opposite perspective:

intercept: Average credit card balance WITH home (i.e., not_Own == 0 baseline): $529.54
coefficient: DEcrease by $19.73 in ABSENCE of home ownership (but effect is small and non-significant).

Qualitative predictors with more than two levels

When a qualitative predictor has more than two levels, a single dummy variable cannot represent all possible values. In this situation, we can create additional dummy variables.

Example: For predicting Credit card balance by Region (i.e., a factor variable with 3 levels), we create and use two (binary) dummy variables:

table(cred$Region)  # is a factor with 3 levels (1: East, 2: South, 3: West)
## 
##  East South  West 
##    99   199   102

# Create 2 dummy variables:
cred$r_S <- rep(NA, nrow(cred))
cred$r_S <- ifelse(cred$Region == "South", 1, 0)
table(cred$r_S)
## 
##   0   1 
## 201 199

cred$r_W <- rep(NA, nrow(cred))
cred$r_W <- ifelse(cred$Region == "West", 1, 0)
table(cred$r_W)
## 
##   0   1 
## 298 102

Model mlm_4 (using 2 dummy variables as binary predictors) re-creates the results of Table 3.8 (p. 86): Least squares coefficient estimates associated with the regression of Balance onto region in the Credit data set. The categorical Region variable (with 3 levels) is encoded via two dummy variables.

# Linear model (with 2 binary dummy variables):
mlm_4 <- lm(Balance ~ r_S + r_W, data = cred)
summary(mlm_4)
## 
## Call:
## lm(formula = Balance ~ r_S + r_W, data = cred)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -531.00 -457.08  -63.25  339.25 1480.50 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   531.00      46.32  11.464   <2e-16 ***
## r_S           -12.50      56.68  -0.221    0.826    
## r_W           -18.69      65.02  -0.287    0.774    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 460.9 on 397 degrees of freedom
## Multiple R-squared:  0.0002188,  Adjusted R-squared:  -0.004818 
## F-statistic: 0.04344 on 2 and 397 DF,  p-value: 0.9575

Interpretation:

The level with no dummy variable (East in this example) is the baseline. Which level we choose as the baseline is arbitrary, but matters for interpreting the coefficients.
The coefficient estimates show that the differences between regions are small and non-significant.
Overall, the low $F$-value (and its $p$-value of 0.96) indicate that we cannot reject the null hypothesis $H_0$: $\beta_1 = \beta_2 = 0$ (i.e., Region has no effect on Balance) by this model.

Note that we can achieve the same result with the original categorical variable Region (i.e., a factor with 3 levels):

# Linear model with a factor variable (3-levels):
mlm_5 <- lm(Balance ~ Region, data = cred)
summary(mlm_5)
## 
## Call:
## lm(formula = Balance ~ Region, data = cred)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -531.00 -457.08  -63.25  339.25 1480.50 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   531.00      46.32  11.464   <2e-16 ***
## RegionSouth   -12.50      56.68  -0.221    0.826    
## RegionWest    -18.69      65.02  -0.287    0.774    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 460.9 on 397 degrees of freedom
## Multiple R-squared:  0.0002188,  Adjusted R-squared:  -0.004818 
## F-statistic: 0.04344 on 2 and 397 DF,  p-value: 0.9575

Overall, using the dummy variable approach presents no difficulties when incorporating both quantitative and qualitative predictors.

Besides the dummy variable approach, there are alternative ways of coding qualitative variables. All of these approaches lead to equivalent model fits, but the coefficients have different values and interpretations, and are designed to measure particular contrasts.

Extensions of the linear model

The standard LR model provides interpretable results and works quite well on many real-world problems. However, its restrictive assumptions are often violated in practice. Two of the most important assumptions state that the effects of the predictors are additive and the relationship between the predictors and response is linear. Here are some classical approaches for extending the linear model:

1. Non-additive relationships

Removing the additive assumption of standard LR models: Including interaction effects.

Example: Two quantitative predictors

Figure 3.5 (p. 81) suggested a non-linear relationship in the Advertising data for a multiple LR with 2 predictors (TV and radio). Such a non-linear pattern often suggests a synergy or interaction effect between both predictors:

# Multiple linear regression model (2 predictors + interaction): ---- 
mlm_6 <- lm(sales ~ TV + radio + TV:radio, data = ad)
mlm_6 <- lm(sales ~ TV * radio, data = ad)  # same model (main effects + interaction)

# Summary mLR:
summary(mlm_6)
## 
## Call:
## lm(formula = sales ~ TV * radio, data = ad)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3366 -0.4028  0.1831  0.5948  1.5246 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.750e+00  2.479e-01  27.233   <2e-16 ***
## TV          1.910e-02  1.504e-03  12.699   <2e-16 ***
## radio       2.886e-02  8.905e-03   3.241   0.0014 ** 
## TV:radio    1.086e-03  5.242e-05  20.727   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared:  0.9678, Adjusted R-squared:  0.9673 
## F-statistic:  1963 on 3 and 196 DF,  p-value: < 2.2e-16

# Without interaction: ---- 
mlm_x <- lm(sales ~ TV + radio, data = ad)
# summary(mlm_x)

Notes:

Testing for interactions relaxes the additivity assumption insofar as multiplicative relations are included (see explanation and examples on p. 88).
The new $R^2 = 96.7\%$, whereas it was only $R^2 = 89.6\%$ without the interaction.
Thus,68.3% of the remaining variance in sales was explained by including the interaction term.
The hierarchical principle: If we include an interaction in a model, we should also include the main effects, even if the $p$-values associated with their coefficients are not significant. (In mlm_6, both the two main effects and their interaction are significant.)

Example: One qualitative and one quantitative predictor

An interaction between a qualitative variable and a quantitative variable has a particularly nice interpretation (p. 89 ). Consider the Credit data set from above, and suppose that we wish to predict balance using the income (quantitative) and student (qualitative) variables:

# head(cred)

# (a) Linear model with a quantitative and qualitative predictor (NO interaction):
mlm_7a <- lm(Balance ~ Income + Student, data = cred)
summary(mlm_7a)
## 
## Call:
## lm(formula = Balance ~ Income + Student, data = cred)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -762.37 -331.38  -45.04  323.60  818.28 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 211.1430    32.4572   6.505 2.34e-10 ***
## Income        5.9843     0.5566  10.751  < 2e-16 ***
## StudentYes  382.6705    65.3108   5.859 9.78e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 391.8 on 397 degrees of freedom
## Multiple R-squared:  0.2775, Adjusted R-squared:  0.2738 
## F-statistic: 76.22 on 2 and 397 DF,  p-value: < 2.2e-16

# (b) Linear model with a quantitative and qualitative predictor (WITH interaction):
mlm_7b <- lm(Balance ~ Income + Student + Income:Student, data = cred)
summary(mlm_7b)
## 
## Call:
## lm(formula = Balance ~ Income + Student + Income:Student, data = cred)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -773.39 -325.70  -41.13  321.65  814.04 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       200.6232    33.6984   5.953 5.79e-09 ***
## Income              6.2182     0.5921  10.502  < 2e-16 ***
## StudentYes        476.6758   104.3512   4.568 6.59e-06 ***
## Income:StudentYes  -1.9992     1.7313  -1.155    0.249    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 391.6 on 396 degrees of freedom
## Multiple R-squared:  0.2799, Adjusted R-squared:  0.2744 
## F-statistic:  51.3 on 3 and 396 DF,  p-value: < 2.2e-16

When expressed graphically, including the interaction results in a more flexible model that allows not only for different intercepts, but also different slopes for both qualitative groups:

Figure 3.7: Including interactions.

Notes:

Whereas including main effects allow only for parallel lines with different intercepts (for each level of the qualitative predictor), including interaction term additionally allows for different slopes (i.e., different impacts of the quantitative predictor on different qualitative groups).
See derivation of line equations (i.e., values of intercept and slope) from coefficients (James et al., 2021, p. 90)

# Scatterplot: 
plot(x = cred$Income, y = cred$Balance, 
     main = "Balance by income (including interaction with student/no student)", 
     pch = 20, col = ifelse(cred$Student == "Yes", adjustcolor("deepskyblue", .33), adjustcolor("deeppink", .33)),  
     xlab = "Income", ylab = "Balance", xlim = c(0, 200), ylim = c(0, 1600))

# From model:
cf <- coefficients(mlm_7b)

# Line equations (with interaction of categorical predictor): 
l_inc_no_stud <- function(x){ cf[1] + cf[2] * x }
l_inc_student <- function(x){ (cf[1] + cf[3]) + (cf[2] + cf[4]) * x }

curve(expr = l_inc_student, lwd = 2, col = "deepskyblue", add = TRUE)
curve(expr = l_inc_no_stud, lwd = 2, col = "deeppink", add = TRUE)

Recreating Figure 3.7b: 1 quantitative and 1 qualitative predictor with interaction.

2. Non-linear relationships

Removing the linear assumption of standard LR models: Using polynomial regression.

The LR model assumes a linear relationship between the predictors and response. When the true relationship between the response and the predictors is non-linear, a simple way to allow for incorporating incorporating non-linear associations into a linear model is to include transformed versions of the predictors.

Extending the linear model to accommodate non-linear relationships is known as polynomial regression, as we include polynomial functions of the predictors in a LR model.

Example of polynomial regression: The points in Figure 3.8 (p. 91) seem to have a quadratic shape, suggesting that a model of the form

$\text{mpg} = \beta_0 + (\beta_1 \cdot \text{hp}) + (\beta_2 \cdot \text{hp}^2) + \epsilon$

may provide a better fit.

# Data: 
Auto <- ISLR2::Auto

# Variables:
hp  <- Auto$horsepower  # predictor
mpg <- Auto$mpg  # outcome

# Scatterplot:
plot(x = hp, y = mpg,
     main = "Auto data: mpg by horsepower", xlab = "horsepower", ylab = "mpg", 
     pch = 20, cex = 1, col = adjustcolor("black", .25))

# (a) Linear model with one predictor (hp):
mlm_8a <- lm(mpg ~ hp)
summary(mlm_8a)
## 
## Call:
## lm(formula = mpg ~ hp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## hp          -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

abline(mlm_8a, lwd = 2, col = "deepskyblue")

# (b) Polynomial LR model with one predictor and its square (as 2nd predictor):

hp_sq <- hp^2  # squared predictor 
mlm_8b <- lm(mpg ~ hp + hp_sq)

mlm_8b <- lm(mpg ~ hp + I(hp^2))  # Note: I() inhibits the interpretation of ^ 
                                  # (which has special meaning in an R formula)
summary(mlm_8b)
## 
## Call:
## lm(formula = mpg ~ hp + I(hp^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 56.9000997  1.8004268   31.60   <2e-16 ***
## hp          -0.4661896  0.0311246  -14.98   <2e-16 ***
## I(hp^2)      0.0012305  0.0001221   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

curve(expr = coef(mlm_8b)[1] + coef(mlm_8b)[2] * x + coef(mlm_8b)[3] * x^2, 
      lwd = 2, col = "deeppink", add = TRUE)

Figure 3.8

Notes:

The fit of the polynomial model (mlm_8b) is superior: $R^2 = 0.686$, compared to $0.605$ for the linear fit (mlm_08a). Also, the $p$-value for the coefficient of the quadratic term is highly significant.
In this equation, the value of mpg is a non-linear function of hp. However, the model is still a linear model — a multiple LR model with two predictors $X_1 = \text{hp}$ and $X_2 = \text{hp}^2$.
We could extend the polynomial LR approach by including terms for $\text{hp}^3$, $\text{hp}^4$, etc. Although the resulting curves would provide closer fits, they are also prone to overfitting the model training data.

See Chapter 7 for further non-linear extensions of the linear model (e.g., basis functions, splines, local regression, generalized additive models).

Potential problems for LR

When we fit a linear regression model to a particular data set, the following problems can occur:

Non-linearity of the response-predictor relationships.
Correlation of error terms.
Non-constant variance of error terms.
Outliers.
High-leverage points.
Collinearity.

Detecting and solving these problems in practice is as much an art as a science. A brief summary of some key points:

ad 1. Non-linearity of the response-predictor relationships

A LR model assumes a linear (i.e., straight-line) relationship between predictors and response. If the true relationship is not linear, all conclusions drawn from a linear model are suspect.

Diagnosis:

A useful graphical tool for identifying non-linearity are residual plots:

Figure 3.9: Residual plots for linear vs. quadratic fit.

Re-creating Figure 3.9 (p. 93):

Simple LR model: Plotting the residual errors $e_i = y_i − \hat{y_i}$ as a function of the the predictor $x_i$:

# Data: 
Auto <- ISLR2::Auto

# Variables:
hp  <- Auto$horsepower  # predictor
mpg <- Auto$mpg  # outcome

# (a) Linear model with one predictor (hp):
mlm_8a <- lm(mpg ~ hp)
# summary(mlm_8a)

# Scatterplot of residuals:
mpg_residuals <- mpg - predict(mlm_8a)
plot(x = hp, y = mpg_residuals,
     main = "Residuals: mpg by horsepower", xlab = "horsepower", ylab = "mpg", 
     pch = 20, cex = 1, col = adjustcolor("black", .25))
abline(0, 0, lty = 2, lwd = .5)

# identify(x = hp, y = mpg_residuals, plot = TRUE)

# Note: Default residual plot in 
# plot(mlm_8a)

Polynomial LR model: Plotting residual errors $e_i = y_i − \hat{y_i}$ vs. the predictor $x_i$:

# (b) Polynomial LR model with one predictor and its square (as 2nd predictor):
hp_sq <- hp^2  # squared predictor 
mlm_8b <- lm(mpg ~ hp + hp_sq)
# summary(mlm_8b)

# Scatterplot of residuals:
mpg_residuals_2 <- mpg - predict(mlm_8b)
plot(x = hp, y = mpg_residuals_2,
     main = "Residuals: mpg by horsepower + hp^2", xlab = "hp", ylab = "mpg", 
     pch = 20, cex = 1, col = adjustcolor("black", .25))
abline(0, 0, lty = 2, lwd = .5)

# identify(x = hp, y = mpg_residuals_2, plot = TRUE)

# Note: Default residual plot in 
# plot(mlm_8b)

Multiple LR model: Plotting the residual errors $e_i = y_i − \hat{y_i}$ vs. the predicted (or fitted) values .

Interpretation:

The presence of a pattern (e.g., U-shape) may indicate a problem with some aspect of the linear model.
If the linearity assumption is violated, a simple approach for improving a model can be to use a non-linear transformation of the predictor(s) (e.g., log($X$), $\sqrt{X}$, or $X^2$).

ad 2. Correlation of error terms

A LR model assumes that the error terms $e_i = y_i − \hat{y_i}$ are uncorrelated. Violating this assumption results in unjustified confidence (e.g., larger confidence intervals).

Examples:

If we accidentally used each of $n$ data points twice (i.e., created a model that considered $2n$ data points), our confidence intervals would overestimate the truly warranted confidence by a factor of $\sqrt{2}$.
Correlated error terms frequently occur in the context of time series data, which often exhibit positively correlated errors of adjacent data points.

Diagnosis:

Plot the residual errors from a LR model as a function of time (see Figure 3.10, p. 95):

Figure 3.10: Residual plots by time (to diagnose correlated error terms).

Interpretation:

When viewing residual errors as a function of time, we should not see any time-related trends or patterns. By contrast, tracking in the residuals (i.e., adjacent residuals having similar signs or values) suggest correlated error terms.

Examples beyond time series:

Correlation between errors can occur when other relevant variables influence the relationship between predictor(s) and response variables. For instance, a study may aim to predict individuals’ height from their weight. If clusters in the data (e.g., families, communities, countries) consider individuals on similar diets, their error terms may be correlated. Avoiding such correlations is critical factor of (ensuring the internal validity of an) experimental design.

ad 3. Non-constant variance of error terms

A LR model assumes that error terms have a constant variance $\text{Var}(\epsilon_i) = \sigma^2$ (aka. heteroscedasticity).

However, variances of the error terms may increase (or decrease) with the value of the response.

Diagnosis:

A funnel shape in the residual plot indicates non-constant variances in the errors (heteroscedasticity).
A concave transformation of the outcome/response variable $Y$ (e.g., using $\text{log}(Y)$ or $\sqrt{Y}$) can restore homoscedasticity (see Figure 3.11, p. 96):

When the variance of response values $Y_i$ is known, we can fit a LR model with weighted least squares, with weights proportional to the inverse variances. LR software usually allows for observation weights.

ad 4. Outliers

Definition: An outlier is a data point for which $y_i$ differs substantially from the value $\hat{y_i}$ predicted by the model (i.e., the value’s residual error is large). This can — but does not have to — indicate errors (e.g., in measuring or coding the original data) and should be investigated carefully.

Even when the model coefficients (and thus the best-fitting regression line) may change very little, removing outliers may dramatically improve the model fit (by reducing $RSE$ and increasing $R^2$).

Example of Figure 3.12 (p. 97):

Figure 3.12: Impact on outliers on LR model.

Interpretation:

Left: The least squares regression line is shown in red, and the regression line after removing the outlier is shown in blue.
Center: The residual plot clearly identifies the outlier.
Right: The outlier has a studentized residual of $+6$; typically we expect values between $−3$ and $+3$.

Diagnosis:

Residual plots can be used to identify outliers. To standardize, we can plot studentized residuals that divide each residual error $e_i$ by its estimated standard error. Observations whose studentized residuals are greater than $|3|$ are possible outliers.

ad 5. High-leverage points

Definition: Whereas outliers are observations for which the response $y_i$ is unusual given the predictor $x_i$. By contrast, observations with high leverage have an unusual predictor value for $x_i$.

Example: Observation 41 in the left-hand panel of Figure 3.13 (p. 98) has high leverage, in that the predictor value for this observation is large relative to the other observations:

Figure 3.13: Impact of high leverage values.

Interpretation:

Left: Observation 41 is a high leverage point, while 20 is (perhaps) an outlier. The red line is the fit to all the data, and the blue line is the fit with observation 41 removed.
Center: The red observation is not unusual in terms of its $X_1$ value or its $X_2$ value, but still falls outside the bulk of the data, and hence has high leverage.
Right: Observation 41 has a high leverage and a high residual.

Overall, a few high leverage observations can have a sizable impact on the estimated regression line.

Diagnosis:

In order to quantify an observation’s leverage, we compute the leverage statistic: If a given observation has a leverage statistic that greatly exceeds $(p + 1)/n$, then we may suspect that the corresponding point has high leverage.
In Figure 3.13 (p. 98), observation 41 stands out as having a very high leverage statistic as well as a high studentized residual. In other words, it is an outlier as well as a high leverage observation. By contrast, observation 20 has low leverage (i.e., not much influence on the model/line of best fit).

ad 6. Collinearity

Definition: Collinearity refers to the situation in which two or more predictor variables are closely related to one another.

When two or more predictors are collinear (i.e., related to each other), it is difficult or impossible to separate their individual effects on the response. In other words, when predictors tend to increase or decrease together, their $\beta$-weights can fluctuate widely, as we cannot separately measure their relation to the response.

Example: Two scatterplot for predictors in the Credit dataset (without and with collinearity between predictors):

Figure 3.14: Collinearity between predictors.

Interpretation:

Left: A plot of age versus limit. These two variables are not collinear.
Right: A plot of rating versus limit. There is high collinearity.

Example: Two contour plots of RSS (without and with collinearity between predictors):

Figure 3.15: Contour plots for RSS without and with collinearity between predictors.

Interpretation:

Left: A contour plot of RSS for the regression of balance onto age and limit. In the absence of collinearity between predictors, the minimum value is well defined.
Right: A contour plot of RSS for the regression of balance onto rating and limit. In the presence of collinearity between predictors, there are many pairs ($\beta_\text{Limit}$, $\beta_\text{Rating}$) with a similar RSS value.

Consequences of collinearity:

Collinearity yields a great deal of uncertainty in the coefficient estimates.
The standard error of coefficient estimates $\hat{\beta_j}$ grows, resulting in a declining $t$-statistic, and lower power (for rejecting $H_0$).

Example: See Table 3.11 (p. 102) for two models without vs. with collinearity.

Diagnosis:

To detect collinearity between pairs of predictors, inspect the correlation matrix of the predictors.
To detect multicollinearity (i.e., relationships between three or more predictors), compute the variance inflation factor (VIF) for each predictor. VIF values are $\geq 1$: VIF$(\hat{\beta_j}) \geq 5$ suggest problematic amounts of collinearity.

Remedy:

Drop one or more problematic predictors from the model.
Combine collinear variables into a single predictor (e.g., the average of standardized versions of the individual predictors).

3.4 The marketing plan

This chapter started by introducing the Advertising data. After our overview of LR, we can not address the questions (asked on p. 59f.):

Is there a relationship between predictors (advertising budgets) and outcome (sales)?

Fit a multiple LR model (without interactions):

# Data:
dim(ad)
## [1] 200   5
head(ad)
## # A tibble: 6 × 5
##       X    TV radio newspaper sales
##   <int> <dbl> <dbl>     <dbl> <dbl>
## 1     1 230.   37.8      69.2  22.1
## 2     2  44.5  39.3      45.1  10.4
## 3     3  17.2  45.9      69.3   9.3
## 4     4 152.   41.3      58.5  18.5
## 5     5 181.   10.8      58.4  12.9
## 6     6   8.7  48.9      75     7.2

m_1 <- lm(sales ~ TV + radio + newspaper, data = ad)
summary(m_1)
## 
## Call:
## lm(formula = sales ~ TV + radio + newspaper, data = ad)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## TV           0.045765   0.001395  32.809   <2e-16 ***
## radio        0.188530   0.008611  21.893   <2e-16 ***
## newspaper   -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

Interpretation: Table 3.6 (p. 76): There clearly is a relation between advertising budget (mostly that of TV, radio) and sales.

How strong is the relationship?

Summary of m_1 and Table 3.6 (p. 76): Quite strong (e.g., $RSE = 1.69$, $R^2 = 89.6$%, $F = 570$).

Which predictors matter? What are their individual contributions on the overall outcome?

Summary of m_1 and Table 3.4 (p. 74): The $p$-values suggest that This suggest only TV and radio are related to sales (see Chapter 6 for details).

How strong is the relationship for each predictor (with or without considering the others)?

Summary of m_1 and Table 3.4 (p. 74) can be used to compute confidence intervals for the predictors: The coefficients for predictors TV and radio exclude zero, whereas that of newspaper is not different from zero.

Is there collinearity between predictors?

Re-creating the correlation matrix of Table 3.5 (p. 75):

# Correlation matrix (for predictors):
pr <- ad[ , c(-1, -5)]
round(cor(x = pr), 2)
##             TV radio newspaper
## TV        1.00  0.05      0.06
## radio     0.05  1.00      0.35
## newspaper 0.06  0.35      1.00

Only low to moderate correlation between predictors.

Check the variance inflation factor (VIF) for each predictor:

# install.packages(car)  # if car has not been installed yet
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
vif(m_1)
##        TV     radio newspaper 
##  1.004611  1.144952  1.145187

No evidence for collinearity (as VIF values are close to 1, not exceeding 5).

Prediction: How accurately can we predict future outomes?

We can use the model m_1 to predict (or estimate) response values. The accuracy of such estimates depends on whether we wish to predict an individual response $Y = f(X) + \epsilon$, or the average response $f(X)$ (see Question 4 of Section 3.2.2, p. 81f.). If the former, we use a prediction interval, and if the latter, we use a confidence interval. Prediction intervals are always be wider than confidence intervals because they account for the uncertainty associated with the irreducible error $\epsilon$.

Inference: What is the form of the relationship (e.g., linear or non-linear)?

We can use some diagnostic plots:

par(mfrow = c(2, 2))
plot(m_1, pch = 20, col = adjustcolor("black", .25))

par(opar)  # restore original par

Alternatively, we can compute the residuals of m_1 by plotting the predicted values vs. the residuals() function:

plot(predict(m_1), residuals(m_1), 
     main = "Residual plot", pch = 20, col = adjustcolor("black", .25))

The function rstudent() will return the studentized residuals, and we can use this function to plot the residuals against the fitted values:

plot(predict(m_1), rstudent(m_1), 
     main = "Studentized residual plot", pch = 20, col = adjustcolor("black", .25))

Interpretation: On the basis of the residual plots, there is some evidence of non-linearity. This could be addressed by transforming the predictors in the LR model in order to accommodate non-linear relationships (see Section 3.3.2).

Leverage statistics can be computed for any number of predictors using the hatvalues() function:

# Leverage statistics:
plot(hatvalues(m_1), 
     main = "Leverage", pch = 20, col = adjustcolor("black", .25))


# Point with max leverage: 
(max_lev <- which.max(hatvalues(m_1)))
## 17 
## 17
ad[max_lev, ]  # view observation (row) in data
## # A tibble: 1 × 5
##       X    TV radio newspaper sales
##   <int> <dbl> <dbl>     <dbl> <dbl>
## 1    17  67.8  36.6       114  12.5

Interactions: Are there synergy effects (interactions)?

The standard LR model assumes an additive relation between predictors and response. An additive model is easy to interpret because the association between each predictor and the response is unrelated to the values of the other predictors. However, the additive assumption may be unrealistic for many datasets.

We just concluded that the Advertising data may non-additive (see also Figure 3.5). In Section 3.3.2, we saw that including an interaction term in the regression model can accommodate non-additive relationships:

# Data:
dim(ad)
## [1] 200   5
head(ad)
## # A tibble: 6 × 5
##       X    TV radio newspaper sales
##   <int> <dbl> <dbl>     <dbl> <dbl>
## 1     1 230.   37.8      69.2  22.1
## 2     2  44.5  39.3      45.1  10.4
## 3     3  17.2  45.9      69.3   9.3
## 4     4 152.   41.3      58.5  18.5
## 5     5 181.   10.8      58.4  12.9
## 6     6   8.7  48.9      75     7.2

# Alternative model: 2 predictors and their interaction:
m_2 <- lm(sales ~ TV + radio + TV:radio, data = ad)
m_2 <- lm(sales ~ TV*radio, data = ad)  # shorter version of same model
summary(m_2)
## 
## Call:
## lm(formula = sales ~ TV * radio, data = ad)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3366 -0.4028  0.1831  0.5948  1.5246 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.750e+00  2.479e-01  27.233   <2e-16 ***
## TV          1.910e-02  1.504e-03  12.699   <2e-16 ***
## radio       2.886e-02  8.905e-03   3.241   0.0014 ** 
## TV:radio    1.086e-03  5.242e-05  20.727   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared:  0.9678, Adjusted R-squared:  0.9673 
## F-statistic:  1963 on 3 and 196 DF,  p-value: < 2.2e-16

Interpretation:

The small $p$-value associated with the interaction term indicates the presence of a non-additive relationship.
Including an interaction term in the model results in a substantial increase in $R^2$, from around $90$% to $96.7$%.

3.5 Comparison of linear regression with K-nearest neighbors

LR is an example of a parametric approach that assumes a linear functional form for $f(X)$.

Advantages:

easy to fit, few parameters
easy to test for statistical significance
easy to interpret (in LR)

Disadvantage:

strong assumptions about the linear form of $f(X)$

By contrast, non-parametric methods do not explicitly assume a parametric form for $f(X)$, and thereby provide a more flexible alternative to LR.

K-nearest neighbors regression (KNN regression):
Given a value for $K$ and a prediction point $x_0$, KNN regression first identifies the $K$ training observations that are closest to $x_0$, represented by $N_0$. It then estimates $f(x0)$ using the average of all the training responses in $N_0$.

Figure 3.16: Bias-variance tradeoff in KNN regression (K = 1 vs. K = 9).

Interpretation:

Values for $K$ in KNN regression determine bias-variance tradeoff (see Chapter 2):

Left: $K = 1$ results in a rough step function fit (perfect training fit, low bias, high variance)
Right: $K = 9$ produces a much smoother fit (less perfect training fit, higher bias, lower variance)

Choosing LR vs. KNN regression

Which approach is better mostly depends on the true functional form of $f(X)$. The closer it is to linear, the better or easier is fitting a LR model. However, even for non-linear data, KNN may inferior, when it is prone to fit noise.

Figures 3.17 to 3.19 (p. 108f.) compare LR vs. KNN regression for different functional forms $f(X)$. Overall, LR is superior for linear and near-linear data, whereas KNN regression has clear advantages for highly non-linear data.

Curse of dimensionality: When the number of observations relative to that of predictors is small, parametric methods will tend to outperform non-parametric approaches.

Even when the number of predictors is small, we might prefer LR to KNN from an interpretability standpoint. If the test MSE of KNN is only slightly lower than that of LR, we are willing to forego a little bit of prediction accuracy for the sake of a simple model that can be described in terms of just a few coefficients, and for which $p$-values are available.

Figure 3.20: Test MSE for LR vs. KNN regression by number of predictors p.

Interpretation:

Test $MSE$ for LR (black dashed lines) and KNN (green curves) as the number of variables $p$ increases. The true function is non–linear in the first variable (as in the lower panel in Figure 3.19), and does not depend on the additional variables.

LR performance deteriorates slowly in the presence of additional noise variables (i.e., is quite robust).
KNN regression performance degrades much more quickly as the value of $p$ increases (i.e., is more sensitive to noise).

3.6 Lab: Linear regression

R packages / libraries

In R, the library() function is used to load libraries (aka. packages, i.e., groups of functions and data sets) that are not included in the base R distribution. Basic functions that perform least squares linear regression and other simple analyses come standard with any R distribution, but more exotic functions require additional libraries.

Loading packages

Here we load the MASS package, which is a very large collection of data sets and functions, and the ISLR2 package, which includes the data sets associated with this book:

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
## 
##     Boston
## The following object is masked from 'package:dplyr':
## 
##     select
library(ISLR2)

If you receive an error message when loading any of these libraries, it likely indicates that the corresponding library has not yet been installed on your system. Some libraries, such as MASS, come with R and do not need to be separately installed on your computer.

Installing packages

However, other packages, such as ISLR2, must be downloaded the first time they are used. This can be done directly from within R (using the buttons/menus of an IDE or install.packages() function). For example, on a Windows system, select the Install package option under the Packages tab. After you select any mirror site, a list of available packages will appear. Simply select the package you wish to install and R will automatically download the package. Alternatively, this can be done at the R command line via install.packages("ISLR2").

An installation only needs to be done the first time you use a package. However, the library() function must be called within each R session (to load the package).

Simple linear regression

The ISLR2 library contains the Boston data set, which records medv (median house value) for $506$ census tracts in Boston. We will seek to predict medv using $12$ predictors such as

rmvar (average number of rooms per house),
age (average age of houses), and
lstat (percent of households with low socioeconomic status).

head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

To find out more about the data set, we can type ?Boston.

We will start by using the lm() function to fit a simple linear regression (simple LR) model, with medv as the response and lstat as the predictor. The basic syntax is {}, where y is the response, x is the predictor, and data is the data set in which these two variables are kept:

lm.fit <- lm(medv ~ lstat)
## Error in eval(predvars, data, env): object 'medv' not found

The command causes an error because R does not know where to find the variables medv and lstat. The next line tells R that the variables refer to the Boston data. If we attach Boston, the first line works fine because R now recognizes the variables:

# 1. tell lm() to look for variables in data: 
lm.fit <- lm(medv ~ lstat, data = Boston)

# 2. attach data to environment to make variables accessible: 
attach(Boston)
lm.fit <- lm(medv ~ lstat)

If we type lm.fit, some basic information about the model is printed. For more detailed information, we use summary(lm.fit). This gives us $p$-values and standard errors for the coefficients, as well as the $R^2$ statistic and $F$-statistic for the model:

lm.fit
## 
## Call:
## lm(formula = medv ~ lstat)
## 
## Coefficients:
## (Intercept)        lstat  
##       34.55        -0.95
summary(lm.fit)
## 
## Call:
## lm(formula = medv ~ lstat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

We can use the names() function in order to find out what other pieces of information are stored in lm.fit. Although we can extract these quantities by name — e.g., lm.fit$coefficients — it is safer to use the extractor functions like coef() to access them:

names(lm.fit)  # show parts (of list object)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
coef(lm.fit)   # extract parts 
## (Intercept)       lstat 
##  34.5538409  -0.9500494

In order to obtain a confidence interval for the coefficient estimates, we can use the confint() command. (Type confint(lm.fit) at the command line to obtain the confidence intervals.)

confint(lm.fit)
##                 2.5 %     97.5 %
## (Intercept) 33.448457 35.6592247
## lstat       -1.026148 -0.8739505

Predictions with confidence vs. prediction intervals

The predict() function can be used to produce confidence intervals and prediction intervals for the prediction of medv for a given value of lstat:

# values of interest: 
my_vals <- c(5, 10, 15)

# predictions with confidence intervals:
predict(lm.fit, data.frame(lstat = my_vals),
    interval = "confidence")
##        fit      lwr      upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461

# predictions with prediction intervals: 
predict(lm.fit, data.frame(lstat = my_vals),
    interval = "prediction")
##        fit       lwr      upr
## 1 29.80359 17.565675 42.04151
## 2 25.05335 12.827626 37.27907
## 3 20.30310  8.077742 32.52846

For instance, the 95,% confidence interval associated with a lstat value of 10 is $(24.47, 25.63)$, and the 95,% prediction interval is $(12.828, 37.28)$. As expected, the confidence and prediction intervals are centered around the same value (a predicted value of $25.05$ for medv when lstat equals 10), but the prediction intervals are substantially wider.

Visualizations

We will now plot medv and lstat along with the least squares regression line using the plot() and abline() functions:

plot(x = lstat, y = medv, 
     main = "Scatterplot of response by predictor", 
     pch = 20, col = adjustcolor("black", .33))
abline(lm.fit, lwd = 2, col = "deepskyblue")

There is some evidence for non-linearity in the relationship between lstat and medv (see below).

Plotting lines

The abline() function can be used to draw any line, not just the least squares regression line. To draw a line with intercept a and slope b, we type abline(a, b). Below we experiment with some additional settings for plotting lines and points. The lwd = 3 argument causes the width of the regression line to be increased by a factor of 3; this also works for the plot() and lines() functions. We can also use the pch option to create different plotting symbols:

# Basic plot: 
plot(lstat, medv)

# Lines:
# abline(lm.fit, lwd = 3)
abline(lm.fit, lwd = 3, col = "deeppink")


# Point symbols and colors:
plot(lstat, medv, 
     pch = 20, col = adjustcolor("darkblue", alpha.f = .33))

# plot(lstat, medv, pch = 20)
plot(lstat, medv, pch = "+", col = adjustcolor("deeppink", alpha.f = .67))


# Note: Different pch values:
# plot(1:20, 1:20, pch = 1:20)

Diagnostic plots

Next we examine some diagnostic plots, several of which were discussed in Section 3.3.3. Four diagnostic plots are automatically produced by applying the plot() function directly to the output object obtained by lm(). This command will normally produce one plot at a time, and hitting Enter will generate the next plot. To view all four plots together, we can use the par() and mfrow() functions, which instruct R to split the display screen into separate panels, so that multiple plots can be viewed simultaneously. For example,par(mfrow = c(2, 2)) divides the plotting region into a $2 \times 2$ grid of panels:

par(mfrow = c(2, 2))

# 4 standard diagnostic plots:  
plot(lm.fit, 
     pch = 20, col = adjustcolor("black", .33))


par(opar)  # restore original par

Residual plots

Alternatively, we can compute the residuals from a linear regression fit using the residuals() function. The function rstudent() will return the studentized residuals, and we can use this function to plot the residuals against the fitted values.

# Residual plot:
plot(x = predict(lm.fit), y = residuals(lm.fit), 
     main = "Residual plot", 
     pch = 20, col = adjustcolor("black", .33))


# Studentized residual plot: 
plot(x = predict(lm.fit), y = rstudent(lm.fit), 
     main = "Studentized residuals", 
     pch = 20, col = adjustcolor("black", .33))

On the basis of the residual plots, there is some evidence of non-linearity.

Leverage statistics

Leverage statistics can be computed for any number of predictors using the hatvalues() function:

# Compute leverage statistics: 
hatvalues(lm.fit)[1:6]  # first 6 values
##           1           2           3           4           5           6 
## 0.004262518 0.002455527 0.004863680 0.005639779 0.004058706 0.004127513

# Plotting leverage for all values:
plot(hatvalues(lm.fit), 
     main = "Leverage diagnostics", 
     pch = 20, col = adjustcolor("black", .33))


# Identify max value:
which.max(hatvalues(lm.fit))
## 375 
## 375

The which.max() function identifies the index of the largest element of a vector. In this case, it tells us which observation has the maximum leverage statistic.

Multiple linear regression

In order to fit a multiple linear regression (multiple LR) model using least squares, we can use the same lm() function. The syntax lm(y ~ x1 + x2 + x3) is used to fit a model with three predictors, x1, x2, and x3 (as main effects). The summary() function now outputs the regression coefficients for all the predictors:

# Multiple LR (with 2 predictors): 
lm.fit <- lm(medv ~ lstat + age, data = Boston)
summary(lm.fit)
## 
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.981  -3.978  -1.283   1.968  23.158 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
## lstat       -1.03207    0.04819 -21.416  < 2e-16 ***
## age          0.03454    0.01223   2.826  0.00491 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared:  0.5513, Adjusted R-squared:  0.5495 
## F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16

The Boston data set contains 12 variables, and so it would be cumbersome to have to type all of these in order to perform a regression using all of the predictors. Instead, we can use the dot (.) in following short-hand version:

# Include all main effects:
lm.fit <- lm(medv ~ ., data = Boston)
summary(lm.fit)
## 
## Call:
## lm(formula = medv ~ ., data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zn           4.642e-02  1.373e-02   3.382 0.000778 ***
## indus        2.056e-02  6.150e-02   0.334 0.738288    
## chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age          6.922e-04  1.321e-02   0.052 0.958229    
## dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## black        9.312e-03  2.686e-03   3.467 0.000573 ***
## lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

We can access the individual components of a summary object by name (type ?summary.lm to see what is available). Hence summary(lm.fit)$r.sq gives us the $R^2$, and summary(lm.fit)$sigma gives us the $RSE$.

# Access individual components:
# ?summary.lm  # Help on lm summary methods
summary(lm.fit)$r.sq      # R^2
## [1] 0.7406427
summary(lm.fit)$adj.r.sq  # adj. R^2
## [1] 0.7337897
summary(lm.fit)$sigma     # RSE
## [1] 4.745298

Checking for collinearity

To check for potential collinearity between predictors, the vif() function of the car package can be used to compute variance inflation factors.¹ Most VIF’s are low to moderate for this data:

# install.packages(car)  # if car has not been installed yet
library(car)
vif(lm.fit)
##     crim       zn    indus     chas      nox       rm      age      dis 
## 1.792192 2.298758 3.991596 1.073995 4.393720 1.933744 3.100826 3.955945 
##      rad      tax  ptratio    black    lstat 
## 7.484496 9.008554 1.799084 1.348521 2.941491

Excluding variables and updating models

What if we would like to perform a regression using all of the variables but one? For example, in the above regression output, age has a high $p$-value. So we may wish to run a regression excluding this predictor. The following syntax results in a regression using all predictors except age:

lm.fit_2 <- lm(medv ~ . - age, data = Boston)
summary(lm.fit_2)
## 
## Call:
## lm(formula = medv ~ . - age, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.6054  -2.7313  -0.5188   1.7601  26.2243 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.436927   5.080119   7.172 2.72e-12 ***
## crim         -0.108006   0.032832  -3.290 0.001075 ** 
## zn            0.046334   0.013613   3.404 0.000719 ***
## indus         0.020562   0.061433   0.335 0.737989    
## chas          2.689026   0.859598   3.128 0.001863 ** 
## nox         -17.713540   3.679308  -4.814 1.97e-06 ***
## rm            3.814394   0.408480   9.338  < 2e-16 ***
## dis          -1.478612   0.190611  -7.757 5.03e-14 ***
## rad           0.305786   0.066089   4.627 4.75e-06 ***
## tax          -0.012329   0.003755  -3.283 0.001099 ** 
## ptratio      -0.952211   0.130294  -7.308 1.10e-12 ***
## black         0.009321   0.002678   3.481 0.000544 ***
## lstat        -0.523852   0.047625 -10.999  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.74 on 493 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7343 
## F-statistic: 117.3 on 12 and 493 DF,  p-value: < 2.2e-16

Alternatively, the update() function can be used on an existing model fit:

lm.fit_3 <- update(lm.fit, ~ . - age)
summary(lm.fit_3)
## 
## Call:
## lm(formula = medv ~ crim + zn + indus + chas + nox + rm + dis + 
##     rad + tax + ptratio + black + lstat, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.6054  -2.7313  -0.5188   1.7601  26.2243 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.436927   5.080119   7.172 2.72e-12 ***
## crim         -0.108006   0.032832  -3.290 0.001075 ** 
## zn            0.046334   0.013613   3.404 0.000719 ***
## indus         0.020562   0.061433   0.335 0.737989    
## chas          2.689026   0.859598   3.128 0.001863 ** 
## nox         -17.713540   3.679308  -4.814 1.97e-06 ***
## rm            3.814394   0.408480   9.338  < 2e-16 ***
## dis          -1.478612   0.190611  -7.757 5.03e-14 ***
## rad           0.305786   0.066089   4.627 4.75e-06 ***
## tax          -0.012329   0.003755  -3.283 0.001099 ** 
## ptratio      -0.952211   0.130294  -7.308 1.10e-12 ***
## black         0.009321   0.002678   3.481 0.000544 ***
## lstat        -0.523852   0.047625 -10.999  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.74 on 493 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7343 
## F-statistic: 117.3 on 12 and 493 DF,  p-value: < 2.2e-16

Interaction terms

It is easy to include interaction terms in a linear model using the lm() function:

The syntax lstat:black tells R to include an interaction term between lstat and black.
The syntax lstat * age simultaneously includes terms for the main effects lstat and age, and the interaction term lstat x age as predictors; it is a shorthand for lstat + age + lstat:age.

# including only the interaction term:
summary(lm(medv ~ lstat:age, data = Boston))
## 
## Call:
## lm(formula = medv ~ lstat:age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.347  -4.372  -1.534   1.914  27.193 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.1588631  0.4828240   62.46   <2e-16 ***
## lstat:age   -0.0077146  0.0003799  -20.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.827 on 504 degrees of freedom
## Multiple R-squared:  0.4501, Adjusted R-squared:  0.449 
## F-statistic: 412.4 on 1 and 504 DF,  p-value: < 2.2e-16

# including main effects and interaction:
summary(lm(medv ~ lstat * age, data = Boston))
## 
## Call:
## lm(formula = medv ~ lstat * age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.806  -4.045  -1.333   2.085  27.552 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.0885359  1.4698355  24.553  < 2e-16 ***
## lstat       -1.3921168  0.1674555  -8.313 8.78e-16 ***
## age         -0.0007209  0.0198792  -0.036   0.9711    
## lstat:age    0.0041560  0.0018518   2.244   0.0252 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared:  0.5557, Adjusted R-squared:  0.5531 
## F-statistic: 209.3 on 3 and 502 DF,  p-value: < 2.2e-16

Interpreting the significant interaction — here between two quantitative variables — can be challenging. (See Gelman & Hill, 2007, Sections 3.3 and 4.2.)

We can also pass in transformed versions of the predictors…

Non-linear transformations of the predictors

The lm() function can also accommodate non-linear transformations of the predictors. For instance, given a predictor $X$, we can create a 2nd predictor $X^2$ using I(X^2).

Note: The function I() inhibits the interpretation of the arithmetic expression (here: lstat^2). is needed since the ^ has a special meaning in an R formula object; wrapping as we do allows the standard usage in R, which is to raise X to the power 2.

We now perform a regression of medv onto lstat and lstat^2:

lm.fit2 <- lm(medv ~ lstat + I(lstat^2))
summary(lm.fit2)
## 
## Call:
## lm(formula = medv ~ lstat + I(lstat^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2834  -3.8313  -0.5295   2.3095  25.4148 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.862007   0.872084   49.15   <2e-16 ***
## lstat       -2.332821   0.123803  -18.84   <2e-16 ***
## I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared:  0.6407, Adjusted R-squared:  0.6393 
## F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

Interpretation:

The near-zero $p$-value associated with the quadratic term suggests that it leads to an improved model.

Comparing models

We can use the anova() function to quantify the extent to which the quadratic fit is superior to the linear fit:

lm.fit1 <- lm(medv ~ lstat)  # simple LR model 
anova(lm.fit1, lm.fit2)      # comparing 2 models
## Analysis of Variance Table
## 
## Model 1: medv ~ lstat
## Model 2: medv ~ lstat + I(lstat^2)
##   Res.Df   RSS Df Sum of Sq     F    Pr(>F)    
## 1    504 19472                                 
## 2    503 15347  1    4125.1 135.2 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here Model 1 represents the linear submodel containing only one predictor, lstat, while Model 2 corresponds to the larger quadratic model that has two predictors, lstat and lstat^2. The anova() function performs a hypothesis test comparing the two models. The null hypothesis is that the two models fit the data equally well, and the alternative hypothesis is that the full model is superior. Here the $F$-statistic is $135$ and the associated $p$-value is virtually zero. This provides very clear evidence that the model containing the predictors lstat and lstat^2 is superior to the model that only contains the predictor lstat. This is not surprising, since earlier we saw evidence for non-linearity in the relationship between medv and lstat.

Diagnostic plots

To show diagnostic plots for a model lm.fit2, we type:

opar <- par(no.readonly = TRUE)  # store original par
par(mfrow = c(2, 2))  # 2 rows and 2 cols

plot(lm.fit2, 
     pch = 20, col = adjustcolor("black", .25))


par(opar)  # restore original par

Interpretation:

When the lstat^2 term is included in the model, there is little discernible pattern in the residuals.

Higher-order polynomial models

In order to create a cubic fit, we could include a predictor of the form I(X^3). However, this approach can start to get cumbersome for higher-order polynomials. A better approach involves using the poly() function to create the polynomial within lm(). For example, the following command produces a fifth-order polynomial fit:

lm.fit5 <- lm(medv ~ poly(lstat, 5))
summary(lm.fit5)
## 
## Call:
## lm(formula = medv ~ poly(lstat, 5))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5433  -3.1039  -0.7052   2.0844  27.1153 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       22.5328     0.2318  97.197  < 2e-16 ***
## poly(lstat, 5)1 -152.4595     5.2148 -29.236  < 2e-16 ***
## poly(lstat, 5)2   64.2272     5.2148  12.316  < 2e-16 ***
## poly(lstat, 5)3  -27.0511     5.2148  -5.187 3.10e-07 ***
## poly(lstat, 5)4   25.4517     5.2148   4.881 1.42e-06 ***
## poly(lstat, 5)5  -19.2524     5.2148  -3.692 0.000247 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.215 on 500 degrees of freedom
## Multiple R-squared:  0.6817, Adjusted R-squared:  0.6785 
## F-statistic: 214.2 on 5 and 500 DF,  p-value: < 2.2e-16

This suggests that including additional polynomial terms, up to fifth order, leads to an improvement in the model fit! However, further investigation of the data reveals that no polynomial terms beyond fifth order have significant $p$-values in a regression fit.

By default, the poly() function orthogonalizes the predictors: this means that the features output by this function are not simply a sequence of powers of the argument. However, a linear model applied to the output of the poly() function will have the same fitted values as a linear model applied to the raw polynomials (although the coefficient estimates, standard errors, and $p$-values will differ). In order to obtain the raw polynomials from the poly() function, the argument raw = TRUE must be used.

Other transformations

Of course, we are in no way restricted to using polynomial transformations of the predictors. Here we try a log transformation:

summary(lm(medv ~ log(rm), data = Boston))
## 
## Call:
## lm(formula = medv ~ log(rm), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.487  -2.875  -0.104   2.837  39.816 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -76.488      5.028  -15.21   <2e-16 ***
## log(rm)       54.055      2.739   19.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.915 on 504 degrees of freedom
## Multiple R-squared:  0.4358, Adjusted R-squared:  0.4347 
## F-statistic: 389.3 on 1 and 504 DF,  p-value: < 2.2e-16

(For details on linear and non-linear transformations, see Gelman & Hill, 2007, Chapter 4.)

Qualitative predictors

We will now examine the Carseats data, which is part of the ISLR2 package. We will attempt to predict Sales (child car seat sales) in $400$ locations based on a number of predictors.

head(Carseats)
##   Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1  9.50       138     73          11        276   120       Bad  42        17
## 2 11.22       111     48          16        260    83      Good  65        10
## 3 10.06       113     35          10        269    80    Medium  59        12
## 4  7.40       117    100           4        466    97    Medium  55        14
## 5  4.15       141     64           3        340   128       Bad  38        13
## 6 10.81       124    113          13        501    72       Bad  78        16
##   Urban  US
## 1   Yes Yes
## 2   Yes Yes
## 3   Yes Yes
## 4   Yes Yes
## 5   Yes  No
## 6    No Yes

is.factor(Carseats$ShelveLoc)
## [1] TRUE
table(Carseats$ShelveLoc)
## 
##    Bad   Good Medium 
##     96     85    219

The Carseats data includes qualitative predictors such as shelveloc, an indicator of the quality of the shelving location — that is, the space within a store in which the car seat is displayed — at each location. The predictor shelveloc takes on three possible values: Bad, Medium, and Good.

Given a qualitative variable such as shelveloc (stored as a factor variable in R), R automatically generates dummy variables.
Below we fit a multiple regression model that includes some interaction terms.

# a qualitative variable (factor): 
is.factor(Carseats$ShelveLoc)
## [1] TRUE
table(Carseats$ShelveLoc)
## 
##    Bad   Good Medium 
##     96     85    219

# Simple LR (using only the ShelveLoc factor as single predictor): 
lm_1 <- lm(Sales ~ ShelveLoc, data = Carseats)
summary(lm_1)
## 
## Call:
## lm(formula = Sales ~ ShelveLoc, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3066 -1.6282 -0.0416  1.5666  6.1471 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.5229     0.2388  23.131  < 2e-16 ***
## ShelveLocGood     4.6911     0.3484  13.464  < 2e-16 ***
## ShelveLocMedium   1.7837     0.2864   6.229  1.2e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.339 on 397 degrees of freedom
## Multiple R-squared:  0.3172, Adjusted R-squared:  0.3138 
## F-statistic: 92.23 on 2 and 397 DF,  p-value: < 2.2e-16

# Multiple LR (including 2 interactions): 
lm_2 <- lm(Sales ~ . + Income:Advertising + Price:Age, data = Carseats)
summary(lm_2)
## 
## Call:
## lm(formula = Sales ~ . + Income:Advertising + Price:Age, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9208 -0.7503  0.0177  0.6754  3.3413 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.5755654  1.0087470   6.519 2.22e-10 ***
## CompPrice           0.0929371  0.0041183  22.567  < 2e-16 ***
## Income              0.0108940  0.0026044   4.183 3.57e-05 ***
## Advertising         0.0702462  0.0226091   3.107 0.002030 ** 
## Population          0.0001592  0.0003679   0.433 0.665330    
## Price              -0.1008064  0.0074399 -13.549  < 2e-16 ***
## ShelveLocGood       4.8486762  0.1528378  31.724  < 2e-16 ***
## ShelveLocMedium     1.9532620  0.1257682  15.531  < 2e-16 ***
## Age                -0.0579466  0.0159506  -3.633 0.000318 ***
## Education          -0.0208525  0.0196131  -1.063 0.288361    
## UrbanYes            0.1401597  0.1124019   1.247 0.213171    
## USYes              -0.1575571  0.1489234  -1.058 0.290729    
## Income:Advertising  0.0007510  0.0002784   2.698 0.007290 ** 
## Price:Age           0.0001068  0.0001333   0.801 0.423812    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.011 on 386 degrees of freedom
## Multiple R-squared:  0.8761, Adjusted R-squared:  0.8719 
## F-statistic:   210 on 13 and 386 DF,  p-value: < 2.2e-16

Defining contrasts

The contrasts() function returns the coding that R uses for the dummy variables:

# attach(Carseats)
contrasts(Carseats$ShelveLoc)
##        Good Medium
## Bad       0      0
## Good      1      0
## Medium    0      1

Use ?contrasts to learn about other contrasts, and how to set them.

R has created a ShelveLocGood dummy variable that takes on a value of 1 if the shelving location is good, and 0 otherwise. It has also created a ShelveLocMedium dummy variable that equals 1 if the shelving location is medium, and 0 otherwise. A bad shelving location corresponds to a zero for each of the two dummy variables.

Interpretation: The fact that the coefficient for ShelveLocGood in the regression output is positive indicates that a good shelving location is associated with high sales (relative to a bad location). And ShelveLocMedium has a smaller positive coefficient, indicating that a medium shelving location is associated with higher sales than a bad shelving location, but lower sales than a good shelving location.

Writing R functions

As we have seen, R comes with many useful functions, and still more functions are available by way of R libraries. However, we will often be interested in performing an operation for which no function is available. In this setting, we may want to write our own function. For instance, below we provide a simple function that reads in the ISLR2 and MASS libraries, called LoadLibraries().

Before we have created the function, R returns an error if we try to call it:

LoadLibraries
## Error in eval(expr, envir, enclos): object 'LoadLibraries' not found
LoadLibraries()
## Error in LoadLibraries(): could not find function "LoadLibraries"

We now create a new function. The body of a function is enclosed in curly brackets {}. The { symbol informs R that multiple commands are about to be entered. Hitting Enter after typing { will cause R to print the + symbol. We can then input as many commands as we wish, hitting Enter after each one. Finally the } symbol informs R that no further commands will be entered.

LoadLibraries <- function() {
 library(ISLR2)
 library(MASS)
 print("The libraries have been loaded.")
}

Now if we type in LoadLibraries (without parentheses), R will tell us what is in the function:

LoadLibraries
## function() {
##  library(ISLR2)
##  library(MASS)
##  print("The libraries have been loaded.")
## }

If we call the function, the libraries are loaded in and the print statement is output:

LoadLibraries()
## [1] "The libraries have been loaded."

Note that R functions can also take and use arguments (both mandatory and optional). The names of such arguments would have to be supplied in the parentheses () after the function name (separated by commas).

3.7 Exercises

Conceptual

+++ here now +++

Applied

References

Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Retrieved from http://www.stat.columbia.edu/~gelman/arm/

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning with applications in R (2nd edition). Retrieved from https://www.statlearning.com/

The car package is not part of the base R installation so it must be downloaded the first time you use it via the install.packages() function in R.↩︎

ISL

hn

2022-06-27

3 Linear Regression

Example

Questions

3.1 Simple linear regression

Estimating the coefficients

Assessing the accuracy of the coefficient estimates

True relation vs. least-square regression line

Measures of deviation

Hypothesis testing

Assessing the accuracy of the model

Residual standard error (RSE)

Proportion of variance explained (\(R^2\))

Accuracy measures in R

3.2 Multiple linear regression

Multiple single LR

Multiple LR

Estimating the regression coefficients

Some important questions

ad 1. Overall effect vs. contributions of individual predictors

ad 2. Deciding on important variables

ad 3. Model fit

ad 4. Predictions

3.3 Other considerations in the regression model

Qualitative predictors

Qualitative predictors with more than two levels

Extensions of the linear model

1. Non-additive relationships

2. Non-linear relationships

Potential problems for LR

ad 1. Non-linearity of the response-predictor relationships

ad 2. Correlation of error terms

ad 3. Non-constant variance of error terms

ad 4. Outliers

ad 5. High-leverage points

ad 6. Collinearity

3.4 The marketing plan

3.5 Comparison of linear regression with K-nearest neighbors

Choosing LR vs. KNN regression

3.6 Lab: Linear regression

R packages / libraries

Loading packages

Installing packages

Simple linear regression

Predictions with confidence vs. prediction intervals

Visualizations

Plotting lines

Diagnostic plots

Residual plots

Leverage statistics

Multiple linear regression

Checking for collinearity

Excluding variables and updating models

Interaction terms

Non-linear transformations of the predictors

Comparing models

Diagnostic plots

Higher-order polynomial models

Other transformations

Qualitative predictors

Defining contrasts

Writing R functions

3.7 Exercises

Conceptual

Applied

Links

References