Chapter 4 of James, Witten, Hastie, & Tibshirani (2021). See https://www.statlearning.com/ for data and resources.

4 Classification

The LR model of Chapter 3 assumed that the response variable $Y$ is quantitative. But in many situations, the response variable is qualitative or categorical instead.

The process of predicting qualitative responses or assigning an observation to a category, group, or class, is known as categorization or classification. The corresponding task is known as categorizing or classifying an observation.

When methods used for classification predict a continuous probability that an observation belongs to each of the categories of a qualitative variable, they also behave like regression methods.

4.1 An overview of classification

Examples of classification tasks are ubiquitous. Examples include:

Medical diagnostics: Which of 3 conditions does someone have? Does a new case require emergency or regular treatment?
Financial settings: Is a credit card transaction legitimate or fraudulent?
Genetics: Which DNA sequences cause healthy vs. deleterious changes?

In all these cases, we can use training data to develop a classifier and need to evaluate its performance on different test data (i.e., data not used to train the classifier).

Example

Using the simulated Default data set:

library(ISLR2)

# Data:
df <- ISLR2::Default
dim(df)
## [1] 10000     4
head(df)
##   default student   balance    income
## 1      No      No  729.5265 44361.625
## 2      No     Yes  817.1804 12106.135
## 3      No      No 1073.5492 31767.139
## 4      No      No  529.2506 35704.494
## 5      No      No  785.6559 38463.496
## 6      No     Yes  919.5885  7491.559

Task: Explain or predict default (binary categories) as a function of student (binary categories), balance and income (continuous).

Figure 4.1 (p. 131) displays the relation between balance and income:

Figure 4.1a: Default data: Scatterplot.

Figure 4.1b: Default data: Boxplots for balance and income.

Recreating Figure 4.1:

par(mfrow = c(1, 3))

my_cols <- c("darkorange", "deepskyblue")

# a: Scatterplot
plot(x = df$balance, y = df$income, main = "(a) Income by balance",
     pch = ifelse(df$default == "Yes", 3, 1), cex = .8, 
     xlab = "Balance", ylab = "Income", 
     col = adjustcolor(ifelse(df$default == "Yes", my_cols[1], my_cols[2]), .5))

abline(lm(income ~ balance, data = df), lwd = 1, col = "firebrick")

# b: Boxplot 1
plot(x = df$default, y = df$balance, main = "(b) Balance by default",
     xlab = "Default", ylab = "Balance", 
     pch = 20, cex = 1.5, col = adjustcolor(rev(my_cols), .9))

# c: Boxplot 2
plot(x = df$default, y = df$income, main = "(c) Income by default",
     xlab = "Default", ylab = "Income", 
     pch = 20, cex = 1.5, col = adjustcolor(rev(my_cols), .9))

par(opar)  # restore original par

Figure 4.1: Credit default by balance and income.

Interpretation:

Figure 4.1a shows income and monthly credit card balance for a subset of 10,000 individuals. The individuals who defaulted in a given month are shown in orange, and those who did not in blue. (The overall default rate is about 3%.) It appears that individuals who defaulted tended to have higher credit card balances than those who did not.
Figures 4.1b and c show two pairs of boxplots: The first shows the distribution of balance split by the binary default variable; the second shows the corresponding boxplot for income. It seems that balance is a stronger predictor for the default response than income.

We want a model to predict the response variable default ($Y$) for any given value of balance ($X_1$) and income ($X_2$). But since the response $Y$ is qualitative (rather than quantitative), the simple or multiple LR models of Chapter 3 is inappropriate (see Section 4.2).

4.2 Why not LR?

LR is inappropriate for categorical / qualitative responses:

For qualitative responses with more than two classes, the numeric coding of categorical responses (e.g., 3 diseases) would imply some metric (e.g., order and distance). Alternative codings are arbitrary, but would yield different models.
For binary response variables, we could use the dummy variable approach (see Section 3.3.1). The predictions can be interpreted as probability estimates, but may be outside the $[0, 1]$ interval.

For binary qualitative responses, logistic regression (log R) is well-suited.

4.3 Logistic regression

Logistic regression predicts the conditional probability of a binary response variable (e.g., default) as a function of a continuous predictor variable (e.g., balance):

$p$(balance) = Pr(default = Yes | balance)

The logistic model

Conditional probability of a binary variable: We aim to model

\[p(X) = \text{Pr}(Y = 1\ |\ X)\]

Logistic function: To limit the range of predicted values $p(X)$ to the range $[0, 1]$, logistic regression uses the logistic function:

\[p(X) = \frac{e^{\beta_0 + \beta_1 X}}{1 - e^{\beta_0 + \beta_1 X}}\]

This function yields an S-shaped curve whose values can be interpreted as (conditional) probabilities.

Odds: From the logistic function, we get:

\[\frac{p(X)}{1 - p(X)} = e^{\beta_0 + \beta_1 X}\]

The left term $\frac{p(X)}{1 - p(X)}$ represents the odds, with values in the range $[0, \infty]$:

If the chances for some event are 1 in \5, the odds of this event are $\frac{1}{5 - 1} = \frac{1}{4}$.
If the chances for some event are 4 in 5, the odds of this event are $\frac{4}{5 - 4} = 4$.

Log odds or logit: Taking the logarithm on both sides, we get:

\[\text{log}(\frac{p(X)}{1 - p(X)}) = \beta_0 + \beta_1 X\]

The term on the left is called log odds or logit. Thus, the logit of the linear regression model is linear in $X$.

Note:

Interpreting $\beta_1$: In a LR model, $\beta_1$ represents the average change in $Y$ associated with a one-unit increase in $X$.
By contrast, increasing $X$ by one unit changes in a logistic regression model the log odds or logit by $\beta_1$, or — alternatively — multiplies the odds by $e^{\beta_1}$.
Variable changes of $Y$ by $X$: As the relationship between $X$ and $p(X)$ is a S-shaped curve, $\beta_1$ does not correspond to a constant change in $p(X)$ associated with a one-unit increase in $X$. Instead, the change in $p(X)$ changes based on a one-unit change in $X$ depends on the current value of $X$ (e.g., larger changes in the middle than for low and high values).

Estimating logistic regression coefficients

Rather than using the least square criterion of linear regression, logistic regression uses a maximum likelihood criterion to estimates its coefficients $\beta$. Essentially, the $\beta$-values are fitted to maximize the likelihood of correctly predicting the binary category of $Y$ in the training data (James et al., 2021, p. 135).

Re-creating Table 4.1 (p. 136): Coefficient estimates and related information that result from fitting a logistic regression model on the Default data in order to predict the probability of the response default = Yes by the continuous predictor balance.

The R function glm() can be used to fit many types of generalized linear models, including logistic regression. Its syntax is similar to lm(), except that we must pass in the argument family = binomial to run a logistic regression, rather than another type of generalized linear model:

# Data: 
# head(df)

# Logistic regression model (with 1 continuous predictor):
lrm_1 <- glm(default ~ balance, data = df, family = binomial)

# Inspect model:
# summary(lrm_1)
# coef(lrm_1)
summary(lrm_1)$coef
##                  Estimate   Std. Error   z value      Pr(>|z|)
## (Intercept) -10.651330614 0.3611573721 -29.49221 3.623124e-191
## balance       0.005498917 0.0002203702  24.95309 1.976602e-137

Interpretation:

Significant coefficient $\hat{\beta_1} = 0.0055$ indicates a positive association between balance and default: An increase in balance is associated with an increase in the probability of default. Specifically, a one-unit increase in balance is associated with an increase in the log odds of default by 0.0055 units.
The $z$-statistic in logistic regression plays the same role as the $t$-statistic in linear regression.
The estimated intercept in logistic regression is typically not of interest. It adjusts the average fitted probabilities to the proportion (of positive cases) in the data (here, the overall default rate).

Making predictions

Quantitative predictors

Given the coefficients of a model, we can predict the probability of default for given credit card balance values:

# Inspect model:
# summary(lrm_1)
# coef(lrm_1)
summary(lrm_1)$coef
##                  Estimate   Std. Error   z value      Pr(>|z|)
## (Intercept) -10.651330614 0.3611573721 -29.49221 3.623124e-191
## balance       0.005498917 0.0002203702  24.95309 1.976602e-137

# Predicting response values for new predictor values:
predict(lrm_1, newdata = data.frame(balance = c(1000, 2000)), type = "response")
##           1           2 
## 0.005752145 0.585769370

Interpretation:

The probability for default given a credit card balance of $1000 is low (0.6%), but increases to 58.6% for a credit card balance of $2000.

Qualitative predictors

Using qualitative predictors with a logistic regression model by a dummy variable approach:

Re-creating Table 4.2 (p. 137): Coefficient estimates and related information that result from fitting a logistic regression model on the Default data in order to predict the probability of the response default = Yes by the binary predictor student:

# Data: 
# head(df)

# Logistic regression model (with 1 binary predictor):
lrm_2 <- glm(default ~ student, data = df, family = binomial)

# Inspect model:
# summary(lrm_2)
# coef(lrm_2)
summary(lrm_2)$coef
##               Estimate Std. Error    z value     Pr(>|z|)
## (Intercept) -3.5041278 0.07071301 -49.554219 0.0000000000
## studentYes   0.4048871 0.11501883   3.520181 0.0004312529

# Predicting response values for new predictor values:
predict(lrm_2, newdata = data.frame(student = c("Yes", "No")), type = "response")
##          1          2 
## 0.04313859 0.02919501

Interpretation:

The analysis for a binary predictor student internally creates a dummy variable that takes on a value of 1 for students and 0 for non-students.
The positive and significant coefficient indicates that students tend to have higher default probabilities than non-students.
The predicted (conditional) default probabilities for students are $\hat{\text{Pr}}(\text{default = Yes}\ |\ \text{student = Yes} ) = 4.3$% vs. $\hat{\text{Pr}}(\text{default = Yes}\ |\ \text{student = No} ) = 2.9$% for non-students.

Multiple logistic regression

In analogy to multiple LR (Chapter 3), we can predict a binary response variable by multiple predictors:

\[\text{log}(\frac{p(X)}{1 - p(X)}) = \beta_0 + \beta_1 X_1 + \ldots\ + \beta_p X_p \] which is the same as:

\[p(X) = \frac{e^{\beta_0 + \beta_1 X_1 + \ldots\ + \beta_p X_p}}{1 - e^{\beta_0 + \beta_1 X_1 + \ldots\ + \beta_p X_p}}\]

Again, we can use maximum likelihood to estimate the $p$ coefficients $\beta_i$.

Re-creating Table 4.3 (p. 138): The coefficient estimates for a logistic regression model that uses 3 predictors balance, income (in thousands of dollars), and student status to predict the probability of default:

# Data: 
# head(df)

# Logistic regression model (with 3 continuous/binary predictors):
lrm_3 <- glm(default ~ balance + income + student, data = df, family = binomial)

# Inspect model:
# summary(lrm_3)
# coef(lrm_3)
summary(lrm_3)$coef
##                  Estimate   Std. Error    z value      Pr(>|z|)
## (Intercept) -1.086905e+01 4.922555e-01 -22.080088 4.911280e-108
## balance      5.736505e-03 2.318945e-04  24.737563 4.219578e-135
## income       3.033450e-06 8.202615e-06   0.369815  7.115203e-01
## studentYes  -6.467758e-01 2.362525e-01  -2.737646  6.188063e-03

Interpretation:

The coefficients for balance and the dummy variable for student status are significant, indicating that these variables are associated with the probability of default.
Surprise: However, the coefficient for student status is negative, indicating that students are less likely to default than non-students.

The puzzle of confounding

The fact that the coefficient for the student variable was positive before (see model lrm_2 and Table 4.2 above), presents us with a puzzle: How is it possible for student status to be associated with an increase in probability of default in Table 4.2 and a decrease in probability of default in Table 4.3?

Figure 4.3 (p. 139) provides a graphical illustration of this apparent paradox:

Figure 4.3: Default data: (a) Probability to default by balance; (b) Confounding of balance by student status.

Interpretation:

The negative coefficient for student in the multiple logistic regression indicates that — for fixed values of balance and income — a student is less likely to default than a non-student. Indeed, we observe from the left-hand panel of Figure 4.3 that the student default rate is at or below that of the non-student default rate for every value of balance.
The horizontal dashed lines near the base of the plot show the default rates for students and non-students averaged over all values of balance and income: They indicate the opposite effect: The overall default rate for students is higher than that of non-students. Consequently, there is a positive coefficient for student in the single variable logistic regression (in Table 4.2).
Figure 4.3b (on the righ) provides an explanation for this apparent discrepancy: The variables student and balance are correlated. Students tend to hold higher levels of debt, which is in turn associated with higher probability of default.

This is an example of confounding: When predictors are correlated, the results from regressions with a single predictor may differ from the results using multiple predictors.

Conclusions (e.g., for credit card company, but also researchers): We must be careful in considering the associations between predictors. In the present case, we can conclude

Given no information about balance, a student is riskier than a non-student.
For the same credit card balance, a student is less risy than a non-student.

Prediction with multiple logistic regression

We can plug in the values of Table 4.3 into the equation for $\hat{p}(X)$ to make predictions for concrete values:

# Data: 
# head(df)

# Logistic regression model (with 3 continuous/binary predictors):
# lrm_3 <- glm(default ~ balance + income + student, data = df, family = binomial)

# Inspect model:
summary(lrm_3)$coef
##                  Estimate   Std. Error    z value      Pr(>|z|)
## (Intercept) -1.086905e+01 4.922555e-01 -22.080088 4.911280e-108
## balance      5.736505e-03 2.318945e-04  24.737563 4.219578e-135
## income       3.033450e-06 8.202615e-06   0.369815  7.115203e-01
## studentYes  -6.467758e-01 2.362525e-01  -2.737646  6.188063e-03

# Predicting response values for new predictor values:
df_2_predict <- data.frame(income = 40, balance = 1500, student = c("Yes", "No"))
df_2_predict
##   income balance student
## 1     40    1500     Yes
## 2     40    1500      No

predict(lrm_3, newdata = df_2_predict, type = "response")
##          1          2 
## 0.05161531 0.09413452

Interpretation: Thus, for given values of balance and income, the default probability of a student is about half of a non-student.

Multinomial logistic regression

In multinomial logistic regression, the 2-class logistic regression approach is extended to settings of $K > 2$ classes.

This models the log odds of $K-1$ classes relative to a single class that serves as the baseline, and estimates coefficients for $K-1$ classes.

Alternatively, the softmax coding treats all $K$ classes symmetrically, rather than selecting a baseline class, and estimates coefficients for all $K$ classes.

4.4 Generative models for classification

Logistic regression directly models the conditional distribution of the response $Y$ , given (binary) predictor(s) $X$ — or the conditional probability $\text{Pr}(Y = k\ |\ X = x)$ — using the logistic function.

An alternative and less direct approach to estimating these probabilities models the distribution of the predictors $X$ separately in each of the response classes (i.e., for each value of $Y$) and uses Bayes’ theorem to flip these around into estimates for $\text{Pr}(Y = k\ |\ X = x)$. When the distribution of $X$ within each class is assumed to be normal, it turns out that the model is very similar in form to logistic regression.

Linear discriminant analysis for $p = 1$

Linear discriminant analysis for $p > 1$

Quadratic discriminant analysis

Naive Bayes

“Essentially, the naive Bayes assumption introduces some bias, but reduces variance, leading to a classifier that works quite well in practice as a result of the bias-variance trade-off.” (James et al., 2021, p. 155).

A comparison of classification methods

An analytical comparison

Analytic comparisons between LDA, QDA, and naive Bayes, as well as the non-parametric approach by KNN.

An empirical comparison

Empirical comparison between classification methods for 6 different classification scenarios.

“These six examples illustrate that no one method will dominate the others in every situation. When the true decision boundaries are linear, then the LDA and logistic regression approaches will tend to perform well. When the boundaries are moderately non-linear, QDA or naive Bayes may give better results.” (James et al., 2021, p. 163).

“Finally, for much more complicated decision boundaries, a non-parametric approach such as KNN can be superior. But the level of smoothness for a non-parametric approach must be chosen carefully.” (James et al., 2021, p. 163).

Generalized linear models

In Chapter 3, we assumed that the predictor variable is quantitative (continuous, regression). So far in Chapter 4, we assumed that the predictor variable is qualitative (categorical, classification). However, what can we do when the predictor variable is neither?

Example: Bikeshare data aims to predict bikers as counts (non-negative integer values, frequency counts).

# Inspect data:
dim(Bikeshare)
## [1] 8645   15
head(Bikeshare)
##   season mnth day hr holiday weekday workingday   weathersit temp  atemp  hum
## 1      1  Jan   1  0       0       6          0        clear 0.24 0.2879 0.81
## 2      1  Jan   1  1       0       6          0        clear 0.22 0.2727 0.80
## 3      1  Jan   1  2       0       6          0        clear 0.22 0.2727 0.80
## 4      1  Jan   1  3       0       6          0        clear 0.24 0.2879 0.75
## 5      1  Jan   1  4       0       6          0        clear 0.24 0.2879 0.75
## 6      1  Jan   1  5       0       6          0 cloudy/misty 0.24 0.2576 0.75
##   windspeed casual registered bikers
## 1    0.0000      3         13     16
## 2    0.0000      8         32     40
## 3    0.0000      5         27     32
## 4    0.0000      3         10     13
## 5    0.0000      0          1      1
## 6    0.0896      0          1      1
# hist(Bikeshare$bikers, col = "deepskyblue")

Linear regression on the Bikeshare data

We first conduct a linear regression on the Bikeshare data:

# GLM:
lm_4 <- glm(bikers ~ mnth + hr + workingday + temp + weathersit, data = Bikeshare)

summary(lm_4)
## 
## Call:
## glm(formula = bikers ~ mnth + hr + workingday + temp + weathersit, 
##     data = Bikeshare)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -299.00   -45.70    -6.23    41.08   425.29  
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -68.632      5.307 -12.932  < 2e-16 ***
## mnthFeb                      6.845      4.287   1.597 0.110398    
## mnthMarch                   16.551      4.301   3.848 0.000120 ***
## mnthApril                   41.425      4.972   8.331  < 2e-16 ***
## mnthMay                     72.557      5.641  12.862  < 2e-16 ***
## mnthJune                    67.819      6.544  10.364  < 2e-16 ***
## mnthJuly                    45.324      7.081   6.401 1.63e-10 ***
## mnthAug                     53.243      6.640   8.019 1.21e-15 ***
## mnthSept                    66.678      5.925  11.254  < 2e-16 ***
## mnthOct                     75.834      4.950  15.319  < 2e-16 ***
## mnthNov                     60.310      4.610  13.083  < 2e-16 ***
## mnthDec                     46.458      4.271  10.878  < 2e-16 ***
## hr1                        -14.579      5.699  -2.558 0.010536 *  
## hr2                        -21.579      5.733  -3.764 0.000168 ***
## hr3                        -31.141      5.778  -5.389 7.26e-08 ***
## hr4                        -36.908      5.802  -6.361 2.11e-10 ***
## hr5                        -24.135      5.737  -4.207 2.61e-05 ***
## hr6                         20.600      5.704   3.612 0.000306 ***
## hr7                        120.093      5.693  21.095  < 2e-16 ***
## hr8                        223.662      5.690  39.310  < 2e-16 ***
## hr9                        120.582      5.693  21.182  < 2e-16 ***
## hr10                        83.801      5.705  14.689  < 2e-16 ***
## hr11                       105.423      5.722  18.424  < 2e-16 ***
## hr12                       137.284      5.740  23.916  < 2e-16 ***
## hr13                       136.036      5.760  23.617  < 2e-16 ***
## hr14                       126.636      5.776  21.923  < 2e-16 ***
## hr15                       132.087      5.780  22.852  < 2e-16 ***
## hr16                       178.521      5.772  30.927  < 2e-16 ***
## hr17                       296.267      5.749  51.537  < 2e-16 ***
## hr18                       269.441      5.736  46.976  < 2e-16 ***
## hr19                       186.256      5.714  32.596  < 2e-16 ***
## hr20                       125.549      5.704  22.012  < 2e-16 ***
## hr21                        87.554      5.693  15.378  < 2e-16 ***
## hr22                        59.123      5.689  10.392  < 2e-16 ***
## hr23                        26.838      5.688   4.719 2.41e-06 ***
## workingday                   1.270      1.784   0.711 0.476810    
## temp                       157.209     10.261  15.321  < 2e-16 ***
## weathersitcloudy/misty     -12.890      1.964  -6.562 5.60e-11 ***
## weathersitlight rain/snow  -66.494      2.965 -22.425  < 2e-16 ***
## weathersitheavy rain/snow -109.745     76.667  -1.431 0.152341    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 5852.877)
## 
##     Null deviance: 154743728  on 8644  degrees of freedom
## Residual deviance:  50364008  on 8605  degrees of freedom
## AIC: 99568
## 
## Number of Fisher Scoring iterations: 2

# Fitted values:
fv <- lm_4$fitted.values
sum(fv < 0)/length(fv)  # proportion of fv < 0
## [1] 0.09635628

See Table 4.10 and Figure 4.13 (p. 165) for the resulting coefficients:

Figure 4.13: A least squares linear regression model was fit to predict bikers in the Bikeshare data set. Left: The coefficients associated with the month of the year. Bike usage is highest in the spring and fall, and lowest in the winter. Right: The coefficients associated with the hour of the day. Bike usage is highest during peak commute times, and lowest overnight.

Interpretation:

On first inspection, the coefficients and trends appear reasonable (e.g., lower in winter and at night).
However, many strange values (e.g., $9.6$% of fitted values are negative).
Importantly, the assumption that the variance is constant is violated (see Figure 4.14): Times with low values of bikers have far lower variance than times with high numbers. Thus, the heteroscedasticity of the data questions the suitability of a LR model.

Log-transforming $Y$ as remedy?

Some of the problems that arise when fitting a LR model to the Bikeshare data can be overcome by transforming the response variable $Y$. Log-transforming the response to log($Y$) avoids the possibility of negative predictions, and it overcomes much of the heteroscedasticity in the untransformed data (see the right-hand panel of Figure 4.14). However, the coefficients of log-transformed models are harder to interpret and response values of $Y = 0$ are no longer possible.

Thus, fitting a LR to a transformation of the response may be an adequate approach for some count-valued data sets, but often leaves something to be desired. Next, we will see that a Poisson regression model provides a much more natural and elegant approach for this task.

Poisson Regression on the Bikeshare data

Poisson distribution

If a random variable $Y$ takes on non-negative integer values (i.e., $Y \in \{0, 1, 2, \ldots\}$, it follows a Poisson distribution when

\[ \text{Pr}(Y = k) = \frac{e^{-\lambda} \lambda^k}{k!} \text{ for } k = 1, 2, 3, \ldots\] Interestingly, $\lambda > 0$ is both the expected value $\text{E}(Y)$ and expected variance $\text{Var}(Y)$ for $k = 0, 1, 2, \ldots$. Thus, as the mean of $Y$ increases, so does its variance.

Poisson regression

Conducting a Poisson regression model on $log(Y)$:

# GLM: Poisson 
pm_1 <- glm(log(bikers) ~ mnth + hr + workingday + temp + weathersit, 
            data = Bikeshare, family = poisson)

# Note: Warnings due to non-integers!

summary(pm_1)

# Fitted values:
fv <- pm_1$fitted.values
sum(fv < 0)/length(fv)  # proportion of fv < 0

See Section 4.7.7: Poisson regression in lab (below).

+++ here now +++

Generalized linear models in greater generality

We have now discussed three types of regression models: linear regression (LR), logistic regression, and Poisson regression. They all model the mean of a response variable $Y$ as a function of predictor variables $X_1 \ldots X_p$. Their differences can be expressed by applying different link functions $\eta$ to the term $\beta_0 + \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p$.

The Gaussian, Bernoulli and Poisson distributions are all members of a wider class of distributions, known as the exponential family. Other well-known members of this family are the exponential distribution, the Gamma distribution, and the negative binomial distribution. Any regression approach that follows this general recipe is known as a generalized linear model (GLM).

4.7 Lab: Classification methods

The Stock Market Data

We begin by examining some numerical and graphical summaries of the Smarket data, which is part of the ISLR2 library. This data set consists of percentage returns for the S&P 500 stock index over $1,250$ days, from the beginning of 2001 until the end of 2005.

For each date, we have recorded the percentage returns for each of the five previous trading days, Lag1 through Lag5. We have also recorded volume (the number of shares traded on the previous day, in billions), Today (the percentage return on the date in question), and direction (whether the market was Up or Down on this date).

Our goal is to predict direction (a binary, qualitative response) using the other features.

library(ISLR2)

# Inspect data: 
dim(Smarket)  # 1250 x 9 
## [1] 1250    9
names(Smarket)
## [1] "Year"      "Lag1"      "Lag2"      "Lag3"      "Lag4"      "Lag5"     
## [7] "Volume"    "Today"     "Direction"
head(Smarket)
##   Year   Lag1   Lag2   Lag3   Lag4   Lag5 Volume  Today Direction
## 1 2001  0.381 -0.192 -2.624 -1.055  5.010 1.1913  0.959        Up
## 2 2001  0.959  0.381 -0.192 -2.624 -1.055 1.2965  1.032        Up
## 3 2001  1.032  0.959  0.381 -0.192 -2.624 1.4112 -0.623      Down
## 4 2001 -0.623  1.032  0.959  0.381 -0.192 1.2760  0.614        Up
## 5 2001  0.614 -0.623  1.032  0.959  0.381 1.2057  0.213        Up
## 6 2001  0.213  0.614 -0.623  1.032  0.959 1.3491  1.392        Up
# summary(Smarket)  # descriptive summaries of variables
# pairs(Smarket)    # graphical overview

The cor() function produces a matrix that contains all of the pairwise correlations among the predictors in a data set. The first command below gives an error message because the direction variable is qualitative:

cor(Smarket)  # yields an error, due to a non-numeric variable
## Error in cor(Smarket): 'x' must be numeric

# Inspect data type of variables (columns):
(tb <- tibble::as_tibble(Smarket))  # 9-th variable is a factor
## # A tibble: 1,250 × 9
##     Year   Lag1   Lag2   Lag3   Lag4   Lag5 Volume  Today Direction
##    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <fct>    
##  1  2001  0.381 -0.192 -2.62  -1.06   5.01    1.19  0.959 Up       
##  2  2001  0.959  0.381 -0.192 -2.62  -1.06    1.30  1.03  Up       
##  3  2001  1.03   0.959  0.381 -0.192 -2.62    1.41 -0.623 Down     
##  4  2001 -0.623  1.03   0.959  0.381 -0.192   1.28  0.614 Up       
##  5  2001  0.614 -0.623  1.03   0.959  0.381   1.21  0.213 Up       
##  6  2001  0.213  0.614 -0.623  1.03   0.959   1.35  1.39  Up       
##  7  2001  1.39   0.213  0.614 -0.623  1.03    1.44 -0.403 Down     
##  8  2001 -0.403  1.39   0.213  0.614 -0.623   1.41  0.027 Up       
##  9  2001  0.027 -0.403  1.39   0.213  0.614   1.16  1.30  Up       
## 10  2001  1.30   0.027 -0.403  1.39   0.213   1.23  0.287 Up       
## # … with 1,240 more rows
is.numeric(Smarket$Direction)
## [1] FALSE

# Correlation of numeric variables:
# cor(Smarket[, -9])  # excluding 9-th column/variable 
cor(Smarket[-9])    # (same, as Smarket is a list)
##              Year         Lag1         Lag2         Lag3         Lag4
## Year   1.00000000  0.029699649  0.030596422  0.033194581  0.035688718
## Lag1   0.02969965  1.000000000 -0.026294328 -0.010803402 -0.002985911
## Lag2   0.03059642 -0.026294328  1.000000000 -0.025896670 -0.010853533
## Lag3   0.03319458 -0.010803402 -0.025896670  1.000000000 -0.024051036
## Lag4   0.03568872 -0.002985911 -0.010853533 -0.024051036  1.000000000
## Lag5   0.02978799 -0.005674606 -0.003557949 -0.018808338 -0.027083641
## Volume 0.53900647  0.040909908 -0.043383215 -0.041823686 -0.048414246
## Today  0.03009523 -0.026155045 -0.010250033 -0.002447647 -0.006899527
##                Lag5      Volume        Today
## Year    0.029787995  0.53900647  0.030095229
## Lag1   -0.005674606  0.04090991 -0.026155045
## Lag2   -0.003557949 -0.04338321 -0.010250033
## Lag3   -0.018808338 -0.04182369 -0.002447647
## Lag4   -0.027083641 -0.04841425 -0.006899527
## Lag5    1.000000000 -0.02200231 -0.034860083
## Volume -0.022002315  1.00000000  0.014591823
## Today  -0.034860083  0.01459182  1.000000000

As one would expect, the correlations between the lag variables and today’s returns are close to zero. In other words, there appears to be little correlation between today’s returns and previous days’ returns. The only substantial correlation (of $0.54$) is between Year and volume.

By plotting the data, which is ordered chronologically, we see that volume is increasing over time (as later data points were recorded at later times). In other words, the average number of shares traded daily increased from 2001 to 2005.

attach(Smarket)
plot(Volume, 
     pch = 20, cex = 1.5, col = adjustcolor("black", .25))
title("Smarket data: Volume over time (data points)", adj = 0)

Logistic regression

Next, we will fit a logistic regression model in order to predict direction using Lag1 through Lag5 and volume as predictors. The glm() function can be used to fit many types of generalized linear models, including logistic regression. The syntax of the glm() function is similar to that of lm(), except that we must pass in the argument family = binomial in order to tell R to run a logistic regression rather than some other type of generalized linear model:

Logistic regression model

# Binomial/logistic regression model 1:
glm_logR <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
                data = Smarket, family = binomial)

# Summary:
summary(glm_logR)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Smarket)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.446  -1.203   1.065   1.145   1.326  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000   0.240736  -0.523    0.601
## Lag1        -0.073074   0.050167  -1.457    0.145
## Lag2        -0.042301   0.050086  -0.845    0.398
## Lag3         0.011085   0.049939   0.222    0.824
## Lag4         0.009359   0.049974   0.187    0.851
## Lag5         0.010313   0.049511   0.208    0.835
## Volume       0.135441   0.158360   0.855    0.392
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1741.6
## 
## Number of Fisher Scoring iterations: 3

Interpretation:

The smallest $p$-value here is associated with Lag1 (but is still $14.5$%). The negative coefficient for this predictor suggests that if the market had a positive return yesterday, then it is less likely to go up today. However, at a value of $0.145$, the $p$-value is still relatively large, and so there is no clear evidence of a real association between lagone and direction.

We use the coef() function in order to access just the coefficients for the fitted model. We can also use the summary() function to access particular aspects of the fitted model, such as the $p$-values for the coefficients.

# Partial summaries: 
coef(glm_logR)               # only values of coefficients 
##  (Intercept)         Lag1         Lag2         Lag3         Lag4         Lag5 
## -0.126000257 -0.073073746 -0.042301344  0.011085108  0.009358938  0.010313068 
##       Volume 
##  0.135440659
summary(glm_logR)$coef       # only table of coefficients
##                 Estimate Std. Error    z value  Pr(>|z|)
## (Intercept) -0.126000257 0.24073574 -0.5233966 0.6006983
## Lag1        -0.073073746 0.05016739 -1.4565986 0.1452272
## Lag2        -0.042301344 0.05008605 -0.8445733 0.3983491
## Lag3         0.011085108 0.04993854  0.2219750 0.8243333
## Lag4         0.009358938 0.04997413  0.1872757 0.8514445
## Lag5         0.010313068 0.04951146  0.2082966 0.8349974
## Volume       0.135440659 0.15835970  0.8552723 0.3924004
summary(glm_logR)$coef[, 4]  # only p-values (4th column)
## (Intercept)        Lag1        Lag2        Lag3        Lag4        Lag5 
##   0.6006983   0.1452272   0.3983491   0.8243333   0.8514445   0.8349974 
##      Volume 
##   0.3924004

Model predictions

The predict() function can be used to predict the probability that the market will go up, given values of the predictors. The type = "response" option tells R to output conditional probabilities of the form $P(Y = 1\ |\ X)$, as opposed to other information, such as the logit of the response. If no new data set is supplied to the predict() function, then the conditional probabilities are computed for the training data that was used to fit the logistic regression model. Here print only the first ten probabilities. We know that these values correspond to the probability of the market going Up, rather than Down, because the contrasts() function indicates that R has created a dummy variable with a 1 for Up:

# Predictions: 
glm_probs <- predict(glm_logR, type = "response")
glm_probs[1:10]  # first 10 values
##         1         2         3         4         5         6         7         8 
## 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 0.5069565 0.4926509 0.5092292 
##         9        10 
## 0.5176135 0.4888378

# Check: Values of binary Direction levels: 
contrasts(Direction)  # 1 indicates "Up"
##      Up
## Down  0
## Up    1

In order to make a prediction as to whether the market will go up or down on a particular day, we must convert these predicted probabilities into class labels (i.e., either Up or Down). The following two commands create a vector of class predictions based on whether the predicted probability of a market increase is greater than or less than $0.50$:

n_cases <- dim(Smarket)[1]

# Vector of predictions:
glm_pred <- rep(NA, n_cases)         # initialize predictions to NA (missing)
glm_pred[glm_probs <= .50] = "Down"  # change corresponding elements to "Down"
glm_pred[glm_probs >  .50] = "Up"    # change corresponding elements to "Up"

# Show predictions:
table(glm_pred)
## glm_pred
## Down   Up 
##  286  964

The first command creates a vector of 1,250 NA elements. The second and third lines transform those elements to Down or Up, based on whether the predicted probability of a market increase is lower or higher than a threshold value of $0.50$.

Confusion matrix

Given these predictions, the table() function can be used to produce a confusion matrix in order to determine how many observations were correctly or incorrectly classified. By inputting two qualitative vectors, R will create a 2x2 table with counts of the number of times each combination occurred (e.g., predicted Up and market increased, predicted Up and the market decreased, etc.).

# 2x2 matrix/confusion table: 
table(Prediction = glm_pred, True_direction = Direction)
##           True_direction
## Prediction Down  Up
##       Down  145 141
##       Up    457 507

# Accuracy: 
(145 + 507) / 1250  
## [1] 0.5216
mean(glm_pred == Direction)
## [1] 0.5216

The diagonal elements of the confusion matrix indicate correct predictions, while the off-diagonals represent incorrect predictions. Hence our model correctly predicted that the market would go up on $507$~days and that it would go down on $145$~days, for a total of $507 + 145 = 652$ correct predictions. The mean() function can be used to compute the fraction of days for which the prediction was correct. In this case, logistic regression correctly predicted the movement of the market $52.2$,% of the time.

Interpretation:

At first glance, it appears that the logistic regression model is working a little better than random guessing. However, this result is misleading because we trained and tested the model on the same set of $1,250$ observations. In other words, $100\% - 52.2\% = 47.8\%$, is the training error rate.

As we have seen previously, the training error rate is often overly optimistic — it tends to underestimate the test error rate. In order to better assess the accuracy of the logistic regression model in this setting, we can fit the model using part of the data, and then examine how well it predicts the held out data. This will yield a more realistic error rate, in the sense that in practice we will be interested in our model’s performance not on the data that we used to fit the model, but rather on days in the future (or unseen days) for which the market’s movements are unknown.

Training vs. testing data

To implement this strategy, we will first create a vector corresponding to the observations from 2001 through 2004. We will then use this vector to create a held out data set of observations from 2005.

# Splitting data into parts:
train <- (Year < 2005)  # create a Boolean/logical vector

Smarket_2005 <- Smarket[!train, ]    # test data frame
dim(Smarket_2005)                    # 252 x 9
## [1] 252   9
Direction_2005 <- Direction[!train]  # test data vector

The object train is a vector of $1,250$ elements, corresponding to the observations in our data set. The elements of the vector that correspond to observations that occurred before 2005 are set to TRUE, whereas those that correspond to observations in 2005 are set to FALSE. The object train is a Boolean vector, since its elements are either TRUE or FALSE. Boolean vectors can be used to obtain a subset of the rows or columns of a matrix (via logical indexing or subsetting in R). For instance, the command Smarket[train, ] would pick out a submatrix of the stock market data set, corresponding only to the dates before 2005, since those are the ones for which the elements of train are TRUE. The negation symbol ! is used to reverse all of the elements (or truth values) of a Boolean vector. That is, !train is a vector similar to train, except that the elements that are TRUE in train get swapped to FALSE in !train, and the elements that are FALSE in train get swapped to TRUE in !train. Therefore, Smarket[!train, ] yields a submatrix of the stock market data containing only the observations for which train is FALSE — that is, the observations with dates in 2005. The output above indicates that there are 252 such observations.

We now fit a logistic regression model using only the subset of the observations that correspond to dates before 2005, using the subset argument. We then obtain predicted probabilities of the stock market going up for each of the days in our test set — that is, for the days in 2005.

# Binomial regression model 2 (using only train data subset):
glm_log <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
               data = Smarket, family = binomial, subset = train)

# Predictions for new data:
glm_probs <- predict(glm_log, newdata = Smarket_2005, type = "response")

Notice that we have trained and tested our model on two completely separate data sets (or rather disjunct subsets of the Smarket data): Training was performed using only the dates before 2005, and testing (or prediction) was performed using only the dates in 2005. Finally, we compute the predictions for 2005 and compare them to the actual movements of the market over that time period:

n_pred <- dim(Smarket_2005)[1]
glm_pred <- rep(NA, n_pred)           # initialize vector
glm_pred[glm_probs <= .5] <- "Down"   # switch values with p <= .5  
glm_pred[glm_probs >  .5] <- "Up"     # switch values with p >  .5  

# 2x2 matrix/confusion table: 
table(glm_pred, Direction_2005)
##         Direction_2005
## glm_pred Down Up
##     Down   77 97
##     Up     34 44

# Accuracy:
mean(glm_pred == Direction_2005)  # prop correct cases/accuracy
## [1] 0.4801587
mean(glm_pred != Direction_2005)  # prop incorrect cases/error rate
## [1] 0.5198413

# Base rate/baseline for Up:
sum(Direction_2005 == "Up")/length(Direction_2005)
## [1] 0.5595238

The != notation means not equal to, and so the last command computes the test set error rate. The results are rather disappointing: The test error rate is $52$,%, which is worse than random guessing! Of course this result is not all that surprising, given that one would not generally expect to be able to use previous days’ returns to predict future market performance.¹

Note also that the baseline for the market going Up in 2005 is $97 + 44 = 141$ out of 252 days. Hence, if we always predicted Up, we would be correct on $141/252 = 56.0$% of all days.

Improving the model?

We recall that the logistic regression model glm_log had very underwhelming $p$-values associated with all of the predictors, and that the smallest $p$-value, though not very small, corresponded to the Lag1 predictor variable. Perhaps by removing the variables that appear not to be helpful in predicting direction, we can obtain a more effective model? After all, using predictors that have no relationship with the response tends to cause a deterioration in the test error rate (since such predictors cause an increase in variance without a corresponding decrease in bias), and so removing such predictors may in turn yield an improvement.

Next, we refit the logistic regression model glm_log_2 by using only two predictors Lag1 and Lag2, which seemed to have the highest predictive power in the original logistic regression model:

# Binomial regression model 3 (using only 2 predictors and only training data subset):
glm_log_2 <- glm(Direction ~ Lag1 + Lag2, 
                 data = Smarket, family = binomial, subset = train)

# summary(glm_log_2)  # (coefficients still quite small...)

# Predictions: 
glm_probs <- predict(glm_log_2, newdata = Smarket_2005, type = "response")

glm_pred <- rep(NA, n_pred)   # initialize
glm_pred[glm_probs <= .5] <- "Down"  
glm_pred[glm_probs >  .5] <- "Up"   

# 2x2 matrix/confusion table: 
table(Prediction = glm_pred, Direction_2005)
##           Direction_2005
## Prediction Down  Up
##       Down   35  35
##       Up     76 106
mean(glm_pred == Direction_2005)  # accuracy 
## [1] 0.5595238

# But: Compare accuracy to 2 simpler/baseline strategies:
(35 + 106) / n_pred  # A. accuracy (diagonal cases/N) = 
## [1] 0.5595238
(35 + 106) / n_pred  # B. baseline of Up (right column/N)
## [1] 0.5595238
106 / (106 + 76)     # C. correct given predicting-Up
## [1] 0.5824176

Interpretation:

Now the results appear to be a little better: $56\%$ of the daily movements have been correctly predicted. It is worth noting that in this case, a much simpler strategy of predicting that the market will increase every day will also be correct $56\%$ of the time (i.e., the baseline for Up)! Hence, in terms of overall error rate, the logistic regression method is no better than the naive approach.

Moreover, the confusion matrix shows that on days when logistic regression predicts an increase in the market, always predicting Up has a $58\%$ accuracy rate. This suggests a possible trading strategy of buying on days when the model predicts an increasing market, and avoiding trades on days when a decrease is predicted. Of course one would need to investigate more carefully whether this small improvement was real or just due to random chance.

Predicting particular values

Suppose that we want to predict the returns associated with particular values of Lag1 and Lag2. In particular, we want to predict direction on a day when Lag1 and Lag2 equal 1.2 and 1.1, respectively, and on a day when they equal 1.5 and $-$0.8. We do this using the predict() function:

# Predicting new cases (for particular values): 
predict(glm_log_2,
        newdata = data.frame(Lag1 = c(1.2, 1.5), Lag2 = c(1.1, -0.8)),
        type = "response")
##         1         2 
## 0.4791462 0.4960939

Linear discriminant analysis (LDA)

Now we will perform a linear discriminant analysis (LDA) on the Smarket data. In R, we fit an LDA model using the lda() function, which is part of the MASS library. Notice that the syntax for the lda() function is identical to that of lm(), and to that of glm() except for the absence of the family option. We fit the model using only the observations before 2005:

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
## 
##     Boston
## The following object is masked from 'package:dplyr':
## 
##     select
lda_fit <- lda(Direction ~ Lag1 + Lag2, data = Smarket,
    subset = train)
lda_fit
## Call:
## lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
## 
## Prior probabilities of groups:
##     Down       Up 
## 0.491984 0.508016 
## 
## Group means:
##             Lag1        Lag2
## Down  0.04279022  0.03389409
## Up   -0.03954635 -0.03132544
## 
## Coefficients of linear discriminants:
##             LD1
## Lag1 -0.6420190
## Lag2 -0.5135293
# plot(lda_fit)

The LDA output indicates that $\hat\pi_1 = 0.492$ and $\hat\pi_2 = 0.508$; in other words, $49.2$,% of the training observations correspond to days during which the market went down. It also provides the group means; these are the average of each predictor within each class, and are used by LDA as estimates of $\mu_k$. These suggest that there is a tendency for the previous 2~days’ returns to be negative on days when the market increases, and a tendency for the previous days’ returns to be positive on days when the market declines. The coefficients of linear discriminants output provides the linear combination of lagone and lagtwo that are used to form the LDA decision rule. In other words, these are the multipliers of the elements of $X = x$ in Equation (4.24). If $-0.642\cdot \text{lagone} - 0.514 \cdot \text{lagtwo}$ is large, then the LDA classifier will predict a market increase, and if it is small, then the LDA classifier will predict a market decline.

The plot() function produces plots of the linear discriminants, obtained by computing $-0.642 \cdot \text{lagone} - 0.514 \cdot \text{lagtwo}$ for each of the training observations:

plot(lda_fit, col = "gold")

The Up and Down observations are displayed separately.

The predict() function returns a list with three elements:

lda_pred <- predict(lda_fit, Smarket_2005)
names(lda_pred)
## [1] "class"     "posterior" "x"

The first element, class, contains LDA’s predictions about the movement of the market. The second element, posterior, is a matrix whose $k$th column contains the posterior probability that the corresponding observation belongs to the $k$th class, computed from Equation (4.15). Finally, x contains the linear discriminants, described earlier.

As we observed in Section 4.5, the LDA and logistic regression predictions are almost identical:

lda_class <- lda_pred$class
table(lda_class, Direction_2005)
##          Direction_2005
## lda_class Down  Up
##      Down   35  35
##      Up     76 106
mean(lda_class == Direction_2005)
## [1] 0.5595238

Applying a $50$,% threshold to the posterior probabilities allows us to recreate the predictions contained in lda_pred$class:

sum(lda_pred$posterior[, 1] >= .5)
## [1] 70
sum(lda_pred$posterior[, 1] <  .5)
## [1] 182

Notice that the posterior probability output by the model corresponds to the probability that the market will decrease:

lda_pred$posterior[1:20, 1]
##       999      1000      1001      1002      1003      1004      1005      1006 
## 0.4901792 0.4792185 0.4668185 0.4740011 0.4927877 0.4938562 0.4951016 0.4872861 
##      1007      1008      1009      1010      1011      1012      1013      1014 
## 0.4907013 0.4844026 0.4906963 0.5119988 0.4895152 0.4706761 0.4744593 0.4799583 
##      1015      1016      1017      1018 
## 0.4935775 0.5030894 0.4978806 0.4886331
lda_class[1:20]  # first 20 predictions
##  [1] Up   Up   Up   Up   Up   Up   Up   Up   Up   Up   Up   Down Up   Up   Up  
## [16] Up   Up   Down Up   Up  
## Levels: Down Up

If we wanted to use a posterior probability threshold other than $50$,% in order to make predictions, then we could easily do so. For instance, suppose that we wish to predict a market decrease only if we are very certain that the market will indeed decrease on that day — say, if the posterior probability is at least $90$,%:

sum(lda_pred$posterior[, 1] > .90)
## [1] 0

No days in 2005 meet that threshold! In fact, the greatest posterior probability of decrease in all of 2005 was $52.02$,%:

max(lda_pred$posterior[, 1])
## [1] 0.520235

Quadratic discriminant analysis (QDA)

We will now fit a quadratic discriminant analysis (QDA) model to the Smarket data. QDA is implemented in R using the qda() function, which is also part of the MASS library. The syntax is identical to that of lda():

qda_fit <- qda(Direction ~ Lag1 + Lag2, 
               data = Smarket, subset = train)
qda_fit
## Call:
## qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
## 
## Prior probabilities of groups:
##     Down       Up 
## 0.491984 0.508016 
## 
## Group means:
##             Lag1        Lag2
## Down  0.04279022  0.03389409
## Up   -0.03954635 -0.03132544

The output contains the group means. But it does not contain the coefficients of the linear discriminants, because the QDA classifier involves a quadratic, rather than a linear, function of the predictors. The predict() function works in exactly the same fashion as for LDA:

qda_class <- predict(qda_fit, Smarket_2005)$class

table(qda_class, Direction_2005)
##          Direction_2005
## qda_class Down  Up
##      Down   30  20
##      Up     81 121
mean(qda_class == Direction_2005)
## [1] 0.5992063

Interestingly, the QDA predictions are accurate almost $60$,% of the time, even though the 2005 data was not used to fit the model. This level of accuracy is quite impressive for stock market data, which is known to be quite hard to model accurately. This suggests that the quadratic form assumed by QDA may capture the true relationship more accurately than the linear forms assumed by LDA and logistic regression. However, we recommend evaluating this method’s performance on a larger test set before betting that this approach will consistently beat the market!

Naive Bayes (NBayes)

Next, we fit a naive Bayes (NBayes) model to the Smarket data. Naive Bayes is implemented in R using the naiveBayes() function, which is part of the e1071 library. The syntax is identical to that of lda() and qda(). By default, this implementation of the naive Bayes classifier models each quantitative feature using a Gaussian distribution. However, a kernel density method can also be used to estimate the distributions:

library(e1071)
nb_fit <- naiveBayes(Direction ~ Lag1 + Lag2, data = Smarket,
    subset = train)
nb_fit
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##     Down       Up 
## 0.491984 0.508016 
## 
## Conditional probabilities:
##       Lag1
## Y             [,1]     [,2]
##   Down  0.04279022 1.227446
##   Up   -0.03954635 1.231668
## 
##       Lag2
## Y             [,1]     [,2]
##   Down  0.03389409 1.239191
##   Up   -0.03132544 1.220765

The output contains the estimated mean and standard deviation for each variable in each class. For example, the mean for lagone is $0.0428$ for Direction = Down, and the standard deviation is $1.23$. We can easily verify this:

mean(Lag1[train][Direction[train] == "Down"])
## [1] 0.04279022
sd(Lag1[train][Direction[train] == "Down"])
## [1] 1.227446

The predict() function is straightforward.

nb_class <- predict(nb_fit, Smarket_2005)
table(nb_class, Direction_2005)
##         Direction_2005
## nb_class Down  Up
##     Down   28  20
##     Up     83 121
mean(nb_class == Direction_2005)
## [1] 0.5912698

Naive Bayes performs very well on this data, with accurate predictions over $59\%$ of the time. This is slightly worse than QDA, but much better than LDA.

The predict() function can also generate estimates of the probability that each observation belongs to a particular class:

nb_pred <- predict(nb_fit, Smarket_2005, type = "raw")
nb_pred[1:5, ]
##           Down        Up
## [1,] 0.4873164 0.5126836
## [2,] 0.4762492 0.5237508
## [3,] 0.4653377 0.5346623
## [4,] 0.4748652 0.5251348
## [5,] 0.4901890 0.5098110

$K$-nearest neighbors (KNN)

We will now perform a $K$-nearest neighbors (KNN) analysis using the knn() function, which is part of the class library. This function works rather differently from the other model-fitting functions that we have encountered thus far. Rather than a two-step approach in which we first fit the model and then we use the model to make predictions, knn() forms predictions using a single command. The function requires four inputs:

A matrix containing the predictors associated with the training data, labeled train_X below.
A matrix containing the predictors associated with the data for which we wish to make predictions, labeled test_X below.
A vector containing the class labels for the training observations, labeled train_Direction below.
A value for $K$, the number of nearest neighbors to be used by the classifier.

We use the cbind() function, short for column bind, to bind the Lag1 and Lag2 variables together into two matrices, one for the training set and the other for the test set:

library(class)

train_X <- cbind(Lag1, Lag2)[ train, ]
test_X  <- cbind(Lag1, Lag2)[!train, ]

train_Direction <- Direction[train]

Now the knn() function can be used to predict the market’s movement for the dates in 2005. We set a random seed before we apply knn() because if several observations are tied as nearest neighbors, then R will randomly break the tie. Therefore, a seed must be set in order to ensure the reproducibility of results:

set.seed(1)  # reproducible randomness

# Fit KNN model:
knn_pred <- knn(train_X, test_X, train_Direction, k = 1)

# Evaluation: 
table(knn_pred, Direction_2005)
##         Direction_2005
## knn_pred Down Up
##     Down   43 58
##     Up     68 83
(83 + 43)/252  # same as: 
## [1] 0.5
mean(knn_pred == Direction_2005)
## [1] 0.5

The results using $K = 1$ are not very good, since only $50$% of the observations are correctly predicted. Of course, it may be that $K = 1$ results in an overly flexible fit to the data. Below, we repeat the analysis using $K = 3$:

# Fit KNN model:
knn_pred <- knn(train_X, test_X, train_Direction, k = 3)

# Evaluation: 
table(knn_pred, Direction_2005)
##         Direction_2005
## knn_pred Down Up
##     Down   48 54
##     Up     63 87
(48 + 87)/252
## [1] 0.5357143
mean(knn_pred == Direction_2005)
## [1] 0.5357143

The results have improved slightly. But increasing $K$ further turns out to provide no further improvements. It appears that for this data, QDA provides the best results of the methods that we have examined so far.

KNN on `Caravan` data

KNN does not perform well on the Smarket data but it does often provide impressive results. As an example we will apply the KNN approach to the Caravan data set, which is part of the ISLR2 library. This data set includes $85$ predictors that measure demographic characteristics for 5,822 individuals. The response variable is Purchase, which indicates whether or not a given individual purchases a caravan insurance policy. In this data set, only $6$,% of people purchased caravan insurance:

library(ISLR2)
dim(Caravan)  # 5822 86
## [1] 5822   86

attach(Caravan)

summary(Purchase)
##   No  Yes 
## 5474  348
348 / 5822
## [1] 0.05977327

Scaling/standardizing variables: Because the KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it, the scale of the variables matters. Variables that are on a large scale will have a much larger effect on the distance between the observations, and hence on the KNN classifier, than variables that are on a small scale. For instance, imagine a data set that contains two variables, salary and age (measured in dollars and years, respectively). As far as KNN is concerned, a difference of $\$1,000$ in salary is enormous compared to a difference of $50$ years in age. Consequently, salary will drive the KNN classification results, and age will have almost no effect. This is contrary to our intuition that a salary difference of $\$1,000$ is quite small compared to an age difference of $50$ years. Furthermore, the importance of scale to the KNN classifier leads to another issue: If we measured salary in Japanese yen, or if we measured age in minutes, then we’d get quite different classification results from what we get if these two variables are measured in dollars and years.

A good way to handle this problem is to standardize the data so that all variables are given a mean of zero and a standard deviation of one. Then all variables will be on a comparable scale. The scale() function does just this. In standardizing the data, we exclude column $86$, because that is the qualitative Purchase variable:

standardized_X <- scale(Caravan[, -86])

# Comparing raw values with standardized values: 
var(Caravan[, 1])
## [1] 165.0378
var(Caravan[, 2])
## [1] 0.1647078

var(standardized_X[, 1])
## [1] 1
var(standardized_X[, 2])
## [1] 1

Now every column of standardized_X has a standard deviation of $1$ and a mean of $0$.

We now split the observations into a test set, containing the first 1,000 observations, and a training set, containing the remaining observations. The vector test is numeric, with values from $1$ through $1,000$. Typing standardized_X[test, ] yields the submatrix of the data containing the observations (rows) whose indices range from $1$ to $1,000$, whereas typing standardized_X[-test, ] yields the submatrix containing the observations (rows) whose indices do not range from $1$ to $1,000$.

# Prepare train vs. test data: 
test <- 1:1000

train_X <- standardized_X[-test, ]
test_X  <- standardized_X[ test, ]
train_Y <- Purchase[-test]
test_Y  <- Purchase[ test]

We fit a KNN model on the training data using $K = 1$, and evaluate its performance on the test data:

set.seed(1)  # reproducible randomness 

# Fit KNN model:
knn_pred <- knn(train_X, test_X, train_Y, k = 1)

# Evaluation:
table(knn_pred, test_Y)
##         test_Y
## knn_pred  No Yes
##      No  873  50
##      Yes  68   9
mean(test_Y != knn_pred)  # error rate
## [1] 0.118
mean(test_Y != "No")
## [1] 0.059

The KNN error rate on the 1,000 test observations is just under $12$%. At first glance, this may appear to be fairly good. However, since only $6$% of customers purchased insurance, we could get the error rate down to $6$% by always predicting No regardless of the values of the predictors!

Suppose that there is some non-trivial cost to trying to sell insurance to a given individual. For instance, perhaps a salesperson must visit each potential customer. If the company tries to sell insurance to a random selection of customers, then the success rate will be only $6$%, which may be far too low given the costs involved. Instead, the company would like to try to sell insurance only to customers who are likely to buy it. So the overall error rate is not of interest. Instead, the fraction of individuals that are correctly predicted to buy insurance is of interest.

It turns out that KNN with $K = 1$ does far better than random guessing among the customers that are predicted to buy insurance. Among $77$ such customers, $9$, or $11.7$%, actually do purchase insurance. This is twice the rate that one would obtain from random guessing:

# Evaluation:
table(knn_pred, test_Y)
##         test_Y
## knn_pred  No Yes
##      No  873  50
##      Yes  68   9
9 / (68 + 9)
## [1] 0.1168831

Using $K = 3$, the success rate increases to $19$%, and with $K = 5$ the rate is $26.7$%. This is over four times the rate that results from random guessing. It appears that KNN is finding some real patterns in a difficult data set!

# Fit KNN model:
knn_pred <- knn(train_X, test_X, train_Y, k = 3)
# Evaluation:
table(knn_pred, test_Y)
##         test_Y
## knn_pred  No Yes
##      No  920  54
##      Yes  21   5
5 / 26
## [1] 0.1923077

# Fit KNN model:
knn_pred <- knn(train_X, test_X, train_Y, k = 5)
# Evaluation:
table(knn_pred, test_Y)
##         test_Y
## knn_pred  No Yes
##      No  930  55
##      Yes  11   4
4 / 15
## [1] 0.2666667

However, while this strategy is cost-effective, it is worth noting that only 15 customers are predicted to purchase insurance using KNN with $K = 5$. In practice, the insurance company may wish to expend resources on convincing more than just 15 potential customers to buy insurance.

Comparing to a logistic-regresssion model

As a comparison, we can also fit a logistic regression model to the Caravan data:

# Fit logistic regression model: 
glm_fits <- glm(Purchase ~ ., data = Caravan, 
                family = binomial, subset = -test)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Predicting probabilities and evaluation of categorical predictions:

# Predict probabilities:
glm_probs <- predict(glm_fits, Caravan[test, ], type = "response")

# Turn probabilities into categories:
glm_pred <- rep(NA, 1000)
glm_pred[glm_probs <= .50] <- "No"  # cut-off = .50
glm_pred[glm_probs >  .50] <- "Yes"

# Evaluation:
table(glm_pred, test_Y)
##         test_Y
## glm_pred  No Yes
##      No  934  59
##      Yes   7   0

If we use a value of $0.5$ as the predicted probability cut-off for the classifier, then we have a problem: Only seven of the test observations are predicted to purchase insurance. Even worse, we are wrong about all of these!

Different classification criterion: However, we are not required to use a cut-off of $0.5$. If we instead predict a purchase any time the predicted probability of purchase exceeds $0.25$, we get much better results:

# Turn probabilities into categories:
glm_pred <- rep(NA, 1000)
glm_pred[glm_probs <= .25] <- "No"  # cut-off = .25
glm_pred[glm_probs >  .25] <- "Yes"

# Evaluation:
table(glm_pred, test_Y)
##         test_Y
## glm_pred  No Yes
##      No  919  48
##      Yes  22  11
11 / (22 + 11)
## [1] 0.3333333

We predict that 33 people will purchase insurance, and we are correct for about $33$% of these people. This is over five times better than random guessing!

Poisson regression

Finally, we fit a Poisson regression model to the Bikeshare data set, which measures the number of bike rentals (bikers) per hour in Washington, DC. The data can be found in the ISLR2 library:

library(ISLR2)

attach(Bikeshare)

# Inspect data:
dim(Bikeshare)
## [1] 8645   15
names(Bikeshare)
##  [1] "season"     "mnth"       "day"        "hr"         "holiday"   
##  [6] "weekday"    "workingday" "weathersit" "temp"       "atemp"     
## [11] "hum"        "windspeed"  "casual"     "registered" "bikers"

# Quick plots:
hist(bikers)

plot(x = mnth, y = bikers, main = "Bikers by month")

plot(x = hr,   y = bikers, main = "Bikers by hour")

We begin by fitting a least squares linear regression model to the data:

# Fit linear regression model: 
mod_lm <- lm(bikers ~ mnth + hr + workingday + temp + weathersit,
             data = Bikeshare)

summary(mod_lm)
## 
## Call:
## lm(formula = bikers ~ mnth + hr + workingday + temp + weathersit, 
##     data = Bikeshare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -299.00  -45.70   -6.23   41.08  425.29 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -68.632      5.307 -12.932  < 2e-16 ***
## mnthFeb                      6.845      4.287   1.597 0.110398    
## mnthMarch                   16.551      4.301   3.848 0.000120 ***
## mnthApril                   41.425      4.972   8.331  < 2e-16 ***
## mnthMay                     72.557      5.641  12.862  < 2e-16 ***
## mnthJune                    67.819      6.544  10.364  < 2e-16 ***
## mnthJuly                    45.324      7.081   6.401 1.63e-10 ***
## mnthAug                     53.243      6.640   8.019 1.21e-15 ***
## mnthSept                    66.678      5.925  11.254  < 2e-16 ***
## mnthOct                     75.834      4.950  15.319  < 2e-16 ***
## mnthNov                     60.310      4.610  13.083  < 2e-16 ***
## mnthDec                     46.458      4.271  10.878  < 2e-16 ***
## hr1                        -14.579      5.699  -2.558 0.010536 *  
## hr2                        -21.579      5.733  -3.764 0.000168 ***
## hr3                        -31.141      5.778  -5.389 7.26e-08 ***
## hr4                        -36.908      5.802  -6.361 2.11e-10 ***
## hr5                        -24.135      5.737  -4.207 2.61e-05 ***
## hr6                         20.600      5.704   3.612 0.000306 ***
## hr7                        120.093      5.693  21.095  < 2e-16 ***
## hr8                        223.662      5.690  39.310  < 2e-16 ***
## hr9                        120.582      5.693  21.182  < 2e-16 ***
## hr10                        83.801      5.705  14.689  < 2e-16 ***
## hr11                       105.423      5.722  18.424  < 2e-16 ***
## hr12                       137.284      5.740  23.916  < 2e-16 ***
## hr13                       136.036      5.760  23.617  < 2e-16 ***
## hr14                       126.636      5.776  21.923  < 2e-16 ***
## hr15                       132.087      5.780  22.852  < 2e-16 ***
## hr16                       178.521      5.772  30.927  < 2e-16 ***
## hr17                       296.267      5.749  51.537  < 2e-16 ***
## hr18                       269.441      5.736  46.976  < 2e-16 ***
## hr19                       186.256      5.714  32.596  < 2e-16 ***
## hr20                       125.549      5.704  22.012  < 2e-16 ***
## hr21                        87.554      5.693  15.378  < 2e-16 ***
## hr22                        59.123      5.689  10.392  < 2e-16 ***
## hr23                        26.838      5.688   4.719 2.41e-06 ***
## workingday                   1.270      1.784   0.711 0.476810    
## temp                       157.209     10.261  15.321  < 2e-16 ***
## weathersitcloudy/misty     -12.890      1.964  -6.562 5.60e-11 ***
## weathersitlight rain/snow  -66.494      2.965 -22.425  < 2e-16 ***
## weathersitheavy rain/snow -109.745     76.667  -1.431 0.152341    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76.5 on 8605 degrees of freedom
## Multiple R-squared:  0.6745, Adjusted R-squared:  0.6731 
## F-statistic: 457.3 on 39 and 8605 DF,  p-value: < 2.2e-16

Interpreting coefficients: Due to space constraints, we truncate the output of summary(mod_lm). In mod_lm, the first level of hr (0) and mnth (Jan) are treated as the baseline values, and so no coefficient estimates are provided for them: implicitly, their coefficient estimates are zero, and all other levels are measured relative to these baselines. For example, the Feb coefficient of $6.845$ signifies that, holding all other variables constant, there are on average about 7 more riders in February than in January. Similarly there are about 16.5 more riders in March than in January.

Coding contrasts for coefficients

The results seen in Section 4.6.1 used a slightly different coding of the variables hr and mnth, as follows:

# Define contrasts:
contrasts(Bikeshare$hr)   = contr.sum(24)
contrasts(Bikeshare$mnth) = contr.sum(12)

# Fit linear regression model: 
mod_lm2 <- lm(bikers ~ mnth + hr + workingday + temp + weathersit, 
              data = Bikeshare)

summary(mod_lm2)
## 
## Call:
## lm(formula = bikers ~ mnth + hr + workingday + temp + weathersit, 
##     data = Bikeshare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -299.00  -45.70   -6.23   41.08  425.29 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 73.5974     5.1322  14.340  < 2e-16 ***
## mnth1                      -46.0871     4.0855 -11.281  < 2e-16 ***
## mnth2                      -39.2419     3.5391 -11.088  < 2e-16 ***
## mnth3                      -29.5357     3.1552  -9.361  < 2e-16 ***
## mnth4                       -4.6622     2.7406  -1.701  0.08895 .  
## mnth5                       26.4700     2.8508   9.285  < 2e-16 ***
## mnth6                       21.7317     3.4651   6.272 3.75e-10 ***
## mnth7                       -0.7626     3.9084  -0.195  0.84530    
## mnth8                        7.1560     3.5347   2.024  0.04295 *  
## mnth9                       20.5912     3.0456   6.761 1.46e-11 ***
## mnth10                      29.7472     2.6995  11.019  < 2e-16 ***
## mnth11                      14.2229     2.8604   4.972 6.74e-07 ***
## hr1                        -96.1420     3.9554 -24.307  < 2e-16 ***
## hr2                       -110.7213     3.9662 -27.916  < 2e-16 ***
## hr3                       -117.7212     4.0165 -29.310  < 2e-16 ***
## hr4                       -127.2828     4.0808 -31.191  < 2e-16 ***
## hr5                       -133.0495     4.1168 -32.319  < 2e-16 ***
## hr6                       -120.2775     4.0370 -29.794  < 2e-16 ***
## hr7                        -75.5424     3.9916 -18.925  < 2e-16 ***
## hr8                         23.9511     3.9686   6.035 1.65e-09 ***
## hr9                        127.5199     3.9500  32.284  < 2e-16 ***
## hr10                        24.4399     3.9360   6.209 5.57e-10 ***
## hr11                       -12.3407     3.9361  -3.135  0.00172 ** 
## hr12                         9.2814     3.9447   2.353  0.01865 *  
## hr13                        41.1417     3.9571  10.397  < 2e-16 ***
## hr14                        39.8939     3.9750  10.036  < 2e-16 ***
## hr15                        30.4940     3.9910   7.641 2.39e-14 ***
## hr16                        35.9445     3.9949   8.998  < 2e-16 ***
## hr17                        82.3786     3.9883  20.655  < 2e-16 ***
## hr18                       200.1249     3.9638  50.488  < 2e-16 ***
## hr19                       173.2989     3.9561  43.806  < 2e-16 ***
## hr20                        90.1138     3.9400  22.872  < 2e-16 ***
## hr21                        29.4071     3.9362   7.471 8.74e-14 ***
## hr22                        -8.5883     3.9332  -2.184  0.02902 *  
## hr23                       -37.0194     3.9344  -9.409  < 2e-16 ***
## workingday                   1.2696     1.7845   0.711  0.47681    
## temp                       157.2094    10.2612  15.321  < 2e-16 ***
## weathersitcloudy/misty     -12.8903     1.9643  -6.562 5.60e-11 ***
## weathersitlight rain/snow  -66.4944     2.9652 -22.425  < 2e-16 ***
## weathersitheavy rain/snow -109.7446    76.6674  -1.431  0.15234    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76.5 on 8605 degrees of freedom
## Multiple R-squared:  0.6745, Adjusted R-squared:  0.6731 
## F-statistic: 457.3 on 39 and 8605 DF,  p-value: < 2.2e-16

Interpreting coefficients: What is the difference between the two codings? In mod_lm2, a coefficient estimate is reported for all but the last level of hr and mnth. Importantly, in mod_lm2, the coefficient estimate for the last level of mnth is not zero: Instead, it equals the negative of the sum of the coefficient estimates for all of the other levels. Similarly, in mod_lm2, the coefficient estimate for the last level of hr is the negative of the sum of the coefficient estimates for all of the other levels. This means that the coefficients of hr and mnth in mod_lm2 will always sum to zero, and can be interpreted as the difference from the mean level. For example, the coefficient for January of $-46.087$ indicates that, holding all other variables constant, there are typically 46 fewer riders in January relative to the yearly average.

It is important to realize that the choice of coding really does not matter, provided that we interpret the model output correctly in light of the coding used. For example, we see that the predictions from the linear model are the same regardless of coding:

(sq_dif <- sum((predict(mod_lm) - predict(mod_lm2))^2))
## [1] 1.426274e-18
dplyr::near(sq_dif, 0)
## [1] TRUE

The sum of squared differences is (near) zero. We can also see this using the all.equal() function:

all.equal(predict(mod_lm), predict(mod_lm2))
## [1] TRUE

To reproduce the left-hand side of Figure 4.13, we must first obtain the coefficient estimates associated with mnth. The coefficients for January through November can be obtained directly from the mod_lm2 object. The coefficient for December must be explicitly computed as the negative sum of all the other months:

# Get and compute coefficients (for months):
coef_months <- c(coef(mod_lm2)[2:12], -sum(coef(mod_lm2)[2:12]))

# Get and compute coefficients (for hours):
coef_hours <- c(coef(mod_lm2)[13:35], -sum(coef(mod_lm2)[13:35]))

To make the plot, we manually label the $x$-axis with the names of the months:

opar <- par(no.readonly = TRUE)  # store original par
par(mfrow = c(1, 2))  # 1 row and 2 cols

# Figure 4.13a:
plot(coef_months, xlab = "Month", ylab = "Coefficient",
     xaxt = "n", col = "deepskyblue", pch = 19, type = "o")
# axis(side = 1, at = 1:12, 
#      labels = c("J", "F", "M", "A", "M", "J", "J", "A", "S", "O", "N", "D"))
# simpler: Use R constant month.abb
axis(side = 1, at = 1:12, labels = month.abb)

title("Figure 4.13a", adj = 0)

# Figure 4.13b:

plot(coef_hours, xlab = "Hour", ylab = "Coefficient",
     col = "deeppink", pch = 19, type = "o")
title("Figure 4.13b", adj = 0)


par(opar)  # restore original par

Reproducing the right-hand side of Figure 4.13 follows a similar process. (Here, we combined both in a single figure.)

Poisson regression model

Now, instead of fitting a LR model, we fit a Poisson regression model to the Bikeshare data. To do so, very little changes, except that we now use the function glm() with the argument family = poisson to specify that we wish to fit a Poisson regression model:

mod_pois <- glm(bikers ~ mnth + hr + workingday + temp + weathersit, 
                data = Bikeshare, family = poisson)
summary(mod_pois)
## 
## Call:
## glm(formula = bikers ~ mnth + hr + workingday + temp + weathersit, 
##     family = poisson, data = Bikeshare)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -20.7574   -3.3441   -0.6549    2.6999   21.9628  
## 
## Coefficients:
##                            Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)                4.118245   0.006021  683.964  < 2e-16 ***
## mnth1                     -0.670170   0.005907 -113.445  < 2e-16 ***
## mnth2                     -0.444124   0.004860  -91.379  < 2e-16 ***
## mnth3                     -0.293733   0.004144  -70.886  < 2e-16 ***
## mnth4                      0.021523   0.003125    6.888 5.66e-12 ***
## mnth5                      0.240471   0.002916   82.462  < 2e-16 ***
## mnth6                      0.223235   0.003554   62.818  < 2e-16 ***
## mnth7                      0.103617   0.004125   25.121  < 2e-16 ***
## mnth8                      0.151171   0.003662   41.281  < 2e-16 ***
## mnth9                      0.233493   0.003102   75.281  < 2e-16 ***
## mnth10                     0.267573   0.002785   96.091  < 2e-16 ***
## mnth11                     0.150264   0.003180   47.248  < 2e-16 ***
## hr1                       -0.754386   0.007879  -95.744  < 2e-16 ***
## hr2                       -1.225979   0.009953 -123.173  < 2e-16 ***
## hr3                       -1.563147   0.011869 -131.702  < 2e-16 ***
## hr4                       -2.198304   0.016424 -133.846  < 2e-16 ***
## hr5                       -2.830484   0.022538 -125.586  < 2e-16 ***
## hr6                       -1.814657   0.013464 -134.775  < 2e-16 ***
## hr7                       -0.429888   0.006896  -62.341  < 2e-16 ***
## hr8                        0.575181   0.004406  130.544  < 2e-16 ***
## hr9                        1.076927   0.003563  302.220  < 2e-16 ***
## hr10                       0.581769   0.004286  135.727  < 2e-16 ***
## hr11                       0.336852   0.004720   71.372  < 2e-16 ***
## hr12                       0.494121   0.004392  112.494  < 2e-16 ***
## hr13                       0.679642   0.004069  167.040  < 2e-16 ***
## hr14                       0.673565   0.004089  164.722  < 2e-16 ***
## hr15                       0.624910   0.004178  149.570  < 2e-16 ***
## hr16                       0.653763   0.004132  158.205  < 2e-16 ***
## hr17                       0.874301   0.003784  231.040  < 2e-16 ***
## hr18                       1.294635   0.003254  397.848  < 2e-16 ***
## hr19                       1.212281   0.003321  365.084  < 2e-16 ***
## hr20                       0.914022   0.003700  247.065  < 2e-16 ***
## hr21                       0.616201   0.004191  147.045  < 2e-16 ***
## hr22                       0.364181   0.004659   78.173  < 2e-16 ***
## hr23                       0.117493   0.005225   22.488  < 2e-16 ***
## workingday                 0.014665   0.001955    7.502 6.27e-14 ***
## temp                       0.785292   0.011475   68.434  < 2e-16 ***
## weathersitcloudy/misty    -0.075231   0.002179  -34.528  < 2e-16 ***
## weathersitlight rain/snow -0.575800   0.004058 -141.905  < 2e-16 ***
## weathersitheavy rain/snow -0.926287   0.166782   -5.554 2.79e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1052921  on 8644  degrees of freedom
## Residual deviance:  228041  on 8605  degrees of freedom
## AIC: 281159
## 
## Number of Fisher Scoring iterations: 5

We can plot the coefficients associated with mnth and hr, in order to reproduce Figure 4.15:

opar <- par(no.readonly = TRUE)  # store original par
par(mfrow = c(1, 2))  # 1 row and 2 cols

# (a)
coef_months <- c(coef(mod_pois)[2:12], -sum(coef(mod_pois)[2:12]))
plot(coef_months, xlab = "Month", ylab = "Coefficient",
     xaxt = "n", col = "deepskyblue", pch = 19, type = "o")
axis(side = 1, at = 1:12, labels = month.abb)
title("Figure 4.15a", adj = 0)

# (b)
coef_hours <- c(coef(mod_pois)[13:35], -sum(coef(mod_pois)[13:35]))
plot(coef_hours, xlab = "Hour", ylab = "Coefficient",
     col = "deeppink", pch = 19, type = "o")
title("Figure 4.15b", adj = 0)


par(opar)  # restore original par

We can once again use the predict() function to obtain the fitted values (predictions) from this Poisson regression model. However, we must use the argument type = "response" to specify that we want R to output $\exp(\hat\beta_0 + \hat\beta_1 X_1 + \ldots +\hat\beta_p X_p)$ rather than $\hat\beta_0 + \hat\beta_1 X_1 + \ldots + \hat\beta_p X_p$, which it will output by default.

Comparing predictions: The following plot compares the predictions from the LR model with those of the Poisson regression model:

plot(predict(mod_lm2), predict(mod_pois, type = "response"), 
     pch = 20, cex = 1, col = adjustcolor("black", .10))
abline(0, 1, col = "deeppink", lwd = 2)

The predictions from the Poisson regression model are correlated with those from the linear model; however, the former are non-negative. As a result the Poisson regression predictions tend to be larger than those from the linear model for either very low or very high levels of ridership.

Other family arguments

In this section, we used the glm() function with the argument family = poisson in order to perform a Poisson regression. Earlier in this lab we used the glm() function with family = binomial to perform logistic regression models. Other choices for the family argument can be used to fit other types of GLMs. For instance, family = Gamma fits a gamma regression model.

+++ here now +++

4.8 Exercises

Conceptual

Applied

References

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning with applications in R (2nd edition). Retrieved from https://www.statlearning.com/

After all, if it were possible to do so, then the authors of this book and script would be out striking it rich rather than writing a statistics textbook.↩︎

ISL

hn

2022-06-27

4 Classification

4.1 An overview of classification

Example

4.2 Why not LR?

4.3 Logistic regression

The logistic model

Estimating logistic regression coefficients

Making predictions

Quantitative predictors

Qualitative predictors

Multiple logistic regression

The puzzle of confounding

Prediction with multiple logistic regression

Multinomial logistic regression

4.4 Generative models for classification

Linear discriminant analysis for \(p = 1\)

Linear discriminant analysis for \(p > 1\)

Quadratic discriminant analysis

Naive Bayes

A comparison of classification methods

An analytical comparison

An empirical comparison

Generalized linear models

Linear regression on the Bikeshare data

Log-transforming \(Y\) as remedy?

Poisson Regression on the Bikeshare data

Poisson distribution

Poisson regression

Generalized linear models in greater generality

4.7 Lab: Classification methods

The Stock Market Data

Logistic regression

Logistic regression model

Model predictions

Confusion matrix

Training vs. testing data

Improving the model?

Predicting particular values

Linear discriminant analysis (LDA)

Quadratic discriminant analysis (QDA)

Naive Bayes (NBayes)

\(K\)-nearest neighbors (KNN)

KNN on Caravan data

Comparing to a logistic-regresssion model

Poisson regression

Coding contrasts for coefficients

Poisson regression model

Other family arguments

4.8 Exercises

Conceptual

Applied

Links

References

KNN on `Caravan` data