ISL

hn

2022-05-16

Chapter 2 of James, Witten, Hastie, & Tibshirani (2021). See https://www.statlearning.com/ for data and resources.

2 Statistical Learning

2.1 What is statistical learning?

Example 1: Investigating the association between advertising and product sales (in Advertising data).

Note that budgets can be controlled and manipulated, whereas sales can only be measured. Specific goal: Predict sales on the basis of the three media budgets.

The Advertising data aims to predict sales (an outcome variable) based on advertising budgets (in thousands of dollars) for TV, radio, and newspaper media (i.e., input variables):

library(ISLR2)

# Data: 
# https://www.statlearning.com/resources-second-edition
data_url  <- "https://www.statlearning.com/s/Advertising.csv"

if (wd == "/Users/hneth/Desktop/stuff/Dropbox/GitHub/predict/R"){ # in subdir R: 
  data_file <- "./../data/_JamesEtAl2021_ISL2/data/Advertising.csv"  # up 1 level
} else {
  data_file <- "./data/_JamesEtAl2021_ISL2/data/Advertising.csv" 
}

ad <- tibble::as_tibble(read.csv(data_file))
dim(ad)
## [1] 200   5

Figure 2.1 (p. 16) displays sales (in thousands of units) for a particular product as a function of advertising budgets (in thousands of dollars) for TV, radio, and newspaper media:

Figure 2.1

Recreating Figure 2.1:

par(mfrow = c(1, 3))

plot(ad$TV, ad$sales, main = "(a) sales by TV",
     pch = 20, cex = 2, col = adjustcolor("black", .25))
abline(lm(sales ~ TV, data = ad), lwd = 2, col = "deepskyblue")

plot(ad$radio, ad$sales, main = "(b) sales by radio",
     pch = 20, cex = 2, col = adjustcolor("black", .25))
abline(lm(sales ~ radio, data = ad), lwd = 2, col = "deeppink")

plot(ad$newspaper, ad$sales, main = "(c) sales by newspaper", 
     pch = 20, cex = 2, col = adjustcolor("black", .25))
abline(lm(sales ~ newspaper, data = ad), lwd = 2, col = "gold")

par(opar)  # restore original par
Figure 2.1

Figure 2.1

For an output variable \(Y\) and a set of \(p\) predictor variables \(X = (X_1,\ X_2,\ \ldots,\ X_p)\) we assume:

\[Y = f(X) + \epsilon\]

Here, \(f\) represents systematic information, whereas \(\epsilon\) represents random error information (aka. noise, independent of \(X\) and with a mean of zero).

Example 2: Investigating the association between income and years of education, seniority etc. (in Income1 data):

# Data: 
# https://www.statlearning.com/resources-second-edition
data_url  <- "https://www.statlearning.com/s/Income1.csv"
wd <- getwd()
if (wd == "/Users/hneth/Desktop/stuff/Dropbox/GitHub/predict/R"){ # in subdir R: 
  data_file <- "./../data/_JamesEtAl2021_ISL2/data/Income1.csv"  # up 1 level
} else {
  data_file <- "./data/_JamesEtAl2021_ISL2/data/Income1.csv" 
}

in_1 <- tibble::as_tibble(read.csv(data_file))
dim(in_1)
## [1] 30  3

Plotting Income as a function of years of Education (Figure 2.2, p. 17):

Figure 2.2

Recreating a version of Figure 2.2 with a linear regression model, rather than the true underlying non-linear model:

plot(in_1$Education, in_1$Income, 
     main = "Income by education", xlab = "Years of education", ylab = "Income", 
     ylim = c(0, 90), 
     pch = 20, cex = 2, col = adjustcolor("black", .75))

# Linear regression:
lm_1 <- lm(Income ~ Education, data = in_1) 
abline(lm_1, lwd = 2, col = "deepskyblue")

# Predictions:
y_hat <- predict(lm_1)
segments(x0 = in_1$Education, y0 = in_1$Income, y1 = y_hat,
         lwd = 2, col = "deeppink")
Figure 2.2: Income by years of education (with linear regression model).

Figure 2.2: Income by years of education (with linear regression model).

Note that the values of the outcome variable can be predicted as a function of multiple variables (see Figures 2.3–2.6, James et al., 2021, p. 17ff.).

Figure 2.3

Statistical learning refers to a set of approaches for estimating \(f\).

Why estimate \(f\)?

Two main reasons for estimating \(f\) are prediction and inference:

  1. prediction: Given a set of inputs \(X\) we predict, \(\hat{Y} = \hat{f}(X)\) based on a black box estimate of \(\hat{f}\). The accuracy of \(\hat{Y}\) as a prediction for \(Y\) depends on two quantities: The reducible error results from inadequacies in \(\hat{f}\) that can be refined by using a better model; by contrast, the irreducible error results from the variability in \(Y\) due to \(\epsilon\), which cannot be captured by improving \(\hat{f}\). The irreducible error will always provide an upper bound on the accuracy of our prediction for \(Y\) (see Equation 2.3, James et al., 2021, p. 19).

  2. inference: Rather than making predictions for \(Y\), we may want to understand the association between \(Y\) and \(X = (X_1,\ X_2,\ \ldots,\ X_p)\). In this situation we wish to estimate \(f\) to gain a deeper understanding of the form of the relationship (e.g., linear or non-linear), as well as the direction and magnitude of the association between the outcome and its predictors.

In practice, we often aim for a combination of both prediction and inference (see examples in James et al., 2021, p. 19f.). We often have a trade-off that models need to negotiate between both goals: For example, linear models allow for relatively simple and interpretable inference, but may not yield as accurate predictions as some other approaches. By contrast, some highly non-linear approaches (covered in the later chapters of the book) can provide quite accurate predictions, but come at the expense of a less interpretable model for which inference is more challenging.

How do we estimate \(f\)?

Our general task is to apply a statistical learning method to training data for estimating an unknown function \(f\).

Linear and non-linear approaches for estimating \(f\) share certain characteristics.

Most statistical learning methods can be characterized as either parametric or non-parametric:

  1. parametric methods involve a two-step model-based approach:

    1. Make assumption about the functional form or shape of \(f\) (e.g., a linear, polynomial equation).

    2. Use the training data to fit or train the model (e.g., minimizing squared deviations).

This model-based approach is called parametric as it reduces the problem of estimating \(f\) to estimating a set of parameters (e.g., the coefficients of an equation).

Example: Compare Figure 2.4 (p. 22: A linear regression model fit by least squares to the Income2 data) to the original data from Figure 2.3 (p. 18).

Figure 2.4

  • Advantage: It is easier to estimate parameters for a given form of \(f\) than to estimate the form of \(f\).

  • Cost: If the form is wrong, the fits and predictions of the paremetric model will be poor.

Fitting more flexible models (i.e., varying both the form and the parameters) requires additional parameters and is likely to result in overfitting (i.e., fitting not just systematic information, but also random errors/noise).

  1. non-parametric methods make no explicit assumptions about the functional form of \(f\). Instead, they aim to estimate \(f\) by some model that fits as clasely as possible (given some metric of quantifying deviations).

Example: Figure 2.5 (p. 23) and Figure 2.6 (p. 24) use a thin-plate spline to estimate \(f\) for the Income2 data (shown in Figure 2.3, p. 18). By varying the level of smoothness (i.e., a parameter of the spline), the fit can be increased, but this can result in overfitting.

Figure 2.5

Figure 2.6

Note an Erratum (James et al., 2021, p. 24):
“Figure 2.6 shows the same thin-plate spline fit using a lower level of smoothness, allowing for a rougher fit.”
should read:
“Figure 2.6 shows the same thin-plate spline fit using a lower level of smoothness, allowing for a closer fit.”
or more explicitly:
“Figure 2.6 shows the same thin-plate spline fit using a lower level of smoothness, allowing for rougher surface functions and closer fits.”

  • Advantage: By avoiding the limitation on a particular form for \(f\), a wider range of possible shapes for \(f\) can be fit.

  • Cost: More data is needed, as the range of potential models is far larger than when only estimating parameters.

The trade-off between prediction accuracy and model interpretability

Given the closer fit of more flexible models, we can ask: Why would we ever use a more restrictive method instead of a very flexible approach?

The preference for a more restrictive vs. more flexible model depends mostly on our goals:

  • If we are mainly interested in inference, then restrictive models are usually more interpretable. By contrast, flexible models can become intransparent and difficult or impossible to understand.

  • If we are mainly interested in prediction, more flexible models usually show a closer fit to observed/training data. However, due to the risk of overfitting, this apparent advantage may not carry over to out-of-sample and out-of-population predictions.

See Figure 2.7 (p. 25) for a representation of the tradeoff between flexibility and interpretability for different statistical learning methods (e.g., lasso, least squares, trees, deep learning). In general, as the flexibility of a method increases, its interpretability decreases.

Figure 2.7

Overall, when inference is the main goal, using simple and relatively inflexible models has clear advantages. When prediction is the main goal, more flexible models usually provide better fits to observed data than simpler models, but their benefits and costs for predicting new data become an empirical question.

Supervised versus unsupervised learning

Most statistical learning problems fall into one of two categories: supervised or unsupervised.

  1. supervised learning: For each observation of the predictor measurement(s) \(x_i\), \(i = 1,\ \ldots,\ n\) there is an associated response measurement \(y_i\).

Example: The majority of problems discussed here correspond to this setting: linear regression (Chapter 3), logistic regression (Chapter 4), as well as more modern approaches, such as GAM, boosting, and support vector machines, belong to the supervised learning domain.

  1. unsupervised learning: For every observation \(i = 1,\ \ldots,\ n\) we observe a vector of measurements \(x_i\) but no associated response \(y_i\). Thus, we cannot fit a model, because we lack a response variable that can supervise our analysis.

Example: In this more challenging setting, we seek to understand the relationships between the variables or between the observations (e.g., by cluster analysis, seeking to distinguish between groups of observations on the basis of input variables, see Figure 2.8, p. 27, and Chapter 12).

Figure 2.8

Note that hybrid problems are possible. Example: In a semi-supervised learning problems, we have a total of \(n\) observations. For a subset \(m < n\) of the observations, we have both predictor measurements and a response measurement. For the remaining \((n − m)\) observations, we have only predictor measurements but no response measurement.

Regression versus classification problems

Variables can be characterized as either quantitative or qualitative (aka. categorical):

  • Quantitative variables take on numerical values. Examples: A person’s age, height, or income, the value of a house, and the price of a stock.

  • Qualitative variables take on values in one of \(k\) different classes, or categories. Examples: A person’s marital status (married or not), the brand of product purchased (brand A, B, or C), whether a person defaults on a debt (yes or no), or a cancer diagnosis (Acute Myelogenous Leukemia, Acute Lymphoblastic Leukemia, or No Leukemia).

We tend to refer to problems with a quantitative response as regression problems, while those involving a qualitative response are often referred to as classification problems.

Note:

  • Mixed cases exist. For example, as logistic regression predicts binary outcomes, it is a classification method. But as it assigns continuous probabilities to outcomes, it can be thought of as a regression method as well.

  • Statistical learning methods are usually selected on the basis of whether the response (outcome) is quantitative or qualitative (e.g., we might use linear regression with quantitative responses and logistic regression with qualitative responses). However, whether the predictors (inputs) are qualitative or quantitative is generally less important (provided that they are coded appropriately, see Chapter 3).

2.2 Assessing model accuracy

How can we assess the quality of a model?

Measuring the quality of fit

Minimizing mean squared error (MSE) of model predictions (a) for training data vs. (b) for test data. Typically, we aim for the lowest test MSE (i.e., minimizing deviations for previously unseen test observations that were not used to train the model) or lowest average squared prediction error.

Figure 2.9 (p. 31) illustrates the trade-off between accuracy in fitting vs. predicting by model complexity:

Figure 2.9

  • (a) fitting: More complex models (with many degrees of freedom) fit training data better (i.e., relation between degrees of freedom and training MSE is monotonic, but may be non-linear).

  • (b) predicting: Models of intermediate complexity predict new data better than very simple and very complex models (i.e., U-shaped curve for test MSE by degrees of freedom). An increase in test MSE for complex models is due to overfitting the training data (i.e., a less flexible model would have yielded a smaller test MSE).

Contrast with Figure 2.10 (p. 33) that assumes a true relationship (or \(f\)) that is approximately linear (2 df):

Figure 2.10

  • (a) for fitting/training data, complex models are better (as always), but

    1. for predicting/test data, simple linear model (with 2 df) is almost as good as optimal model (of intermediate complexity).

Contrast with Figure 2.11 (p. 34) that assumes a true relationship (or \(f\)) that is curvilinear (with 5+ df):

Figure 2.11

  • Linear regression (with only 2 df) provides a poor fit to both training and test data.

Ideally, we aim for a simple model (i.e., a transparent model allowing for inference) that is also good at predicting new data. Many methods exist for negotiating the trade-off between model complexity and each model’s performance for fitting vs. predicting data. An important method for estimating test MSE using the training data is cross-validation (Chapter 5).

The bias-variance trade-off

The U-shape between model-complexity and fit observed in the test MSE curves (see Figures 2.9 to 2.11, p. 31ff.) is the result of two competing properties of statistical learning methods.

For a given value \(x_0\), the expected test MSE at \(x_0\) \(\text{E}(y_0 - \hat{f}(x_0))^2\) can be decomposed into the sum of three fundamental quantities:

  1. the variance of \(\hat{f}(x_0)\),
  2. the squared bias \([\text{Bias}\ \hat{f}(x_0)]^2\) and
  3. the variance of the error terms \(\text{Var}(\epsilon)\).

\[ \text{E}(y_0 - \hat{f}(x_0))^2 = \hat{f}(x_0) + [\text{Bias}\ \hat{f}(x_0)]^2 + \text{Var}(\epsilon) \]

What do the terms variance and bias mean in the context of statistical modeling?

  • variance refers to the flexibility of our model that depends on fitting it to specific data. Specifically, model variance depends on the amount by which \(\hat{f}\) would change if we estimated it from different training data set. When a model has high variance, small changes in the data can yield large changes in \(\hat{f}\).

  • bias refers to a systematic error that is introduced by using a particular model or type of model. In many settings, bias results from approximating a real-life problem, which may be complicated (or depend on many influences), by a simpler model (that ignores a lot of information).

The bias-variance trade-off implies that

  • more flexible models generally exhibit higher variance, but result in lower bias.

  • By contrast, simpler models tend to have lower variance, but higher bias.

Example of variance decomposition: Figure 2.12 (p. 36) shows the squared bias (blue), variance (orange), \(\text{Var}(\epsilon)\) (dashed), and test MSE (red) for the three data sets in Figures 2.9–2.11.

Figure 2.12

Note:

  • To minimize the expected test error, we aim for a statistical learning method that simultaneously achieves low variance and low (squared) bias (on the test data). As maximizing either criterion is easy (i.e., by using either a very flexible or a very simple model), but the corresponding methods usually oppose each other, the challenge lies in finding a good compromise that simultaneously satisfies both criteria.

  • As squared errors and variance are non-negative, the expected test MSE can never lie below the irreducible error \(\text{Var}(\epsilon)\).

  • In a real-life situation, the true \(f\) is unobserved (i.e., only to be inferred). Thus, it is generally not possible to compute a model’s test MSE, bias, or variance. However, methods like cross-validation allow estimating the test MSE using the training data (see Chapter 5).

The classification setting

Whereas the quantification of model accuracy is mostly discussed in regression settings, the same issues arise for classification tasks. A simple way to quantify classification performance is to compute the average number of erroneous classifications (via an indicator variable that signals erroneous classifications) for the training and test data (i.e., training error rate vs. test error rate). A good classifier should have a small test error rate.

The Bayes classifier

Assign each observation with values \(x_0\) to the category/class \(j\) for which the conditional probability \(\text{Pr}(Y = j\ |\ X = x_0 )\) is minimized (i.e., to the most likely category, given our observations).

Binary case: Predict class 1 for \(\text{Pr}(Y = 1 \ |\ X = x_0) > 0.5\), otherwise predict class 0 (or the other one).

Example of the Bayes decision boundary for a binary classification problem: Figure 2.13 (p. 38).

Figure 2.13

Given a Bayes classifier that produces the lowest possible error rate (and access to the true data, so that we know the conditional distribution of \(Y\) given \(X\)), we can compute the Bayes error rate. As the Bayes error rate is analogous to the irreducible error (see above), it provides an ideal or gold standard for classification performance.

K-nearest neighbors

In practice, the Bayes classifier serves as an unattainable gold standard against which to compare other methods. Many approaches attempt to estimate the conditional distribution of \(Y\) given \(X\), and classify a given observation to the class with highest estimated probability.

A simple heuristic for classifying a given observation to the class with highest estimated probability is the K-nearest neighbor (KNN) classifier.

Implementing KNN in three steps:

  1. Identify the \(K\) nearest points in the training data for a test observation \(x_0\) (i.e., \(\mathcal{N}_0\)).

  2. Estimate the conditional probability for class \(j\) as the fraction of points in \(\mathcal{N}_0\) whose response values \(y_i\) equal \(j\):

\[ \text{Pr}(Y = j\ |\ X = x_0) = \frac{1}{K} \cdot \sum_{i \in \mathcal{N_0}}{(y_i =j)}. \]

  1. Classify the test observation \(x_0\) to the class with the largest probability.

See Figures 2.14 and 2.15 (p. 40) for an illustrative example of the KNN approach:

Figure 2.14

Figure 2.15

When KNN is used for the all points in a problem space, the KNN decision boundary — despite the heuristic nature of the KNN classifier — can be quite close to the optimal Bayes decision boundary.

Figure 2.16 (p. 41) shows a dataset how the KNN decision boundary varies by the value of \(K\):

Figure 2.16

\(K = 1\) is overly flexible (high variance, low bias), whereas \(K = 100\) is too rigid (low variance, high bias). An intermediate value of \(K = 10\) works best by balancing flexibility (variance) and rigidity (bias) and approximates the ideal Bayes error rate.

Figure 2.17 (p. 42) illustrates the bias-variance trade-off for a KNN classifier with training and test data:

Figure 2.14

  • For training data, the fit increases (training error decreases) as \(K \rightarrow 1\). The curve is non-monotonic due to the variance between different training sets.

  • For test data, the fit first increases (as the value of \(K\) decreases) up to an optimal point (where the test error rate is close to the Bayes error rate) and then increases again (due to overfitting the training data).

Conclusion

The bias-variance tradeoff (and the characteristic U-shape in the test error by model flexibility) occurs in both regression and classification settings. Thus, choosing an appropriate level of flexibility is critical to the success of any statistical learning method. (See Chapter 5, for various methods for estimating test error rates and choosing the optimal level of flexibility for a given method.)

2.3 Lab: Introduction to R

In this lab, we will introduce some simple R commands. The best way to learn a new language is to try out the commands. R can be downloaded from

http://cran.r-project.org/

We recommend that you run R within an integrated development environment (IDE) such as RStudio, which can be freely downloaded from

http://rstudio.com

The RStudio website also provides a cloud-based version of R, which does not require installing any software.

Basic commands

R uses functions to perform operations. To run a function called funcname(), we type funcname(input1, input2), where the inputs (or arguments) input1 and input2 tell R how to run the function. A function can have any number of inputs. For example, to create a vector of numbers, we use the function c() (for concatenate). Any numbers inside the parentheses are joined together. The following command instructs R to join together the numbers 1, 3, 2, and 5, and to save them as a vector named x. When we type x, it gives us back the vector:

x <- c(1, 3, 2, 5)
x
## [1] 1 3 2 5

Note that the > is not part of the command; rather, it is printed by R to indicate that it is ready for another command to be entered. We can also save things using = rather than <-:

x = c(1, 6, 2)
x
## [1] 1 6 2
y = c(1, 4, 3)

Hitting the up arrow multiple times will display the previous commands, which can then be edited. This is useful since one often wishes to repeat a similar command. In addition, typing ?funcname will always cause R to open a new help file window with additional information about the function funcname().

We can tell R to add two sets of numbers together. It will then add the first number from x to the first number from y, and so on. However, x and y should be the same length. We can check their length using the length() function:

length(x)
## [1] 3
length(y)
## [1] 3
x + y
## [1]  2 10  5

The ls() function allows us to look at a list of all of the objects, such as data and functions, that we have saved so far. The rm() function can be used to delete any that we don’t want:

ls()
##  [1] "ad"        "data_file" "data_url"  "fileName"  "in_1"      "lm_1"     
##  [7] "opar"      "wd"        "x"         "y"         "y_hat"
rm(x, y)
ls()
## [1] "ad"        "data_file" "data_url"  "fileName"  "in_1"      "lm_1"     
## [7] "opar"      "wd"        "y_hat"

It’s also possible to remove all objects at once:

rm(list = ls())

The matrix() function can be used to create a matrix of numbers. Before we use the matrix() function, we can learn more about it:

?matrix

The help file reveals that the matrix() function takes a number of inputs, but for now we focus on the first three: the data (the entries in the matrix), the number of rows, and the number of columns. First, we create a simple matrix:

x <- matrix(data = c(1, 2, 3, 4), nrow = 2, ncol = 2)
x
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

Note that we could just as well omit typing data=, nrow=, and ncol= in the matrix() command above — that is, we could just type:

x <- matrix(c(1, 2, 3, 4), 2, 2)

and this would have the same effect. However, it can sometimes be useful to specify the names of the arguments passed in, since otherwise R will assume that the function arguments are passed into the function in the same order that is given in the function’s definition (or help file). As this example illustrates, by default R creates matrices by successively filling in columns. Alternatively, the byrow = TRUE option can be used to populate the matrix in order of the rows:

matrix(c(1, 2, 3, 4), 2, 2, byrow = TRUE)
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4

Notice that in the above command we did not assign the matrix to a value such as x. In this case the matrix is printed to the screen but is not saved for future calculations. The sqrt() function returns the square root of each element of a vector or matrix. The command x^2 raises each element of x to the power 2; any powers are possible, including fractional or negative powers:

sqrt(x)
##          [,1]     [,2]
## [1,] 1.000000 1.732051
## [2,] 1.414214 2.000000
x^2
##      [,1] [,2]
## [1,]    1    9
## [2,]    4   16

The rnorm() function generates a vector of random normal variables, with first argument n the sample size. Each time we call this function, we will get a different answer. Here we create two correlated sets of numbers, x and y, and use the cor() function to compute the correlation between them:

x <- rnorm(50)
y <- x + rnorm(50, mean = 50, sd = .1)
cor(x, y)
## [1] 0.9936505

By default, rnorm() creates standard normal random variables with a mean of \(0\) and a standard deviation of \(1\). However, the mean and standard deviation can be altered using the mean and sd arguments, as illustrated above. Sometimes we want our code to reproduce the exact same set of random numbers; we can use the set.seed() function to do this. The set.seed() function takes an (arbitrary) integer argument:

set.seed(1303)
rnorm(50)
##  [1] -1.1439763145  1.3421293656  2.1853904757  0.5363925179  0.0631929665
##  [6]  0.5022344825 -0.0004167247  0.5658198405 -0.5725226890 -1.1102250073
## [11] -0.0486871234 -0.6956562176  0.8289174803  0.2066528551 -0.2356745091
## [16] -0.5563104914 -0.3647543571  0.8623550343 -0.6307715354  0.3136021252
## [21] -0.9314953177  0.8238676185  0.5233707021  0.7069214120  0.4202043256
## [26] -0.2690521547 -1.5103172999 -0.6902124766 -0.1434719524 -1.0135274099
## [31]  1.5732737361  0.0127465055  0.8726470499  0.4220661905 -0.0188157917
## [36]  2.6157489689 -0.6931401748 -0.2663217810 -0.7206364412  1.3677342065
## [41]  0.2640073322  0.6321868074 -1.3306509858  0.0268888182  1.0406363208
## [46]  1.3120237985 -0.0300020767 -0.2500257125  0.0234144857  1.6598706557

We use set.seed() throughout the labs whenever we perform calculations involving random quantities. In general this should allow the user to reproduce our results. However, as new versions of R become available, small discrepancies may arise between this book and the output from R.

The mean() and var() functions can be used to compute the mean and variance of a vector of numbers. Applying sqrt() to the output of var() will give the standard deviation. Or we can simply use the sd() function:

set.seed(3)
y <- rnorm(100)
mean(y)
## [1] 0.01103557
var(y)
## [1] 0.7328675
sqrt(var(y))
## [1] 0.8560768
sd(y)
## [1] 0.8560768

Graphics

The plot() function is the primary way to plot data in R. For instance, plot(x, y) produces a scatterplot of the numbers in x versus the numbers in y. There are many additional options that can be passed in to the plot() function. For example, passing in the argument xlab will result in a label on the \(x\)-axis. To find out more information about the plot() function, type ?plot.

x <- rnorm(100)
y <- rnorm(100)

plot(x, y)


plot(x, y, xlab = "this is the x-axis",
     ylab = "this is the y-axis",
     main = "Plot of X vs Y")

We will often want to save the output of an R plot. The command that we use to do this will depend on the file type that we would like to create. For instance, to create a PDF file, we could use the pdf() function, and to create a JPEG file, we use the jpeg() function:

pdf("Figure.pdf")

plot(x, y, col = "green")
dev.off()

The function dev.off() indicates to R that we are done creating the plot. Alternatively, we can simply copy the plot window and paste it into an appropriate file type, such as a Word document.

Creating number sequences

The function seq() can be used to create a sequence of numbers. For instance, seq(a, b) makes a vector of integers between a and b. There are many other options: for instance, seq(0, 1, length = 10) makes a sequence of \(10\) numbers that are equally spaced between \(0\) and \(1\). Typing 3:11 is a shorthand for seq(3, 11) for integer arguments:

x <- seq(1, 10)
x
##  [1]  1  2  3  4  5  6  7  8  9 10
x <- 1:10
x
##  [1]  1  2  3  4  5  6  7  8  9 10
x <- seq(-pi, pi, length = 50)

3d-plots

We will now create some more sophisticated plots. The contour() function produces a contour plot in order to represent three-dimensional data; it is like a topographical map. It takes three arguments:

  • A vector of the x values (the first dimension),
  • A vector of the y values (the second dimension), and
  • A matrix whose elements correspond to the z value (the third dimension) for each pair of (xy) coordinates.

As with the plot() function, there are many other inputs that can be used to fine-tune the output of the contour() function. To learn more about these, take a look at the help file by typing ?contour.

y <- x
f <- outer(x, y, function(x, y) cos(y) / (1 + x^2))

contour(x, y, f)
contour(x, y, f, nlevels = 45, add = TRUE)


fa <- (f - t(f)) / 2

contour(x, y, fa, nlevels = 15)

The image() function works the same way as contour(), except that it produces a color-coded plot whose colors depend on the z value. This is known as a heatmap, and is sometimes used to plot temperature in weather forecasts:

image(x, y, fa)

Alternatively, the persp() function can be used to produce three-dimensional plots. The arguments theta and phi control the angles at which the plot is viewed:

persp(x, y, fa)

persp(x, y, fa, theta = 30)

persp(x, y, fa, theta = 30, phi = 20)

persp(x, y, fa, theta = 30, phi = 70)

persp(x, y, fa, theta = 30, phi = 40)

Indexing data

We often wish to examine part of a set of data. Suppose that our data is stored in the matrix A.

A <- matrix(1:16, 4, 4)
A
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16

Then, typing

A[2, 3]
## [1] 10

will select the element corresponding to the second row and the third column. The first number after the open-bracket symbol [ always refers to the row, and the second number always refers to the column. We can also select multiple rows and columns at a time, by providing vectors as the indices:

A[c(1, 3), c(2, 4)]
##      [,1] [,2]
## [1,]    5   13
## [2,]    7   15
A[1:3, 2:4]
##      [,1] [,2] [,3]
## [1,]    5    9   13
## [2,]    6   10   14
## [3,]    7   11   15
A[1:2, ]
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
A[, 1:2]
##      [,1] [,2]
## [1,]    1    5
## [2,]    2    6
## [3,]    3    7
## [4,]    4    8

The last two examples include either no index for the columns or no index for the rows. These indicate that R should include all columns or all rows, respectively. R treats a single row or column of a matrix as a vector:

A[1, ]
## [1]  1  5  9 13

The use of a negative sign - in the index tells R to keep all rows or columns except those indicated in the index:

A[-c(1, 3), ]
##      [,1] [,2] [,3] [,4]
## [1,]    2    6   10   14
## [2,]    4    8   12   16
A[-c(1, 3), -c(1, 3, 4)]
## [1] 6 8

The dim() function outputs the number of rows followed by the number of columns of a given matrix:

dim(A)
## [1] 4 4

Loading data

For most analyses, the first step involves importing a data set into R. The read.table() function is one of the primary ways to do this. The help file contains details about how to use this function. We can use the function write.table() to export data.

Before attempting to load a data set, we must make sure that R knows to search for the data in the proper directory. For example, on a Windows system one could select the directory using the Change dir ... option under the File menu. However, the details of how to do this depend on the operating system (e.g. Windows, Mac, Unix) that is being used, and so we do not give further details here.

Loading the Auto data

We begin by loading in the Auto data set. This data is part of the ISLR2 library, discussed in Chapter 3. To illustrate the read.table() function, we load it now from a text file, Auto.data, which you can find on the textbook website. The following command will load the Auto.data file into R and store it as an object called Auto, in a format referred to as a data frame. Once the data has been loaded, the View() function can be used to view it in a spreadsheet-like window.1 The head() function can also be used to view the first few rows of the data:

# Data: 
# https://www.statlearning.com/resources-second-edition
data_url  <- "https://www.statlearning.com/s/Auto.csv"

wd <- getwd()
if (wd == "/Users/hneth/Desktop/stuff/Dropbox/GitHub/predict/R"){ # in subdir R: 
  data_file <- "./../data/_JamesEtAl2021_ISL2/data/Auto.data"  # up 1 level
} else {
  data_file <- "./data/_JamesEtAl2021_ISL2/data/Auto.data" 
}

Auto <- read.table(data_file)
# View(Auto)
head(Auto)
##     V1        V2           V3         V4     V5           V6   V7     V8
## 1  mpg cylinders displacement horsepower weight acceleration year origin
## 2 18.0         8        307.0      130.0  3504.         12.0   70      1
## 3 15.0         8        350.0      165.0  3693.         11.5   70      1
## 4 18.0         8        318.0      150.0  3436.         11.0   70      1
## 5 16.0         8        304.0      150.0  3433.         12.0   70      1
## 6 17.0         8        302.0      140.0  3449.         10.5   70      1
##                          V9
## 1                      name
## 2 chevrolet chevelle malibu
## 3         buick skylark 320
## 4        plymouth satellite
## 5             amc rebel sst
## 6               ford torino

Note that Auto.data is simply a text file, which you could alternatively open on your computer using a standard text editor. It is often a good idea to view a data set using a text editor or other software such as Excel before loading it into R.

This particular data set has not been loaded correctly, because R has assumed that the variable names are part of the data and so has included them in the first row. The data set also includes a number of missing observations, indicated by a question mark ?. Missing values are a common occurrence in real data sets. Using the option header = T (or header = TRUE) in the read.table() function tells R that the first line of the file contains the variable names, and using the option na.strings tells R that any time it sees a particular character or set of characters (such as a question mark), it should be treated as a missing element of the data matrix:

Auto <- read.table(data_file, header = TRUE, na.strings = "?", stringsAsFactors = TRUE)
# View(Auto)
head(Auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

The stringsAsFactors = TRUE argument tells R that any variable containing character strings should be interpreted as a qualitative variable, and that each distinct character string represents a distinct level for that qualitative variable. An easy way to load data from Excel into R is to save it as a csv (comma-separated values) file, and then use the read.csv() function:

Auto <- read.csv(data_file, header = TRUE, na.strings = "?", stringsAsFactors = TRUE)

Inspecting data

After loading a new dataset, we typically inspect its dimensions (rows and columns), the absence or presence of missing values, and its variable names:

# View(Auto) # view data (in IDE only)
dim(Auto)    # print dimensions 
## [1] 397   9
Auto[1:4, ]  # print first 4 lines
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst

sum(is.na(Auto))  # number of missing (NA) values
## [1] 5

The dim() function tells us that the data has \(397\) observations, or rows, and nine variables, or columns. There are various ways to deal with the missing data (NA values). In this case, only five of the rows contain missing observations, so we choose to use the na.omit() function to simply remove these rows from the data:

Auto <- na.omit(Auto)
dim(Auto)
## [1] 392   9

Once the data are loaded correctly, we can use names() to check the variable names:

names(Auto)
## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"         "origin"       "name"

Additional graphical and numerical summaries

We can use the plot() function to produce scatterplots of the quantitative variables. However, simply typing the variable names will produce an error message, because R does not know to look in the Auto data set for those variables:

plot(cylinders, mpg)
## Error in plot(cylinders, mpg): object 'cylinders' not found

To refer to a variable, we must type the data set and the variable name joined with a symbol. Alternatively, we can use the attach() function in order to tell R to make the variables in this data frame available by name:

plot(Auto$cylinders, Auto$mpg)


# alternatively: 
attach(Auto)
## The following object is masked from package:ggplot2:
## 
##     mpg
plot(cylinders, mpg)

The cylinders variable is stored as a numeric vector, so R has treated it as quantitative. However, since there are only a small number of possible values for cylinders, one may prefer to treat it as a qualitative variable. The as.factor() function converts quantitative variables into qualitative variables:

cylinders <- as.factor(cylinders)

If the variable plotted on the \(x\)-axis is qualitative, then boxplots will automatically be produced by the plot() function. As usual, a number of options can be specified in order to customize the plots:

plot(cylinders, mpg)

plot(cylinders, mpg, col = "gold")

plot(cylinders, mpg, col = "gold", varwidth = TRUE)

plot(cylinders, mpg, col = "gold", varwidth = TRUE, horizontal = TRUE)

plot(cylinders, mpg, col = "gold", varwidth = TRUE, xlab = "cylinders", ylab = "MPG")

The hist() function can be used to plot a histogram. Note that col = 2 has the same effect as col = "red":

hist(mpg)

hist(mpg, col = 2)

hist(mpg, col = 2, breaks = 15)

The pairs() function creates a scatterplot matrix, i.e. a scatterplot for every pair of variables. We can also produce scatterplots for just a subset of the variables:

pairs(Auto)

pairs(~ mpg + displacement + horsepower + weight + acceleration,
      data = Auto)

In conjunction with the plot() function, identify() provides a useful interactive method for identifying the value of a particular variable for points on a plot. We pass in three arguments to identify(): the \(x\)-axis variable, the \(y\)-axis variable, and the variable whose values we would like to see printed for each point. Then clicking one or more points in the plot and hitting Escape will cause R to print the values of the variable of interest. The numbers printed under the identify() function correspond to the rows for the selected points.

plot(horsepower, mpg)
identify(horsepower, mpg, name)

## integer(0)

The summary() function produces a numerical summary of each variable in a particular data set:

summary(Auto)
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365

For qualitative variables such as name, R will list the number of observations that fall in each category. We can also produce a summary of just a single variable:

summary(mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   17.00   22.75   23.45   29.00   46.60

Finishing R and command history

Once we have finished using R, we type q() in order to shut it down, or quit. When exiting R, we have the option to save the current environment so that all objects (such as data sets) that we have created in this R session will be available next time. Before exiting R, we may want to save a record of all of the commands that we typed in the most recent session; this can be accomplished using the savehistory() function. Next time we enter R, we can load that history using the loadhistory() function, if we wish.

2.4 Exercises

Conceptual

Exercise 1

For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

  1. The sample size \(n\) is extremely large, and the number of predictors \(p\) is small.
  2. The number of predictors \(p\) is extremely large, and the number of observations \(n\) is small.
  3. The relationship between the predictors and response is highly non-linear.
  4. The variance of the error terms, i.e. \(\sigma^2 = \text{Var}(\epsilon)\), is extremely high.

Exercise 2

Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide \(n\) and \(p\).

  1. We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

  2. We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

  3. We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the %
    change in the British market, and the % change in the German market.

Exercise 3

We now revisit the bias-variance decomposition.

  1. Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The \(x\)-axis should represent the amount of flexibility in the method, and the \(y\)-axis should represent the values for each curve. There should be five curves. Make sure to label each one.

  2. Explain why each of the five curves has the shape displayed in part (a).

Exercise 4

You will now think of some real-life applications for statistical learning.

  1. Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

  2. Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

  3. Describe three real-life applications in which cluster analysis might be useful.

Exercise 5

What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

Exercise 6

Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a non- parametric approach)? What are its disadvantages?

Exercise 7

The table below provides a training data set containing six observations, three predictors, and one qualitative response variable.

Obs. \(X_1\) \(X_2\) \(X_3\) \(Y\)
1 0 3 0 Red
2 2 0 0 Red
3 0 1 3 Red
4 0 1 2 Green
5 −1 0 1 Green
6 1 1 1 Red

Suppose we wish to use this data set to make a prediction for \(Y\) when \(X_1 = X_2 = X_3 = 0\) using \(K\)-nearest neighbors (KNN).

  1. Compute the Euclidean distance between each observation and the test point, \(X_1 = X_2 = X_3 = 0\).

  2. What is our prediction with \(K = 1\)? Why?

  3. What is our prediction with \(K = 3\)? Why?

  4. If the Bayes decision boundary in this problem is highly non-linear, then would we expect the best value for \(K\) to be large or small? Why?

Applied

Exercise 8

This exercise relates to the College data set, which can be found in the file College.csv on the book website. It contains a number of variables for 777 different universities and colleges in the US. The variables are

  • Private: Public/private indicator
  • Apps: Number of applications received
  • Accept: Number of applicants accepted
  • Enroll: Number of new students enrolled
  • Top10perc: New students from top 10% of high school class
  • Top25perc: New students from top 25% of high school class
  • F.Undergrad: Number of full-time undergraduates
  • P.Undergrad: Number of part-time undergraduates
  • Outstate: Out-of-state tuition
  • Room.Board: Room and board costs
  • Books: Estimated book costs
  • Personal: Estimated personal spending
  • PhD: Percent of faculty with Ph.D.’s
  • Terminal: Percent of faculty with terminal degree
  • S.F.Ratio: Student/faculty ratio
  • perc.alumni: Percent of alumni who donate
  • Expend: Instructional expenditure per student
  • Grad.Rate: Graduation rate

Before reading the data into R, it can be viewed in Excel or a text editor.

  1. Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.

  2. Look at the data using the View() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later. Try the following commands:

rownames(college) <- college[, 1] 
View(college)

You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try

college <- college[, -1] 
View(college)

Now you should see that the first data column is Private. Note that another column labeled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row.

  1. Use the summary() function to produce a numerical summary of the variables in the data set.

  2. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[ , 1:10].

  3. Use the plot() function to produce side-by-side boxplots of Outstate versus Private.

  4. Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.

Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] <- "Yes" 
Elite <- as.factor(Elite)
college <- data.frame(college, Elite)

Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.

  1. Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow = c(2, 2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.

  2. Continue exploring the data, and provide a brief summary of what you discover.

Exercise 9

This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.

  1. Which of the predictors are quantitative, and which are qualitative?

  2. What is the range of each quantitative predictor? You can answer this using the range() function.

  3. What is the mean and standard deviation of each quantitative predictor?

  4. Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

  5. Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

  6. Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

Exercise 10

This exercise involves the Boston housing data set.

  1. To begin, load in the Boston data set. The Boston data set is part of the ISLR2 library. How many rows are in this data set? How many columns? What do the rows and columns represent?

  2. Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.

  3. Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

  4. Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

  5. How many of the census tracts in this data set bound the Charles river?

  6. What is the median pupil-teacher ratio among the towns in this data set?

  7. Which census tract of Boston has lowest median value of owner-occupied homes? What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

  8. In this data set, how many of the census tracts average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the census tracts that average more than eight rooms per dwelling.

Solutions to exercises

Conceptual

Exercise 1

For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

The question asks how the flexibility of a learning method influences performance, when considering the interplay of sample size \(n\) and the number of predictors \(p\). Note that inflexible here implies simple or high bias (or elegant). As the details of the method and performance are not specified, we assume that the primary interest is fitting existing test data (minimizing training MSE), rather than predicting new data (minimizing test MSE), or inference/understanding.

  1. The sample size \(n\) is extremely large, and the number of predictors \(p\) is small.

Unless the data matches the form of a simple model (i.e., its bias happens to match the data), a flexible method/model will usually outperform an inflexible one (in training). In testing (predicting new data), however, flexible methods may be outperformed by simpler ones due to overfitting the training data.

  1. The number of predictors \(p\) is extremely large, and the number of observations \(n\) is small.

Here, both simple and flexible methods will be worse than in (a). However, the flexible method generally suffers more from not the lack of data during training.

  1. The relationship between the predictors and response is highly non-linear.

More flexible methods can generally capture non-linear relationships more closely than inflexible ones. However, the impact of non-linearity on simple methods depends on their functional form. A very inflexible model can outperform a more flexible model when its bias happens to capture the real pattern.

  1. The variance of the error terms, i.e. \(\sigma^2 = \text{Var}(\epsilon)\), is extremely high.

The performance of both flexible and inflexible models will suffer when the irreducible error (or noise) increases. Any apparent benefit of more flexible methods (in training) will be due to overfitting and thus become a burden when predicting out-of-sample data (in testing).

+++ here now +++

Solution

References

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning with applications in R (2nd edition). Retrieved from https://www.statlearning.com/


  1. This function can sometimes be a bit finicky. If you have trouble using it, then try the head() function instead.↩︎