Readings: ISLR Sections 6.1 and 6.2

In our previous lecture, we discussed how to use cross-validation to choose among models. We concentrated on choosing among a family of linear regression models that varied in their orders, in the sense that they included as predictors all powers of the horsepower variable hp up to some maximum power to predict gas mileage. Hopefully it is clear how we could modify our approach to, say, choose which variables we do and don’t include in a model (e.g., as in the Pima diabetes data set we saw in discussion section).

Ultimately, our goal was to choose, from among a set of predictors that we could include in our model (e.g., powers of hp, in the case of the mtcars example we saw in the last lecture), which predictors to actually include in the model. This task is often called variable selection.

In this lecture, we’ll see a few different approaches to variable selection that do not rely on cross-validation. These alternative methods have the advantage of not trying to estimate the unknown model error on unseen data. On the other hand, these methods can be more computationally intensive and tend to come with fewer theoretical guarantees.

This is not to suggest, however, that the methods in this lecture are at odds with cross-validation. In actual research papers and in industry applications, you’ll often see both CV and some of the methods presented below used in tandem to select the best model for the job.

Learning objectives

After this lecture, you will be able to

Setup: linear regression and fitting

We will focus in this lecture on linear regression, bearing in mind that the ideas introduced here apply equally well to other regression and prediction methods (e.g., logistic regression). Let’s recall that multiple linear regression models a response \(Y \in \mathbb{R}\) as a linear (again, technically affine– linear plus an intercept!) function of a set of \(p\) predictors plus normal noise: \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon. \] Here, \(\epsilon\) is mean-zero normal with unknown variance \(\sigma^2 > 0\), and the variables \(X_1,X_2,\dots,X_p\) are the predictors. We often refer to \(p\), the number of predictors, as the dimension of the problem, because the data (well, the vector of predictors, anyway), lies in \(p\)-dimensional space. Collecting the coefficients into a vector \((\beta_0,\beta_1,\dots,\beta_p) \in \mathbb{R}^{p+1}\) and creating a vector \(X=(1,X_1,X_2,\dots,X_p) \in \mathbb{R}^p\), we can write this more succinctly as (if you have not taken linear regression, you can safely ignore this, we’re just including it because it’s a common notation) \[ Y = \beta^T X + \epsilon. \] In multiple linear regression, we observe a collection of predictor-response pairs \((X_i,Y_i)\) for \(i=1,2,\dots,n\), with \[ X_i = (1,X_{i,1},X_{i,2},\dots,X_{i,p}) \in \mathbb{R}^{p+1}. \] Note that here we are including the intercept term \(1\) in the vector of predictors for ease of notation. This is a common notational choice, so we’re including it here to get you used to seeing this. Of course, this is not universal– it’s one of those conventions that you have to be careful of and check what you are reading.

Overview: variable selection

So we have \(p\) variables (plus an intercept term), and we want to select which ones to include in our model. There are many reasons to want to do this, but let’s just highlight three of them:

  • If there are many “useless” variables (i.e., ones that are not good predictors of the response), then including them in the model can make our predictions less accurate. Thus, we would like to proactively identify which variables are not useful, and avoid including them in the model in the first place.
  • A model with fewer variables is simpler, and we like simple models! Explaining, say, heart attack risk as a function of two or three factors is a lot easier to use than a model that uses ten or twenty factors.
  • If the number of predictors \(p\) is too large (say, larger than \(n\)– a common occurrence in genomic studies, for example), our estimates of the coefficients are very unstable. Variable selection and related tools give us a way to introduce stability in the form of regularization, which we will talk about below.

Best subset selection

So, we have \(p\) predictor variables available to us, and we want to choose which of them to actually include in our model.

Well, the most obvious solution is to just try all possible combinations of features, train a model using each combination, and keep the best one (measured by, say, residual sum of squares).

This would have an obvious drawback: we have already seen that we can trivially improve the RSS of our model by adding variables. So the models that include more variables would do better, even if those variables did not actually lead to better model error on unseen data.

The solution to this is to do the following:

  1. For each \(k=1,2,\dots,p\), for every set of \(k\) different variables, fit a model and keep the model that best fits the data (measured by RSS). Call this model \(M_k\).
  2. Use CV (or some other tool like AIC or adjusted \(R^2\), which we’ll discuss below) to select among the models \(M_1,M_2,\dots,M_p\).

This is called best subset selection. It is implemented in R in, for example, the leaps library the function regsubsets, which gets called in more or less the same way as lm. See (here)[https://cran.r-project.org/web/packages/leaps/index.html] for documentation if you’re interested.

There is one rather glaring problem with best subset selection, though:

Question: if there are \(p\) predictors, how many models does best subset selection fit before it makes a decision?

So once \(p\) is even moderately large, best subset selection is computationally expensive, and we need to do something a little more clever.

Stepwise selection

So best subset selection is expensive because we have to try every possible model, and then choose among the best “size-\(k\)” model for each \(k=1,2,\dots,p\). How might we cut down on the computational expense?

Stepwise selection methods avoid exhaustively checking all \(2^p\) possible models by starting with a particular model and adding or removing one variable at a time (i.e., in “steps”).

The important part is in how we decide which predictor to add or remove from the model at a particular time.

Forward stepwise selection

The most obvious (to me, anyway) way to avoid checking every possible model is to start with a “null” model (i.e., no predictors, just an intercept term), then repeatedly add the “best” predictor not already in the model. That is,

  1. Start by fitting the “null” model, with just an intercept term. Call it \(M_0\).
  2. For each \(k=1,2,\dots,p\), among the \(p-k\) predictors not already in the model, add the one that yields the biggest improvement in RSS. Call this model, which includes \(k\) predictors and the intercept term, \(M_k\).
  3. Use CV or some other method (e.g., an information criterion, which we will discuss soon) to choose among \(M_0,M_1,M_2,\dots,M_p\).

The important thing is that in Step 2 above, for each \(k=1,2,\dots,p\), we need to fit \(p-k\) different models. Thus, in total (i.e., summing over \(k=0,1,2,\dots,p\)), we end up fitting \[ 1+ \sum_{k=0}^{p-1} (p-k) = 1+p^2 - \frac{(p-1)p}{2} = 1+\frac{ 2p^2 - p^2 + p }{2} = 1+ \frac{ p(p+1)}{2} \] different models. To get a sense of what a big improvement this is, when \(p\) is large, this right-hand side is approximately \(p^2/2\). Compare that with \(2^p\), which is a MUCH larger number. For example, when \(p=10\), \(2^{10} \approx 1000\), while \(10^2/2 \approx 50\). When \(p=20\), \(2^{20} \approx 1,000,000\) while \(20^2/2 \approx 200\).

Of course, the drawback is that forward stepwise selection might “miss” the optimal model, since it does not exhaustively fit every possible model the way that best subset selection does.

Backward stepwise selection

Well, if we can do forward stepwise selection, why not go backwards?

In backward stepwise selection, we start with the full model (i.e., a model with all \(p\) predictors), and iteratively remove one predictor at a time, always removing the predictor that decreases RSS the least.

Just like forward stepwise regression, this decreases the number of models we have to fit from \(2^p\) to something more like (approximately) \(p^2/2\).

Cautionary note: backward selection will only work if the number of observations \(n\) is larger than \(p\). If \(n < p\), the “full” model cannot be fit, because we have an overdetermined system of linear equations\(n\) equations in \(p\) unknowns, and \(p > n\). This is a setting where regularization can help a lot (see below), but the details are best left to your regression course(s).

Hybrid approaches: the best of both worlds?

It is outside the scope of this course, but there do exist stepwise selection methods that try to combine forward and backward stepwise selection. For example, we can alternately add and remove variables as needed. This can be helpful when, for example, a predictor is useful “early” in the selection process, but becomes a less useful predictor once other predictors have been included.

You can learn more about these selection methods in a regression course. If you think these methods may be useful in your group project, refer to ISLR for more information or post on the discussion board.

Shrinkage and Regularization

The variable selection methods we just discussed involved trying out different subsets of the predictors and seeing how the model performance changed as a result.

Let’s consider an alternative approach. What if instead of trying lots of different models with different numbers of predictors, we went ahead and fit a model with all \(p\) available predictors, but we modify our loss function in such a way that we will set the coefficients of “unhelpful” predictors to zero? This is usually called “shrinkage”, because we shrink the coefficients toward zero. You will also often hear the term regularization, which is popular in machine learning, and means more or less the same thing.

Let’s briefly discuss two such methods, undoubtedly two of the most important tools in the statistical toolbox: ridge regression and the LASSO.

Ridge regression

By now you are bored to death of seeing the linear regression least squares objective, but here it is again: \[ \sum_{i=1}^n \left( Y_i - \beta_0 - \sum_{j=1}^p \beta_j X_{i,j} \right)^2 \] Here we are assuming that we have \(p\) predictors, so each \((X_i,Y_i)\) pair has a vector of predictors \(X_i = (X_{i,1},X_{i,2},\dots,X_{i,p}) \in \mathbb{R}^p\) and response \(Y_i \in \mathbb{R}\).

Remember, we’re trying to minimize this RSS by choosing the coefficients \(\beta_j\), \(j=0,1,2,\dots,p\) in a clever way.

Ridge regression shrinks these estimated coefficients toward zero by changing the loss slightly. Instead of minimizing the RSS alone, we add a penalty term: \[ \sum_{i=1}^n \left( Y_i - \beta_0 - \sum_{j=1}^p \beta_j X_{i,j} \right)^2 + \lambda \sum_{j=1}^p \beta_j^2 = \operatorname{RSS} + \lambda \sum_{j=1}^p \beta_j^2 \] where \(\lambda \ge 0\) is a tuning parameter (which we have to choose– more on that soon).

Our cost function now has two different terms:

  1. Our old friend RSS, which encourages us to choose coefficients that reproduce the observed responses accurately
  2. The shrinkage penalty \(\lambda \sum_{j=1}^p \beta_j^2\), which encourages us to choose all our coefficients (other than \(\beta_0\)) equal to zero. That is, it shrinks our solution toward the origin.

The tuning parameter \(\lambda\) controls how much we care about this shrinkage penalty compared to the RSS term. When \(\lambda\) is big, we “pay” more for large coefficients, so we will prefer coefficients closer to zero. When \(\lambda=0\), we recover plain old least squares regression.

For each value of \(\lambda\) that we choose, we get a different solution to our (regularized) regression, say, \(\hat{\beta}^{(\lambda)}\). In this sense, whereas least squares linear regression gives us just one solution \(\hat{\beta}\), shrinkage methods give us a whole family of solutions, corresponding to different choices of \(\lambda\).

For this reason, choosing the tuning parameter \(\lambda\) is crucial, but we will have only a little to say about this matter, owing to time constraints. Luckily, you already know a family of methods for choosing \(\lambda\)– cross validation is a very common appraoch!

Ridge regression on the mtcars data set

Let’s try this out on the mtcars data set, trying to predict mpg from all the of the available predictors, this time. One thing to bear in mind: the data set is only 32 observations, so our fits are going to be a little unstable (but this is precisely why we use regularization!).

names(mtcars);
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

Ridge regression is available in the MASS library in R.

library(MASS);

lambda_vals <- c(0,1,2,5,10,20,50,100,200,500); # Choose lambdas to try.
# lm.ridge needs:
# 1) a model (mpg~. says to model mpg as an intercept
#         plus a coefficient for every other variable in the data frame)
# 2) a data set (mtcars, of course)
# 3) a value for lambda. lambda=0 is the default,
#         and recovers classic linear regression.
#         But we can also pass a whole vector of lambdas, like we are about to do,
#         and lm.ridge will fit a separate model for each.
# See ?lm.ridge for details.
ridge_models <- lm.ridge(mpg~., mtcars, lambda=lambda_vals);

# Naively plotting this object shows us how the different coefficients
# change as lambda changes.
plot( ridge_models );

Each line in the above plot represents the coefficient of one of our predictors. The x-axis is our choice of \(\lambda\) (lambda in the code) and the y-axis is the actual value of the coefficients.

Actually extracting those predictor labels to make a legend for this plot is annoying, and beside the point– refer to the documentation in ?lm.ridge). The important point is that as we change \(\lambda\), the coefficients change. Generally speaking, as \(\lambda\) gets bigger, more coefficients are closer to zero.

Indeed, if we make \(\lambda\) big enough, all of the coefficients will be zero (except the intercept, because it isn’t multiplied by \(\lambda\) in the loss). That’s shrinkage!

Just as a sanity check, let’s fit plain old linear regression and verify that the coefficients with \(\lambda=0\) match.

lm_sanity_check <- lm(mpg~., mtcars);
lm_sanity_check$coefficients
## (Intercept)         cyl        disp          hp        drat          wt 
## 12.30337416 -0.11144048  0.01333524 -0.02148212  0.78711097 -3.71530393 
##        qsec          vs          am        gear        carb 
##  0.82104075  0.31776281  2.52022689  0.65541302 -0.19941925

And compare that with

head( coef( ridge_models), 1 ); # the first row is the lambda=0.0 setting.
##                     cyl       disp          hp     drat        wt      qsec
##   0 12.30337 -0.1114405 0.01333524 -0.02148212 0.787111 -3.715304 0.8210407
##            vs       am     gear       carb
##   0 0.3177628 2.520227 0.655413 -0.1994193

They’re the same, up to several digits of precision, anyway. Good!

Shrinkage and RSS

Now, for each value of \(\lambda\), we get a different fitted model. How do these different models do in terms of their fit (as measured by RSS)?

Well, annoyingly, the object returned by lm.ridge does not include a residuals attribute the same way that the lm object does:

mean( lm_sanity_check$residuals^2 );
## [1] 4.609201

More annoyingly still, the object returned by lm.ridge also does not include a predict method, so we can just call something like predict( model, data) the way we would with the output of lm:

mean( (predict( lm_sanity_check, mtcars) - mtcars$mpg )^2 )
## [1] 4.609201

So, we have to roll our own predict/residuals computation. This is going to be a bit complicated, but it’s worth the detour to get some programming practice.

Our ridge regression model has coefficients: one set of coefficients for each valu of lambda that we passed in.

length( lambda_vals )
## [1] 10

Those estimated coefficients are stored in a matrix. Each column of this matrix corresponds to a coefficient (including the intercept, the first column. Each row corresponds to one \(\lambda\) value.

coef( ridge_models )
##                     cyl          disp           hp      drat         wt
##   0 12.30337 -0.1114405  0.0133352399 -0.021482119 0.7871110 -3.7153039
##   1 16.53766 -0.1624028  0.0023330776 -0.014934856 0.9246313 -2.4611460
##   2 18.55460 -0.2212878 -0.0007347273 -0.013481440 0.9597173 -2.0619305
##   5 20.72198 -0.3079969 -0.0036206803 -0.012460649 1.0060841 -1.6219933
##  10 21.27391 -0.3563891 -0.0048700433 -0.011966312 1.0409643 -1.3618221
##  20 20.88322 -0.3792044 -0.0054324669 -0.011248991 1.0545238 -1.1389693
##  50 19.85752 -0.3614183 -0.0052401829 -0.009609992 0.9828882 -0.8673112
## 100 19.37410 -0.3092845 -0.0044742424 -0.007814946 0.8353896 -0.6656981
## 200 19.29829 -0.2337711 -0.0033702943 -0.005732613 0.6286426 -0.4706006
## 500 19.54099 -0.1331434 -0.0019134597 -0.003201881 0.3567082 -0.2559721
##           qsec        vs        am      gear        carb
##   0 0.82104075 0.3177628 2.5202269 0.6554130 -0.19941925
##   1 0.49258752 0.3746517 2.3083758 0.6857159 -0.57579125
##   2 0.36772540 0.4389536 2.1835666 0.6545252 -0.64772938
##   5 0.23220486 0.5749017 1.9622086 0.5933014 -0.65548632
##  10 0.17615173 0.7007040 1.7537869 0.5573491 -0.59737336
##  20 0.15680180 0.8158626 1.5168704 0.5362044 -0.50497523
##  50 0.15120048 0.8706103 1.1764833 0.4942313 -0.36858373
## 100 0.13669747 0.7898063 0.9064651 0.4225302 -0.27224119
## 200 0.10798351 0.6187171 0.6407444 0.3195920 -0.18719020
## 500 0.06363418 0.3611011 0.3479026 0.1819390 -0.09983672

So we can pick out the coefficients associated to a particular lambda value by taking the corresponding row of this matrix. For example, \(\lambda = 5\) is in the 4-th row of the matrix:

cat(paste0("The 4-th lambda value is: ", lambda_vals[4]) );
## The 4-th lambda value is: 5
coef( ridge_models )[4,]; # Pick out the 4-th row. these are coefs when lambda=5.
##                     cyl        disp          hp        drat          wt 
## 20.72198423 -0.30799694 -0.00362068 -0.01246065  1.00608409 -1.62199325 
##        qsec          vs          am        gear        carb 
##  0.23220486  0.57490168  1.96220860  0.59330141 -0.65548632

Now, to get our prediction from these coefficients, we have to multiply each predictor by its coefficient and add the intercept term. Equivalently, we can think of adding an extra predictor that is just \(1\) for every observation. Something like \[ \beta_0 + \sum_{j=1}^p \beta_j X_{i,j} = \sum_{j=0}^p \beta_j X_{i,j}, \] As an aside, for those that have taken linear algebra, you should be looking at that and thinking “that’s just an inner product!” \[ \beta^T X_i = \sum_{j=0}^p \beta_j X_{i,j}. \]

So let’s modify the mtcars data to make that all easy.

# The mpg column of mtcars needs to get removed (it is the outcome,
# not a predictor), so we drop it-- it's the column numbered 1.
# And we're using cbind to add a column of 1s with column name const.
mtc_predictors <- cbind(const=1,mtcars[,-c(1)]);
head(mtc_predictors);
##                   const cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4             1   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag         1   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710            1   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive        1   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout     1   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant               1   6  225 105 2.76 3.460 20.22  1  0    3    1

Now, to make a prediction on, say, the Datsun 710 observation, we need to multiply each predictor (including the const column) by its coefficient, and sum up the total. Again, something like \[ \sum_{j=0}^p \beta_j X_{i,j}, \] where \(X_{i,0}=1\) is the extra constant term that we tacked on.

So to get the prediction for a particular observation (again, say, the Datsun 710 row in mtcars), we need to make this sum (i.e., inner product) between that row of the predictors matrix and the vector of coefficients.

beta5 <- coef( ridge_models )[4,]; # 4th row was for lambda=5.
datsun <- mtc_predictors['Datsun 710',]
sum( beta5*datsun )
## [1] 26.62668

As a sanity check, let’s verify that the 1-th row, which is \(\lambda=0\), agrees with our linear model’s prediction.

beta0 <- coef( ridge_models )[1,]; # 1st row was for lambda=0, i.e., plain old LR.
datsun <- mtc_predictors['Datsun 710',]
sum( beta0*datsun );
## [1] 26.25064

and compare with

predict( lm_sanity_check, datsun )
## Datsun 710 
##   26.25064

Okay, but to compute the RSS of our model we want to compute predictions for all 32 of our observations in the mtcars data set. And we want to compute those predictions for each of our different choices of \(\lambda\).

We’re going to get those predictions in a matrix. If you haven’t taken linear algebra, don’t let the word scare you. In this context, it’s enough to just think of a matrix as a big box of numbers.

Now, we currently have two boxes of numbers. One is mtc_predictors. Each row is an observation (so there are 32 rows), and each row has 11 entries, corresponding to the intercept term and ten additional predictors.

dim( mtc_predictors )
## [1] 32 11

The other box of numbers is our collection of coefficients. One row for each of the models we fit (i.e., \(\lambda\) values), and one column for each predictor.

beta_matrix <- coef( ridge_models )
dim(beta_matrix);
## [1] 10 11
beta_matrix
##                     cyl          disp           hp      drat         wt
##   0 12.30337 -0.1114405  0.0133352399 -0.021482119 0.7871110 -3.7153039
##   1 16.53766 -0.1624028  0.0023330776 -0.014934856 0.9246313 -2.4611460
##   2 18.55460 -0.2212878 -0.0007347273 -0.013481440 0.9597173 -2.0619305
##   5 20.72198 -0.3079969 -0.0036206803 -0.012460649 1.0060841 -1.6219933
##  10 21.27391 -0.3563891 -0.0048700433 -0.011966312 1.0409643 -1.3618221
##  20 20.88322 -0.3792044 -0.0054324669 -0.011248991 1.0545238 -1.1389693
##  50 19.85752 -0.3614183 -0.0052401829 -0.009609992 0.9828882 -0.8673112
## 100 19.37410 -0.3092845 -0.0044742424 -0.007814946 0.8353896 -0.6656981
## 200 19.29829 -0.2337711 -0.0033702943 -0.005732613 0.6286426 -0.4706006
## 500 19.54099 -0.1331434 -0.0019134597 -0.003201881 0.3567082 -0.2559721
##           qsec        vs        am      gear        carb
##   0 0.82104075 0.3177628 2.5202269 0.6554130 -0.19941925
##   1 0.49258752 0.3746517 2.3083758 0.6857159 -0.57579125
##   2 0.36772540 0.4389536 2.1835666 0.6545252 -0.64772938
##   5 0.23220486 0.5749017 1.9622086 0.5933014 -0.65548632
##  10 0.17615173 0.7007040 1.7537869 0.5573491 -0.59737336
##  20 0.15680180 0.8158626 1.5168704 0.5362044 -0.50497523
##  50 0.15120048 0.8706103 1.1764833 0.4942313 -0.36858373
## 100 0.13669747 0.7898063 0.9064651 0.4225302 -0.27224119
## 200 0.10798351 0.6187171 0.6407444 0.3195920 -0.18719020
## 500 0.06363418 0.3611011 0.3479026 0.1819390 -0.09983672

Once again, each column corresponds to one of 11 predictors (the intercept term and ten non-trivial predictors), and the rows correspond to the different choice of \(\lambda\).

So, for each value of \(\lambda\) (i.e., each row of \(\beta\)), and each row of mtc_predictors (i.e., each observation in the data set), we want to sum up the products of the coefficients with their corresponding predictors.

We are going to make a new matrix, whose rows correspond to the 32 data observations and whose columns correspond to different choices of \(\lambda\). We need to use some basic matrix algebra to construct that. Let’s do the computation, then unpack it.

mtc_mx <- as.matrix( mtc_predictors );
cat('Dimensions of the predictors matrix: ');
## Dimensions of the predictors matrix:
cat(dim(mtc_mx)) 
## 32 11
beta_mx <- coef( ridge_models );
cat('Dimensions of the coefficients matrix: ');
## Dimensions of the coefficients matrix:
cat( dim(beta_mx) );
## 10 11
# Now compute the appropriate matrix product.
# We want to rows indexed by observations
# and the columns indexed by lambdas.
# That requires transposing the coefficients matrix, whose original form
# has rows indexed by lambda and columns indexed by the predictors.
# We transpose a matrix in R with t( ).
obs_by_lambda_predictions <- as.matrix( mtc_mx ) %*% t( beta_mx );
obs_by_lambda_predictions
##                            0        1        2        5       10       20
## Mazda RX4           22.59951 22.30763 22.23051 22.13369 22.02501 21.85270
## Mazda RX4 Wag       22.11189 21.95588 21.91065 21.85012 21.77639 21.65007
## Datsun 710          26.25064 26.61821 26.68382 26.62668 26.42251 26.04170
## Hornet 4 Drive      21.23740 20.88953 20.78456 20.66661 20.59049 20.52290
## Hornet Sportabout   17.69343 17.20040 17.01742 16.78452 16.64525 16.59645
## Valiant             20.38304 20.37257 20.35076 20.31018 20.28168 20.26423
## Duster 360          14.38626 14.15765 14.13388 14.17680 14.29042 14.52925
## Merc 240D           22.49601 22.68287 22.80572 23.00581 23.14734 23.21667
## Merc 230            24.41909 23.91587 23.73479 23.58564 23.58592 23.62033
## Merc 280            18.69903 19.10420 19.31008 19.67422 20.00760 20.33982
## Merc 280C           19.19165 19.39975 19.53072 19.81354 20.11329 20.43390
## Merc 450SE          14.17216 14.91618 15.12809 15.35748 15.52382 15.75031
## Merc 450SL          15.59957 15.85149 15.90269 15.95540 16.02207 16.16892
## Merc 450SLC         15.74222 15.92546 15.94668 15.96718 16.02444 16.17470
## Cadillac Fleetwood  12.03401 11.67687 11.64501 11.75998 12.02127 12.49759
## Lincoln Continental 10.93644 11.05719 11.16858 11.42987 11.76777 12.30084
## Chrysler Imperial   10.49363 10.99657 11.21759 11.58203 11.96222 12.51055
## Fiat 128            27.77291 27.88472 27.85376 27.69494 27.44262 27.01867
## Honda Civic         29.89674 29.26876 29.06962 28.80821 28.54153 28.10115
## Toyota Corolla      29.51237 29.12150 28.91791 28.56765 28.21014 27.70198
## Toyota Corona       23.64310 23.78667 23.85479 23.91651 23.89758 23.77409
## Dodge Challenger    16.94305 16.84439 16.79091 16.69114 16.60761 16.57993
## AMC Javelin         17.73218 17.59335 17.50887 17.37192 17.27326 17.23149
## Camaro Z28          13.30602 13.73881 13.92543 14.19839 14.43699 14.75698
## Pontiac Firebird    16.69168 16.24701 16.09680 15.91932 15.83133 15.84875
## Fiat X1-9           28.29347 28.25684 28.19035 27.99133 27.70163 27.22949
## Porsche 914-2       26.15295 26.45050 26.49502 26.40196 26.15825 25.72990
## Lotus Europa        27.63627 27.46919 27.38886 27.19338 26.92047 26.48469
## Ford Pantera L      18.87004 18.79097 18.67829 18.47524 18.33544 18.26423
## Ferrari Dino        19.69383 19.73505 19.79328 19.91247 20.01800 20.11261
## Maserati Bora       13.94112 13.74682 13.72632 13.83990 14.10487 14.56029
## Volvo 142E          24.36827 24.93714 25.10820 25.23790 25.21281 25.03481
##                           50      100      200      500
## Mazda RX4           21.52222 21.21442 20.89232 20.52801
## Mazda RX4 Wag       21.38573 21.12122 20.83279 20.49838
## Datsun 710          25.19341 24.25687 23.15477 21.80461
## Hornet 4 Drive      20.44289 20.37973 20.30828 20.21609
## Hornet Sportabout   16.82946 17.31265 17.99524 18.89895
## Valiant             20.25478 20.24265 20.21593 20.16802
## Duster 360          15.18740 16.02340 17.06870 18.38818
## Merc 240D           23.04180 22.62447 22.01421 21.19076
## Merc 230            23.45483 23.00816 22.32148 21.37321
## Merc 280            20.63826 20.68452 20.59553 20.39937
## Merc 280C           20.72898 20.76654 20.66032 20.43755
## Merc 450SE          16.28647 16.94379 17.75743 18.77860
## Merc 450SL          16.61160 17.19747 17.93903 18.87836
## Merc 450SLC         16.62871 17.21886 17.95869 18.89101
## Cadillac Fleetwood  13.57618 14.77514 16.18498 17.90822
## Lincoln Continental 13.43666 14.67145 16.11294 17.86941
## Chrysler Imperial   13.63141 14.83376 16.23293 17.93646
## Fiat 128            26.06659 24.98855 23.70223 22.11461
## Honda Civic         27.04746 25.80880 24.31247 22.45784
## Toyota Corolla      26.63521 25.44909 24.03979 22.30309
## Toyota Corona       23.35933 22.81202 22.11937 21.23728
## Dodge Challenger    16.81441 17.29638 17.98109 18.89022
## AMC Javelin         17.40983 17.80019 18.35988 19.10525
## Camaro Z28          15.45171 16.26403 17.25580 18.49633
## Pontiac Firebird    16.20433 16.80970 17.62907 18.69568
## Fiat X1-9           26.20867 25.08570 23.76438 22.14560
## Porsche 914-2       24.84061 23.92121 22.88154 21.63991
## Lotus Europa        25.55719 24.54521 23.36083 21.91658
## Ford Pantera L      18.35395 18.59356 18.94851 19.43374
## Ferrari Dino        20.18278 20.18662 20.16519 20.13150
## Maserati Bora       15.45915 16.35277 17.35633 18.56415
## Volvo 142E          24.45802 23.71104 22.77794 21.60303

So this matrix has rows indexed by observations (i.e., cars) and columns indexed by choices of \(\lambda\). So the \((i,j)\) entry of this matrix is the prediction made for the \(i\)-th car by the model with the \(j\)-th lambda value.

We are now ready (finally!) to compute the mean squared residuals for these different choices of \(\lambda\). We just need to

  1. Compute the errors between these predictions and the true mpg values for the cars
  2. Square those errors.
  3. Sum along the columns (because each column corresponds to a different choice of \(\lambda\), and hence a different fitted model).
errors <- mtcars$mpg - obs_by_lambda_predictions;
# Just to check, each column of errors should be length-32, because we have
# 32 data points in the mtcars data set.
# And there should be 32 columns, one for each of our ten lambda values.
dim( errors );
## [1] 32 10

So we’re going to square those errors and take a mean along each column

# We're going to squares the entries of errors,
# then take a mean along the columns (that's the 2 argument to apply)
RSS_by_model <- apply( errors^2, 2, FUN=mean);
RSS_by_model
##         0         1         2         5        10        20        50       100 
##  4.609201  4.724319  4.816983  4.986377  5.192462  5.590303  6.937583  9.407707 
##       200       500 
## 13.821594 21.567053

This is easier to see in a plot– and we’ll put the \(\lambda\) values on a log-scale, because the lambda_vals vector spans multiple orders of magnitude.

plot( log(lambda_vals), RSS_by_model, type='b', lwd=2)

Let’s unpack this. We have the smallest RSS when \(\lambda = 0\), and RSS increases as \(\lambda\) increases. This is exactly what we expect. Recall that our loss function is \[ \sum_{i=1}^n \left( Y_i - \sum_{j=0}^p \beta_j X_{i,j} \right)^2 + \lambda \sum_{j=1}^p \beta_j^2 \] As \(\lambda\) gets bigger, we pay a bigger and bigger penalty for making coefficients non-zero. Thus, as \(\lambda\) get bigger, it becomes “harder” or “more expensive” to make the coefficients take the values that would make the RSS term smaller. As a result, for larger values of \(\lambda\), the RSS of our solution is larger.

Why is ridge regression helpful?

Well, the short answer is that ridge regression (and other shrinkage methods) prevents over-fitting. \(\lambda\) makes it more expensive to simply choose whatever coefficients we please, which in turn prevents us from over-fitting to the data.

In essence, this is the bias-variance tradeoff again! As \(\lambda\) increases, our freedom to choose the coefficients becomes more constrained, and the variance decreases (and the bias increases).

Here’s an example from ISLR.

MSE (pink), squared bias (black) and variance (green), estimated from performance on previously unseen data, as a function of \lambda (ISLR fig. 6.5)

MSE (pink), squared bias (black) and variance (green), estimated from performance on previously unseen data, as a function of \(\lambda\) (ISLR fig. 6.5)

Notice that the variance decreases as \(\lambda\) increases, while squares bias decreases, but there is a “sweet spot” that minimizes the MSE. The whole point of model selection (CV, AIC, ridge regression, etc.) is to find this sweet spot (or spot close to it).

The LASSO

Now, there’s one issue with ridge regression, which becomes evident when we compare it with subset selection methods. Except when \(\lambda\) is truly huge (i.e., infinite), ridge regression fits a model that still has all of the coefficients in it (i.e., all of the coefficients are nonzero, though perhaps small). Said another way, we haven’t simplified the model in the sense of reducing the number of predictors or only keeping the “useful” predictors around.

This isn’t a problem for prediction. After all, more predictors often make prediction better, especially when we have regularization to prevent us from over-fitting.

But this is a problem if our goal is to simplify our model by selecting only some predictors to include in our model. One way to do this would be to make it so that coefficients that aren’t “pulling their weight” in the sense of helping our prediction error will be set to zero. This is precisely the goal of the LASSO.

Note: I (Keith) usually write LASSO in all caps because it is an acronym for least absolute shrinkage and selection operator, but it is so wildly popular in statistics that we will often just write Lasso or lasso to refer to it. For those not familiar, a lasso is one of these.

The LASSO looks a lot like ridge regression, except that the penalty is slightly different: \[ \sum_{i=1}^n \left( Y_i - \sum_{j=0}^p \beta_j X_{i,j} \right)^2 + \lambda \sum_{j=1}^p \left| \beta_j \right| = \operatorname{RSS} + \lambda \sum_{j=1}^p \left| \beta_j \right| \]

The penalty term now involves a sum of absolute values of the coefficients instead of a sum of squares.

The interesting thing is that this small change has a big effect on what our estimated coefficients look like. The LASSO penalty encourages coefficients to be set precisely equal to zero if they aren’t useful predictors (i.e., if they do not help to decrease the RSS).

There is an interesting geometric reason for this, though it is outside the scope of the course. See the end of Section 6.6.2 in ISLR.

The important point is that the LASSO performs variable selection for us by setting many coefficients to zero.

The glmnet package has a very good LASSO implementation. This is generally a very useful package for doing all kinds of different penalized regression, and you’ll likely use it extensively in your regression course(s). You’ll get a bit of practice with it in discussion section.

How to choose \(\lambda\)? CV to the rescue!

A natural question in both ridge and the LASSO concerns how we should choose the term \(\lambda\) that controls the “strength” of the penalty term.

We said a few lectures ago that CV was useful beyond just variable selection, and here’s the payoff.

CV is also a great way to choose \(\lambda\) in these kinds of penalized problems. We choose different values of \(\lambda\), fit the corresponding models, and use CV to select among them!

Section 6.2.3 has a more detailed discussion of this, using \(10\)-fold CV to compare different choices of \(\lambda\).