link to source

XKCD comic

Refresher: the LASSO

In lecture, we discussed the LASSO, which is a technique for regularizing linear regression while simultaneously doing variable selection.

Recall that under the LASSO, we try to choose our coefficients \(\beta_0,\beta_1,\beta_2,\dots,\beta_p\) to minimize \[ \sum_{i=1}^n \left( Y_i - \beta^T X_i \right)^2 + \lambda \sum_{j=1}^p |\beta_j|, \] where \(\lambda \ge 0\) controls the amount of regularization (i.e., how much we care about the second term) and we are assuming that the vectors of predictors \(X_1,X_2,\dots,X_n \in \mathbb{R}^{p+1}\) include an entry equal to 1 to account for the intercept term: \[ X_i = \left( 1, X_{i,1}, X_{i,2}, \dots, X_{i,p} \right)^T \in \mathbb{R}^{p+1}. \]

The LASSO objective (an objective is just a quantity that we want to optimize– in this case, we are trying to minimize the LASSO cost) is very similar to ridge regression, which we discussed at length in lecture, and tries to minimize: \[ \operatorname{RSS} + \lambda \sum_{j=1}^p \beta_j^2. \]

The difference is that ridge regression penalizes the sum of squares of the coefficients, while the LASSO penalizes the sum of their absolute values. It turns out that this small change has a surprisingly large effect on the solution that we find.

Specifically, as discussed in lecture, the LASSO penalty actually encourages us to set “unhelpful” coefficients to zero. This is in contrast to ridge regression, which “encourages” coefficients that do not help much with prediction to be close to zero, but if you examine our worked example in lecture, you’ll see that the coefficients in our estimated models, even for large values of \(\lambda\), are not equal to zero.

In lecture, we waved our hands at the LASSO, and we mentioned that glmnet is a popular package for fitting the LASSO, but we didn’t dive into implementation like we did with ridge regression. So let’s get a bit of practice with glmnet now.

mtcars yet again

We’ve been using the mtcars data set as a running example in lecture for our regression problems, so why stop now? Let’s load the mtcars data set and just remind ourselves of the variables involved.

data(mtcars);
head(mtcars);
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

In the event that this output doesn’t fit on one screen, here’s the full list of variables:

names(mtcars);
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

Our goal is still to predict the fuel efficiency (mpg) from all of the other variables, this time using the LASSO.

Installation and overview of glmnet

To start, we need to install glmnet. So let’s do that.

# Uncomment this line and run it if you need to install glmnet.
#install.packages('glmnet');

If you run into weird errors about clang or anything like that, you might need to try including the dependencies=TRUE argument in install.packages. Also, if your version of R is too old (something like older than 3.4, if I remember correctly), you may need to update R.

Once you have glmnet installed, let’s have a look at the documentation.

library(glmnet);
## Warning: package 'glmnet' was built under R version 3.6.2
## Loading required package: Matrix
## Loaded glmnet 4.1-3
?glmnet

Read over the help documentation. Note that the documentation refers to “generalized linear model via penalized maximum likelihood”. This is a fancy way of saying that glmnet covers a much wider array of regression problems than just linear regression with the LASSO. We’ll return to that below.

Skipping to the “Arguments” section of the documentation, we see that we need to first specify the input data matrix x (i.e., the matrix of predictors) and the responses y.

In our data, then the y argument will be the column of car mpg values, and the matrix x will be everything else.

We also need to specify a family. This tells glmnet what kind of regression we want to do (e.g., linear vs logistic).

Skipping down a few arguments, we see that there are some arguments related to specifying the \(\lambda\) parameter that controls the amount of regularization. We’ll just set \(\lambda\) by hand, using the lambda parameter, but you can see that there are other options in the documentation.

Reading through more of the docs, you’ll see that are plenty more arguments available to modify the behavior of our fitting procedure. We’ll leave most of those for another day when you know more about the nuts and bolts of regression and optimization.

However, there is one very important one: if you read the documentation for the alpha parameter, this is described as “The elasticnet mixing parameter, with \(0 \le \alpha \le 1\).” We need to set \(\alpha = 1\) to get the LASSO.

Fitting a model

So let’s get down to it and fit a model, already!

From reading the documentation, we see that we need to split the mtcars data set into a column of responses and… everything else. Recall that the mpg column is the 1st column,

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

So let’s split it out.

y_mtc <- mtcars[,1]; # Grab just the first column
# ... and the predictors are everything else.
x_mtc <- mtcars[, -c(1)];
# Just to verify:
head(x_mtc);
##                   cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant             6  225 105 2.76 3.460 20.22  1  0    3    1

Let’s start with a sanity check. We’ll fit the LASSO with \(\lambda=0\). This is just linear regression, so the predictions should be the same (up to rounding errors) if we use the built-in linear regression in lm.

# Remember, alpha=1 for the LASSO.
mtc_lasso_lambda0 <- glmnet(x_mtc, y_mtc, alpha = 1, lambda=0);

Note: if you get an error along the lines of missing functions in Rcpp, try running install.packages('Rcpp'); library(Rcpp). This will reinstall the Rcpp package and should correct the issue. This was a common problem for students earlier in the semester, so you are unlikely to encounter it now, but just in case.

Let’s extract the coefficients just to see what they’re like.

coef( mtc_lasso_lambda0 )
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                      s0
## (Intercept) 12.19850081
## cyl         -0.09882217
## disp         0.01307841
## hp          -0.02142912
## drat         0.79812453
## wt          -3.68926778
## qsec         0.81769993
## vs           0.32109677
## am           2.51824708
## gear         0.66755681
## carb        -0.21040602

Now, let’s fit linear regression and compare the coefficients.

#TODO: code goes here.
# Fit linear regression to the mtcars data set and extract the coefficients.
mtc_vanilla_lm <- lm( mpg ~ ., mtcars )
coef(mtc_vanilla_lm)
## (Intercept)         cyl        disp          hp        drat          wt 
## 12.30337416 -0.11144048  0.01333524 -0.02148212  0.78711097 -3.71530393 
##        qsec          vs          am        gear        carb 
##  0.82104075  0.31776281  2.52022689  0.65541302 -0.19941925

The coefficients are largely the same, though they are often only in agreement to one or two decimal points. This has a lot to do with the fact that the scalings of the variables in the mtcars data set are very different. Let’s compare the model predictions. Note that this is still just prediction on the train set. We are doing this not to evaluate model fit but just to verify that the two different models we fit are reasonably similar (as they should be, because they are both actually the same linear regression model with no regularization).

# Note that the lasso model object requires its input prediction examples
# to be in a matrix, so we are obliging it with as.matrix().
predict( mtc_lasso_lambda0, newx=as.matrix(x_mtc) )
##                           s0
## Mazda RX4           22.59390
## Mazda RX4 Wag       22.11105
## Datsun 710          26.25301
## Hornet 4 Drive      21.22927
## Hornet Sportabout   17.68818
## Valiant             20.38336
## Duster 360          14.37072
## Merc 240D           22.49403
## Merc 230            24.41218
## Merc 280            18.71290
## Merc 280C           19.20352
## Merc 450SE          14.19206
## Merc 450SL          15.60995
## Merc 450SLC         15.75257
## Cadillac Fleetwood  12.02110
## Lincoln Continental 10.93297
## Chrysler Imperial   10.49791
## Fiat 128            27.77790
## Honda Civic         29.88808
## Toyota Corolla      29.50987
## Toyota Corona       23.62986
## Dodge Challenger    16.94555
## AMC Javelin         17.73892
## Camaro Z28          13.30725
## Pontiac Firebird    16.68582
## Fiat X1-9           28.29339
## Porsche 914-2       26.15796
## Lotus Europa        27.62799
## Ford Pantera L      18.88532
## Ferrari Dino        19.68972
## Maserati Bora       13.93165
## Volvo 142E          24.37203

And compare with

predict( mtc_vanilla_lm, mtcars)
##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
##            22.59951            22.11189            26.25064            21.23740 
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
##            17.69343            20.38304            14.38626            22.49601 
##            Merc 230            Merc 280           Merc 280C          Merc 450SE 
##            24.41909            18.69903            19.19165            14.17216 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
##            15.59957            15.74222            12.03401            10.93644 
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
##            10.49363            27.77291            29.89674            29.51237 
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
##            23.64310            16.94305            17.73218            13.30602 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
##            16.69168            28.29347            26.15295            27.63627 
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
##            18.87004            19.69383            13.94112            24.36827

Okay, it’s a little annoying to scroll back and forth, but you’ll see that the predictions agree quite well.

Adding some regularization

All right, now, modify the above code to run LASSO with \(\lambda=1\) and extract the coefficients. Compare the coefficients with those returned by linear regression. Any observations? You may find it useful to plot the different sets of coefficients in a plot.

#TODO: code goes here.

TODO: discussion/observations go here.


Sending coefficients to zero

Okay, now try running the same LASSO code with increasing values of \(\lambda\). What happens as you increase \(\lambda\)?

Try a few different choices of \(\lambda\) and compare the resulting coefficients. Note that you can pass a vector of values to the lambda argument in glmnet to use multiple values for \(\lambda\) at once.

Question 1: What is (approximately) the smallest value of \(\lambda\) for which all of the coefficients get set to zero?

#TODO: code goes here.

TODO: discussion/observations go here.


Question 2: When \(\lambda = 0\), we saw that all of the coefficients were non-zero, and we know that as \(\lambda\) increases, the LASSO penalizes non-zero coefficients more and more. So there should exist a point, as \(\lambda\) increases from \(0\), at which the first coefficient gets set to zero. What is (approximately) this value of \(\lambda\)?

#TODO: code goes here.

TODO: discussion/observations go here.