Parts of these notes adapted from here, here, here, here, here.

Part 1: Simple regression (1 predictor)

Brief review

Model equation

A simple linear model has an outcome, \(y\), and one predictor, \(x\). It is defined by the following equation.

\[ y_i = \beta_0 + \beta_1x_i + \epsilon_i, \]

where \(i = 1, \ldots, n\). This equation represents the true, theoretical, equation, with slope \(\beta_1\) and intercept \(\beta_0\)

The subscript in this equation, \(i\), indexes the \(n\) observations in the dataset. (Think of \(i\) as a row number.) The equation can be read as follows: the value of \(i^{th}\) outcome variable, \(y_i\), is defined by an intercept, \(\beta_0\), plus a slope, \(\beta_1\), multiplied by the \(i^{th}\) predictor variable, \(x_i\). These elements define the systematic or deterministic portion of the model. However, because the world is uncertain, containing randomness, we know that the model will be wrong (as George Box said). To fully describe the data we need an error term, \(\epsilon_i\), which is also indexed by row. The error term is the stochastic portion of the model. \(\epsilon_i\) measures the distance between the fitted or expected values of the model—calculated from the deterministic portion of the model—and the actual values.

The errors in a linear model—also known as model residuals—are the part of the data that remains unexplained by the deterministic portion of the model. One of the key assumptions of a linear model is that the residuals are normally distributed with mean = 0 and variance = \(\sigma^2\), which we denote as

\[ \epsilon_i\sim N(0,\sigma^2) \]

Loss function

For ordinary least squares regression, we have the following loss function (or error function).

\[ \operatorname{RSS} = \sum_{i=1}^n ((\beta_0 + \beta_1x_i) - y_i)^2 = \sum_{i=1}^n (\hat{y}_i - y_i)^2 \]

where \(\hat{y_i}=\beta_0+\beta_1x\) is the predicted \(y_i\) based on the coefficients \(\beta_0,\beta_1\). This function represents something we want to minimize. The higher the loss, the worse the estimate; the lower the loss, the better the estimate.

Note: since the residuals can be positive or negative, so if we simply add the residuals up we might be canceling out some of them. So instead of minimizing the sum of the residuals, we usually choose to square the residuals and minimize the sum of squares of the residuals. (Mathematically it becomes easier to work with the square than the absolute value).

This is not the only possible loss function, there are other ones! However, for simple linear regression, this is the simplest possible loss function to work with, and it’s also the most commonly used (kind of by default) so we’re using it too here.

When we fit a line of regression through the data, we are trying to find estimates \(\hat{\beta_0}\) and \(\hat{\beta_1}\) such that we minimize the loss function. I.e., our goal is to find

\[ \hat{\beta_0},\hat{\beta_1} = \arg\min_{\beta_0,\beta_1} \sum_{i=1}^n ((\beta_0 + \beta_1x_i) - y_i)^2 \]

The math for this is actually surprisingly not that difficult, so we show the derivation below.

Derivation of estimates

We can do this by taking the partial derivative with respect to \(\hat{\beta_0}\) and \(\hat{\beta_1}\), and setting them both to 0. First, we define the following variables to simplify notation:

\[\begin{align} \text{Define } \bar{y} &:= \frac{1}{n}\sum_i^n y_i \\ \text{Define } \bar{x} &;= \frac{1}{n}\sum_i^n x_i \\ \text{Define } Z &:= \sum_i\left( y_i - \beta_0 - \beta_1 x_i \right)^2 \\ \end{align}\]

Then we take the partial derivative with respect to \(\beta_0\), solve for this \(\beta_0\), then substitute it into the partial derivative with respect to \(\beta_1\): \[\begin{align} \text{Partial deriative w.r.t. } \beta_0 : \; \; \frac{\partial Z}{\partial \beta_0} &= \sum_i^n -2 \left(y_i - \beta_0 - \beta_1x_i \right) \\ \text{Setting the derivative to 0 and solving: } \; \; ~~~~ \hat{\beta_0} &= \frac{1}{n}\sum_i^n y_i - \frac{1}{n}\sum_i^n\hat{\beta_1}x_i \\ \implies ~~~~~ \hat{\beta_0} &= \bar{y} - \hat{\beta_1} \bar{x} \\ \text{Partial deriative w.r.t. } \beta_1 : \; \;\frac{\partial Z}{\partial \beta_1} &= \sum_i^n -2x_i \left( y_i - \beta_0 - \beta_1x_i \right) \end{align}\] \[\begin{align} \text{Setting the derivative to 0 and substituting $\hat{\beta_1}$, we have: } & \\ \sum_i^n x_i y_i - \sum_i^n (\bar{y}-\hat{\beta_1}\bar{x})x_i - \sum_i^n\hat{\beta_1}x_i^2 &= 0 \\ \sum_i^n x_i y_i - \bar{y} \sum_i^n x_i + \hat{\beta_1} \left(\bar{x} \sum_i^n x_i - \sum_i^n x_i^2 \right) &= 0 \\ \hat{\beta_1} &= \frac{\sum_i^n x_i y_i - \bar{y}\sum_i^n x_i }{ \sum_i^n x_i^2 - \bar{x} \sum_i^n x_i } \\ &= \frac{\sum_i^n x_i y_i - n\bar{x}\bar{y}}{ \sum_i^n x_i^2 - n\bar{x}^2} \\ \text{simplifying: } \; \; \hat{\beta_1} &= \frac{\sum_i^n (x_i - \bar{x})(y_i - \bar{y})}{ \sum_i^n (x_i - \bar{x})^2 } \end{align}\]

And we end up with the final OLS solution:

\[\begin{align} \hat{\beta_0} &= \bar{y} - \hat{\beta_1} \bar{x} \\ \hat{\beta_1} &= \frac{\sum_i^n (x_i - \bar{x})(y_i - \bar{y})}{ \sum_i^n (x_i - \bar{x})^2 } = \frac{Cov(x,y)}{Var(x)} \end{align}\]

Then, we can use these to find our predicted \(\hat{y_i}\), i.e. the \(y\) values on the line.

\[ \hat{y_i}=\hat{\beta_0}+\hat{\beta_1}x_i \]

as well as our model residuals \(\hat{\epsilon}_i\)

\[ \hat{\epsilon_i}=y_i-\hat{y_i} \]

From this, we also get for free an estimate of the variance of the residuals \(\sigma^2\), which happens to be very useful in computing other statistics. The reason is that the larger the residuals’ variance, the less precisely we can estimate our regression coefficients, which should make a lot of sense.

\[ \hat{\sigma}^2=\text{mean squared error}=\frac{SSE}{n-2}=\frac1{n-2}\sum_i(y_i-\hat{y_i})^2 \]

We can also easily derive the variance of the slope. First, observe that

\[\begin{align} \sum_i (x_i - \bar{x})\bar{y} &= \bar{y}\sum_i (x_i - \bar{x})\\ &= \bar{y}\left(\left(\sum_i x_i\right) - n\bar{x}\right)\\ &= \bar{y}\left(n\bar{x} - n\bar{x}\right)\\ &= 0 \end{align}\]

This means that

\[\begin{align} \sum_i (x_i - \bar{x})(y_i - \bar{y}) &= \sum_i (x_i - \bar{x})y_i - \sum_i (x_i - \bar{x})\bar{y}\\ &= \sum_i (x_i - \bar{x})y_i\\ &= \sum_i (x_i - \bar{x})(\beta_0 + \beta_1x_i + \epsilon_i )\\ \end{align}\]

Using this, we can easily derive \(\text{Var}(\hat{\beta_1})\) as follows:

\[\begin{align} \text{Var}(\hat{\beta_1}) & = \text{Var} \left(\frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sum_i (x_i - \bar{x})^2} \right) \\ &= \text{Var} \left(\frac{\sum_i (x_i - \bar{x})(\beta_0 + \beta_1x_i + \epsilon_i )}{\sum_i (x_i - \bar{x})^2} \right), \;\;\;\text{substituting in the above} \\ &= \text{Var} \left(\frac{\sum_i (x_i - \bar{x})\epsilon_i}{\sum_i (x_i - \bar{x})^2} \right), \;\;\;\text{noting only $\epsilon_i$ is a random variable} \\ &= \frac{\sum_i (x_i - \bar{x})^2\text{Var}(\epsilon_i)}{\left(\sum_i (x_i - \bar{x})^2\right)^2} , \;\;\;\text{independence of } \epsilon_i \text{ and, Var}(cX)=c^2\text{Var}(X) \\ &= \frac{\sigma^2}{\sum_i (x_i - \bar{x})^2} \\ \end{align}\]

Model diagnostics

At this point it’s important to run some diagnostics on our model to check if our assumptions are well satisfied.

Assumptions like independence and data actually having a theoretical linear relationship are usually not things you can really check in the data, so we tend to hope they are true. You can usually determine based on the way the data was gathered and by looking at a plot of the data to determine if there are major violations of these two assumptions, but generally speaking, these may be harder to assess.

That usually leaves us with 3 main things to check: - residuals have mean 0 - residuals have constant variance - residuals are normally distributed

1. Residual vs. Fitted plot

The ideal case

Let’s begin by looking at the Residual-Fitted plot coming from a linear model that is fit to data that perfectly satisfies all the of the standard assumptions of linear regression. What are those assumptions? In the ideal case, we expect the \(i\)th data point to be generated as:

\[y_i = \beta_0 + \beta_1x_{1i} + \cdots + \beta_p x_{pi} + \epsilon_i\]

where \(\epsilon_i\) is an “error” term with mean 0 and some variance \(\sigma^2\).

To create an example of this type of model, let’s generate data according to

\[y_i = 3 + 0.1 x + \epsilon_i,\]

for \(i = 1, 2, \ldots, 1000\), where the \(\epsilon_i\) are independent Normal\((0,sd = 3)\) variables (with standard deviation 3).

Here’s code to generate this data and then regress y on x.

library(ggplot2)

n <- 1000      # sample size
x <- runif(n, min = 0, max = 100)
y.good <- 3 + 0.1 * x + rnorm(n, sd = 3)

# Scatterplot of the data with regression line overlaid
qplot(x, y.good, ylab = "y", main = "Ideal regression setup") + stat_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

# Run regression and display residual-fitted plot
lm.good <- lm(y.good ~ x)
plot(lm.good, which = 1)  

The scatterplot shows the perfect setup for a linear regression: The data appear to be well modeled by a linear relationship between \(y\) and \(x\), and the points appear to be randomly spread out about the line, with no discernible non-linear trends or indications of non-constant variance.

When we look at the diagnostic plots, we’ll see perfect behavior. The quantities that enter into the diagnostic plot are:

  • Fitted values: \(\hat y_i = \hat \beta_0 + \hat \beta_1x_{1i} + \cdot \hat\beta_p x_{pi}\)
    • Here, \(\hat \beta_j\) is the estimated value of the coefficient for variable \(j\)
  • Residuals: \(r_i = y_i - \hat y_i\)

You can think of the residuals \(r_i\) as being estimates of the error terms \(\epsilon_i\). So anytime we’re looking at a plot that involves residuals, we’re doing so because we’re trying to assess whether some assumption about the errors \(\epsilon_i\) appears to hold in our data.

Looking at the Residuals vs Fitted plot (showing \(r_i\) on the y-axis and \(\hat y_i\) on the x-axis), we see that the red line (which is just a scatterplot smoother, showing the average value of the residuals at each value of fitted value) is perfectly flat. This tells us that there is no discernible non-linear trend to the residuals. Furthermore, the residuals appear to be equally variable across the entire range of fitted values. There is no indication of non-constant variance.

# Display scale-location plot
plot(lm.good, which = 3)  

The scale-location plot is a more sensitive approach to looking for deviations from the constant variance assumption. If you see significant trends in the red line on this plot, it tells you that the residuals (and hence errors) have non-constant variance. That is, the assumption that all the \(\epsilon_i\) have the same variance \(\sigma^2\) is not true. When you see a flat line like what’s shown above, it means your errors have constant variance, like we want to see.

Constructing your own Residual vs Fitted plot

Here’s a better version of the default plot.

# Plot model residuals on y axis, fitted values on x axis
# Add red trend curve with better choice of smoothing bandwidth
qplot(y = lm.curved$residuals, x = lm.curved$fitted.values,
      ylab = "Residuals", xlab = "Fitted values", 
      main = "The Do-it-yourself Residuals vs. Fitted plot") +
  stat_smooth(method = "loess", span = 0.1, colour = I("red"), se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

Non-constant variance

In this example we’ll generate data where the error variance increases with \(x\). Our model will be:

\[ y_i = 3 + 0.2x_i + \epsilon_i \] where \[\epsilon_i \sim N(0, 9(1 + x/25)^2)\].

y.increasing <- 3 + 0.2 * x + (1 + x / 25) * rnorm(n, sd = 3)

# Produce scatterplot of y vs x
qplot(x, y.increasing, ylab = "y")

Here’s what the Residual vs. Fitted plot looks like in this case.

lm.increasing <- lm(y.increasing ~ x)
plot(lm.increasing, which = 1)

If you look at this plot, you’ll see that there’s a clear “funneling” phenomenon. The distribution of the residuals is quite well concentrated around 0 for small fitted values, but they get more and more spread out as the fitted values increase. This is an instance of “increasing variance”. Here’s what the scale-location plot looks like in this example:

plot(lm.increasing, which = 3)

Note the clear upward slope in the red trend line. This tells us we have non-constant variance.

The standard linear regression assumption is that the variance is constant across the entire range. When this assumption isn’t valid, such as in this example, we shouldn’t believe our confidence intervals, prediction bands, or the p-values in our regression.

Normal QQ plot

The Normal QQ plot helps us to assess whether the residuals are roughly normally distributed. If the residuals look far from normal we may be in trouble. In particular, if the residual tend to be larger in magnitude than what we would expect from the normal distribution, then our p-values and confidence intervals may be too optimisitic. i.e., we may fail to adequately account for the full variability of the data.

The ideal case

First, here’s an example of a Normal QQ plot that’s as perfect as it gets. This comes from the ideal simulation setting in the previous section. The residuals here are a perfect match to the diagonal line. These residuals look to be normally distributed.

plot(lm.good, which = 2)

Lighter tails

In the next example, we see a QQ plot where the residuals deviate from the diagonal line in both the upper and lower tail. This plot indicated that the tails are ‘lighter’ (have smaller values) than what we would expect under the standard modeling assumptions. This is indicated by the points forming a “flatter” line than than the diagonal.

plot(lm.curved, which = 2)

Heavier tails

In this final example, we see a QQ plot where the residuals deviate from the diagonal line in both the upper and lower tail. Unlike the previous plot, in this case we see that the tails are observed to be ‘heavier’ (have larger values) than what we would expect under the standard modeling assumptions. This is indicated by the points forming a “steeper” line than the diagonal.

plot(lm.increasing, which = 2)

Bonus: Outliers and the Residuals vs Leverage plot

There’s no single accepted definition for what consitutes an outlier. One possible definition is that an outlier is any point that isn’t approximated well by the model (has a large residual) and which significantly influences model fit (has large leverage). This is where the Residuals vs Leverage plot comes in.

The ideal case

Let’s look at our ideal setting once again. The plot below is a great example of a Residuals vs Leverage plot in which we see no evidence of outliers. Those “Cook’s distance” dashed curves don’t even appear on the plot. None of the points come close to having both high residual and leverage.

plot(lm.good, which = 5)

An example with possible outliers
set.seed(12345)
y.corrupted <- y.good[1:100]
x.corrupted <- x[1:100]

# Randomly select 10 points to corrupt
to.corrupt <- sample(1:length(x.corrupted), 10)
y.corrupted[to.corrupt] <- - 1.5 * y.corrupted[to.corrupt] + 3 * rt(10, df = 3)
x.corrupted[to.corrupt] <- x.corrupted[to.corrupt] * 2.5
# Fit regression and display diagnostic plot
lm.corrupted <- lm(y.corrupted ~ x.corrupted)

plot(lm.corrupted, which = 5)

In this plot we see that there are several points that have high residual and high leverage. The points that lie close to or outside of the dashed red curves are worth investigating further.

Can’t we just use scatterplots?

All of the examples above were generated by considering the regression of a single outcome variable on a single covariate. In this case, we could’ve diagnosed most of the violations of model assumptions just by looking at the x-y scatterplot. The reason for using diagnostic plots is that most regressions we run aren’t so simple. Most regressions use many variables (tens or hundreds of variables), and in those cases there isn’t a good way of visualizing all of the data. Residuals, fitted values and leverage are all quantities that can be computed and plotted regardless of how many variables are included in the model. Thus diagnostics such as the Residual vs. Fitted plot, Normal QQ plot and Residual vs. Leverage plot can help us even when we have complicated models.

R^2, confidence, hypothesis

Continue on page 36 of Devore


Part 2: Multiple regression

Natural way to extend simple linear prediction: add more predictors! For each observation \(i\), consider the following model equation specification:

\[ y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2}+...+\beta_px_{ip}+\epsilon_i \]

Where \(\beta_k\) is the \(k\)-th coefficient and \(x_{ik}\) is the \(i\)-th observation of the \(k\)-th predictor variable.

We can then again represent the loss function as

\[ RSS = \sum_{i=1}^n ((\beta_0 + \beta_1x_{i1}+...+\beta_nx_{np}) - y_i)^2 \]

Alternatively, we can use a matrix representation. Let

\[ y=\begin{bmatrix}y_1\\y_2\\\vdots\\y_n\end{bmatrix} \quad X=\begin{bmatrix} 1&x_{11}&x_{12}&\cdots&x_{1p}\\ 1&x_{21}&x_{22}&\cdots&x_{2p}\\ \cdots&\cdots&\vdots&\ddots&\cdots\\ 1&x_{n1}&x_{n2}&\cdots&x_{np}\\ \end{bmatrix} \quad b=\begin{bmatrix}\beta_0\\\beta_1\\\beta_2\\\vdots\\\beta_n\end{bmatrix} \quad e=\begin{bmatrix}\epsilon_1\\\epsilon_2\\\vdots\\\epsilon_n\end{bmatrix} \]

Then we can represent out model equation as

\[ y=Xb+e \]

We can then represent our loss function as

\[\begin{align*} e'e &= (y-Xb)'(y-Xb) \\ &=y'y-b'X'y-y'Xb+b'X'Xb \\ &=y'y-2b'X'y+b'X'Xb \\ \end{align*}\]

Where \(e'\) is the transpose of \(e\). Then, we can take a derivative of this with respect to \(b\) and set to 0 again to solve for \(\hat{b}\)

\[ \frac{\partial e'e}{\partial b}\bigg|_{b=\hat{b}}=-2X'y+2X'X\hat{b}=0 \]

We can rearrange to become

\[ (X'X)\hat{b}=X'y \]

Solving for \(\hat{b}\) gives us

\[ \hat{b} = (X'X)^{-1}X'y \]

You can also easily derive the variance of the coefficients:

\[\begin{align*} \operatorname{Var}(\hat{b}) &=\operatorname{Var}((X'X)^{-1}X'y)\\ &=\operatorname{Var}((X'X)^{-1}X'e)\\ &=((X'X)^{-1}X')\cdot\operatorname{Var}(e)\cdot((X'X)^{-1}X')'\\ &=\sigma^2((X'X)^{-1}X')\cdot I\cdot((X'X)^{-1}X')'\\ &=\sigma^2((X'X)^{-1}X')((X'X)^{-1}X')'\\ &=\sigma^2(X'X)^{-1}(X'X)(X'X)^{-1}\\ &=\sigma^2(X'X)^{-1} \end{align*}\]

This process is exactly the same as 1 predictor regression, except we’re fitting a hyperplane through our data.


Continue on page 23 of ISLR

Advanced R formulae

Other examples of interaction plots: here and here


Logistic regression


Chapter 4 of ISLR

Example:

admit = read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
admit$rank <- factor(admit$rank)

head(admit)
##   admit gre  gpa rank
## 1     0 380 3.61    3
## 2     1 660 3.67    3
## 3     1 800 4.00    1
## 4     1 640 3.19    4
## 5     0 520 2.93    4
## 6     1 760 3.00    2
summary(admit)
##      admit             gre             gpa        rank   
##  Min.   :0.0000   Min.   :220.0   Min.   :2.260   1: 61  
##  1st Qu.:0.0000   1st Qu.:520.0   1st Qu.:3.130   2:151  
##  Median :0.0000   Median :580.0   Median :3.395   3:121  
##  Mean   :0.3175   Mean   :587.7   Mean   :3.390   4: 67  
##  3rd Qu.:1.0000   3rd Qu.:660.0   3rd Qu.:3.670          
##  Max.   :1.0000   Max.   :800.0   Max.   :4.000
logit.admit = glm(admit~gre+gpa+rank, data=admit, family="binomial")
summary(logit.admit)
## 
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
##     data = admit)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6268  -0.8662  -0.6388   1.1490   2.0790  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.989979   1.139951  -3.500 0.000465 ***
## gre          0.002264   0.001094   2.070 0.038465 *  
## gpa          0.804038   0.331819   2.423 0.015388 *  
## rank2       -0.675443   0.316490  -2.134 0.032829 *  
## rank3       -1.340204   0.345306  -3.881 0.000104 ***
## rank4       -1.551464   0.417832  -3.713 0.000205 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.52  on 394  degrees of freedom
## AIC: 470.52
## 
## Number of Fisher Scoring iterations: 4
confint(logit.admit)
## Waiting for profiling to be done...
##                     2.5 %       97.5 %
## (Intercept) -6.2716202334 -1.792547080
## gre          0.0001375921  0.004435874
## gpa          0.1602959439  1.464142727
## rank2       -1.3008888002 -0.056745722
## rank3       -2.0276713127 -0.670372346
## rank4       -2.4000265384 -0.753542605
# by default, gives you type="link"
pred.link = predict(logit.admit, newdata=data.frame(gre=800, gpa=4, rank="1"))
pred.link
##        1 
## 1.037712
# to give response, use type="response"
predict(logit.admit, newdata=data.frame(gre=800, gpa=4, rank="1"), type="response")
##         1 
## 0.7384082
# relationship between two
1/(1/exp(pred.link)+1)
##         1 
## 0.7384082
# predict completely manually
coef(logit.admit)
##  (Intercept)          gre          gpa        rank2        rank3        rank4 
## -3.989979073  0.002264426  0.804037549 -0.675442928 -1.340203916 -1.551463677
1/(1/exp(sum(coef(logit.admit) * c(1,800,4,0,0,0)))+1)
## [1] 0.7384082
# another example
predict(logit.admit, newdata=data.frame(gre=750, gpa=3.7, rank="2"), type="response")
##         1 
## 0.5019591
1/(1/exp(sum(coef(logit.admit) * c(1,750,3.7,1,0,0)))+1)
## [1] 0.5019591