Parts of these notes adapted from here, here, here, here, here.
A simple linear model has an outcome, \(y\), and one predictor, \(x\). It is defined by the following equation.
\[ y_i = \beta_0 + \beta_1x_i + \epsilon_i, \]
where \(i = 1, \ldots, n\). This equation represents the true, theoretical, equation, with slope \(\beta_1\) and intercept \(\beta_0\)
The subscript in this equation, \(i\), indexes the \(n\) observations in the dataset. (Think of \(i\) as a row number.) The equation can be read as follows: the value of \(i^{th}\) outcome variable, \(y_i\), is defined by an intercept, \(\beta_0\), plus a slope, \(\beta_1\), multiplied by the \(i^{th}\) predictor variable, \(x_i\). These elements define the systematic or deterministic portion of the model. However, because the world is uncertain, containing randomness, we know that the model will be wrong (as George Box said). To fully describe the data we need an error term, \(\epsilon_i\), which is also indexed by row. The error term is the stochastic portion of the model. \(\epsilon_i\) measures the distance between the fitted or expected values of the model—calculated from the deterministic portion of the model—and the actual values.
The errors in a linear model—also known as model residuals—are the part of the data that remains unexplained by the deterministic portion of the model. One of the key assumptions of a linear model is that the residuals are normally distributed with mean = 0 and variance = \(\sigma^2\), which we denote as
\[ \epsilon_i\sim N(0,\sigma^2) \]
For ordinary least squares regression, we have the following loss function (or error function).
\[ \operatorname{RSS} = \sum_{i=1}^n ((\beta_0 + \beta_1x_i) - y_i)^2 = \sum_{i=1}^n (\hat{y}_i - y_i)^2 \]
where \(\hat{y_i}=\beta_0+\beta_1x\) is the predicted \(y_i\) based on the coefficients \(\beta_0,\beta_1\). This function represents something we want to minimize. The higher the loss, the worse the estimate; the lower the loss, the better the estimate.
Note: since the residuals can be positive or negative, so if we simply add the residuals up we might be canceling out some of them. So instead of minimizing the sum of the residuals, we usually choose to square the residuals and minimize the sum of squares of the residuals. (Mathematically it becomes easier to work with the square than the absolute value).
This is not the only possible loss function, there are other ones! However, for simple linear regression, this is the simplest possible loss function to work with, and it’s also the most commonly used (kind of by default) so we’re using it too here.
When we fit a line of regression through the data, we are trying to find estimates \(\hat{\beta_0}\) and \(\hat{\beta_1}\) such that we minimize the loss function. I.e., our goal is to find
\[ \hat{\beta_0},\hat{\beta_1} = \arg\min_{\beta_0,\beta_1} \sum_{i=1}^n ((\beta_0 + \beta_1x_i) - y_i)^2 \]
The math for this is actually surprisingly not that difficult, so we show the derivation below.
We can do this by taking the partial derivative with respect to \(\hat{\beta_0}\) and \(\hat{\beta_1}\), and setting them both to 0. First, we define the following variables to simplify notation:
\[\begin{align} \text{Define } \bar{y} &:= \frac{1}{n}\sum_i^n y_i \\ \text{Define } \bar{x} &;= \frac{1}{n}\sum_i^n x_i \\ \text{Define } Z &:= \sum_i\left( y_i - \beta_0 - \beta_1 x_i \right)^2 \\ \end{align}\]
Then we take the partial derivative with respect to \(\beta_0\), solve for this \(\beta_0\), then substitute it into the partial derivative with respect to \(\beta_1\): \[\begin{align} \text{Partial deriative w.r.t. } \beta_0 : \; \; \frac{\partial Z}{\partial \beta_0} &= \sum_i^n -2 \left(y_i - \beta_0 - \beta_1x_i \right) \\ \text{Setting the derivative to 0 and solving: } \; \; ~~~~ \hat{\beta_0} &= \frac{1}{n}\sum_i^n y_i - \frac{1}{n}\sum_i^n\hat{\beta_1}x_i \\ \implies ~~~~~ \hat{\beta_0} &= \bar{y} - \hat{\beta_1} \bar{x} \\ \text{Partial deriative w.r.t. } \beta_1 : \; \;\frac{\partial Z}{\partial \beta_1} &= \sum_i^n -2x_i \left( y_i - \beta_0 - \beta_1x_i \right) \end{align}\] \[\begin{align} \text{Setting the derivative to 0 and substituting $\hat{\beta_1}$, we have: } & \\ \sum_i^n x_i y_i - \sum_i^n (\bar{y}-\hat{\beta_1}\bar{x})x_i - \sum_i^n\hat{\beta_1}x_i^2 &= 0 \\ \sum_i^n x_i y_i - \bar{y} \sum_i^n x_i + \hat{\beta_1} \left(\bar{x} \sum_i^n x_i - \sum_i^n x_i^2 \right) &= 0 \\ \hat{\beta_1} &= \frac{\sum_i^n x_i y_i - \bar{y}\sum_i^n x_i }{ \sum_i^n x_i^2 - \bar{x} \sum_i^n x_i } \\ &= \frac{\sum_i^n x_i y_i - n\bar{x}\bar{y}}{ \sum_i^n x_i^2 - n\bar{x}^2} \\ \text{simplifying: } \; \; \hat{\beta_1} &= \frac{\sum_i^n (x_i - \bar{x})(y_i - \bar{y})}{ \sum_i^n (x_i - \bar{x})^2 } \end{align}\]
And we end up with the final OLS solution:
\[\begin{align} \hat{\beta_0} &= \bar{y} - \hat{\beta_1} \bar{x} \\ \hat{\beta_1} &= \frac{\sum_i^n (x_i - \bar{x})(y_i - \bar{y})}{ \sum_i^n (x_i - \bar{x})^2 } = \frac{Cov(x,y)}{Var(x)} \end{align}\]
Then, we can use these to find our predicted \(\hat{y_i}\), i.e. the \(y\) values on the line.
\[ \hat{y_i}=\hat{\beta_0}+\hat{\beta_1}x_i \]
as well as our model residuals \(\hat{\epsilon}_i\)
\[ \hat{\epsilon_i}=y_i-\hat{y_i} \]
From this, we also get for free an estimate of the variance of the residuals \(\sigma^2\), which happens to be very useful in computing other statistics. The reason is that the larger the residuals’ variance, the less precisely we can estimate our regression coefficients, which should make a lot of sense.
\[ \hat{\sigma}^2=\text{mean squared error}=\frac{SSE}{n-2}=\frac1{n-2}\sum_i(y_i-\hat{y_i})^2 \]
We can also easily derive the variance of the slope. First, observe that
\[\begin{align} \sum_i (x_i - \bar{x})\bar{y} &= \bar{y}\sum_i (x_i - \bar{x})\\ &= \bar{y}\left(\left(\sum_i x_i\right) - n\bar{x}\right)\\ &= \bar{y}\left(n\bar{x} - n\bar{x}\right)\\ &= 0 \end{align}\]
This means that
\[\begin{align} \sum_i (x_i - \bar{x})(y_i - \bar{y}) &= \sum_i (x_i - \bar{x})y_i - \sum_i (x_i - \bar{x})\bar{y}\\ &= \sum_i (x_i - \bar{x})y_i\\ &= \sum_i (x_i - \bar{x})(\beta_0 + \beta_1x_i + \epsilon_i )\\ \end{align}\]
Using this, we can easily derive \(\text{Var}(\hat{\beta_1})\) as follows:
\[\begin{align} \text{Var}(\hat{\beta_1}) & = \text{Var} \left(\frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sum_i (x_i - \bar{x})^2} \right) \\ &= \text{Var} \left(\frac{\sum_i (x_i - \bar{x})(\beta_0 + \beta_1x_i + \epsilon_i )}{\sum_i (x_i - \bar{x})^2} \right), \;\;\;\text{substituting in the above} \\ &= \text{Var} \left(\frac{\sum_i (x_i - \bar{x})\epsilon_i}{\sum_i (x_i - \bar{x})^2} \right), \;\;\;\text{noting only $\epsilon_i$ is a random variable} \\ &= \frac{\sum_i (x_i - \bar{x})^2\text{Var}(\epsilon_i)}{\left(\sum_i (x_i - \bar{x})^2\right)^2} , \;\;\;\text{independence of } \epsilon_i \text{ and, Var}(cX)=c^2\text{Var}(X) \\ &= \frac{\sigma^2}{\sum_i (x_i - \bar{x})^2} \\ \end{align}\]
At this point it’s important to run some diagnostics on our model to check if our assumptions are well satisfied.
Assumptions like independence and data actually having a theoretical linear relationship are usually not things you can really check in the data, so we tend to hope they are true. You can usually determine based on the way the data was gathered and by looking at a plot of the data to determine if there are major violations of these two assumptions, but generally speaking, these may be harder to assess.
That usually leaves us with 3 main things to check: - residuals have mean 0 - residuals have constant variance - residuals are normally distributed
Let’s begin by looking at the Residual-Fitted plot coming from a linear model that is fit to data that perfectly satisfies all the of the standard assumptions of linear regression. What are those assumptions? In the ideal case, we expect the \(i\)th data point to be generated as:
\[y_i = \beta_0 + \beta_1x_{1i} + \cdots + \beta_p x_{pi} + \epsilon_i\]
where \(\epsilon_i\) is an “error” term with mean 0 and some variance \(\sigma^2\).
To create an example of this type of model, let’s generate data according to
\[y_i = 3 + 0.1 x + \epsilon_i,\]
for \(i = 1, 2, \ldots, 1000\), where the \(\epsilon_i\) are independent Normal\((0,sd = 3)\) variables (with standard deviation 3).
Here’s code to generate this data and then regress y on x.
library(ggplot2)
n <- 1000 # sample size
x <- runif(n, min = 0, max = 100)
y.good <- 3 + 0.1 * x + rnorm(n, sd = 3)
# Scatterplot of the data with regression line overlaid
qplot(x, y.good, ylab = "y", main = "Ideal regression setup") + stat_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
# Run regression and display residual-fitted plot
lm.good <- lm(y.good ~ x)
plot(lm.good, which = 1)
The scatterplot shows the perfect setup for a linear regression: The data appear to be well modeled by a linear relationship between \(y\) and \(x\), and the points appear to be randomly spread out about the line, with no discernible non-linear trends or indications of non-constant variance.
When we look at the diagnostic plots, we’ll see perfect behavior. The quantities that enter into the diagnostic plot are:
You can think of the residuals \(r_i\) as being estimates of the error terms \(\epsilon_i\). So anytime we’re looking at a plot that involves residuals, we’re doing so because we’re trying to assess whether some assumption about the errors \(\epsilon_i\) appears to hold in our data.
Looking at the Residuals vs Fitted plot (showing \(r_i\) on the y-axis and \(\hat y_i\) on the x-axis), we see that the red line (which is just a scatterplot smoother, showing the average value of the residuals at each value of fitted value) is perfectly flat. This tells us that there is no discernible non-linear trend to the residuals. Furthermore, the residuals appear to be equally variable across the entire range of fitted values. There is no indication of non-constant variance.
# Display scale-location plot
plot(lm.good, which = 3)
The scale-location plot is a more sensitive approach to looking for deviations from the constant variance assumption. If you see significant trends in the red line on this plot, it tells you that the residuals (and hence errors) have non-constant variance. That is, the assumption that all the \(\epsilon_i\) have the same variance \(\sigma^2\) is not true. When you see a flat line like what’s shown above, it means your errors have constant variance, like we want to see.
Here’s an example where we have non-linear trends in the data. This example is constructed to mimic seasonal data.
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
The blue line shows the model fit. The red curve is a non-linear fit that does a better job of modelling the average value of \(y\) at each value of \(x\). Note that the linear model fails to capture the clear non-linear trend that’s present in the data. This causes tremendous problems for our inference. Look at the grey confidence band that surrounds the regression line. If the standard linear regression assumptions are satisfied, this band with high likelihood would contain the average value of \(y\) at each value of \(x\). i.e., the grey bands around the blue curve should mostly contain the red curve. This obviously does not happen. The red curve is almost always far outside the grey bands around the blue regression line.
Take-away: When one or more of the model assumptions underlying the linear model is violated, we can no longer believe our inferential procedures. E.g., our confidence intervals and p-values may no longer be reliable.
Here’s what the Residual - Fitted plot looks like for this model.
lm.curved <- lm(y.curved ~ x)
plot(lm.curved, which = 1)
Visually, we see a clear trend in the residuals. They have a periodic trend. Unfortunately, the scatterplot smoother that’s used to construct the red line isn’t doing a good job here. This is a case where the choice of neighbourhood size (how many points go into calculating the local average) is taken to be too large to capture the the trend that we visually observe. Don’t always trust that red curve.
Here’s a better version of the default plot.
# Plot model residuals on y axis, fitted values on x axis
# Add red trend curve with better choice of smoothing bandwidth
qplot(y = lm.curved$residuals, x = lm.curved$fitted.values,
ylab = "Residuals", xlab = "Fitted values",
main = "The Do-it-yourself Residuals vs. Fitted plot") +
stat_smooth(method = "loess", span = 0.1, colour = I("red"), se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
In this example we’ll generate data where the error variance increases with \(x\). Our model will be:
\[ y_i = 3 + 0.2x_i + \epsilon_i \] where \[\epsilon_i \sim N(0, 9(1 + x/25)^2)\].
y.increasing <- 3 + 0.2 * x + (1 + x / 25) * rnorm(n, sd = 3)
# Produce scatterplot of y vs x
qplot(x, y.increasing, ylab = "y")
Here’s what the Residual vs. Fitted plot looks like in this case.
lm.increasing <- lm(y.increasing ~ x)
plot(lm.increasing, which = 1)
If you look at this plot, you’ll see that there’s a clear “funneling” phenomenon. The distribution of the residuals is quite well concentrated around 0 for small fitted values, but they get more and more spread out as the fitted values increase. This is an instance of “increasing variance”. Here’s what the scale-location plot looks like in this example:
plot(lm.increasing, which = 3)
Note the clear upward slope in the red trend line. This tells us we have non-constant variance.
The standard linear regression assumption is that the variance is constant across the entire range. When this assumption isn’t valid, such as in this example, we shouldn’t believe our confidence intervals, prediction bands, or the p-values in our regression.
The Normal QQ plot helps us to assess whether the residuals are roughly normally distributed. If the residuals look far from normal we may be in trouble. In particular, if the residual tend to be larger in magnitude than what we would expect from the normal distribution, then our p-values and confidence intervals may be too optimisitic. i.e., we may fail to adequately account for the full variability of the data.
First, here’s an example of a Normal QQ plot that’s as perfect as it gets. This comes from the ideal simulation setting in the previous section. The residuals here are a perfect match to the diagonal line. These residuals look to be normally distributed.
plot(lm.good, which = 2)
In the next example, we see a QQ plot where the residuals deviate from the diagonal line in both the upper and lower tail. This plot indicated that the tails are ‘lighter’ (have smaller values) than what we would expect under the standard modeling assumptions. This is indicated by the points forming a “flatter” line than than the diagonal.
plot(lm.curved, which = 2)
In this final example, we see a QQ plot where the residuals deviate from the diagonal line in both the upper and lower tail. Unlike the previous plot, in this case we see that the tails are observed to be ‘heavier’ (have larger values) than what we would expect under the standard modeling assumptions. This is indicated by the points forming a “steeper” line than the diagonal.
plot(lm.increasing, which = 2)
There’s no single accepted definition for what consitutes an outlier. One possible definition is that an outlier is any point that isn’t approximated well by the model (has a large residual) and which significantly influences model fit (has large leverage). This is where the Residuals vs Leverage plot comes in.
Let’s look at our ideal setting once again. The plot below is a great example of a Residuals vs Leverage plot in which we see no evidence of outliers. Those “Cook’s distance” dashed curves don’t even appear on the plot. None of the points come close to having both high residual and leverage.
plot(lm.good, which = 5)
set.seed(12345)
y.corrupted <- y.good[1:100]
x.corrupted <- x[1:100]
# Randomly select 10 points to corrupt
to.corrupt <- sample(1:length(x.corrupted), 10)
y.corrupted[to.corrupt] <- - 1.5 * y.corrupted[to.corrupt] + 3 * rt(10, df = 3)
x.corrupted[to.corrupt] <- x.corrupted[to.corrupt] * 2.5
# Fit regression and display diagnostic plot
lm.corrupted <- lm(y.corrupted ~ x.corrupted)
plot(lm.corrupted, which = 5)
In this plot we see that there are several points that have high residual and high leverage. The points that lie close to or outside of the dashed red curves are worth investigating further.
All of the examples above were generated by considering the regression of a single outcome variable on a single covariate. In this case, we could’ve diagnosed most of the violations of model assumptions just by looking at the x-y scatterplot. The reason for using diagnostic plots is that most regressions we run aren’t so simple. Most regressions use many variables (tens or hundreds of variables), and in those cases there isn’t a good way of visualizing all of the data. Residuals, fitted values and leverage are all quantities that can be computed and plotted regardless of how many variables are included in the model. Thus diagnostics such as the Residual vs. Fitted plot, Normal QQ plot and Residual vs. Leverage plot can help us even when we have complicated models.
Natural way to extend simple linear prediction: add more predictors! For each observation \(i\), consider the following model equation specification:
\[ y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2}+...+\beta_px_{ip}+\epsilon_i \]
Where \(\beta_k\) is the \(k\)-th coefficient and \(x_{ik}\) is the \(i\)-th observation of the \(k\)-th predictor variable.
We can then again represent the loss function as
\[ RSS = \sum_{i=1}^n ((\beta_0 + \beta_1x_{i1}+...+\beta_nx_{np}) - y_i)^2 \]
Alternatively, we can use a matrix representation. Let
\[ y=\begin{bmatrix}y_1\\y_2\\\vdots\\y_n\end{bmatrix} \quad X=\begin{bmatrix} 1&x_{11}&x_{12}&\cdots&x_{1p}\\ 1&x_{21}&x_{22}&\cdots&x_{2p}\\ \cdots&\cdots&\vdots&\ddots&\cdots\\ 1&x_{n1}&x_{n2}&\cdots&x_{np}\\ \end{bmatrix} \quad b=\begin{bmatrix}\beta_0\\\beta_1\\\beta_2\\\vdots\\\beta_n\end{bmatrix} \quad e=\begin{bmatrix}\epsilon_1\\\epsilon_2\\\vdots\\\epsilon_n\end{bmatrix} \]
Then we can represent out model equation as
\[ y=Xb+e \]
We can then represent our loss function as
\[\begin{align*} e'e &= (y-Xb)'(y-Xb) \\ &=y'y-b'X'y-y'Xb+b'X'Xb \\ &=y'y-2b'X'y+b'X'Xb \\ \end{align*}\]
Where \(e'\) is the transpose of \(e\). Then, we can take a derivative of this with respect to \(b\) and set to 0 again to solve for \(\hat{b}\)
\[ \frac{\partial e'e}{\partial b}\bigg|_{b=\hat{b}}=-2X'y+2X'X\hat{b}=0 \]
We can rearrange to become
\[ (X'X)\hat{b}=X'y \]
Solving for \(\hat{b}\) gives us
\[ \hat{b} = (X'X)^{-1}X'y \]
You can also easily derive the variance of the coefficients:
\[\begin{align*} \operatorname{Var}(\hat{b}) &=\operatorname{Var}((X'X)^{-1}X'y)\\ &=\operatorname{Var}((X'X)^{-1}X'e)\\ &=((X'X)^{-1}X')\cdot\operatorname{Var}(e)\cdot((X'X)^{-1}X')'\\ &=\sigma^2((X'X)^{-1}X')\cdot I\cdot((X'X)^{-1}X')'\\ &=\sigma^2((X'X)^{-1}X')((X'X)^{-1}X')'\\ &=\sigma^2(X'X)^{-1}(X'X)(X'X)^{-1}\\ &=\sigma^2(X'X)^{-1} \end{align*}\]
This process is exactly the same as 1 predictor regression, except we’re fitting a hyperplane through our data.
Other examples of interaction plots: here and here
Example:
admit = read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
admit$rank <- factor(admit$rank)
head(admit)
## admit gre gpa rank
## 1 0 380 3.61 3
## 2 1 660 3.67 3
## 3 1 800 4.00 1
## 4 1 640 3.19 4
## 5 0 520 2.93 4
## 6 1 760 3.00 2
summary(admit)
## admit gre gpa rank
## Min. :0.0000 Min. :220.0 Min. :2.260 1: 61
## 1st Qu.:0.0000 1st Qu.:520.0 1st Qu.:3.130 2:151
## Median :0.0000 Median :580.0 Median :3.395 3:121
## Mean :0.3175 Mean :587.7 Mean :3.390 4: 67
## 3rd Qu.:1.0000 3rd Qu.:660.0 3rd Qu.:3.670
## Max. :1.0000 Max. :800.0 Max. :4.000
logit.admit = glm(admit~gre+gpa+rank, data=admit, family="binomial")
summary(logit.admit)
##
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial",
## data = admit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6268 -0.8662 -0.6388 1.1490 2.0790
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.989979 1.139951 -3.500 0.000465 ***
## gre 0.002264 0.001094 2.070 0.038465 *
## gpa 0.804038 0.331819 2.423 0.015388 *
## rank2 -0.675443 0.316490 -2.134 0.032829 *
## rank3 -1.340204 0.345306 -3.881 0.000104 ***
## rank4 -1.551464 0.417832 -3.713 0.000205 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 458.52 on 394 degrees of freedom
## AIC: 470.52
##
## Number of Fisher Scoring iterations: 4
confint(logit.admit)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -6.2716202334 -1.792547080
## gre 0.0001375921 0.004435874
## gpa 0.1602959439 1.464142727
## rank2 -1.3008888002 -0.056745722
## rank3 -2.0276713127 -0.670372346
## rank4 -2.4000265384 -0.753542605
# by default, gives you type="link"
pred.link = predict(logit.admit, newdata=data.frame(gre=800, gpa=4, rank="1"))
pred.link
## 1
## 1.037712
# to give response, use type="response"
predict(logit.admit, newdata=data.frame(gre=800, gpa=4, rank="1"), type="response")
## 1
## 0.7384082
# relationship between two
1/(1/exp(pred.link)+1)
## 1
## 0.7384082
# predict completely manually
coef(logit.admit)
## (Intercept) gre gpa rank2 rank3 rank4
## -3.989979073 0.002264426 0.804037549 -0.675442928 -1.340203916 -1.551463677
1/(1/exp(sum(coef(logit.admit) * c(1,800,4,0,0,0)))+1)
## [1] 0.7384082
# another example
predict(logit.admit, newdata=data.frame(gre=750, gpa=3.7, rank="2"), type="response")
## 1
## 0.5019591
1/(1/exp(sum(coef(logit.admit) * c(1,750,3.7,1,0,0)))+1)
## [1] 0.5019591