These notes on prediction are based on Karl Rohe’s.
In a prediction problem, you are given data pairs \((X_1, Y_1), (X_2, Y_2), \dots, (X_n, Y_n)\) and you want to use \(X_i\) to predict \(Y_i\). We typically imagine \(X_i\) as containing several values (i.e. it is a “vector”).
There are two types of prediction problems, continuous \(Y_i\) and discrete \(Y_i\). For example, you might want to predict tomorrow’s the price for asset \(i\) using data that is available today. So, develop an historical training set, where you have information on asset \(i\) from one day contained in \(X_i\) and that asset’s price for the next day contained in \(Y_i\). Here, stock price is continuous.
Alternatively, you might only be interested in knowing if you should buy the stock, sell the stock, or hold the stock. So, you develop an historical training set where \(X_i\) contains the information that is available on one day. Then, you develop “labels” \(Y_i\) using data from the next day that say whether you should have bought, sold, or held onto the asset. Here, the label (buy, sell, hold) is discrete.
We will often call continuous outcomes “regression” and discrete outcomes “classification”.
Alternatively, perhaps we want to make predictions about the 2020 election. You could try to predict who is going to win (classification) or the number of delegates/votes that the Republicans recieve (regression).
In the cases above, there are two natural versions of the same problem (one is regression and one is classification). However, many classification problems do not have an analogous regression problem. For example, in the handwritten digit example in Chapter 1 of ISLR, \(X_i\) is an image of a handwritten digit and \(Y_i\) is a label that says whether the digit is 0, 1, 2, 3, 4,… , or 9.
We are going to imagine two broad approaches to regression and classification.
Chapter 2 in ISLR provides a broad overview of prediction. In the previous weeks of this course, Monte Carlo provided the basic computational tool; we were always working to get the problem stated as something that we could solve with Monte Carlo. Now, the basic computational tool is numerical optimization. We will not write code to do optimization. Instead, we will see optimization problems multiple times; it is often used to define our various techniques.
You are encouraged to follow along here with the demo! The data below comes from real sources.
In the 1920s, Thomas Midgley Jr. discovered that adding tetraethyllead to gasoline decreased engine knocking (i.e. fuel doesn’t ignite correctly in engine, which may damage the engine). He won the prestigious 1923 Nichols medal for his discovery. There were other safer additives, but because he and General Motors owned a patent on tetraethyllead, they marketed it as the best option.
Later, in the 1950s to 70s, people began to realize the increased levels of lead in the atmosphere due to all the leaded gasoline being burned was causing widespread lead poisoning, with symptoms ranging from depression, loss of appetite, and amnesia to anemia, insomnia, slurred speech, and cognitive impairment like decrease in concentration and loss of short-term memory. In the 1980s its usage was slowly phased out 1.
Later, in the late 1990s to early 2000s, researchers also realized levels of lead in the atmosphere correlated strongly with rates of violent crime later in life 2 3. This study was first conducted in the US, but it was soon replicated in other countries and the same results were found over and over again.
Let’s look at a dataset that contains atmospheric lead content levels and aggravated assault rates for several cities in the US and see if we can build a simple linear regression model to effectively explain the trend and make predictions. The data we are using comes from.
For simplicity, let’s focus on the city of Atlanta. First, we plot the data. The variables here are metric tons of lead in atmosphere (independent/explanatory), and aggravated assault rate per million 22 years later (dependent/response).
library(tidyverse)
lead = read_csv("https://raw.githubusercontent.com/kdlevin-uwstat/STAT340-Fall2021/93c315a378f3fe421e7e2217f9b039471d313741/lecs/06/lead.csv")
head(lead)
## # A tibble: 6 x 3
## city air.pb.metric.tons aggr.assault.per.million
## <chr> <dbl> <dbl>
## 1 Atlanta 421 1029
## 2 Atlanta 429 937
## 3 Atlanta 444 887
## 4 Atlanta 457 533
## 5 Atlanta 461 1012
## 6 Atlanta 454 848
atlanta_lead = lead %>% filter(city == "Atlanta")
ggplot(atlanta_lead,aes(x=air.pb.metric.tons, y=aggr.assault.per.million)) +
geom_point() + labs(x="Lead in air (metric tons)",
y="Aggravated assault rate (per million)",
title="Violent crime v. atmospheric lead (22 year lag)")
The data is a little wider at one side than the other side, but the trend does appear to be pretty linear. We can use the geom_smooth()
function to get ggplot
to plot the line of best fit for us before we fit it manually ourselves.
ggplot(atlanta_lead,aes(x=air.pb.metric.tons,y=aggr.assault.per.million)) +
geom_point() + geom_smooth(method="lm", formula="y~x", se=F) +
labs(x="Lead in air (metric tons)",
y="Aggravated assault rate (per million)",
title="Violent crime v. atmospheric lead (22 year lag)")
To build a linear model, the syntax is as simple as lm(y~x, data=df)
. Running summary()
on the model object gives us a variety of useful summary statistics and other information about the model.
atlanta_lead_lm = lm(aggr.assault.per.million ~ air.pb.metric.tons, data=atlanta_lead)
summary(atlanta_lead_lm)
##
## Call:
## lm(formula = aggr.assault.per.million ~ air.pb.metric.tons, data = atlanta_lead)
##
## Residuals:
## Min 1Q Median 3Q Max
## -356.36 -84.55 6.89 122.93 382.88
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 107.94276 80.46409 1.342 0.189
## air.pb.metric.tons 1.40375 0.08112 17.305 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 180.6 on 34 degrees of freedom
## Multiple R-squared: 0.898, Adjusted R-squared: 0.895
## F-statistic: 299.4 on 1 and 34 DF, p-value: < 2.2e-16
You can also access key properties of the model using certain built in functions
# gets fitted y-values (points on line of best fit)
fitted(atlanta_lead_lm)
## 1 2 3 4 5 6 7 8
## 698.9200 710.1499 731.2061 749.4548 755.0698 745.2436 787.3560 836.4871
## 9 10 11 12 13 14 15 16
## 849.1208 990.8992 1038.6266 1096.1802 1118.6401 1169.1750 1183.2125 1187.4237
## 17 18 19 20 21 22 23 24
## 1194.4424 1197.2499 1302.5309 1351.6620 1414.8306 1529.9378 1597.3176 1689.9649
## 25 26 27 28 29 30 31 32
## 1757.3447 1782.6122 1852.7995 1879.4707 1991.7704 2000.1929 2101.2626 2185.4874
## 33 34 35 36
## 2209.3511 2231.8110 2217.7735 2236.0222
# gets residuals (difference between actual y and fitted y)
resids = residuals(atlanta_lead_lm) # can also use resid()
resids
## 1 2 3 4 5 6
## 330.080025 226.850054 155.793859 -216.454844 256.930171 102.756395
## 7 8 9 10 11 12
## -356.355996 -149.487118 382.879164 -273.899218 -268.626594 -280.180195
## 13 14 15 16 17 18
## 175.359863 -294.175006 23.787530 109.576291 -97.442440 -80.249933
## 19 20 21 22 23 24
## -8.530910 45.337967 -43.830619 116.062179 -55.317646 -70.964906
## 25 26 27 28 29 30
## -129.344731 14.387834 -58.799484 143.529335 148.229626 9.807148
## 31 32 33 34 35 36
## 199.737410 -64.487372 16.648940 14.188998 -27.773538 3.977759
# gets coefficients of model
coefs = coefficients(atlanta_lead_lm) # can also use coef()
coefs
## (Intercept) air.pb.metric.tons
## 107.942757 1.403746
We can plot our model just to make sure it looks correct and appropriate.
ggplot(atlanta_lead,aes(x=air.pb.metric.tons,y=aggr.assault.per.million)) +
geom_point() +
geom_abline(slope=coefs[2], intercept=coefs[1], color="red") +
labs(x="Lead in air (metric tons)",
y="Aggravated assault rate (per million)",
title="Violent crime v. atmospheric lead (22 year lag)")
A good way of seeing if our model looks appropriate is by examining the residuals. Remember, we want to check if the residuals look normal, have mean 0, and have constant variance. Most of these looks pretty good; there’s some change in the variance across the dataset, but it’s not too large.
par(mfrow=c(1,3))
plot(atlanta_lead_lm,which=1:2,ask=F)
hist(resids)
Note the summary tells us that our slope estimate is significant. This basically conducts a \(t\)-test to see if it’s significantly different from 0. In this case it is. Note that this is strictly speaking correlation, NOT causation. However, it’s a very suspicious correlation and warrants further studies and analysis. We can also get confidence intervals for our coefficients.
confint(atlanta_lead_lm, level=0.95)
## 2.5 % 97.5 %
## (Intercept) -55.579958 271.465472
## air.pb.metric.tons 1.238891 1.568602
At the very least, lead in air is a useful predictor for levels of violent crime 22 years later. For example, suppose tomorrow, a chemical company near Atlanta has an explosion, and some more lead is released into the atmosphere. New measurements of lead are found to be 1300 metric tons. What would you predict the approximate aggravated assault rate in 2043 to be?
predict(atlanta_lead_lm, newdata=data.frame(air.pb.metric.tons=1300))
## 1
## 1932.813
Suppose the lead factory continues to release more lead into the atmosphere, and next year, the levels are measured to be 2000 metric tons. Can we use our model to predict what aggravated assault rates in 2044 might look like?
predict(atlanta_lead_lm, newdata=data.frame(air.pb.metric.tons=2000))
## 1
## 2915.435
These predictions have to be treated more carefully, since they are out of the range of our data. They may be reliable, but they also may not. The reliability of a prediction usually decreases the further away it is from your data.
Example: logistic regression. Very similar to simple linear regression, but applied to predicting categories.
ISLR chapter 4