These notes on prediction are based on Karl Rohe’s.

Prediction

In a prediction problem, you are given data pairs \((X_1, Y_1), (X_2, Y_2), \dots, (X_n, Y_n)\) and you want to use \(X_i\) to predict \(Y_i\). We typically imagine \(X_i\) as containing several values (i.e. it is a “vector”).

There are two types of prediction problems, continuous \(Y_i\) and discrete \(Y_i\). For example, you might want to predict tomorrow’s the price for asset \(i\) using data that is available today. So, develop an historical training set, where you have information on asset \(i\) from one day contained in \(X_i\) and that asset’s price for the next day contained in \(Y_i\). Here, stock price is continuous.

Alternatively, you might only be interested in knowing if you should buy the stock, sell the stock, or hold the stock. So, you develop an historical training set where \(X_i\) contains the information that is available on one day. Then, you develop “labels” \(Y_i\) using data from the next day that say whether you should have bought, sold, or held onto the asset. Here, the label (buy, sell, hold) is discrete.

We will often call continuous outcomes “regression” and discrete outcomes “classification”.

Alternatively, perhaps we want to make predictions about the 2020 election. You could try to predict who is going to win (classification) or the number of delegates/votes that the Republicans recieve (regression).

In the cases above, there are two natural versions of the same problem (one is regression and one is classification). However, many classification problems do not have an analogous regression problem. For example, in the handwritten digit example in Chapter 1 of ISLR, \(X_i\) is an image of a handwritten digit and \(Y_i\) is a label that says whether the digit is 0, 1, 2, 3, 4,… , or 9.

We are going to imagine two broad approaches to regression and classification.

  1. Model-based approaches parameterize the distribution of \(Y_i\) given \(X_i\). That is, we imagine \(Y_i\) being a random variable that follows some distribution and that distribution is somehow parameterized by the variables in \(X_i\).
  2. Black-box approaches are defined algorithmically.

Chapter 2 in ISLR provides a broad overview of prediction. In the previous weeks of this course, Monte Carlo provided the basic computational tool; we were always working to get the problem stated as something that we could solve with Monte Carlo. Now, the basic computational tool is numerical optimization. We will not write code to do optimization. Instead, we will see optimization problems multiple times; it is often used to define our various techniques.

Regression

Example: Simple Linear Regression (SLR)

Demo - Lead poisoning and violent crimes

You are encouraged to follow along here with the demo! The data below comes from real sources.

In the 1920s, Thomas Midgley Jr. discovered that adding tetraethyllead to gasoline decreased engine knocking (i.e. fuel doesn’t ignite correctly in engine, which may damage the engine). He won the prestigious 1923 Nichols medal for his discovery. There were other safer additives, but because he and General Motors owned a patent on tetraethyllead, they marketed it as the best option.

Later, in the 1950s to 70s, people began to realize the increased levels of lead in the atmosphere due to all the leaded gasoline being burned was causing widespread lead poisoning, with symptoms ranging from depression, loss of appetite, and amnesia to anemia, insomnia, slurred speech, and cognitive impairment like decrease in concentration and loss of short-term memory. In the 1980s its usage was slowly phased out 1.

Later, in the late 1990s to early 2000s, researchers also realized levels of lead in the atmosphere correlated strongly with rates of violent crime later in life 2 3. This study was first conducted in the US, but it was soon replicated in other countries and the same results were found over and over again.

Let’s look at a dataset that contains atmospheric lead content levels and aggravated assault rates for several cities in the US and see if we can build a simple linear regression model to effectively explain the trend and make predictions. The data we are using comes from.

For simplicity, let’s focus on the city of Atlanta. First, we plot the data. The variables here are metric tons of lead in atmosphere (independent/explanatory), and aggravated assault rate per million 22 years later (dependent/response).

library(tidyverse)

lead = read_csv("https://raw.githubusercontent.com/kdlevin-uwstat/STAT340-Fall2021/93c315a378f3fe421e7e2217f9b039471d313741/lecs/06/lead.csv")

head(lead)
## # A tibble: 6 x 3
##   city    air.pb.metric.tons aggr.assault.per.million
##   <chr>                <dbl>                    <dbl>
## 1 Atlanta                421                     1029
## 2 Atlanta                429                      937
## 3 Atlanta                444                      887
## 4 Atlanta                457                      533
## 5 Atlanta                461                     1012
## 6 Atlanta                454                      848
atlanta_lead = lead %>% filter(city == "Atlanta")

ggplot(atlanta_lead,aes(x=air.pb.metric.tons, y=aggr.assault.per.million)) + 
  geom_point() + labs(x="Lead in air (metric tons)",
                      y="Aggravated assault rate (per million)",
                      title="Violent crime v. atmospheric lead (22 year lag)")

The data is a little wider at one side than the other side, but the trend does appear to be pretty linear. We can use the geom_smooth() function to get ggplot to plot the line of best fit for us before we fit it manually ourselves.

ggplot(atlanta_lead,aes(x=air.pb.metric.tons,y=aggr.assault.per.million)) + 
  geom_point() + geom_smooth(method="lm", formula="y~x", se=F) + 
  labs(x="Lead in air (metric tons)",
       y="Aggravated assault rate (per million)",
       title="Violent crime v. atmospheric lead (22 year lag)")

To build a linear model, the syntax is as simple as lm(y~x, data=df). Running summary() on the model object gives us a variety of useful summary statistics and other information about the model.

atlanta_lead_lm = lm(aggr.assault.per.million ~ air.pb.metric.tons, data=atlanta_lead)
summary(atlanta_lead_lm)
## 
## Call:
## lm(formula = aggr.assault.per.million ~ air.pb.metric.tons, data = atlanta_lead)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -356.36  -84.55    6.89  122.93  382.88 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        107.94276   80.46409   1.342    0.189    
## air.pb.metric.tons   1.40375    0.08112  17.305   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 180.6 on 34 degrees of freedom
## Multiple R-squared:  0.898,  Adjusted R-squared:  0.895 
## F-statistic: 299.4 on 1 and 34 DF,  p-value: < 2.2e-16

You can also access key properties of the model using certain built in functions

# gets fitted y-values (points on line of best fit)
fitted(atlanta_lead_lm)
##         1         2         3         4         5         6         7         8 
##  698.9200  710.1499  731.2061  749.4548  755.0698  745.2436  787.3560  836.4871 
##         9        10        11        12        13        14        15        16 
##  849.1208  990.8992 1038.6266 1096.1802 1118.6401 1169.1750 1183.2125 1187.4237 
##        17        18        19        20        21        22        23        24 
## 1194.4424 1197.2499 1302.5309 1351.6620 1414.8306 1529.9378 1597.3176 1689.9649 
##        25        26        27        28        29        30        31        32 
## 1757.3447 1782.6122 1852.7995 1879.4707 1991.7704 2000.1929 2101.2626 2185.4874 
##        33        34        35        36 
## 2209.3511 2231.8110 2217.7735 2236.0222
# gets residuals (difference between actual y and fitted y)
resids = residuals(atlanta_lead_lm)    # can also use resid()
resids
##           1           2           3           4           5           6 
##  330.080025  226.850054  155.793859 -216.454844  256.930171  102.756395 
##           7           8           9          10          11          12 
## -356.355996 -149.487118  382.879164 -273.899218 -268.626594 -280.180195 
##          13          14          15          16          17          18 
##  175.359863 -294.175006   23.787530  109.576291  -97.442440  -80.249933 
##          19          20          21          22          23          24 
##   -8.530910   45.337967  -43.830619  116.062179  -55.317646  -70.964906 
##          25          26          27          28          29          30 
## -129.344731   14.387834  -58.799484  143.529335  148.229626    9.807148 
##          31          32          33          34          35          36 
##  199.737410  -64.487372   16.648940   14.188998  -27.773538    3.977759
# gets coefficients of model
coefs = coefficients(atlanta_lead_lm)  # can also use coef()
coefs
##        (Intercept) air.pb.metric.tons 
##         107.942757           1.403746

We can plot our model just to make sure it looks correct and appropriate.

ggplot(atlanta_lead,aes(x=air.pb.metric.tons,y=aggr.assault.per.million)) + 
  geom_point() + 
  geom_abline(slope=coefs[2], intercept=coefs[1], color="red") + 
  labs(x="Lead in air (metric tons)",
       y="Aggravated assault rate (per million)",
       title="Violent crime v. atmospheric lead (22 year lag)")

A good way of seeing if our model looks appropriate is by examining the residuals. Remember, we want to check if the residuals look normal, have mean 0, and have constant variance. Most of these looks pretty good; there’s some change in the variance across the dataset, but it’s not too large.

par(mfrow=c(1,3))
plot(atlanta_lead_lm,which=1:2,ask=F)
hist(resids)

Note the summary tells us that our slope estimate is significant. This basically conducts a \(t\)-test to see if it’s significantly different from 0. In this case it is. Note that this is strictly speaking correlation, NOT causation. However, it’s a very suspicious correlation and warrants further studies and analysis. We can also get confidence intervals for our coefficients.

confint(atlanta_lead_lm, level=0.95)
##                         2.5 %     97.5 %
## (Intercept)        -55.579958 271.465472
## air.pb.metric.tons   1.238891   1.568602

At the very least, lead in air is a useful predictor for levels of violent crime 22 years later. For example, suppose tomorrow, a chemical company near Atlanta has an explosion, and some more lead is released into the atmosphere. New measurements of lead are found to be 1300 metric tons. What would you predict the approximate aggravated assault rate in 2043 to be?

predict(atlanta_lead_lm, newdata=data.frame(air.pb.metric.tons=1300))
##        1 
## 1932.813

Suppose the lead factory continues to release more lead into the atmosphere, and next year, the levels are measured to be 2000 metric tons. Can we use our model to predict what aggravated assault rates in 2044 might look like?

predict(atlanta_lead_lm, newdata=data.frame(air.pb.metric.tons=2000))
##        1 
## 2915.435

These predictions have to be treated more carefully, since they are out of the range of our data. They may be reliable, but they also may not. The reliability of a prediction usually decreases the further away it is from your data.

Classification example

Example: logistic regression. Very similar to simple linear regression, but applied to predicting categories.
ISLR chapter 4


  1. https://archive.org/details/toxictruthscient00denw_0/page/210/mode/2up↩︎

  2. https://ir.lawnet.fordham.edu/ulj/vol20/iss3/1↩︎

  3. https://doi.org/10.1016/j.envres.2007.02.008↩︎