Let’s get some practice working with logistic regression by exploring the famous Pima diabetes data set. This data set consists of biometric data collected from women of Pima Indian descent living in Arizona, including whether or not each woman had diabetes. The data set is included in the MASS
library (this library should already be installed in your R setup, but if not, you should install it):
library(MASS);
head(Pima.te)
## npreg glu bp skin bmi ped age type
## 1 6 148 72 35 33.6 0.627 50 Yes
## 2 1 85 66 29 26.6 0.351 31 No
## 3 1 89 66 23 28.1 0.167 21 No
## 4 3 78 50 32 31.0 0.248 26 Yes
## 5 2 197 70 45 30.5 0.158 53 Yes
## 6 5 166 72 19 25.8 0.587 51 Yes
Important note: the MASS
library actually includes three different versions of this data set. We will work with the Pima.te
data frame and the Pima.tr
data frame in this discussion section. We will use Pima.te
to fit our models and Pima.tr
to evaluate them.
Take a few minutes to read the documentation surrounding this data set, so that you understand the sampled population, the meaning of the different variables, etc.
?Pima.te
Our goal in this discussion section is to fit logistic regression to this data set so that we can predict, given biometric data, whether or not a patient is likely to have diabetes. The type
column of the Pima.te
data frame (and the Pima.tr
data frame ) encodes whether a patient has diabetes (coded as Yes
) or not (coded as No
), but logistic regression requires that we have binary outcomes. Let’s create a new column of this data frame called diabetes
that takes the value 1
if type
is Yes
and takes the value 0
otherwise.
Pima.te$diabetes <- ifelse( Pima.te$type=='Yes', 1, 0);
head(Pima.te)
## npreg glu bp skin bmi ped age type diabetes
## 1 6 148 72 35 33.6 0.627 50 Yes 1
## 2 1 85 66 29 26.6 0.351 31 No 0
## 3 1 89 66 23 28.1 0.167 21 No 0
## 4 3 78 50 32 31.0 0.248 26 Yes 1
## 5 2 197 70 45 30.5 0.158 53 Yes 1
## 6 5 166 72 19 25.8 0.587 51 Yes 1
And do the same for Pima.tr
data frame.
#TODO: code goes here to add a `diabetes` column to Pima.tr.
Having done that, there remains the question of which variables to include in our model. This is a model selection problem.
One way to see which variables are predictive and which aren’t is to, well, look.
Let’s try plotting some of the above variables, breaking them out by whether or not each patient tested positive for diabetes.
pp <- ggplot(Pima.te, aes(x=as.factor(diabetes), group=diabetes, y=glu)) + geom_boxplot();
pp
This plot certainly suggests that there is a difference in glu
between diabetic and non-diabetic patients. This is in keeping with the known relationship between blood glucose levels and diabetes.
Choose (at least) two more predictor variables from among the predictors in the Pima.te
data set and make boxplots like the one above. Do these variables appear to be predictive or not? Discuss.
# TODO: code goes here.
TODO: discussion/explanation goes here.
Let’s first try fitting a logistic regression using each of the variables npreg
, glu
, bp
, skin
, bmi
, ped
and age
to predict the diabetes
outcome. Then, having fit a model, we’ll evaluate it on another set of patient data.
We’ll give an example for you to work from.
lr_glu <- glm( diabetes ~ 1 + glu, data=Pima.te, family=binomial );
lr_glu
##
## Call: glm(formula = diabetes ~ 1 + glu, family = binomial, data = Pima.te)
##
## Coefficients:
## (Intercept) glu
## -5.94681 0.04242
##
## Degrees of Freedom: 331 Total (i.e. Null); 330 Residual
## Null Deviance: 420.3
## Residual Deviance: 326 AIC: 330
Note the AIC entry above– AIC stands for Akaike Information Criterion. It can be useful in model selection, and we will discuss it in more detail in coming lectures. For now, you can ignore it– just note that it’s included in your model output.
Note the coefficient of glu
above. Is it positive or negative? Is it large (in absolute value)? How might you interpret this?
In the case of linear regression, the coefficients essentially capture the (expected) change in our outcome based on a unit change in the predictor. The story is a bit more complicated in the case of logistic regression. Refering back to our lecture notes on logistic regression, give a brief (two or three sentences should be plenty) description of how we should interpret this coefficient.
TODO: discussion/explanation goes here.
Now, we would like to know how good our model actually is. That is, how well our model does at predicting diabetes in other patients. But if we evalute our model on the data that we just used to train it, well, of course it will do well! This idea is well known in machine learning– you shouldn’t “test” your model on the “train” data.
Lucky for us, there the Pima.tr
data set contains another collection of patient data from the same study. We can use our trained model to predict the diabetes outcome based on the predictors, and that will give us a sense of how well our trained model does at predicting on data it hasn’t seen before.
This idea of applying our model to data it hasn’t seen before is the motivation for cross-validation (e.g., leave-one-out CV and K-fold CV), which we are discussing in class this week. In CV, we separate our data into “training” data and a “validation” set. In the case of the Pima data, we don’t need to use CV, because we have a separate set of patient data just kind of lying around for us to use!
For example, let’s predict the outcome for the first patient:
# type='response' tells R to output p(X), i.e., the (estimated) logistic function
# evaluated on the given data.
# newdata=Pima.tr[1,] tells R to apply our trained model lr_glu
# to the predictors (in this case, just `glu`) in the first row of Pima.tr.
predict(lr_glu, type='response', newdata = Pima.tr[1,] )
## 1
## 0.09123867
Okay, but this outputs a probability. So this says that our model thinks that the probability that this patient has diabetes is about 0.09, which is pretty small! And we can check that indeed this patient doesn’t have diabetes:
Pima.tr[1,]$diabetes
## NULL
So let’s say that for a given patient, if our model outputs a probability larger than \(0.5\), we will predict that the patient has diabetes, and we will predict that she doesn’t, otherwise.
probabilities <- predict(lr_glu, type='response', newdata = Pima.tr );
predictions <- ifelse(probabilities>0.5, 1, 0);
# Now, how many predictions did we get right?
# Count up how many of our predicted diabetes statuses agree with the true labels,
# and divide by the total number of patients to get fraction of correct labels.
sum(predictions==Pima.tr$diabetes)/nrow(Pima.tr)
## [1] 0
Not bad based on a single variable!
_Side note: 74% accuracy isn’t actually as impressive as you might think. Suppose that we just always guess that a patient doesn’t have diabetes. Then our accuracy would be:
sum(Pima.tr$diabetes==0)/nrow(Pima.tr)
## [1] 0
This is an important thing to keep in mind when working on classification and prediction problems! If you have a rare outcome, if can be easy to convince yourself that you are doing a very good job of prediction, when in fact you are performing near chance.
Of course, in this setting, 0.74 and 0.66 are certainly not the same– out model has surely learned something useful, this is just meant as a caution for future work (e.g., in your group projects!).
Pick another predictor variable from the Pima.te
data set and repeat the analysis above– fit a model on Pima.te
and evaluate its prediction accuracy on the Pima.tr
data set.
TODO: code and explanation/discussion go here. ***
Now, just like in multiple linear regression, it seems like that we can do better by including multiple variables in our model. Choose two variables in the Pima.te
data set and fit models for them. Then, fit a third model that predicts an outcome based on an intercept term and both of your variables. So, for example, if you have trained models with formulas diabetes ~ 1 + glu
and diabetes ~ 1 + age
, try training a third model with the formual diabetes ~ 1 + glu + age
. Having trained these three models on the Pima.te
model, evaluate them on the Pima.tr
data. Compare their performances and discuss.
TODO: code and explanation/discussion go here. ***