STAT340 Discussion 05: EDA, PCA and clustering

Problem 1: PCA and clustering

In lecture, we saw that PCA and \(k\)-means are both useful ways to identify structure in data. Unsurprisingly, a common workflow is to apply PCA to reduce the dimension of a data set, and then apply \(k\)-means on the result to identify clusters.

Let’s see that in action on the iris data set that we played around with in lecture.

Part a : Reading and plotting the data

Following the code from lecture, import the iris dataset with data(iris) and create a pairs plot . Hint: it might be easiest to separate the first four columns of the data frame from the fifth– recall that the first four columns are numeric, and the last one contains the species labels.

data(iris);
head(iris);

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

# TODO: code goes here.

# Create a pairs plot with the points colored according to species, like in the lecture code.
# You have permission to simply copy the code from lecture, if you wish.

Part b : Applying PCA

Examining the pairs plot, try and predict what the principal directions will be. Here’s one that stands out to me: look at the relation between the Petal.Length and Petal.Width variables. Anything else?

TODO: write any observations/predictions here.

Now, apply PCA to the data. Again, you may copy-paste code from the lecture notes if you like.

#TODO: code goes here.

Examine the summary of PCA (in particular, the loadings). Do they agree with your predictions? Discuss. You may also find it helpful to look at the biplot of the data.

Part c: Projecting the Data

We have applied PCA to our data in R, and we got back four principal components directions. If we have \(d\)-dimensional data, then we can only ever get \(d\) principal components, but the whole point of PCA is that often, we can represent the data quite well with fewer than \(d\) principal components.

This is an example of data compression (indeed, dimensionality reduction is a kind of data compression). We have data that is high-dimensional and thus takes up a lot of memory, so it is expensive to store, even more expensive to process with our ML algorithm of choice, and even more expensive than that to send over the internet.

One solution is to… well, throw out some dimensions! PCA shows us how to throw away dimensions in the “best” way (i.e., by minimizing the reconstruction error, as we discussed in lecture).

So let’s suppose (again, this is a silly example, but just go with it!) that we can only afford to store two dimensions rather than the four dimensions in our original iris data.

Above, we computed the principal components of the data, and to optimize our reconstruction error, we want to keep the top two components (i.e., the first two columns ).

The x attribute of the PCA output stores the actual PCA projections of the data.

Extract the first two dimensions of this representation, corresponding to the first two principal components. That is, extract what the two-dimensional PCA representation of our data would be. Plot this two-dimensional representation, coloring the points in your plot according to the species labels.

Hint: the \(k\)-th column of the x attribute corresponds to the projection of the data onto the \(k\)-th principal component direction. So you want to extract the first two columns of the x attribute.

#TODO: code goes here.

Part d: clustering after PCA?

A common pipeline is to apply PCA to a high-dimensional data set to get something that is easier to work with and/or visualize, and then apply our clustering method of choice to the PCA outputs to identify clusters.

The intuition is that once PCA has identified the dimensions that carry most of the information, clustering will be able to identify the cluster structure present in those dimensions.

Note: it is possible that at the time you are doing this discussion section, we have not yet discussed clustering in class (depending on how quickly we got through material!). For now, it suffices to think of clustering as a way of assigning a set of observations to a set of \(k\) different groups. Ideally, observations assigned to the same cluster (i.e., group) should tend to be more similar to one another than they are to observations assigned to other clusters.

Examine your plot created in Part c above. Do you see any obvious cluster structure? That is, are there obvious groups of observations that tend to “clump together”?

How does the structure you see here compare with that observed in the data as a whole (refer back to your pairs plot created in Part a)? Discuss.

TODO: any observations, here.

Part e (optional): Clustering with \(k\)-means

If we have covered \(k\)-means in lecture, or if you are familiar with it from elsewhere, try applying \(k\)-means to the full data and to the two-dimensional PCA representation constructed in Part d.

How do the resulting clusterings compare?

By inspection, does it look like one or the other is doing a better job of recovering the true species labels?

There are actually quantitative ways to assess how well two clusterings (e.g., the true species labels and our \(k\)-means clustering) agree. For example, there is the Adjusted Rand Index (ARI; see Wikipedia or the mclust package in R).

Unfortunately, giving this a thorough discussion would take us rather far afield. If time allows, we will come back to it later in the semester. Meanwhile, if you’re curious, you can try installing the mclust package and use its adjustedRandIndex function to measure how well your clustering captured the true cluster structure.