Saturday, March 29, 2014

Unsupervised Statistical Learning within R

Principal Components

Principal Components

We will use the USArrests data (which is in R)

dimnames(USArrests)
## [[1]]
##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"       
## 
## [[2]]
## [1] "Murder"   "Assault"  "UrbanPop" "Rape"
apply(USArrests, 2, mean)
##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232
apply(USArrests, 2, var)
##   Murder  Assault UrbanPop     Rape 
##    18.97  6945.17   209.52    87.73

We see that Assault has a much larger variance than the other variables. It would dominate the principal components, so we choose to standardize the variables when we perform PCA

pca.out = prcomp(USArrests, scale = TRUE)
pca.out
## Standard deviations:
## [1] 1.5749 0.9949 0.5971 0.4164
## 
## Rotation:
##              PC1     PC2     PC3      PC4
## Murder   -0.5359  0.4182 -0.3412  0.64923
## Assault  -0.5832  0.1880 -0.2681 -0.74341
## UrbanPop -0.2782 -0.8728 -0.3780  0.13388
## Rape     -0.5434 -0.1673  0.8178  0.08902
names(pca.out)
## [1] "sdev"     "rotation" "center"   "scale"    "x"
biplot(pca.out, scale = 0)

plot of chunk unnamed-chunk-2

K-Means Clustering

K-means works in any dimension, but is most fun to demonstrate in two, because we can plot pictures. Lets make some data with clusters. We do this by shifting the means of the points around.

set.seed(101)
x = matrix(rnorm(100 * 2), 100, 2)
xmean = matrix(rnorm(8, sd = 4), 4, 2)
which = sample(1:4, 100, replace = TRUE)
x = x + xmean[which, ]
plot(x, col = which, pch = 19)

plot of chunk unnamed-chunk-3

We know the “true” cluster IDs, but we wont tell that to the kmeans algorithm.

km.out = kmeans(x, 4, nstart = 15)
km.out
## K-means clustering with 4 clusters of sizes 21, 30, 32, 17
## 
## Cluster means:
##      [,1]    [,2]
## 1 -3.1069  1.1213
## 2  1.7226 -0.2585
## 3 -5.5818  3.3685
## 4 -0.6148  4.8861
## 
## Clustering vector:
##   [1] 2 3 3 4 1 1 4 3 2 3 2 1 1 3 1 1 2 3 3 2 2 3 1 3 1 1 2 2 3 1 1 4 3 1 3
##  [36] 3 1 2 2 3 2 2 3 3 1 3 1 3 4 2 1 2 2 4 3 3 2 2 3 2 1 2 3 4 2 4 3 4 4 2
##  [71] 2 4 3 2 3 4 4 2 2 1 2 4 4 3 3 2 3 3 1 2 3 2 4 4 4 2 3 3 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 30.83 54.48 71.98 21.05
##  (between_SS / total_SS =  87.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"
plot(x, col = km.out$cluster, cex = 2, pch = 1, lwd = 2)
points(x, col = which, pch = 19)
points(x, col = c(4, 3, 2, 1)[which], pch = 19)

plot of chunk unnamed-chunk-4

Hierarchical Clustering

We will use these same data and use hierarchical clustering

hc.complete = hclust(dist(x), method = "complete")
plot(hc.complete)

plot of chunk unnamed-chunk-5

hc.single = hclust(dist(x), method = "single")
plot(hc.single)

plot of chunk unnamed-chunk-5

hc.average = hclust(dist(x), method = "average")
plot(hc.average)

plot of chunk unnamed-chunk-5

Lets compare this with the actualy clusters in the data. We will use the function cutree to cut the tree at level 4. This will produce a vector of numbers from 1 to 4, saying which branch each observation is on. You will sometimes see pretty plots where the leaves of the dendrogram are colored. I searched a bit on the web for how to do this, and its a little too complicated for this demonstration.

We can use table to see how well they match:

hc.cut = cutree(hc.complete, 4)
table(hc.cut, which)
##       which
## hc.cut  1  2  3  4
##      1  0  0 30  0
##      2  1 31  0  2
##      3 17  0  0  0
##      4  0  0  0 19
table(hc.cut, km.out$cluster)
##       
## hc.cut  1  2  3  4
##      1  0 30  0  0
##      2  2  0 32  0
##      3  0  0  0 17
##      4 19  0  0  0

or we can use our group membership as labels for the leaves of the dendrogram:

plot(hc.complete, labels = which)

plot of chunk unnamed-chunk-7


Credit

Please note, this material is extracted from online Statistical Learning cource at Stanford University by Prof. T Hastie and Prof R. Tibshirani. It aims only for quick and future references in R and statistical learning. Please visit course page for more information and materials.