stat 408 - statistical learning clustering€¦ · statistical learning clustering unsupervised...
TRANSCRIPT
STAT 408 -StatisticalLearning
Clustering
UnsupervisedLearning STAT 408 - Statistical Learning
Clustering
April 3, 2018
STAT 408 -StatisticalLearning
Clustering
UnsupervisedLearning
Unsupervised Learning
STAT 408 -StatisticalLearning
Clustering
UnsupervisedLearning
Supervised vs. Unsupervised Learning
STAT 408 -StatisticalLearning
Clustering
UnsupervisedLearning
Supervised
1
1
1
11
11
11
1
11
1
1
11
1
11
1 1
1
11
1
1
1
11
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1 1
1
1
11
00
0
0
00
0
0
00
0
00
00
0
00
00
0
0
0
0 00
0
0
0
0
000
00
0
00
0
00
0
0
00
0
0
0
00
STAT 408 -StatisticalLearning
Clustering
UnsupervisedLearning
Unsupervised - How many clusters?
STAT 408 -StatisticalLearning
Clustering
UnsupervisedLearning
Unsupervised
STAT 408 -StatisticalLearning
Clustering
UnsupervisedLearning
k-means clustering
3
3
3
33
33
33
3
33
3
3
33
3
33
3 3
3
33
3
2
2
22
2
2
2
2
2
22
2
2
2
2
2
2
2
2
2 2
2
2
22
11
1
1
11
1
1
11
1
11
11
1
31
11
3
2
1
1 11
1
3
1
3
111
11
1
11
1
11
1
1
11
1
1
1
31
STAT 408 -StatisticalLearning
Clustering
UnsupervisedLearning
k-means clustering
## K-means clustering with 3 clusters of sizes 44, 26, 30#### Cluster means:## [,1] [,2]## 1 0.7693055 0.6026478## 2 0.1329025 0.8018608## 3 0.2844651 0.1830810#### Clustering vector:## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2## [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1## [71] 3 2 1 1 1 1 1 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1#### Within cluster sum of squares by cluster:## [1] 1.915896 1.175133 1.417926## (between_SS / total_SS = 75.2 %)#### Available components:#### [1] "cluster" "centers" "totss" "withinss"## [5] "tot.withinss" "betweenss" "size" "iter"## [9] "ifault"
STAT 408 -StatisticalLearning
Clustering
UnsupervisedLearning
k-means clustering - code
km <- kmeans(combined, 3)plot(combined,type='n',axes=F, xlab='',ylab='')box()points(combined,pch=as.character(km$cluster),
col=c(rep('dodgerblue',25),rep('forestgreen',25),rep('firebrick4',50)))
draw.circle(.31,-0.1,.335, border='dodgerblue')draw.circle(.79,.65,.3, border='firebrick4')draw.circle(.14,1.05,.3, border='forestgreen')
STAT 408 -StatisticalLearning
Clustering
UnsupervisedLearning
Hierarchical clustering
7 22 25 5 16 9 23 2 4 13 18 11 12 19 24 1 15 14 20 27 44 33 72 47 50 43 28 42 36 35 48 34 41 40 38 49 37 45 31 39 26 46 29 30 32 56 88 61 94 52 51 53 55 60 66 59 90 75 58 74 95 93 70 79 69 91 73 87 96 54 84 77 63 81 82 89 83 57 76 97 65 68 85 62 64 71 99 80 100 67 86 92 98 6 10 3 8 17 21 78
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Cluster Dendrogram
hclust (*, "complete")dist(combined)
Hei
ght
STAT 408 -StatisticalLearning
Clustering
UnsupervisedLearning
Hierarchical clustering - with 3clusters
1
1
2
11
21
21
2
11
1
1
11
2
11
1 2
1
11
1
3
3
33
3
3
3
3
3
33
3
3
3
3
3
3
3
3
3 3
3
3
33
22
2
2
22
2
2
22
2
22
22
2
22
22
2
3
2
2 22
2
2
2
2
222
22
2
22
2
22
2
2
22
2
2
2
22
STAT 408 -StatisticalLearning
Clustering
UnsupervisedLearning
Hierarchical clustering - with 4clusters
1
1
2
11
21
21
2
11
1
1
11
2
11
1 2
1
11
1
3
3
33
3
3
3
3
3
33
3
3
3
3
3
3
3
3
3 3
3
3
33
44
4
2
44
2
4
44
4
22
22
4
22
44
2
3
4
4 42
2
2
4
2
222
22
2
44
2
44
2
4
44
4
2
2
22
STAT 408 -StatisticalLearning
Clustering
UnsupervisedLearning
Hierarchical clustering - code
hc <- hclust(dist(combined))plot(hc, hang=-1)plot(combined,type='n',axes=F, xlab='',ylab='')box()points(combined,pch=as.character(cutree(hc,4)),
col=c(rep('dodgerblue',25),rep('forestgreen',25),rep('firebrick4',50)))
STAT 408 -StatisticalLearning
Clustering
UnsupervisedLearning
How to choose the number of clusters?
Given these plots that we have seen, how do we choose theappropriate number of clusters?
STAT 408 -StatisticalLearning
Clustering
UnsupervisedLearning
How to choose the number of clusters?- Scree plot
2 4 6 8 10 12 14
510
15
Number of Clusters
With
in g
roup
s su
m o
f squ
ares
STAT 408 -StatisticalLearning
Clustering
UnsupervisedLearning
Scree plot - code
wss <- rep(0,15)for (i in 1:15) {
wss[i] <- sum(kmeans(combined,centers=i)$withinss)}plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
STAT 408 -StatisticalLearning
Clustering
UnsupervisedLearning
Data with more than 2 dimensions
warm-blooded can fly vertebrate endangered have hair
ant No No No No Nobee No Yes No No Yescat Yes No Yes No Yescow Yes No Yes No Yesduc Yes Yes Yes No Noeag Yes Yes Yes Yes Noele Yes No Yes Yes Nofly No Yes No No Nofro No No Yes Yes Nolio Yes No Yes Yes Yesliz No No Yes No Nolob No No No No Norab Yes No Yes No Yesspi No No No No Yeswha Yes No Yes Yes No
STAT 408 -StatisticalLearning
Clustering
UnsupervisedLearning
Multidimensional Scaling
ant
bee
catcow
duc
eag
elefly
fro
lio
liz
lob
rab
spi
wha
STAT 408 -StatisticalLearning
Clustering
UnsupervisedLearning
MDS - Code
animals <- cluster::animals
colnames(animals) <- c("warm-blooded","can fly","vertebrate","endangered","live in groups","have hair")
animals.cluster <- animals[,-(5)]animals.cluster <- animals.cluster[-c(4,5,12,16,18),]animals.cluster[10,4] <- 2animals.cluster[14,4] <- 1
d <- dist(animals.cluster)fit <- cmdscale(d, k=2)fit.jitter <- fit + runif(nrow(fit*2),-.15,.15)plot(fit.jitter[,1], fit.jitter[,2], xlab="", ylab="", main="", type="n",axes=F)box()text(fit.jitter[,1], fit.jitter[,2], labels = row.names(animals.cluster), cex=1.3)
STAT 408 -StatisticalLearning
Clustering
UnsupervisedLearning
Hierarchical Clustering of Animals
fly ant
lob
bee
spi
liz fro
ele
wha lio ra
b
cat
cow
duc
eag
0.0
0.5
1.0
1.5
2.0
Cluster Dendrogram
hclust (*, "complete")dist(animals.cluster)
Hei
ght
STAT 408 -StatisticalLearning
Clustering
UnsupervisedLearning
Lecture Exercise: Clustering ZooAnimals
Use the dataset create below for the following questions.
zoo.data <- read.csv('http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/ZooClean.csv')rownames(zoo.data) <- zoo.data[,1]zoo.data <- zoo.data[,-1]
Use multidimensional scaling to visualize the data in two dimensions.What are two animals that are very similar and two that are very different?Create a hierachical clustering object for this dataset.
Why are a leopard and raccoon clustered together for any cluster size?
Now add colors corresponding to four different clusters to your MDS plot.
Interpret what each of the four clusters correspond to.