stat 408 - statistical learning clustering€¦ · statistical learning clustering unsupervised...

21
STAT 408 - Statistical Learning Clustering Unsupervised Learning STAT 408 - Statistical Learning Clustering April 3, 2018

Upload: others

Post on 08-Aug-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: STAT 408 - Statistical Learning Clustering€¦ · Statistical Learning Clustering Unsupervised Learning k-means clustering ## K-means clustering with 3 clusters of sizes 44, 26,

STAT 408 -StatisticalLearning

Clustering

UnsupervisedLearning STAT 408 - Statistical Learning

Clustering

April 3, 2018

Page 2: STAT 408 - Statistical Learning Clustering€¦ · Statistical Learning Clustering Unsupervised Learning k-means clustering ## K-means clustering with 3 clusters of sizes 44, 26,

STAT 408 -StatisticalLearning

Clustering

UnsupervisedLearning

Unsupervised Learning

Page 3: STAT 408 - Statistical Learning Clustering€¦ · Statistical Learning Clustering Unsupervised Learning k-means clustering ## K-means clustering with 3 clusters of sizes 44, 26,

STAT 408 -StatisticalLearning

Clustering

UnsupervisedLearning

Supervised vs. Unsupervised Learning

Page 4: STAT 408 - Statistical Learning Clustering€¦ · Statistical Learning Clustering Unsupervised Learning k-means clustering ## K-means clustering with 3 clusters of sizes 44, 26,

STAT 408 -StatisticalLearning

Clustering

UnsupervisedLearning

Supervised

1

1

1

11

11

11

1

11

1

1

11

1

11

1 1

1

11

1

1

1

11

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1 1

1

1

11

00

0

0

00

0

0

00

0

00

00

0

00

00

0

0

0

0 00

0

0

0

0

000

00

0

00

0

00

0

0

00

0

0

0

00

Page 5: STAT 408 - Statistical Learning Clustering€¦ · Statistical Learning Clustering Unsupervised Learning k-means clustering ## K-means clustering with 3 clusters of sizes 44, 26,

STAT 408 -StatisticalLearning

Clustering

UnsupervisedLearning

Unsupervised - How many clusters?

Page 6: STAT 408 - Statistical Learning Clustering€¦ · Statistical Learning Clustering Unsupervised Learning k-means clustering ## K-means clustering with 3 clusters of sizes 44, 26,

STAT 408 -StatisticalLearning

Clustering

UnsupervisedLearning

Unsupervised

Page 7: STAT 408 - Statistical Learning Clustering€¦ · Statistical Learning Clustering Unsupervised Learning k-means clustering ## K-means clustering with 3 clusters of sizes 44, 26,

STAT 408 -StatisticalLearning

Clustering

UnsupervisedLearning

k-means clustering

3

3

3

33

33

33

3

33

3

3

33

3

33

3 3

3

33

3

2

2

22

2

2

2

2

2

22

2

2

2

2

2

2

2

2

2 2

2

2

22

11

1

1

11

1

1

11

1

11

11

1

31

11

3

2

1

1 11

1

3

1

3

111

11

1

11

1

11

1

1

11

1

1

1

31

Page 8: STAT 408 - Statistical Learning Clustering€¦ · Statistical Learning Clustering Unsupervised Learning k-means clustering ## K-means clustering with 3 clusters of sizes 44, 26,

STAT 408 -StatisticalLearning

Clustering

UnsupervisedLearning

k-means clustering

## K-means clustering with 3 clusters of sizes 44, 26, 30#### Cluster means:## [,1] [,2]## 1 0.7693055 0.6026478## 2 0.1329025 0.8018608## 3 0.2844651 0.1830810#### Clustering vector:## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2## [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1## [71] 3 2 1 1 1 1 1 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1#### Within cluster sum of squares by cluster:## [1] 1.915896 1.175133 1.417926## (between_SS / total_SS = 75.2 %)#### Available components:#### [1] "cluster" "centers" "totss" "withinss"## [5] "tot.withinss" "betweenss" "size" "iter"## [9] "ifault"

Page 9: STAT 408 - Statistical Learning Clustering€¦ · Statistical Learning Clustering Unsupervised Learning k-means clustering ## K-means clustering with 3 clusters of sizes 44, 26,

STAT 408 -StatisticalLearning

Clustering

UnsupervisedLearning

k-means clustering - code

km <- kmeans(combined, 3)plot(combined,type='n',axes=F, xlab='',ylab='')box()points(combined,pch=as.character(km$cluster),

col=c(rep('dodgerblue',25),rep('forestgreen',25),rep('firebrick4',50)))

draw.circle(.31,-0.1,.335, border='dodgerblue')draw.circle(.79,.65,.3, border='firebrick4')draw.circle(.14,1.05,.3, border='forestgreen')

Page 10: STAT 408 - Statistical Learning Clustering€¦ · Statistical Learning Clustering Unsupervised Learning k-means clustering ## K-means clustering with 3 clusters of sizes 44, 26,

STAT 408 -StatisticalLearning

Clustering

UnsupervisedLearning

Hierarchical clustering

7 22 25 5 16 9 23 2 4 13 18 11 12 19 24 1 15 14 20 27 44 33 72 47 50 43 28 42 36 35 48 34 41 40 38 49 37 45 31 39 26 46 29 30 32 56 88 61 94 52 51 53 55 60 66 59 90 75 58 74 95 93 70 79 69 91 73 87 96 54 84 77 63 81 82 89 83 57 76 97 65 68 85 62 64 71 99 80 100 67 86 92 98 6 10 3 8 17 21 78

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Cluster Dendrogram

hclust (*, "complete")dist(combined)

Hei

ght

Page 11: STAT 408 - Statistical Learning Clustering€¦ · Statistical Learning Clustering Unsupervised Learning k-means clustering ## K-means clustering with 3 clusters of sizes 44, 26,

STAT 408 -StatisticalLearning

Clustering

UnsupervisedLearning

Hierarchical clustering - with 3clusters

1

1

2

11

21

21

2

11

1

1

11

2

11

1 2

1

11

1

3

3

33

3

3

3

3

3

33

3

3

3

3

3

3

3

3

3 3

3

3

33

22

2

2

22

2

2

22

2

22

22

2

22

22

2

3

2

2 22

2

2

2

2

222

22

2

22

2

22

2

2

22

2

2

2

22

Page 12: STAT 408 - Statistical Learning Clustering€¦ · Statistical Learning Clustering Unsupervised Learning k-means clustering ## K-means clustering with 3 clusters of sizes 44, 26,

STAT 408 -StatisticalLearning

Clustering

UnsupervisedLearning

Hierarchical clustering - with 4clusters

1

1

2

11

21

21

2

11

1

1

11

2

11

1 2

1

11

1

3

3

33

3

3

3

3

3

33

3

3

3

3

3

3

3

3

3 3

3

3

33

44

4

2

44

2

4

44

4

22

22

4

22

44

2

3

4

4 42

2

2

4

2

222

22

2

44

2

44

2

4

44

4

2

2

22

Page 13: STAT 408 - Statistical Learning Clustering€¦ · Statistical Learning Clustering Unsupervised Learning k-means clustering ## K-means clustering with 3 clusters of sizes 44, 26,

STAT 408 -StatisticalLearning

Clustering

UnsupervisedLearning

Hierarchical clustering - code

hc <- hclust(dist(combined))plot(hc, hang=-1)plot(combined,type='n',axes=F, xlab='',ylab='')box()points(combined,pch=as.character(cutree(hc,4)),

col=c(rep('dodgerblue',25),rep('forestgreen',25),rep('firebrick4',50)))

Page 14: STAT 408 - Statistical Learning Clustering€¦ · Statistical Learning Clustering Unsupervised Learning k-means clustering ## K-means clustering with 3 clusters of sizes 44, 26,

STAT 408 -StatisticalLearning

Clustering

UnsupervisedLearning

How to choose the number of clusters?

Given these plots that we have seen, how do we choose theappropriate number of clusters?

Page 15: STAT 408 - Statistical Learning Clustering€¦ · Statistical Learning Clustering Unsupervised Learning k-means clustering ## K-means clustering with 3 clusters of sizes 44, 26,

STAT 408 -StatisticalLearning

Clustering

UnsupervisedLearning

How to choose the number of clusters?- Scree plot

2 4 6 8 10 12 14

510

15

Number of Clusters

With

in g

roup

s su

m o

f squ

ares

Page 16: STAT 408 - Statistical Learning Clustering€¦ · Statistical Learning Clustering Unsupervised Learning k-means clustering ## K-means clustering with 3 clusters of sizes 44, 26,

STAT 408 -StatisticalLearning

Clustering

UnsupervisedLearning

Scree plot - code

wss <- rep(0,15)for (i in 1:15) {

wss[i] <- sum(kmeans(combined,centers=i)$withinss)}plot(1:15, wss, type="b", xlab="Number of Clusters",

ylab="Within groups sum of squares")

Page 17: STAT 408 - Statistical Learning Clustering€¦ · Statistical Learning Clustering Unsupervised Learning k-means clustering ## K-means clustering with 3 clusters of sizes 44, 26,

STAT 408 -StatisticalLearning

Clustering

UnsupervisedLearning

Data with more than 2 dimensions

warm-blooded can fly vertebrate endangered have hair

ant No No No No Nobee No Yes No No Yescat Yes No Yes No Yescow Yes No Yes No Yesduc Yes Yes Yes No Noeag Yes Yes Yes Yes Noele Yes No Yes Yes Nofly No Yes No No Nofro No No Yes Yes Nolio Yes No Yes Yes Yesliz No No Yes No Nolob No No No No Norab Yes No Yes No Yesspi No No No No Yeswha Yes No Yes Yes No

Page 18: STAT 408 - Statistical Learning Clustering€¦ · Statistical Learning Clustering Unsupervised Learning k-means clustering ## K-means clustering with 3 clusters of sizes 44, 26,

STAT 408 -StatisticalLearning

Clustering

UnsupervisedLearning

Multidimensional Scaling

ant

bee

catcow

duc

eag

elefly

fro

lio

liz

lob

rab

spi

wha

Page 19: STAT 408 - Statistical Learning Clustering€¦ · Statistical Learning Clustering Unsupervised Learning k-means clustering ## K-means clustering with 3 clusters of sizes 44, 26,

STAT 408 -StatisticalLearning

Clustering

UnsupervisedLearning

MDS - Code

animals <- cluster::animals

colnames(animals) <- c("warm-blooded","can fly","vertebrate","endangered","live in groups","have hair")

animals.cluster <- animals[,-(5)]animals.cluster <- animals.cluster[-c(4,5,12,16,18),]animals.cluster[10,4] <- 2animals.cluster[14,4] <- 1

d <- dist(animals.cluster)fit <- cmdscale(d, k=2)fit.jitter <- fit + runif(nrow(fit*2),-.15,.15)plot(fit.jitter[,1], fit.jitter[,2], xlab="", ylab="", main="", type="n",axes=F)box()text(fit.jitter[,1], fit.jitter[,2], labels = row.names(animals.cluster), cex=1.3)

Page 20: STAT 408 - Statistical Learning Clustering€¦ · Statistical Learning Clustering Unsupervised Learning k-means clustering ## K-means clustering with 3 clusters of sizes 44, 26,

STAT 408 -StatisticalLearning

Clustering

UnsupervisedLearning

Hierarchical Clustering of Animals

fly ant

lob

bee

spi

liz fro

ele

wha lio ra

b

cat

cow

duc

eag

0.0

0.5

1.0

1.5

2.0

Cluster Dendrogram

hclust (*, "complete")dist(animals.cluster)

Hei

ght

Page 21: STAT 408 - Statistical Learning Clustering€¦ · Statistical Learning Clustering Unsupervised Learning k-means clustering ## K-means clustering with 3 clusters of sizes 44, 26,

STAT 408 -StatisticalLearning

Clustering

UnsupervisedLearning

Lecture Exercise: Clustering ZooAnimals

Use the dataset create below for the following questions.

zoo.data <- read.csv('http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/ZooClean.csv')rownames(zoo.data) <- zoo.data[,1]zoo.data <- zoo.data[,-1]

Use multidimensional scaling to visualize the data in two dimensions.What are two animals that are very similar and two that are very different?Create a hierachical clustering object for this dataset.

Why are a leopard and raccoon clustered together for any cluster size?

Now add colors corresponding to four different clusters to your MDS plot.

Interpret what each of the four clusters correspond to.