unsupervised learning in r...unsupervised learning in r linking clusters in hierarchical clustering...

38
UNSUPERVISED LEARNING IN R Introduction to hierarchical clustering

Upload: others

Post on 26-Jun-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

UNSUPERVISED LEARNING IN R

Introduction to hierarchical clustering

Page 2: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Hierarchical clustering● Number of clusters is not known ahead of time

● Two kinds: bo!om-up and top-down, this course bo!om-up

Page 3: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Hierarchical clustering

Simple Example

Page 4: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Hierarchical clustering

5 Clusters Each point a cluster

Page 5: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Hierarchical clustering

4 Clusters

Page 6: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Hierarchical clustering

3 Clusters

Page 7: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Hierarchical clustering

2 Clusters

Page 8: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Hierarchical clustering

1 Cluster

Page 9: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Hierarchical clustering in R> # Calculates similarity as Euclidean distance between observations > dist_matrix <- dist(x)

> # Returns hierarchical clustering model > hclust(d = dist_matrix)

Call: hclust(d = s)

Cluster method : complete Distance : euclidean Number of objects: 50

x is a data matrix

Page 10: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

UNSUPERVISED LEARNING IN R

Let’s practice!

Page 11: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

UNSUPERVISED LEARNING IN R

Selecting number of clusters

Page 12: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Interpreting results> # Create hierarchical cluster model: hclust.out > hclust.out <- hclust(dist(x)) > # Inspect the result > summary(hclust.out) Length Class Mode merge 98 -none- numeric height 49 -none- numeric order 50 -none- numeric labels 0 -none- NULL method 1 -none- character call 2 -none- call dist.method 1 -none- character

Information isn’t particularly useful

Page 13: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Dendrogram● Tree shaped structure used to interpret hierarchical

clustering models

height

Page 14: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Dendrogram

height

● Tree shaped structure used to interpret hierarchical clustering models

Page 15: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Dendrogram● Tree shaped structure used to interpret hierarchical

clustering models

height

Page 16: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Dendrogram● Tree shaped structure used to interpret hierarchical

clustering models

height

Page 17: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Dendrogram● Tree shaped structure used to interpret hierarchical

clustering models

height

Page 18: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Dendrogram plo!ing in R> # Draws a dendrogram > plot(hclust.out) > abline(h = 6, col = "red")

height

02

46

Page 19: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Tree 'cu!ing' in R● Need to cut the tree to get cluster assignments

> # Cut by height h > cutree(hclust.out, h = 6) [1] 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 [32] 3 3 3 3 4 4 4 4 4 4 4 4 4 4 2 4 2 4 4

> # Cut by number of clusters k > cutree(hclust.out, k = 2) [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 [32] 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Page 20: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

UNSUPERVISED LEARNING IN R

Let’s practice!

Page 21: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

UNSUPERVISED LEARNING IN R

Clustering linkage and practical ma!ers

Page 22: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Linking clusters in hierarchical clustering● How is distance between clusters determined? Rules?

● Four methods to determine which cluster should be linked

● Complete: pairwise similarity between all observations in cluster 1 and cluster 2, and uses largest of similarities

● Single: same as above but uses smallest of similarities

● Average: same as above but uses average of similarities

● Centroid: finds centroid of cluster 1 and centroid of cluster 2, and uses similarity between two centroids

Page 23: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Linking methods: complete and average

Page 24: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Linking method: single

Page 25: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Linking method: centroid

Page 26: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Linkage in R> # Fitting hierarchical clustering models using different methods > hclust.complete <- hclust(d, method = "complete") > hclust.average <- hclust(d, method = "average") > hclust.single <- hclust(d, method = "single")

Page 27: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Practical ma!ers● Data on different scales can cause undesirable results

in clustering methods

● Solution is to scale data so that features have same mean and standard deviation

● Subtract mean of a feature from all observations

● Divide each feature by the standard deviation of the feature

● Normalized features have a mean of zero and a standard deviation of one

Page 28: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Practical ma!ers> # Check if scaling is necessary > colMeans(x) [1] -0.1337828 0.0594019

> apply(x, 2, sd) [1] 1.974376 2.112357

> # Produce new matrix with columns of mean of 0 and sd of 1 > scaled_x <- scale(x)

> colMeans(scaled_x) [1] 2.775558e-17 3.330669e-17

> apply(scaled_x, 2, sd) [1] 1 1

x is a data matrix

Normalized features have mean of 0 and standard deviation of 1

Page 29: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

UNSUPERVISED LEARNING IN R

Let’s practice!

Page 30: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

UNSUPERVISED LEARNING IN R

Review of hierarchical clustering

Page 31: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Hierarchical clustering review> # Fitting various hierarchical clustering models > hclust.complete <- hclust(d, method = "complete") > hclust.average <- hclust(d, method = "average") > hclust.single <- hclust(d, method = "single")

Page 32: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Linking methods: complete and average

Page 33: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Hierarchical clustering review

height

Page 34: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Hierarchical clustering review

height

Page 35: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Hierarchical clustering review

height

Page 36: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Hierarchical clustering review

height

Page 37: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

Unsupervised Learning in R

Hierarchical clustering review> # Scale the data > pokemon.scaled <- scale(pokemon)

> # Create hierarchical and k-means clustering models > hclust.pokemon <- hclust(dist(pokemon.scaled), method = "complete") > km.pokemon <- kmeans(pokemon.scaled, centers = 3, nstart = 20, iter.max = 50)

> # Compare results of the models > cut.pokemon <- cutree(hclust.pokemon, k = 3) > table(km.pokemon$cluster, cut.pokemon) cut.pokemon 1 2 3 1 242 1 0 2 342 1 0 3 204 9 1

Page 38: UNSUPERVISED LEARNING IN R...Unsupervised Learning in R Linking clusters in hierarchical clustering How is distance between clusters determined? Rules? Four methods to determine which

UNSUPERVISED LEARNING IN R

Let’s practice!