this is your presentation (or slidedoc) titleย ยท clustering is the process of finding these...

35
CSE/STAT 416 Clustering Vinitra Swamy University of Washington July 27, 2020

Upload: others

Post on 15-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

CSE/STAT 416Clustering

Vinitra SwamyUniversity of WashingtonJuly 27, 2020

Page 2: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Recap Last week, we introduced document retrieval and the algorithms associated with it.

We discussed the k-Nearest Neighbor algorithm and ways of representing text documents / measuring their distances.

We talked about how to use k-NN for classification/regression. We extended this to include other notions of distances by using weighted k-NN and kernel regression.

We also talked about how to efficiently find the set of nearest neighbors using Locality Sensitive Hashing (LSH).

2

Page 3: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Clustering

3

SPORTS WORLD NEWS

Page 4: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Recommending News

User preferences are important to learn, but can be challenging to do in practice.

โ–ช People have complicated preferences

โ–ช Topics arenโ€™t always clearly defined

4

I donโ€™t just like sports!

00.10.20.30.40.50.6

Sport

s

World

New

s

Entert

ainm

ent

Scie

nce

Cluster 1

Cluster 3 Cluster 4

Cluster 2Use feedback to

learn user preferences over topics

Page 5: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Labeled Data What if the labels are known? Given labeled training data

Can do multi-class classification methods to predict label

5

Training set of labeled docs

SPORTS WORLD NEWS

ENTERTAINMENT SCIENCE

Example ofsupervised learning

SPORTS

WORLD NEWS

ENTERTAINMENT

SCIENCE TECHNOLOGY

Page 6: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Unsupervised Learning

โ–ช In many real world contexts, there arenโ€™t clearly defined labels so we wonโ€™t be able to do classification

โ–ช We will need to come up with methods that uncover structure from the (unlabeled) input data ๐‘‹.

โ–ช Clustering is an automatic process of trying to find related groups within the given dataset.

6

๐ผ๐‘›๐‘๐‘ข๐‘ก: ๐‘ฅ1, ๐‘ฅ2, โ€ฆ , ๐‘ฅ๐‘› ๐‘‚๐‘ข๐‘ก๐‘๐‘ข๐‘ก: ๐‘ง1, ๐‘ง2, โ€ฆ , ๐‘ง๐‘›

Page 7: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Define Clusters

In their simplest form, a cluster is defined by

โ–ช The location of its center (centroid)

โ–ช Shape and size of its spread

Clustering is the process of finding these clusters and assigning each example to a particular cluster.

โ–ช ๐‘ฅ๐‘– gets assigned ๐‘ง๐‘– โˆˆ [1, 2,โ€ฆ , ๐‘˜]

โ–ช Usually based on closest centroid

Will define some kind of score for a clustering that determines how good the assignments are

โ–ช Based on distance of assignedexamples to each cluster

7

Page 8: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

When This Works

Clustering is easy when distance captures the clusters

Ground Truth (not visible) Given Data

8

Page 9: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Not Always Easy

There are many clusters that are harder to learn with this setup

โ–ช Distance does not determine clusters

9

Page 10: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

k-means

10

Page 11: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Algorithm Will define the Score for assigning a point to a cluster is ๐‘†๐‘๐‘œ๐‘Ÿ๐‘’ ๐‘ฅ๐‘– , ๐œ‡๐‘— = ๐‘‘๐‘–๐‘ ๐‘ก(๐‘ฅ๐‘– , ๐œ‡๐‘—)

Lower score => Better Clustering

k-means Algorithm at a glance

Step 0: Initialize cluster centers

Repeat until convergence:

Step 1: Assign each example to its closest cluster centroid

Step 2: Update the centroids to be the average of all the pointsassigned to that cluster

11

Page 12: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Step 0 Start by choosing the initial cluster centroids

โ–ช A common default choice is to choose centroids at random

โ–ช Will see later that there are smarter ways of initializing

12

๐œ‡1

๐œ‡2

๐œ‡3

Page 13: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Step 1 Assign each example to its closest cluster centroid

๐‘ง๐‘– โ† argmin๐‘—โˆˆ[๐‘˜]

๐œ‡๐‘— โˆ’ ๐‘ฅ๐‘–2

13

Page 14: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Step 2 Update the centroids to be the mean of all the points assigned to that cluster.

๐œ‡๐‘— โ†1

๐‘›๐‘—

๐‘–:๐‘ง๐‘–=๐‘—

๐‘ฅ๐‘–

Computes center of mass for cluster!

14

Page 15: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Repeat Repeat Steps 1 and 2 until convergence

Will it converge? Yes!

What will it converge to?

โ–ช Global optimum

โ–ช Local optimum

โ–ช Neither

15

Page 16: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Brain Break

16

Page 17: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Local OptimaWhat does it mean for something to converge to a local optima?

โ–ช Initial settings will greatly impact results!

17

Page 18: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Smart Initializing w/ k-means++

Making sure the initialized centroids are โ€œgoodโ€ is critical to finding quality local optima. Our purely random approach was wasteful since itโ€™s very possible that initial centroids start close together.

Idea: Try to select a set of points farther away from each other.

k-means++ does a slightly smarter random initialization

1. Choose first cluster ๐œ‡1 from the data uniformly at random

2. For the current set of centroids (starting with just ๐œ‡1), compute the distance between each datapoint and its closest centroid

3. Choose a new centroid from the remaining data points with probability of ๐‘ฅ๐‘– being chosen proportional to ๐‘‘ ๐‘ฅ๐‘–

2

4. Repeat 2 and 3 until we have selected ๐‘˜ centroids18

Page 19: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

k-means++ Example

Start by picking a point at random

Then pick points proportional to their distances to their centroids

This tries to maximize the spread of the centroids!

19

Page 20: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

k-means++Pros / Cons

Pros

โ–ช Improves quality of local minima

โ–ช Faster convergence to local minima

Cons

โ–ช Computationally more expensive at beginning when compared to simple random initialization

20

Page 21: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Assessing Performance

21

Page 22: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Which Cluster?Which clustering would I prefer?

k-means is trying to optimize the heterogeneity objective

๐‘—=1

๐‘˜

๐‘–:๐‘ง๐‘–=๐‘—

๐œ‡๐‘— โˆ’ ๐‘ฅ๐‘–2

2

22

Page 23: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Coordinate Descent

k-means is trying to minimize the heterogeneity objective

argmin๐œ‡1,โ€ฆ,๐œ‡๐‘˜, ๐‘ง1,โ€ฆ, ๐‘ง๐‘›

๐‘—=1

๐‘˜

๐‘–:๐‘ง๐‘–=๐‘—

๐œ‡๐‘— โˆ’ ๐‘ฅ๐‘–2

2

Step 0: Initialize cluster centers

Repeat until convergence:

Step 1: Assign each example to its closest cluster centroid

Step 2: Update the centroids to be the mean of all the pointsassigned to that cluster

Coordinate Descent alternates how it updates parameters to find minima. On each of iteration of Step 1 and Step 2, heterogeneity decreases or stays the same.

=> Will converge in finite time23

Page 24: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Think

pollev.com/cse416

Consider trying k-means with different values of k. Which of the following graphs shows how the globally optimal heterogeneity changes for each value of k?

24

1 min

# of clusters k

Low

est

po

ssib

lecl

ust

er

het

ero

gen

eit

y

# of clusters k

Low

est

po

ssib

lecl

ust

er

het

ero

gen

eit

y

# of clusters k

Low

est

po

ssib

lecl

ust

er

het

ero

gen

eit

y

# of clusters k

Low

est

po

ssib

lecl

ust

er

het

ero

gen

eit

y

Page 25: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Think

pollev.com/cse416

Consider trying k-means with different values of k. Which of the following graphs shows how the globally optimal heterogeneity changes for each value of k?

25

2.5 min

# of clusters k

Low

est

po

ssib

lecl

ust

er

het

ero

gen

eit

y

# of clusters k

Low

est

po

ssib

lecl

ust

er

het

ero

gen

eit

y

# of clusters k

Low

est

po

ssib

lecl

ust

er

het

ero

gen

eit

y

# of clusters k

Low

est

po

ssib

lecl

ust

er

het

ero

gen

eit

y

Page 26: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

How to Choose k?

No right answer! Depends on your application.

โ–ช General, look for the โ€œelbowโ€ in the graph

26

# of clusters k

Low

est

po

ssib

lecl

ust

er

het

ero

gen

eit

y

Page 27: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Brain Break

27

Page 28: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Other Applications

28

Page 29: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Clustering Images

29

Page 30: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Result Diversity

30

โ€ข Search terms can have multiple meanings

โ€ข Example: โ€œcardinalโ€

โ€ข Use clustering to structure output

Page 31: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Share Information

31

Page 32: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Curse of Dimensionality

32

Page 33: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

High Dimensions

Methods like k-NN and k-means that rely on computing distances start to struggle in high dimensions.

As the number of dimensions grow, the data gets sparser!

Need more data to make sure you cover all the space in high dim.33

Page 34: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Even Weirder

Itโ€™s believable with more dimensions the data becomes more sparse, but whatโ€™s even weirder is the sparsity is not uniform!

As ๐ท increases, the โ€œmassโ€ of the space goes towards the corners.

โ–ช Most of the points arenโ€™t in the center.

โ–ช Your nearest neighbors start looking like your farthest neighbors!

34

Page 35: THIS IS YOUR PRESENTATION (OR SLIDEDOC) TITLEย ยท Clustering is the process of finding these clusters and assigning each example to a particular cluster. ๐‘ฅ gets assigned ๐‘ง โˆˆ[1,2,โ€ฆ,๐‘˜]

Practicalities Have you pay attention to the number of dimensions

โ–ช Very tricky if ๐‘› < ๐ท

โ–ช Can run into some strange results if ๐ท is very large

Later, we will talk about ways of trying to do dimensionality reduction in order to reduce the number of dimensions here.

35