clustering

48
LOGO Clustering Lecturer: Dr. Bo Yuan E-mail: [email protected]

Upload: egil

Post on 22-Feb-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Clustering. Lecturer: Dr. Bo Yuan E-mail: [email protected]. Overview. Partitioning Methods K-Means Sequential Leader Model Based Methods Density Based Methods Hierarchical Methods. What is cluster analysis?. Finding groups of objects - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Clustering

LOGO

Clustering

Lecturer Dr Bo Yuan

E-mail yuanbsztsinghuaeducn

Overview

Partitioning Methods K-Means Sequential Leader Model Based Methods Density Based Methods

Hierarchical Methods

2

What is cluster analysis

Finding groups of objects Objects similar to each other are in the same group Objects are different from those in other groups

Unsupervised Learning No labels Data driven

3

Clusters

4

Inter-Cluster

Intra-Cluster

Clusters

5

Applications of Clustering

Marketing Finding groups of customers with similar behaviours

Biology Finding groups of animals or plants with similar features

Bioinformatics Clustering of microarray data genes and sequences

Earthquake Studies Clustering observed earthquake epicenters to identify dangerous zones

WWW Clustering weblog data to discover groups of similar access patterns

Social Networks Discovering groups of individuals with close friendships internally

6

Earthquakes

7

Image Segmentation

8

The Big Picture

9

Requirements

Scalability

Ability to deal with different types of attributes

Ability to discover clusters with arbitrary shape

Minimum requirements for domain knowledge

Ability to deal with noise and outliers

Insensitivity to order of input records

Incorporation of user-defined constraints

Interpretability and usability

10

Practical Considerations

11

Normalization or Not

12

Evaluation

13

ii Dxi

i

c

i Dxie x

nmmxJ 1

1

2

VS

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros Simple and works well for regular disjoint clusters Converges relatively fast Relatively efficient and scalable O(tkn)

bull t iteration k number of centroids n number of data points

Cons Need to specify the value of K in advance

bull Difficult and domain knowledge may help May converge to local optima

bull In practice try different initial centroids May be sensitive to noisy data and outliers

bull Mean of data points hellip Not suitable for clusters of

bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m For each non-medoid point o Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm No iteration Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max)()()(iaibiaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 2: Clustering

Overview

Partitioning Methods K-Means Sequential Leader Model Based Methods Density Based Methods

Hierarchical Methods

2

What is cluster analysis

Finding groups of objects Objects similar to each other are in the same group Objects are different from those in other groups

Unsupervised Learning No labels Data driven

3

Clusters

4

Inter-Cluster

Intra-Cluster

Clusters

5

Applications of Clustering

Marketing Finding groups of customers with similar behaviours

Biology Finding groups of animals or plants with similar features

Bioinformatics Clustering of microarray data genes and sequences

Earthquake Studies Clustering observed earthquake epicenters to identify dangerous zones

WWW Clustering weblog data to discover groups of similar access patterns

Social Networks Discovering groups of individuals with close friendships internally

6

Earthquakes

7

Image Segmentation

8

The Big Picture

9

Requirements

Scalability

Ability to deal with different types of attributes

Ability to discover clusters with arbitrary shape

Minimum requirements for domain knowledge

Ability to deal with noise and outliers

Insensitivity to order of input records

Incorporation of user-defined constraints

Interpretability and usability

10

Practical Considerations

11

Normalization or Not

12

Evaluation

13

ii Dxi

i

c

i Dxie x

nmmxJ 1

1

2

VS

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros Simple and works well for regular disjoint clusters Converges relatively fast Relatively efficient and scalable O(tkn)

bull t iteration k number of centroids n number of data points

Cons Need to specify the value of K in advance

bull Difficult and domain knowledge may help May converge to local optima

bull In practice try different initial centroids May be sensitive to noisy data and outliers

bull Mean of data points hellip Not suitable for clusters of

bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m For each non-medoid point o Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm No iteration Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max)()()(iaibiaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 3: Clustering

What is cluster analysis

Finding groups of objects Objects similar to each other are in the same group Objects are different from those in other groups

Unsupervised Learning No labels Data driven

3

Clusters

4

Inter-Cluster

Intra-Cluster

Clusters

5

Applications of Clustering

Marketing Finding groups of customers with similar behaviours

Biology Finding groups of animals or plants with similar features

Bioinformatics Clustering of microarray data genes and sequences

Earthquake Studies Clustering observed earthquake epicenters to identify dangerous zones

WWW Clustering weblog data to discover groups of similar access patterns

Social Networks Discovering groups of individuals with close friendships internally

6

Earthquakes

7

Image Segmentation

8

The Big Picture

9

Requirements

Scalability

Ability to deal with different types of attributes

Ability to discover clusters with arbitrary shape

Minimum requirements for domain knowledge

Ability to deal with noise and outliers

Insensitivity to order of input records

Incorporation of user-defined constraints

Interpretability and usability

10

Practical Considerations

11

Normalization or Not

12

Evaluation

13

ii Dxi

i

c

i Dxie x

nmmxJ 1

1

2

VS

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros Simple and works well for regular disjoint clusters Converges relatively fast Relatively efficient and scalable O(tkn)

bull t iteration k number of centroids n number of data points

Cons Need to specify the value of K in advance

bull Difficult and domain knowledge may help May converge to local optima

bull In practice try different initial centroids May be sensitive to noisy data and outliers

bull Mean of data points hellip Not suitable for clusters of

bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m For each non-medoid point o Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm No iteration Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max)()()(iaibiaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 4: Clustering

Clusters

4

Inter-Cluster

Intra-Cluster

Clusters

5

Applications of Clustering

Marketing Finding groups of customers with similar behaviours

Biology Finding groups of animals or plants with similar features

Bioinformatics Clustering of microarray data genes and sequences

Earthquake Studies Clustering observed earthquake epicenters to identify dangerous zones

WWW Clustering weblog data to discover groups of similar access patterns

Social Networks Discovering groups of individuals with close friendships internally

6

Earthquakes

7

Image Segmentation

8

The Big Picture

9

Requirements

Scalability

Ability to deal with different types of attributes

Ability to discover clusters with arbitrary shape

Minimum requirements for domain knowledge

Ability to deal with noise and outliers

Insensitivity to order of input records

Incorporation of user-defined constraints

Interpretability and usability

10

Practical Considerations

11

Normalization or Not

12

Evaluation

13

ii Dxi

i

c

i Dxie x

nmmxJ 1

1

2

VS

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros Simple and works well for regular disjoint clusters Converges relatively fast Relatively efficient and scalable O(tkn)

bull t iteration k number of centroids n number of data points

Cons Need to specify the value of K in advance

bull Difficult and domain knowledge may help May converge to local optima

bull In practice try different initial centroids May be sensitive to noisy data and outliers

bull Mean of data points hellip Not suitable for clusters of

bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m For each non-medoid point o Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm No iteration Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max)()()(iaibiaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 5: Clustering

Clusters

5

Applications of Clustering

Marketing Finding groups of customers with similar behaviours

Biology Finding groups of animals or plants with similar features

Bioinformatics Clustering of microarray data genes and sequences

Earthquake Studies Clustering observed earthquake epicenters to identify dangerous zones

WWW Clustering weblog data to discover groups of similar access patterns

Social Networks Discovering groups of individuals with close friendships internally

6

Earthquakes

7

Image Segmentation

8

The Big Picture

9

Requirements

Scalability

Ability to deal with different types of attributes

Ability to discover clusters with arbitrary shape

Minimum requirements for domain knowledge

Ability to deal with noise and outliers

Insensitivity to order of input records

Incorporation of user-defined constraints

Interpretability and usability

10

Practical Considerations

11

Normalization or Not

12

Evaluation

13

ii Dxi

i

c

i Dxie x

nmmxJ 1

1

2

VS

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros Simple and works well for regular disjoint clusters Converges relatively fast Relatively efficient and scalable O(tkn)

bull t iteration k number of centroids n number of data points

Cons Need to specify the value of K in advance

bull Difficult and domain knowledge may help May converge to local optima

bull In practice try different initial centroids May be sensitive to noisy data and outliers

bull Mean of data points hellip Not suitable for clusters of

bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m For each non-medoid point o Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm No iteration Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max)()()(iaibiaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 6: Clustering

Applications of Clustering

Marketing Finding groups of customers with similar behaviours

Biology Finding groups of animals or plants with similar features

Bioinformatics Clustering of microarray data genes and sequences

Earthquake Studies Clustering observed earthquake epicenters to identify dangerous zones

WWW Clustering weblog data to discover groups of similar access patterns

Social Networks Discovering groups of individuals with close friendships internally

6

Earthquakes

7

Image Segmentation

8

The Big Picture

9

Requirements

Scalability

Ability to deal with different types of attributes

Ability to discover clusters with arbitrary shape

Minimum requirements for domain knowledge

Ability to deal with noise and outliers

Insensitivity to order of input records

Incorporation of user-defined constraints

Interpretability and usability

10

Practical Considerations

11

Normalization or Not

12

Evaluation

13

ii Dxi

i

c

i Dxie x

nmmxJ 1

1

2

VS

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros Simple and works well for regular disjoint clusters Converges relatively fast Relatively efficient and scalable O(tkn)

bull t iteration k number of centroids n number of data points

Cons Need to specify the value of K in advance

bull Difficult and domain knowledge may help May converge to local optima

bull In practice try different initial centroids May be sensitive to noisy data and outliers

bull Mean of data points hellip Not suitable for clusters of

bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m For each non-medoid point o Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm No iteration Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max)()()(iaibiaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 7: Clustering

Earthquakes

7

Image Segmentation

8

The Big Picture

9

Requirements

Scalability

Ability to deal with different types of attributes

Ability to discover clusters with arbitrary shape

Minimum requirements for domain knowledge

Ability to deal with noise and outliers

Insensitivity to order of input records

Incorporation of user-defined constraints

Interpretability and usability

10

Practical Considerations

11

Normalization or Not

12

Evaluation

13

ii Dxi

i

c

i Dxie x

nmmxJ 1

1

2

VS

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros Simple and works well for regular disjoint clusters Converges relatively fast Relatively efficient and scalable O(tkn)

bull t iteration k number of centroids n number of data points

Cons Need to specify the value of K in advance

bull Difficult and domain knowledge may help May converge to local optima

bull In practice try different initial centroids May be sensitive to noisy data and outliers

bull Mean of data points hellip Not suitable for clusters of

bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m For each non-medoid point o Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm No iteration Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max)()()(iaibiaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 8: Clustering

Image Segmentation

8

The Big Picture

9

Requirements

Scalability

Ability to deal with different types of attributes

Ability to discover clusters with arbitrary shape

Minimum requirements for domain knowledge

Ability to deal with noise and outliers

Insensitivity to order of input records

Incorporation of user-defined constraints

Interpretability and usability

10

Practical Considerations

11

Normalization or Not

12

Evaluation

13

ii Dxi

i

c

i Dxie x

nmmxJ 1

1

2

VS

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros Simple and works well for regular disjoint clusters Converges relatively fast Relatively efficient and scalable O(tkn)

bull t iteration k number of centroids n number of data points

Cons Need to specify the value of K in advance

bull Difficult and domain knowledge may help May converge to local optima

bull In practice try different initial centroids May be sensitive to noisy data and outliers

bull Mean of data points hellip Not suitable for clusters of

bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m For each non-medoid point o Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm No iteration Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max)()()(iaibiaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 9: Clustering

The Big Picture

9

Requirements

Scalability

Ability to deal with different types of attributes

Ability to discover clusters with arbitrary shape

Minimum requirements for domain knowledge

Ability to deal with noise and outliers

Insensitivity to order of input records

Incorporation of user-defined constraints

Interpretability and usability

10

Practical Considerations

11

Normalization or Not

12

Evaluation

13

ii Dxi

i

c

i Dxie x

nmmxJ 1

1

2

VS

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros Simple and works well for regular disjoint clusters Converges relatively fast Relatively efficient and scalable O(tkn)

bull t iteration k number of centroids n number of data points

Cons Need to specify the value of K in advance

bull Difficult and domain knowledge may help May converge to local optima

bull In practice try different initial centroids May be sensitive to noisy data and outliers

bull Mean of data points hellip Not suitable for clusters of

bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m For each non-medoid point o Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm No iteration Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max)()()(iaibiaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 10: Clustering

Requirements

Scalability

Ability to deal with different types of attributes

Ability to discover clusters with arbitrary shape

Minimum requirements for domain knowledge

Ability to deal with noise and outliers

Insensitivity to order of input records

Incorporation of user-defined constraints

Interpretability and usability

10

Practical Considerations

11

Normalization or Not

12

Evaluation

13

ii Dxi

i

c

i Dxie x

nmmxJ 1

1

2

VS

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros Simple and works well for regular disjoint clusters Converges relatively fast Relatively efficient and scalable O(tkn)

bull t iteration k number of centroids n number of data points

Cons Need to specify the value of K in advance

bull Difficult and domain knowledge may help May converge to local optima

bull In practice try different initial centroids May be sensitive to noisy data and outliers

bull Mean of data points hellip Not suitable for clusters of

bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m For each non-medoid point o Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm No iteration Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max)()()(iaibiaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 11: Clustering

Practical Considerations

11

Normalization or Not

12

Evaluation

13

ii Dxi

i

c

i Dxie x

nmmxJ 1

1

2

VS

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros Simple and works well for regular disjoint clusters Converges relatively fast Relatively efficient and scalable O(tkn)

bull t iteration k number of centroids n number of data points

Cons Need to specify the value of K in advance

bull Difficult and domain knowledge may help May converge to local optima

bull In practice try different initial centroids May be sensitive to noisy data and outliers

bull Mean of data points hellip Not suitable for clusters of

bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m For each non-medoid point o Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm No iteration Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max)()()(iaibiaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 12: Clustering

Normalization or Not

12

Evaluation

13

ii Dxi

i

c

i Dxie x

nmmxJ 1

1

2

VS

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros Simple and works well for regular disjoint clusters Converges relatively fast Relatively efficient and scalable O(tkn)

bull t iteration k number of centroids n number of data points

Cons Need to specify the value of K in advance

bull Difficult and domain knowledge may help May converge to local optima

bull In practice try different initial centroids May be sensitive to noisy data and outliers

bull Mean of data points hellip Not suitable for clusters of

bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m For each non-medoid point o Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm No iteration Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max)()()(iaibiaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 13: Clustering

Evaluation

13

ii Dxi

i

c

i Dxie x

nmmxJ 1

1

2

VS

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros Simple and works well for regular disjoint clusters Converges relatively fast Relatively efficient and scalable O(tkn)

bull t iteration k number of centroids n number of data points

Cons Need to specify the value of K in advance

bull Difficult and domain knowledge may help May converge to local optima

bull In practice try different initial centroids May be sensitive to noisy data and outliers

bull Mean of data points hellip Not suitable for clusters of

bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m For each non-medoid point o Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm No iteration Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max)()()(iaibiaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 14: Clustering

Evaluation

14

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros Simple and works well for regular disjoint clusters Converges relatively fast Relatively efficient and scalable O(tkn)

bull t iteration k number of centroids n number of data points

Cons Need to specify the value of K in advance

bull Difficult and domain knowledge may help May converge to local optima

bull In practice try different initial centroids May be sensitive to noisy data and outliers

bull Mean of data points hellip Not suitable for clusters of

bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m For each non-medoid point o Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm No iteration Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max)()()(iaibiaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 15: Clustering

The Influence of Outliers

15

outlier

K=2

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros Simple and works well for regular disjoint clusters Converges relatively fast Relatively efficient and scalable O(tkn)

bull t iteration k number of centroids n number of data points

Cons Need to specify the value of K in advance

bull Difficult and domain knowledge may help May converge to local optima

bull In practice try different initial centroids May be sensitive to noisy data and outliers

bull Mean of data points hellip Not suitable for clusters of

bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m For each non-medoid point o Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm No iteration Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max)()()(iaibiaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 16: Clustering

K-Means

16

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros Simple and works well for regular disjoint clusters Converges relatively fast Relatively efficient and scalable O(tkn)

bull t iteration k number of centroids n number of data points

Cons Need to specify the value of K in advance

bull Difficult and domain knowledge may help May converge to local optima

bull In practice try different initial centroids May be sensitive to noisy data and outliers

bull Mean of data points hellip Not suitable for clusters of

bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m For each non-medoid point o Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm No iteration Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max)()()(iaibiaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 17: Clustering

K-Means

17

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros Simple and works well for regular disjoint clusters Converges relatively fast Relatively efficient and scalable O(tkn)

bull t iteration k number of centroids n number of data points

Cons Need to specify the value of K in advance

bull Difficult and domain knowledge may help May converge to local optima

bull In practice try different initial centroids May be sensitive to noisy data and outliers

bull Mean of data points hellip Not suitable for clusters of

bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m For each non-medoid point o Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm No iteration Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max)()()(iaibiaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 18: Clustering

K-Means

18

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros Simple and works well for regular disjoint clusters Converges relatively fast Relatively efficient and scalable O(tkn)

bull t iteration k number of centroids n number of data points

Cons Need to specify the value of K in advance

bull Difficult and domain knowledge may help May converge to local optima

bull In practice try different initial centroids May be sensitive to noisy data and outliers

bull Mean of data points hellip Not suitable for clusters of

bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m For each non-medoid point o Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm No iteration Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max)()()(iaibiaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 19: Clustering

K-Means

Determine the value of K

Choose K cluster centres randomly

Each data point is assigned to its closest centroid

Use the mean of each cluster to update each centroid

Repeat until no more new assignment

Return the K centroids

Reference J MacQueen (1967) Some Methods for Classification and Analysis of

Multivariate Observations Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability vol1 pp 281-297

19

Comments on K-Means

Pros Simple and works well for regular disjoint clusters Converges relatively fast Relatively efficient and scalable O(tkn)

bull t iteration k number of centroids n number of data points

Cons Need to specify the value of K in advance

bull Difficult and domain knowledge may help May converge to local optima

bull In practice try different initial centroids May be sensitive to noisy data and outliers

bull Mean of data points hellip Not suitable for clusters of

bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m For each non-medoid point o Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm No iteration Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max)()()(iaibiaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 20: Clustering

Comments on K-Means

Pros Simple and works well for regular disjoint clusters Converges relatively fast Relatively efficient and scalable O(tkn)

bull t iteration k number of centroids n number of data points

Cons Need to specify the value of K in advance

bull Difficult and domain knowledge may help May converge to local optima

bull In practice try different initial centroids May be sensitive to noisy data and outliers

bull Mean of data points hellip Not suitable for clusters of

bull Non-convex shapes

20

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m For each non-medoid point o Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm No iteration Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max)()()(iaibiaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 21: Clustering

The Influence of Initial Centroids

21

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m For each non-medoid point o Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm No iteration Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max)()()(iaibiaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 22: Clustering

The Influence of Initial Centroids

22

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m For each non-medoid point o Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm No iteration Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max)()()(iaibiaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 23: Clustering

The K-Medoids Method

The basic idea is to use real data points as centres

Determine the value of K in advance

Randomly select K points as medoids

Assign each data point to the closest medoid

Calculate the cost of the configuration J

For each medoid m For each non-medoid point o Swap m and o and calculate the new cost of configuration Jprime

If the cost of the best new configuration J is lower than J make the corresponding swap and repeat the above steps

Otherwise terminate the procedure23

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm No iteration Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max)()()(iaibiaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 24: Clustering

The K-Medoids Method

24

Cost =20 Cost =26

Sequential Leader Clustering

A very efficient clustering algorithm No iteration Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max)()()(iaibiaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 25: Clustering

Sequential Leader Clustering

A very efficient clustering algorithm No iteration Time complexity O(nk)

No need to specify K in advance

Choose a cluster threshold value

For every new data point Compute the distance between the new data point and every clusters centre

If the distance is smaller than the chosen threshold assign the new data point to the corresponding cluster and re-compute cluster centre

Otherwise create a new cluster with the new data point as its centre

Clustering results may be influenced by the sequence of data points

25

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max)()()(iaibiaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 26: Clustering

Silhouette

A method of interpretation and validation of clusters of data

A succinct graphical representation of how well each data point lies within its cluster compared to other clusters

a(i) average dissimilarity of i with all other points in the same cluster

b(i) the lowest average dissimilarity of i to other clusters

26

)()(max)()()(iaibiaibis

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 27: Clustering

Silhouette

27

-02 0 02 04 06 08 1

1

2

Silhouette Value

Clu

ster

-3 -2 -1 0 1 2 3 4-3

-2

-1

0

1

2

3

4

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 28: Clustering

Gaussian Mixture

28

)2()(

2

22

21)(

xexg

1amp0)()(1

i

ii

n

iiii xgxf

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 29: Clustering

Clustering by Mixture Models

29

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 30: Clustering

K-Means Revisited

30

120579=(1199091 1199101 ) (1199092 119910 2)

119885=119862119897119906119904119905119890119903 1 119862119897119906119904119905119890119903 2

model parameters

latent parameters

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 31: Clustering

Expectation Maximization

31

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 32: Clustering

32

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 33: Clustering

EM Gaussian Mixture

33

Gaussianjth by the generated is i instancer whethecomponents mixture ofnumber the

points data ofnumber the

ijznm

n

kk

x

j

x

n

kkki

jjiij

ki

ji

e

e

xxp

xxpzE

1

)(2

1

)(2

1

1

22

22

)|(

)|(][

m

iij

m

iiij

j

zE

xzE

1

1

][

][

m

iijj zE

m 1

][1

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 34: Clustering

Density Based Methods

Generate clusters of arbitrary shapes

Robust against noise

No K value required in advance

Somewhat similar to human vision

34

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 35: Clustering

DBSCAN

Density-Based Spatial Clustering of Applications with Noise

Density number of points within a specified radius

Core Point points with high density

Border Point points with low density but in the neighbourhood of a core point

Noise Point neither a core point nor a border point

35

Core Point

Noise Point

Border Point

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 36: Clustering

DBSCAN

36

p

q

directly density reachable

p

q

density reachable

o

qp

density connected

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 37: Clustering

DBSCAN

A cluster is defined as the maximal set of density connected points Start from a randomly selected unseen point P If P is a core point build a cluster by gradually adding all points that are

density reachable to the current point set Noise points are discarded (unlabelled)

37

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 38: Clustering

Hierarchical Clustering

Produce a set of nested tree-like clusters

Can be visualized as a dendrogram Clustering is obtained by cutting at desired level No need to specify K in advance May correspond to meaningful taxonomies

38

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 39: Clustering

Dinosaur Family Tree

39

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 40: Clustering

Agglomerative Methods

Bottom-up Method

Assign each data point to a cluster

Calculate the proximity matrix

Merge the pair of closest clusters

Repeat until only a single cluster remains

How to calculate the distance between clusters

Single Link Minimum distance between points

Complete Link Maximum distance between points

40

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 41: Clustering

Example

41

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0Single Link

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 42: Clustering

Example

42

BA FI MITO NA RM

BA 0 662 877 255 412

FI 662 0 295 468 268

MITO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

BA FI MITO NARM

BA 0 662 877 255FI 662 0 295 268

MITO 877 295 0 564

NARM 255 268 564 0

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 43: Clustering

Example

43

BANARM FI MITO

BANARM 0 268 564FI 268 0 295

MITO 564 295 0

BAFINARM MITO

BAFINARM 0 295

MITO 295 0

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 44: Clustering

Min vs Max

44

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 45: Clustering

Reading Materials

Text Books Richard O Duda et al Pattern Classification Chapter 10 John Wiley amp Sons J Han and M Kamber Data Mining Concepts and Techniques Chapter 8

Morgan Kaufmann

Survey Papers A K Jain M N Murty and P J Flynn (1999) ldquoData Clustering A Reviewrdquo ACM

Computing Surveys Vol 31(3) pp 264-323 R Xu and D Wunsch (2005) ldquoSurvey of Clustering Algorithmsrdquo IEEE Transactions

on Neural Networks Vol 16(3) pp 645-678 A K Jain (2010) ldquoData Clustering 50 Years Beyond K-Meansrdquo Pattern

Recognition Letters Vol 31 pp 651-666

Online Tutorials httphomedeipolimiitmatteuccClusteringtutorial_html httpwwwautonlaborgtutorialskmeanshtml httpusersinformatikuni-hallede~hinneburClusterTutorial

45

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 46: Clustering

Review

What is clustering

What are the two categories of clustering methods

How does the K-Means algorithm work

What are the major issues of K-Means

How to control the number of clusters in Sequential Leader Clustering

How to use Gaussian mixture models for clustering

What are the main advantages of density methods

What is the core idea of DBSCAN

What is the general procedure of hierarchical clustering

Which clustering methods do not require K as the input

46

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 47: Clustering

Next Weekrsquos Class Talk

Volunteers are required for next weekrsquos class talk

Topic Affinity Propagation Science 315 972ndash976 2007 Clustering by passing messages between points httpwwwpsitorontoeduindexphpq=affinity20propagation

Topic Clustering by Fast Search and Find of Density Peaks Science 344 1492ndash1496 2014 Cluster centers higher density than neighbors Cluster centers distant from others points with higher densities

Length 20 minutes plus question time

47

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment
Page 48: Clustering

Assignment

Topic Clustering Techniques and Applications

Techniques K-Means Another clustering method for comparison

Task 1 2D Artificial Datasets To demonstrate the influence of data patterns To demonstrate the influence of algorithm factors

Task 2 Image Segmentation Gray vs Colour

Deliverables Reports (experiment specification algorithm parameters in-depth analysis) Code (any programming language with detailed comments)

Due Sunday 28 December

Credit 1548

  • Clustering
  • Overview
  • What is cluster analysis
  • Clusters
  • Clusters (2)
  • Applications of Clustering
  • Earthquakes
  • Image Segmentation
  • The Big Picture
  • Requirements
  • Practical Considerations
  • Normalization or Not
  • Evaluation
  • Evaluation (2)
  • The Influence of Outliers
  • K-Means
  • K-Means (2)
  • K-Means (3)
  • K-Means (4)
  • Comments on K-Means
  • The Influence of Initial Centroids
  • The Influence of Initial Centroids (2)
  • The K-Medoids Method
  • The K-Medoids Method (2)
  • Sequential Leader Clustering
  • Silhouette
  • Silhouette (2)
  • Gaussian Mixture
  • Clustering by Mixture Models
  • K-Means Revisited
  • Expectation Maximization
  • Slide 32
  • EM Gaussian Mixture
  • Density Based Methods
  • DBSCAN
  • DBSCAN (2)
  • DBSCAN (3)
  • Hierarchical Clustering
  • Dinosaur Family Tree
  • Agglomerative Methods
  • Example
  • Example (2)
  • Example (3)
  • Min vs Max
  • Reading Materials
  • Review
  • Next Weekrsquos Class Talk
  • Assignment