1 lecture 10 clustering. 2 preview introduction partitioning methods hierarchical methods...

38
1 Lecture 10 Clustering

Upload: julius-elliott

Post on 30-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods

1

Lecture 10 Clustering

Page 2: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods

2

Preview

Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods

Page 3: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods

4

Examples of Clustering Applications

Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs

Land use: Identification of areas of similar land use in an earth observation database

Insurance: Identifying groups of motor insurance policy holders with a high average claim cost

Urban planning: Identifying groups of houses according to their house type, value, and geographical location

Seismology: Observed earth quake epicenters should be clustered along continent faults

Page 4: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods

5

What Is a Good Clustering?

A good clustering method will produce clusters with High intra-class similarity Low inter-class similarity

Precise definition of clustering quality is difficult Application-dependent Ultimately subjective

Page 5: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods

6

Requirements for Clustering in Data Mining

Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Minimal domain knowledge required to

determine input parameters Ability to deal with noise and outliers Insensitivity to order of input records Robustness wrt high dimensionality Incorporation of user-specified constraints Interpretability and usability

Page 6: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods

7

Similarity and Dissimilarity Between Objects

Euclidean distance (p = 2):

Properties of a metric d(i,j): d(i,j) 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) d(i,k) + d(k,j)

)||...|||(|),( 22

22

2

11 pp jx

ix

jx

ix

jx

ixjid

Page 7: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods

8

Major Clustering Approaches

Partitioning: Construct various partitions and then

evaluate them by some criterion

Hierarchical: Create a hierarchical decomposition of

the set of objects using some criterion

Model-based: Hypothesize a model for each cluster

and find best fit of models to data

Density-based: Guided by connectivity and density

functions

Page 8: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods

9

Partitioning Algorithms

Partitioning method: Construct a partition of a database D of n objects into a set of k clusters

Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids

algorithms k-means (MacQueen, 1967): Each cluster is

represented by the center of the cluster k-medoids or PAM (Partition around medoids)

(Kaufman & Rousseeuw, 1987): Each cluster is represented by one of the objects in the cluster

Page 9: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods

10

K-Means Clustering

Given k, the k-means algorithm consists of four steps: Select initial centroids at random. Assign each object to the cluster with

the nearest centroid. Compute each centroid as the mean

of the objects assigned to it. Repeat previous 2 steps until no

change.

Page 10: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods

k-Means [MacQueen ’67]

Page 11: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods
Page 12: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods
Page 13: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods
Page 14: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods
Page 15: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods
Page 16: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods
Page 17: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods
Page 18: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods

Another example (k=3)

19

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6

Page 19: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods

20

K-Means Clustering (contd.)

Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 20: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods

21

Comments on the K-Means Method

Strengths Relatively efficient: O(tkn), where n is # objects, k is

# clusters, and t is # iterations. Normally, k, t << n.

Often terminates at a local optimum. The global optimum may be found using techniques such as simulated annealing and genetic algorithms

Weaknesses Applicable only when mean is defined (what about

categorical data?) Need to specify k, the number of clusters, in advance Trouble with noisy data and outliers Not suitable to discover clusters with non-convex

shapes

Page 21: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods

22

Hierarchical Clustering

Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative(AGNES)

divisive(DIANA)

Page 22: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods

Simple example

1 3 2 5 4 60

0.05

0.1

0.15

0.2

1

2

3

4

5

6

1

23 4

5

Page 23: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods

Cut-off point

Page 24: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods

Single link example

Page 25: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods
Page 26: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods
Page 27: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods
Page 28: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods
Page 29: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods
Page 30: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods
Page 31: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods
Page 32: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods
Page 33: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods

close?

Inter-Cluster similarity?

Page 34: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods

MIN(Single-link)

Page 35: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods

MAX(complete-link)

Page 36: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods

Distance Between Centroids

Page 37: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods

Group Average(average-link)

Page 38: 1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods

What is: density-based problem?

Classes of arbitrary spatial shapes

Desired properties: good efficiency (large databases) minimal domain knowledge (determine

parameters)

Intuition: notion of density