clustering

16
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP 790-90 Research Seminar BCB 713 Module Spring 2011 Wei Wang

Upload: lydia-olsen

Post on 30-Dec-2015

88 views

Category:

Documents


0 download

DESCRIPTION

Clustering. COMP 790-90 Research Seminar BCB 713 Module Spring 2011 Wei Wang. Step 0. Step 1. Step 2. Step 3. Step 4. agglomerative (AGNES). a. a b. b. a b c d e. c. c d e. d. d e. e. divisive (DIANA). Step 3. Step 2. Step 1. Step 0. Step 4. Hierarchical Clustering. - PowerPoint PPT Presentation

TRANSCRIPT

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Clustering

COMP 790-90 Research SeminarBCB 713 Module

Spring 2011Wei Wang

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

2

Hierarchical Clustering

Group data objects into a tree of clustersStep 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative(AGNES)

divisive(DIANA)

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

3

AGNES (Agglomerative Nesting)

Initially, each object is a clusterStep-by-step cluster merging, until all objects form a cluster

Single-link approachEach cluster is represented by all of the objects in the clusterThe similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

4

Dendrogram

Show how to merge clusters hierarchically

Decompose data objects into a multi-level nested partitioning (a tree of clusters)

A clustering of the data objects: cutting the dendrogram at the desired level

Each connected component forms a cluster

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

5

DIANA (DIvisive ANAlysis)

Initially, all objects are in one cluster

Step-by-step splitting clusters until each cluster contains only one object

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

6

Distance Measures

Minimum distance

Maximum distance

Mean distance

Average distance

i j

ji

ji

Cp Cqjijiavg

jijimean

CqCpji

CqCpji

qpdnn

CCd

mmdCCd

qpdCCd

qpdCCd

),(1

),(

),(),(

),(max),(

),(min),(

,max

,min

m: mean for a clusterC: a clustern: the number of objects in a cluster

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

7

Challenges of Hierarchical Clustering Methods

Hard to choose merge/split pointsNever undo merging/splittingMerging/splitting decisions are critical

Do not scale well: O(n2)What is the bottleneck when the data can’t fit in memory? Integrating hierarchical clustering with other techniques

BIRCH, CURE, CHAMELEON, ROCK

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

8

BIRCH

Balanced Iterative Reducing and Clustering using Hierarchies

CF (Clustering Feature) tree: a hierarchical data structure summarizing object info

Clustering objects clustering leaf nodes of the CF tree

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

9

Clustering Feature: CF = (N, LS, SS)

N: Number of data points

LS: Ni=1=Xi

SS: Ni=1=Xi

2

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

CF = (5, (16,30),(54,190))

(3,4)(2,6)(4,5)(4,7)(3,8)

Clustering Feature Vector

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

10

CF-tree in BIRCH

Clustering feature: Summarize the statistics for a subcluster: the 0th, 1st and 2nd moments of the subcluster

Register crucial measurements for computing cluster and utilize storage efficiently

A CF tree: a height-balanced tree storing the clustering features for a hierarchical clustering

A nonleaf node in a tree has descendants or “children”

The nonleaf nodes store sums of the CFs of children

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

11

CF TreeCF1

child1

CF3

child3

CF2

child2

CF6

child6

CF1

child1

CF3

child3

CF2

child2

CF5

child5

CF1 CF2 CF6prev next CF1 CF2 CF4

prev next

B = 7L = 6

Root

Non-leaf node

Leaf node Leaf node

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

12

Parameters of A CF-tree

Branching factor: the maximum number of children

Threshold: max diameter of sub-clusters stored at the leaf nodes

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

13

BIRCH Clustering

Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data)

Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

14

Pros & Cons of BIRCH

Linear scalabilityGood clustering with a single scan

Quality can be further improved by a few additional scans

Can handle only numeric data

Sensitive to the order of the data records

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

15

Drawbacks of Square Error Based Methods

One representative per clusterGood only for convex shaped having similar size and density

A number of clusters parameter kGood only if k can be reasonably estimated

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

16

CURE: the Ideas

Each cluster has c representativesChoose c well scattered points in the clusterShrink them towards the mean of the cluster by a fraction of The representatives capture the physical shape and geometry of the cluster

Merge the closest two clustersDistance of two clusters: the distance between the two closest representatives