clustering

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Clustering

COMP 790-90 Research SeminarBCB 713 Module

Spring 2011Wei Wang

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications

2

Hierarchical Clustering

Group data objects into a tree of clustersStep 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative(AGNES)

divisive(DIANA)


3

AGNES (Agglomerative Nesting)

Initially, each object is a clusterStep-by-step cluster merging, until all objects form a cluster

Single-link approachEach cluster is represented by all of the objects in the clusterThe similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters


4

Dendrogram

Show how to merge clusters hierarchically

Decompose data objects into a multi-level nested partitioning (a tree of clusters)

A clustering of the data objects: cutting the dendrogram at the desired level

Each connected component forms a cluster


5

DIANA (DIvisive ANAlysis)

Initially, all objects are in one cluster

Step-by-step splitting clusters until each cluster contains only one object

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10


6

Distance Measures

Minimum distance

Maximum distance

Mean distance

Average distance

i j

ji

ji

Cp Cqjijiavg

jijimean

CqCpji

CqCpji

qpdnn

CCd

mmdCCd

qpdCCd

qpdCCd

),(1

),(

),(),(

),(max),(

),(min),(

,max

,min

m: mean for a clusterC: a clustern: the number of objects in a cluster


7

Challenges of Hierarchical Clustering Methods

Hard to choose merge/split pointsNever undo merging/splittingMerging/splitting decisions are critical

Do not scale well: O(n2)What is the bottleneck when the data can’t fit in memory? Integrating hierarchical clustering with other techniques

BIRCH, CURE, CHAMELEON, ROCK


8

BIRCH

Balanced Iterative Reducing and Clustering using Hierarchies

CF (Clustering Feature) tree: a hierarchical data structure summarizing object info

Clustering objects clustering leaf nodes of the CF tree


9

Clustering Feature: CF = (N, LS, SS)

N: Number of data points

LS: Ni=1=Xi

SS: Ni=1=Xi

2

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

CF = (5, (16,30),(54,190))

(3,4)(2,6)(4,5)(4,7)(3,8)

Clustering Feature Vector


10

CF-tree in BIRCH

Clustering feature: Summarize the statistics for a subcluster: the 0th, 1st and 2nd moments of the subcluster

Register crucial measurements for computing cluster and utilize storage efficiently

A CF tree: a height-balanced tree storing the clustering features for a hierarchical clustering

A nonleaf node in a tree has descendants or “children”

The nonleaf nodes store sums of the CFs of children


11

CF TreeCF1

child1

CF3

child3

CF2

child2

CF6

child6

CF1

child1

CF3

child3

CF2

child2

CF5

child5

CF1 CF2 CF6prev next CF1 CF2 CF4

prev next

B = 7L = 6

Root

Non-leaf node

Leaf node Leaf node


12

Parameters of A CF-tree

Branching factor: the maximum number of children

Threshold: max diameter of sub-clusters stored at the leaf nodes


13

BIRCH Clustering

Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data)

Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree


14

Pros & Cons of BIRCH

Linear scalabilityGood clustering with a single scan

Quality can be further improved by a few additional scans

Can handle only numeric data

Sensitive to the order of the data records


15

Drawbacks of Square Error Based Methods

One representative per clusterGood only for convex shaped having similar size and density

A number of clusters parameter kGood only if k can be reasonably estimated


16

CURE: the Ideas

Each cluster has c representativesChoose c well scattered points in the clusterShrink them towards the mean of the cluster by a fraction of The representatives capture the physical shape and geometry of the cluster

Merge the closest two clustersDistance of two clusters: the distance between the two closest representatives

clustering

Documents