clustering
DESCRIPTION
Clustering. COMP 790-90 Research Seminar BCB 713 Module Spring 2011 Wei Wang. Step 0. Step 1. Step 2. Step 3. Step 4. agglomerative (AGNES). a. a b. b. a b c d e. c. c d e. d. d e. e. divisive (DIANA). Step 3. Step 2. Step 1. Step 0. Step 4. Hierarchical Clustering. - PowerPoint PPT PresentationTRANSCRIPT
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Clustering
COMP 790-90 Research SeminarBCB 713 Module
Spring 2011Wei Wang
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
2
Hierarchical Clustering
Group data objects into a tree of clustersStep 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative(AGNES)
divisive(DIANA)
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
3
AGNES (Agglomerative Nesting)
Initially, each object is a clusterStep-by-step cluster merging, until all objects form a cluster
Single-link approachEach cluster is represented by all of the objects in the clusterThe similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
4
Dendrogram
Show how to merge clusters hierarchically
Decompose data objects into a multi-level nested partitioning (a tree of clusters)
A clustering of the data objects: cutting the dendrogram at the desired level
Each connected component forms a cluster
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
5
DIANA (DIvisive ANAlysis)
Initially, all objects are in one cluster
Step-by-step splitting clusters until each cluster contains only one object
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
6
Distance Measures
Minimum distance
Maximum distance
Mean distance
Average distance
i j
ji
ji
Cp Cqjijiavg
jijimean
CqCpji
CqCpji
qpdnn
CCd
mmdCCd
qpdCCd
qpdCCd
),(1
),(
),(),(
),(max),(
),(min),(
,max
,min
m: mean for a clusterC: a clustern: the number of objects in a cluster
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
7
Challenges of Hierarchical Clustering Methods
Hard to choose merge/split pointsNever undo merging/splittingMerging/splitting decisions are critical
Do not scale well: O(n2)What is the bottleneck when the data can’t fit in memory? Integrating hierarchical clustering with other techniques
BIRCH, CURE, CHAMELEON, ROCK
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
8
BIRCH
Balanced Iterative Reducing and Clustering using Hierarchies
CF (Clustering Feature) tree: a hierarchical data structure summarizing object info
Clustering objects clustering leaf nodes of the CF tree
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
9
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: Ni=1=Xi
SS: Ni=1=Xi
2
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
CF = (5, (16,30),(54,190))
(3,4)(2,6)(4,5)(4,7)(3,8)
Clustering Feature Vector
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
10
CF-tree in BIRCH
Clustering feature: Summarize the statistics for a subcluster: the 0th, 1st and 2nd moments of the subcluster
Register crucial measurements for computing cluster and utilize storage efficiently
A CF tree: a height-balanced tree storing the clustering features for a hierarchical clustering
A nonleaf node in a tree has descendants or “children”
The nonleaf nodes store sums of the CFs of children
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
11
CF TreeCF1
child1
CF3
child3
CF2
child2
CF6
child6
CF1
child1
CF3
child3
CF2
child2
CF5
child5
CF1 CF2 CF6prev next CF1 CF2 CF4
prev next
B = 7L = 6
Root
Non-leaf node
Leaf node Leaf node
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
12
Parameters of A CF-tree
Branching factor: the maximum number of children
Threshold: max diameter of sub-clusters stored at the leaf nodes
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
13
BIRCH Clustering
Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
14
Pros & Cons of BIRCH
Linear scalabilityGood clustering with a single scan
Quality can be further improved by a few additional scans
Can handle only numeric data
Sensitive to the order of the data records
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
15
Drawbacks of Square Error Based Methods
One representative per clusterGood only for convex shaped having similar size and density
A number of clusters parameter kGood only if k can be reasonably estimated
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
16
CURE: the Ideas
Each cluster has c representativesChoose c well scattered points in the clusterShrink them towards the mean of the cluster by a fraction of The representatives capture the physical shape and geometry of the cluster
Merge the closest two clustersDistance of two clusters: the distance between the two closest representatives