clustering… in general in vector space, clusters are vectors found within of a cluster vector,...
Post on 19-Dec-2015
216 views
TRANSCRIPT
Clustering… in General In vector space, clusters are vectors found within
of a cluster vector, with different techniques for determining the cluster vector and .
Clustering is unsupervised pattern classification. Unsupervised means no correct answer or feedback. Patterns typically are samples of feature vectors or
matrices. Classification means collecting the samples into groups
of similar members.
Clustering Decisions Pattern Representation
feature selection (e.g., stop word removal, stemming) number of categories
Pattern proximity distance measure on pairs of patterns
Grouping characteristics of clusters (e.g., fuzzy, hierarchical)
Clustering algorithms embody different assumptions about these decisions and the form of clusters.
Formal Definitions Feature vector x is a single datum
of d measurements. Hard clustering techniques assign
a class label to each cluster; members of clusters are mutually exclusive.
Fuzzy clustering techniques assign a fractional degree of membership to each label for each x.
Proximity Measures Generally, use Euclidean distance or
mean squared distance. In IR, use similarity measure from
retrieval (e.g., cosine measure for TFIDF).
[Jain, Murty & Flynn] Taxonomy of Clustering
Clustering
Hierarchical Partitional
SingleLink
CompleteLink
SquareError
GraphTheoretic
MixtureResolving
ModeSeeking
k-meansExpectationMinimizationHAC
Clustering Issues
Agglomerative: begin with each sample in its own cluster and merge
Divisive: begin with single cluster and split
Hard: mutually exclusive cluster membership
Fuzzy: degrees of membership in clusters
Deterministic Stochastic
Incremental: samples may be added to clusters
Batch: clusters created over entire sample space
Hierarchical Algorithms Produce hierarchy of
classes (taxonomy) from singleton clusters to just one cluster.
Select level for extracting cluster set.
Representation is a dendrogram.
C1
D1 D2D3 D4
C2C3 C4
C1,3
0.99
C1,3,2
0.29
C1,3,2,4
0.00
Complete-Link Revisited Used to create statistical thesaurus Agglomerative, hard, deterministic, batch1. Start with 1 cluster/sample2. Find two clusters with lowest distance3. Merge two clusters and add to hierarchy4. Repeat from 2 until termination criterion or
until all clusters have merged
Single-Link Like Complete-Link except…
use minimum of distances between all pairs of samples in the two clusters (complete-link uses maximum).
Single-link has chaining effect with elongated clusters, but can construct more complex shapes.
Example:Plot
05
101520253035404550
0 10 20 30 40 50
Example: Proximity Matrix
21,15
26,25
29,22
31,15
21,27
23,32
29,26
33,21
21,15
0 11.2 10.6 10.0 12.0 17.1 13.6 13.4
26,25
0 4.2 11.1 5.4 7.6 3.2 8.1
29,22
0 7.3 9.4 11.7 4.0 4.1
31,15
0 15.6 18.8 11.2 6.3
21,27
0 5.4 8.1 13.4
23,32
0 8.5 14.9
29,26
0 6.4
33,21
0
Complete-Link Solution
1,28
4,9
9,16
13,18
21,15 29,22
31,15 33,21 35,35 42,45
45,4246,3023,32
21,27
29,26
26,25
C1 C2 C3C4 C5
C6C7C8 C9
C10C11 C12
C13 C14
C15
Single-Link Solution
1,28
4,9
9,16
13,18
21,15 29,22
31,15 33,21 35,35 42,45
45,4246,3023,32
21,27
29,26
26,25
C1 C4C5 C6
C7
C9
C13
C10
C11
C15
C2
C3
C8
C12
C14
Hierarchical Agglomerative Clustering (HAC)
Agglomerative, hard, deterministic, batch1. Start with 1 cluster/sample and compute a
proximity matrix between pairs of clusters.2. Merge most similar pair of clusters and
update proximity matrix.3. Repeat 2 until all clusters merged. Difference is in how proximity matrix is
updated. Ability to combine benefits of both single and
complete link algorithms.
HAC for IR
Intra-cluster Similarity
where S is TFIDF vectors for documents, c is centroid of cluster X, and d is a document.
Proximity is similarity of all documents to the cluster centroid.
Select pair of clusters that produces the smallest decrease in similarity, e.g., if merge(X,Y)=>Z, thenmax[Sim(Z)-(Sim(X)+Sim(Y))]
Sd
Xd
dS
c
cdXSim
1
),cos()(
HAC for IR- AlternativesCentroid Similarity
cosine similarity between the centroid of the two clusters
UPGMA
Sd
YX
dS
c
ccYXSim
1
),cos(),(
YX
ddYXSim YdXd
*
),cos(),( 21 ,
21
Partitional Algorithms Results in set of unrelated clusters. Issues:
how many clusters is enough? how to search space of possible
partitions? what is appropriate clustering
criterion?
K Means Number of clusters is set by user to be k. Non-deterministic Clustering criterion is squared error:
where S is document set, L is a clustering, K is number of clusters, x is ith document in jth cluster and c is centroid of jth cluster.
K
j
n
ij
j
i
j
cxLSe1 1
2
),(
k-Means Clustering Algorithm
1. Randomly select k samples as cluster centroids.
2. Assign each pattern to the closest cluster centroid.
3. Recompute centroids.4. If convergence criterion (e.g., minimal
decrease in error or no change in cluster composition) is not met, return to 2.
Example:K-Means Solutions
05
101520253035404550
0 10 20 30 40 50
k-Means Sensitivity to Initialization
ABC D E
F G
K=3, red started w/A, D, F; yellow w/A, B, C
k-Means for IR Update centroids incrementally Calculate centroid as with
hierarchical methods. Can refine into a divisive
hierarchical method by starting with single cluster and splitting using k-means until forms k clusters with highest summed similarities. (bisecting k-means)
Other Types of Clustering AlgorithmsGraph Theoretic: construct minimal
spanning tree and delete edges with largest lengths
Expectation Minimization (EM): assume clusters are drawn from distributions, use maximum likelihood to estimate parameters of distributions.
Nearest Neighbors: iteratively assign each sample to the cluster of its nearest labelled neighbor, so long as distance is below a set threshold.
Comparison of Clustering Algorithms [Steinbach et al.]
Implement 3 versions of HAC and 2 versions of k-Means
Compare performance on documents hand labelled as relevant to one of a set of classes.
Well known data sets (TREC) Found that UPGMA is best of hierarchical,
but bisecting k-means seems to do better if considered over many runs.
M. Steinbach, G. Karypis, V.Kumar. A Comparison of Document Clustering Techniques, KDD Workshop on Text Mining, 2000.
Evaluation Metrics 1
Evaluation: how to measure cluster quality? Entropy:
where pij is probability that a member of cluster j belongs to class i, nj is size of cluster j, m is number of clusters, n is number of docs and CS is a clustering solution.
m
j
jj
CS
iijijj
n
EnE
ppE
1
*
)log(
Comparison Measure 2 F measure: combines precision and
recall treat each cluster as the result of a query
and each class as the relevant set of docs
i
i
jij
iij
jiFn
nF
jiji
jijijiF
nnji
nnji
)],(max[
),(Recall),(Precision
),(Precision*),(Recall*2),(
/),(Precision
/),(Recall
nij is # of members of class i in cluster j,nj is # in j, ni is # in i,n is # of docs.