clustering… in general in vector space, clusters are vectors found within of a cluster vector,...

Clustering… in General In vector space, clusters are vectors found within

of a cluster vector, with different techniques for determining the cluster vector and .

Clustering is unsupervised pattern classification. Unsupervised means no correct answer or feedback. Patterns typically are samples of feature vectors or

matrices. Classification means collecting the samples into groups

of similar members.

Clustering Decisions Pattern Representation

feature selection (e.g., stop word removal, stemming) number of categories

Pattern proximity distance measure on pairs of patterns

Grouping characteristics of clusters (e.g., fuzzy, hierarchical)

Clustering algorithms embody different assumptions about these decisions and the form of clusters.

Formal Definitions Feature vector x is a single datum

of d measurements. Hard clustering techniques assign

a class label to each cluster; members of clusters are mutually exclusive.

Fuzzy clustering techniques assign a fractional degree of membership to each label for each x.

Proximity Measures Generally, use Euclidean distance or

mean squared distance. In IR, use similarity measure from

retrieval (e.g., cosine measure for TFIDF).

[Jain, Murty & Flynn] Taxonomy of Clustering

Clustering

Hierarchical Partitional

SingleLink

CompleteLink

SquareError

GraphTheoretic

MixtureResolving

ModeSeeking

k-meansExpectationMinimizationHAC

Clustering Issues

Agglomerative: begin with each sample in its own cluster and merge

Divisive: begin with single cluster and split

Hard: mutually exclusive cluster membership

Fuzzy: degrees of membership in clusters

Deterministic Stochastic

Incremental: samples may be added to clusters

Batch: clusters created over entire sample space

Hierarchical Algorithms Produce hierarchy of

classes (taxonomy) from singleton clusters to just one cluster.

Select level for extracting cluster set.

Representation is a dendrogram.

C1

D1 D2D3 D4

C2C3 C4

C1,3

0.99

C1,3,2

0.29

C1,3,2,4

0.00

Complete-Link Revisited Used to create statistical thesaurus Agglomerative, hard, deterministic, batch1. Start with 1 cluster/sample2. Find two clusters with lowest distance3. Merge two clusters and add to hierarchy4. Repeat from 2 until termination criterion or

until all clusters have merged

Single-Link Like Complete-Link except…

use minimum of distances between all pairs of samples in the two clusters (complete-link uses maximum).

Single-link has chaining effect with elongated clusters, but can construct more complex shapes.

Example:Plot

05

101520253035404550

0 10 20 30 40 50

Example: Proximity Matrix

21,15

26,25

29,22

31,15

21,27

23,32

29,26

33,21

21,15

0 11.2 10.6 10.0 12.0 17.1 13.6 13.4

26,25

0 4.2 11.1 5.4 7.6 3.2 8.1

29,22

0 7.3 9.4 11.7 4.0 4.1

31,15

0 15.6 18.8 11.2 6.3

21,27

0 5.4 8.1 13.4

23,32

0 8.5 14.9

29,26

0 6.4

33,21

0

Complete-Link Solution

1,28

4,9

9,16

13,18

21,15 29,22

31,15 33,21 35,35 42,45

45,4246,3023,32

21,27

29,26

26,25

C1 C2 C3C4 C5

C6C7C8 C9

C10C11 C12

C13 C14

C15

Single-Link Solution

1,28

4,9

9,16

13,18

21,15 29,22

31,15 33,21 35,35 42,45

45,4246,3023,32

21,27

29,26

26,25

C1 C4C5 C6

C7

C9

C13

C10

C11

C15

C2

C3

C8

C12

C14

Hierarchical Agglomerative Clustering (HAC)

Agglomerative, hard, deterministic, batch1. Start with 1 cluster/sample and compute a

proximity matrix between pairs of clusters.2. Merge most similar pair of clusters and

update proximity matrix.3. Repeat 2 until all clusters merged. Difference is in how proximity matrix is

updated. Ability to combine benefits of both single and

complete link algorithms.

HAC for IR

Intra-cluster Similarity

where S is TFIDF vectors for documents, c is centroid of cluster X, and d is a document.

Proximity is similarity of all documents to the cluster centroid.

Select pair of clusters that produces the smallest decrease in similarity, e.g., if merge(X,Y)=>Z, thenmax[Sim(Z)-(Sim(X)+Sim(Y))]

Sd

Xd

dS

c

cdXSim

1

),cos()(

HAC for IR- AlternativesCentroid Similarity

cosine similarity between the centroid of the two clusters

UPGMA

Sd

YX

dS

c

ccYXSim

1

),cos(),(

YX

ddYXSim YdXd

*

),cos(),( 21 ,

21

Partitional Algorithms Results in set of unrelated clusters. Issues:

how many clusters is enough? how to search space of possible

partitions? what is appropriate clustering

criterion?

K Means Number of clusters is set by user to be k. Non-deterministic Clustering criterion is squared error:

where S is document set, L is a clustering, K is number of clusters, x is ith document in jth cluster and c is centroid of jth cluster.

K

j

n

ij

j

i

j

cxLSe1 1

2

),(

k-Means Clustering Algorithm

1. Randomly select k samples as cluster centroids.

2. Assign each pattern to the closest cluster centroid.

3. Recompute centroids.4. If convergence criterion (e.g., minimal

decrease in error or no change in cluster composition) is not met, return to 2.

Example:K-Means Solutions

05

101520253035404550

0 10 20 30 40 50

k-Means Sensitivity to Initialization

ABC D E

F G

K=3, red started w/A, D, F; yellow w/A, B, C

k-Means for IR Update centroids incrementally Calculate centroid as with

hierarchical methods. Can refine into a divisive

hierarchical method by starting with single cluster and splitting using k-means until forms k clusters with highest summed similarities. (bisecting k-means)

Other Types of Clustering AlgorithmsGraph Theoretic: construct minimal

spanning tree and delete edges with largest lengths

Expectation Minimization (EM): assume clusters are drawn from distributions, use maximum likelihood to estimate parameters of distributions.

Nearest Neighbors: iteratively assign each sample to the cluster of its nearest labelled neighbor, so long as distance is below a set threshold.

Comparison of Clustering Algorithms [Steinbach et al.]

Implement 3 versions of HAC and 2 versions of k-Means

Compare performance on documents hand labelled as relevant to one of a set of classes.

Well known data sets (TREC) Found that UPGMA is best of hierarchical,

but bisecting k-means seems to do better if considered over many runs.

M. Steinbach, G. Karypis, V.Kumar. A Comparison of Document Clustering Techniques, KDD Workshop on Text Mining, 2000.

Evaluation Metrics 1

Evaluation: how to measure cluster quality? Entropy:

where pij is probability that a member of cluster j belongs to class i, nj is size of cluster j, m is number of clusters, n is number of docs and CS is a clustering solution.

m

j

jj

CS

iijijj

n

EnE

ppE

1

*

)log(

Comparison Measure 2 F measure: combines precision and

recall treat each cluster as the result of a query

and each class as the relevant set of docs

i

i

jij

iij

jiFn

nF

jiji

jijijiF

nnji

nnji

)],(max[

),(Recall),(Precision

),(Precision*),(Recall*2),(

/),(Precision

/),(Recall

nij is # of members of class i in cluster j,nj is # in j, ni is # in i,n is # of docs.

clustering… in general in vector space, clusters are vectors found within of a cluster vector,...

Documents

clusters completelink

clusters batch

cluster members of clusters

plot slide

single cluster

singleton clusters

characteristics of clusters

form of clusters