clustering -...
Post on 15-Oct-2020
0 Views
Preview:
TRANSCRIPT
Clustering CE-324: Modern Information Retrieval Sharif University of Technology
M. SoleymaniFall 2018
Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
What is clustering?} Clustering: grouping a set of objects into similar ones
} Docs within a cluster should be similar.} Docs from different clusters should be dissimilar.
} The commonest form of unsupervised learning} Unsupervised learning
} learning from raw data, as opposed to supervised data where aclassification of examples is given
} A common and important task that finds many applications inIR and other places
Ch. 16
2
A data set with clear cluster structure
} How would youdesign an algorithmfor finding the threeclusters in this case?
Ch. 16
3
Applications of clustering in IR} For better navigation of search results
} Effective “user recall” will be higher
} Whole corpus analysis/navigation} Better user interface: search without typing
} For improving recall in search applications} Better search results (like pseudo RF)
} For speeding up vector space retrieval} Cluster-based retrieval gives faster search
Sec. 16.1
4
Applications of clustering in IR
5
Search result clustering
6
7 yippy.com – grouping search results
Clustering the collection
8
} Cluster-based navigation is an interesting alternative tokeyword searching (i.e., the standard IR paradigm)
} User may prefer browsing over searching when they areunsure about which terms to use
} Well suited to a collection of news stories} News reading is not really search, but rather a process of
selecting a subset of stories about recent events
Yahoo! Hierarchy isn’t clustering but is the kind of output you want from clustering
dairycrops
agronomyforestry
AIHCI
craftmissions
botany
evolution
cellmagnetism
relativity
courses
agriculture biology physics CS space
... ... ...
… (30)
www.yahoo.com/Science
... ...
Yahoo! Hierarchy
9
Google News: automatic clustering gives an effective news presentation metaphor
10
11
To improve efficiency and effectiveness of search system
12
} Improve language modeling: replacing the collectionmodel used for smoothing by a model derived from doc’scluster
} Clustering can speed-up search (via an inexact algorithm)
} Clustering can improve recall
For improving search recall} Cluster hypothesis: Docs in the same cluster behave similarly
with respect to relevance to information needs
} Therefore, to improve search recall:} Cluster docs in corpus a priori} When a query matches a doc 𝑑, also return other docs in the cluster
containing 𝑑
} Query car: also return docs containing automobile} Because clustering grouped together docs containing car with
those containing automobile.
Why might this happen?
Sec. 16.1
13
Issues for clustering} Representation for clustering
} Doc representation} Vector space? Normalization?} Centroids aren’t length normalized
} Need a notion of similarity/distance
} How many clusters?} Fixed a priori?} Completely data driven?
} Avoid “trivial” clusters - too large or small¨ too large: for navigation purposes you've wasted an extra user click
without whittling down the set of docs much
Sec. 16.2
14
Notion of similarity/distance} Ideal: semantic similarity.
} Practical: term-statistical similarity} We will use cosine similarity.
} For many algorithms, easier to think in terms of a distance (rather thansimilarity)
} We will mostly speak of Euclidean distance¨ But real implementations use cosine similarity
15
Clustering algorithms categorization} Flat algorithms (k-means)
} Usually start with a random (partial) partitioning} Refine it iteratively
} Hierarchical algorithms} Bottom-up, agglomerative} Top-down, divisive
16
Hard vs. soft clustering} Hard clustering: Each doc belongs to exactly one cluster
} More common and easier to do
} Soft clustering:A doc can belong to more than one cluster.
17
Partitioning algorithms} Construct a partition of 𝑁 docs into 𝐾 clusters
} Given: a set of docs and the number 𝐾} Find: a partition of docs into 𝐾 clusters that optimizes the
chosen partitioning criterion} Finding a global optimum is intractable for many objective functions of
clustering} Effective heuristic methods: K-means and K-medoids algorithms
18
K-means} Assumes docs are real-valued vectors 𝒙(&), … , 𝒙(*).} Clusters based on centroids (aka the center of gravity or
mean) of points in a cluster:
𝝁, =1𝒞,
0 𝒙(1)𝒙(2)∈𝒞4
} K-means cost function:
𝐽(𝒞) =0 0 𝒙(1)– 𝝁𝑗9�
𝒙(2)∈𝒞4
;
,<&
Sec. 16.4
19
𝒞 = {𝒞1, 𝒞2, … , 𝒞𝐾}
𝒞𝑗: the set of data points assigned to j-th cluster
K-means algorithm
Select K random points {𝝁1, 𝝁2, …𝝁𝐾} as clusters’ initial centroids.
Until clustering converges (or other stopping criterion):
For each doc 𝒙(1):Assign 𝒙(1) to the cluster 𝒞, such that 𝑑𝑖𝑠𝑡(𝒙(1), 𝝁𝑗) is minimal.
For each cluster 𝐶𝑗𝝁𝑗 =
∑ 𝒙(2)�2∈𝒞4𝒞4
Sec. 16.4
Reassignment of instances to clusters is based on distance to the current cluster centroids (can equivalently be in terms of similarities)
20
21 [Bishop]
22
Termination conditions} Several possibilities for terminal condition, e.g.,
} A fixed number of iterations} Doc partition unchanged} 𝐽 < 𝜃: cost function falls below a threshold} ∆𝐽 < 𝜃: the decrease in the cost function (in two successive
iterations) falls below a threshold
Sec. 16.4
23
Convergence of K-means} K-means algorithm ever reaches a fixed point in which
clusters don’t change.} We must use tie-breaking when there are samples with the
same distance from two or more clusters (by assigning it tothe lower index cluster)
Sec. 16.4
24
K-means decreases 𝐽(𝒞) in each iteration(before convergence)
} First, reassignment monotonically decreases 𝐽(𝒞) since eachvector is assigned to the closest centroid.
} Second, recomputation monotonically decreases each∑ 𝒙(1)– 𝝁𝑘 2�1∈IJ :
} ∑ 𝒙(1)– 𝝁𝑘 2�1∈IJ reaches minimum for 𝝁𝑘 =
&IK∑ 𝒙(1)�1∈IJ
} K-means typically converges quickly
Sec. 16.4
25
Time complexity of K-means} Computing distance between two docs:𝑂(𝑀)
} 𝑀 is the dimensionality of the vectors.
} Reassigning clusters: 𝑂(𝐾𝑁) distance computations ⇒𝑂(𝐾𝑁𝑀).
} Computing centroids: Each doc gets added once to somecentroid:𝑂(𝑁𝑀).
} Assume these two steps are each done once for 𝐼iterations: 𝑂(𝐼𝐾𝑁𝑀).
Sec. 16.4
26
Seed choice
} Results can vary based on randomselection of initial centroids.
} Some initializations get poorconvergence rate, or convergence tosub-optimal clustering} Exclude outliers from the seed set} Try out multiple starting points and
choosing the clustering with lowest cost} Select good seeds using a heuristic (e.g., doc
least similar to any existing mean)} Obtaining seeds from another method such
as hierarchical clustering
If you start with B and E as centroids you converge to {A,B,C} and {D,E,F}
If you start with D and F, you converge to {A,B,D,E} {C,F}
Example showingsensitivity to seeds
Sec. 16.4
27
K-means issues, variations, etc.} Computes the centroid only after all points are re-
assigned} Instead, we can re-compute the centroid after every
assignment} It can improve speed of convergence of K-means
} Assumes clusters are spherical in vector space} Sensitive to coordinate changes, weighting etc.
} Disjoint and exhaustive} Doesn’t have a notion of “outliers” by default
} But can add outlier filtering
Sec. 16.4
Dhillon et al. ICDM 2002 – variation to fix some issues with smalldocument clusters
28
How many clusters?} Number of clusters 𝐾 is given
} Partition n docs into predetermined number of clusters
} Finding the “right” number is part of the problem} Given docs, partition into an “appropriate” no. of subsets.
} E.g., for query results - ideal value of K not known up front - thoughUI may impose limits.
29
How many clusters?
How many clusters?
Four ClustersTwo Clusters
Six Clusters
30
Selecting k
31
K not specified in advance
} Tradeoff between having better focus within each clusterand having too many clusters
} Solve an optimization problem: penalize having lots ofclusters} application dependent
} e.g., compressed summary of search results list.
𝑘∗ = minK𝐽T1U 𝑘 + 𝜆𝑘
32
obtained in e.g. 𝐽 {𝒞1,𝒞2, … ,𝒞𝑘}show the minimum value of 𝐽T1U 𝑘 :100 runs of k-means (with different initializations)
Penalize lots of clusters} Benefit for a doc: cosine similarity to its centroid} Total Benefit: sum of the individual doc Benefits.
} Why is there always a clustering of Total Benefit n?
} For each cluster, we have a Cost C.} For K clusters, the Total Cost is KC.
} Value of a clustering = Total Benefit - Total Cost.
} Find clustering of highest value, over all choices of K.} Total benefit increases with increasing K.} But can stop when it doesn’t increase by “much”. The Cost term
enforces this.
33
What is a good clustering?} Internal criterion:
} intra-class (that is, intra-cluster) similarity is high} inter-class similarity is low
} The measured quality of a clustering depends on both thedoc representation and the similarity measure
Sec. 16.3
34
External criteria for clustering quality} Quality: ability to discover some or all of patterns in gold
standard data
} Assesses a clustering with respect to ground truth …requires labeled data
} 𝐶 gold standard classes 𝐶&, … , 𝐶Y , while clusteringproduces K clusters,ω1,ω2, …,ωK.
Sec. 16.3
35
External criteria} Purity} Rand Index} F measure
36
Rand Index (RI)
Number of points Same Cluster in clustering
Different Clusters in clustering
Same class in ground truth 20 24
Different classes in ground truth 20 72
Sec. 16.3
measures between pair decisions RI = 0.68
37
Rand index and F-measure
TPPTP FP
=+
TP TNRITP FP TN FN
+=
+ + +
Compare with standard Precision and Recall:
People also define and use a cluster F-measure, which is probably a better measure.
Sec. 16.3
38
𝐹[ =𝛽9 + 1 𝑃𝑅𝛽9𝑃 + 𝑅
𝑅 =𝑇𝑃
𝑇𝑃 + 𝐹𝑁
External evaluation of cluster quality} Purity: ratio between the dominant class in the cluster 𝜔1
and the size of cluster 𝜔1
𝑝𝑢𝑟𝑖𝑡𝑦 𝜔1 =max,<&,…,Y
𝜔1 ∩ 𝐶,𝜔1
𝑇𝑜𝑡𝑎𝑙𝑃𝑢𝑟𝑖𝑡𝑦 𝜔&, … , 𝜔Y =1𝑁0 max
,<&,…,Y𝜔1 ∩ 𝐶,
;
1<&
=0𝑝𝑢𝑟𝑖𝑡𝑦 𝜔1
;
1<&
×𝜔1𝑁
} Biased because having 𝑁 clusters maximizes purity
Sec. 16.3
39
• •• •• •
• •• •• •
• •• •
•
Cluster I Cluster II Cluster III
Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6
Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6
Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5
Purity example
Sec. 16.3
40
Total Purity = 1/17 (5+4+3) = 12/17
Hierarchical clustering} Build a tree-based hierarchical taxonomy (dendrogram)
from a set of docs.
} Outputs a hierarchy, a structure that is more informative} Does not require a pre-specified number of clusters} But they have lower efficiency.
animal
vertebrate
fish reptile amphib. mammal worm insect crustacean
invertebrate
Ch. 17
41
Hierarchical clustering: Approaches
42
} Divisive approach: recursive application of a partitionalclustering algorithm.
} Agglomerative approach:} treat each doc as a singleton cluster at the outset} then successively merge (or agglomerate) pairs of clusters until
all clusters have been merged into a single cluster
Hierarchical clustering
43
Dendrogram: hierarchical clustering
} Each merge is represented by ahorizontal line
} y-coordinate of the horizontalline is the similarity of the twoclusters that were merged
} Clustering obtained by cutting thedendrogram at a desired level:
} Each connected component forms acluster.
44
Hierarchical Agglomerative Clustering (HAC)} Starts with each doc in a separate cluster
} then repeatedly joins the closest pair of clusters, until there isonly one cluster.
} The history of merging forms a binary tree or hierarchy.
Sec. 17.1
Note: the resulting clusters are still “hard” and induce a partition
45
Example} Hierarchical Agglomerative Clustering (HAC)
46
7653241
7
65
4
3
2
1
47
How to cut the hierarchy?
48
} Cut at a prespecified level of similarity} Cut where the gap between two successive combination
similarities is largest.} prespecify the number of clusters K and select the cutting
point that produces K clusters.} Find the number of clusters as : 𝑘∗ = min
K𝐽T1U 𝑘 + 𝜆𝑘
Closest pair of clusters} Many variants to defining closest pair of clusters
} Single-link} Similarity of the most similar pair (single-link)
} Complete-link} Similarity of the “furthest” points, the least similar pair
} Centroid} Clusters whose centroids (centers of gravity) are the most similar
} Average-link} Average similarities between pairs of elements
Sec. 17.2
49
Distances between cluster pairs
50
Single-link Complete-link
Centroid Average-link
Single link} Use maximum similarity of pairs:
} Can result in “straggly” (long and thin) clusters} due to chaining effect.
} After merging 𝐶𝑖 and 𝐶𝑗, the similarity of the resultingcluster to another cluster, 𝐶𝑘:
,( , ) max ( , )
i ji j x C y C
sim C C sim x yÎ Î
=
(( ), ) max( ( , ), ( , ))i j k i k j ksim C C C sim C C sim C C=U
Sec. 17.2
51
Single link: example
Sec. 17.2
52
Complete link} Use minimum similarity of pairs:
} Makes “tighter,” spherical clusters that are typicallypreferable.
} After merging 𝐶𝑖 and 𝐶𝑗, the similarity of the resultingcluster to another cluster, 𝐶𝑘:
,( , ) min ( , )
i ji j x C y C
sim C C sim x yÎ Î
=
(( ), ) min( ( , ), ( , ))i j k i k j ksim C C C sim C C sim C C=U
Sec. 17.2
53
Complete link: example
Sec. 17.2
54
Graph-theoretic interpretations
55
} 𝑠K : similarity of the two clusters merged in step k
} 𝐺(𝑠K): graph that links all data points with a similarity ofat least 𝑠K.
} The clusters after step k in:} single-link clustering are the connected components of G(sk)} complete-link clustering are maximal cliques of G(sk).
Noise
56
Group average} Similarity of two clusters = average similarity of all pairs
within merged cluster.
} Compromise between single and complete link.
( ) ( ):
1( , ) ( , )( 1) i j i j
i jx C C y C C y xi j i j
sim C C sim x yC C C C Î Î ¹
=- å år r r rU U
r rU U
Sec. 17.3
57
Computing group average similarity} When we use dot product as the similarity measure we
can compute similarity of two clusters in constant time ifwe maintain sum of vectors in each cluster.
} Compute similarity of clusters:j
jx C
s xÎ
= årr r
( ) ( ) (| | | |)( , )
(| | | |)(| | | | 1)i j i j i j
i ji j i j
s s s s C Csim c c
C C C C+ • + - +
=+ + -
r r r r
Sec. 17.3
58
If the length of doc vectors are assumed one
GAAC requires:(i) documents represented as vectors(ii) length normalization of vectors, so that self-similarities are 1.0
(iii) the dot product as the similarity measure between vectors and sums of vectors.
Centroid Clustering
59
Inversions in centroid clustering
60
Computational complexity} First iteration: all HAC methods need to compute similarity of
all pairs of instances, O(N2) similarity computation.
} N-1 merging iterations: compute the distance between themost recently created cluster and all other existing ones.
} In order to maintain an overall O(N2) performance, computingsimilarity to each other cluster must be done in constant time.
} We can find the closest pair in O(N log N)
} Thus, the overall complexity is O(N3) if done naively orO(N2 log N) if done more cleverly using priority queues
Sec. 17.2.1
61
62
Cluster labeling
63
} Cluster internal labeling: depends on the cluster itself} Title of the doc closest to the centroid
} Titles are easiest to read than a list of terms} However, single doc is unlikely to be representative of all docs in a
cluster
} A list of terms with high weights in the cluster centroid
} Differential cluster labeling: comparing the distributionof terms in one cluster with that of other clusters.} We can use measures such as “Mutual Information”
} 𝐼 𝐶K; 𝑋1 = ∑ ∑ 𝑃 𝑥1, 𝑐K log t u2,YJt u2 t YJu2YJ∈{v,&}
Cluster Labeling
64
Final word and resources} In clustering, clusters are inferred from the data without
human input (unsupervised learning)
} However, in practice, it’s a bit less clear} many ways of influencing the outcome of clustering: number of
clusters, similarity measure, representation of docs, . . .
} Resources} IIR 16 except 16.5} IIR 17 except 17.5
65
top related