clustering cosc 526 class 12 arvind ramanathan computational science & engineering division oak...
Post on 13-Dec-2015
216 Views
Preview:
TRANSCRIPT
Clustering
COSC 526 Class 12
Arvind RamanathanComputational Science & Engineering DivisionOak Ridge National Laboratory, Oak RidgePh: 865-576-7266E-mail: ramanathana@ornl.gov
Slides inspired by: Andrew Moore (CMU), Andrew Ng (Stanford), Tan, Steinbach and Kumar
2
Assignment 1: Your first hand at random walks
• Write up is here…
• Pair up and do the assignment– It helps to work in small teams
– Maximize your productivity
• Most of the assignment and its notes are in the handouts (class web-page)
3
Clustering: Basics…
4
Clustering
• Finding groups of items (or objects) in a group that are related to one other and different from other groups
Inter-cluster distances are maximized
Intra-cluster distances are minimized
5
Applications• Grouping regions
together based on precipitation
• Grouping genes together based on expression patterns in cells
• Finding ensembles of folded/unfolded protein structures
6
What is not clustering?
• Supervised classification– class label information
• Simple segmentation– Dividing students into different registration
groups (either alphabetically, by major, etc.)
• Results of a query– Grouping is a result of external specification
• Graph partitioning– Areas not identical…
Take Home Message:• Clustering of data is essentially
driven by the data at hand!!• Meaning or interpretation of the
clusters should be driven by the data!!
7
Constitution of a cluster can be ambiguous
• How to decide between 8 clusters and 2 clusters?
8
Types of Clustering
• Partitional Clustering– A division of data into
non-overlapping subsets (clusters) such that each data point is in exactly one subset
• Hierarchical Clustering– A set of nested clusters
organized as a hierarchical tree
p1 p2 p3 p4 p5 p6
p3
p4
p5
p6
p1p2
9
Other types of distinctions…
• Exclusive vs. Non-exclusive:– Points may belong to multiple clusters
• Fuzzy vs. Non-fuzzy:– A point may belong to every cluster with weight
between 0 and 1
– Similar to probabilistic clustering
• Partial vs. Complete:– We may want to cluster only some of the data
• Heterogeneous vs. Homogeneous– Cluster of widely different sizes…
10
Types of Clusters
• Well-separated clusters
• Center-based clusters
• Contiguous clusters
• Density based clusters
• Property/conceptual
• Described by an objective function
set of points such that any point in a cluster is closer to every other point in the cluster than to any point not in the cluster
11
Types of Clusters
• Well-separated clusters
• Center-based clusters
• Contiguous clusters
• Density based clusters
• Property/conceptual
• Described by an objective function
Cluster is a set of objects such that an object in a cluster is closer to the center of the cluster (called centroid) than any other center of any other cluster…
12
Types of Clusters
• Well-separated clusters
• Center-based clusters
• Contiguous clusters
• Density based clusters
• Property/conceptual
• Described by an objective function
Nearest neighbor or transitive…
13
Types of Clusters
• Well-separated clusters
• Center-based clusters
• Contiguous clusters
• Density based clusters
• Property/conceptual
• Described by an objective function
A cluster is a dense region of points separated by low-density regions Used when clusters are irregular and when noise/outliers are present
14
Types of Clusters
• Well-separated clusters
• Center-based clusters
• Contiguous clusters
• Density based clusters
• Property/conceptual
• Described by an objective function
Find clusters that share a common property or representationEg. taste, smell, …
15
Types of Clusters
• Well-separated clusters
• Center-based clusters
• Contiguous clusters
• Density based clusters
• Property/conceptual
• Described by an objective function
• Find clusters based on minimizing or maximizing an objective function
• Enumerate all possible ways of dividing points into clusters:• Evaluate the goodness of
each potential set of clusters by an objective function
• NP Hard problem• Global vs. Local Objectives:
• Hierarchical clustering typically have local objectives
• Partitional algorithms typically have global objectives
16
More on objective functions… (1)
• Objective functions tend to map the clustering problem to a different domain and solve a related problem:– E.g., defining a proximity matrix as a weighted
graph
– Clustering is equivalent to breaking the graph into connected components
– Minimize the edge weight between clusters and maximize edge weight within clusters
17
More on objective functions… (2)
• Best clustering usually minimizes/maximizes an objective function
• Mixture models assume that the data is a mixture of a number of parametric statistical distributions (e.g., Gaussians)
18
Characteristics of input data
• Type of proximity or density measure
– derived measure, central to clustering
• Sparseness:
– Dictates type of similarity
– Adds to efficiency
• Attribute Type:
– Dictates type of similarity
• Type of data:
– Dictates type of similarity
• Dimensionality
• Noise and outliers
• Type of distribution
19
Clustering Algorithms:K-means ClusteringHierarchical ClusteringDensity-based Clustering
20
K-means Clustering
• Partitional clustering:– Each cluster is associated with a centroid
– Each point is assigned to the cluster with the closest centroid
– We need to identify the total number of clusters, k, as one of the inputs
• Simple Algorithm K-means Algorithm
1 : Select K points as the initial centers2 : repeat
3 : Form K clusters by assigning all points to the closest centroid4 : Recompute the centroid of each cluster
5: until centroids don’t change
21
K-means Clustering• Initial centroids are chosen randomly:
– clusters can vary depending on how you started
• Centroid is the mean of the points in the cluster
• “Closeness” is measured usually Euclidean distances
• K-means will typically converge quickly
– Points stop changing assignments
– Another stopping criterion: Only a few points change clusters
• Time complexity O(nKId) – n: number of points; K: number of clusters
– I: number of iterations; d: number of attributes
22
K-means example
23
How to initialize (seed) K-means?
• If there are K “real” clusters, then the chance of selecting one centroid from each cluster is small– Chance is relatively small when K is large
– If clusters have the same size (say m)
– If k = 10 P = 0.00036 (really small!!)
• The choice of centroids can have a deep impact on how the clusters are determined…
24
Choosing K
25
What are the solutions for this problem?
• Multiple runs!!– Usually helps
• Sample the points so that you can guesstimate the number of clusters– Depends on how we have sampled
– Or we have sampled outliers in the data
• Select more than the k number of centroids and then select k among these centroids– Choose widely separated k centroids
26
How to evaluate k-means clusters
• Most common measure is the sum of squared errors (SSE):
• Given two clustering outputs from k-means, we can choose the one with the least error
• Only compare clustering with the same K
• Important side note: K-means is a heuristic for minimizing SSE
27
Pre-processing and Post-processing
• Pre-processing:– normalize the data (e.g., scale the data to unit
standard deviation)
– eliminate outliers
• Post-processing:– Eliminate small clusters that may represent
outliers
– Split clusters that have a high SSE
– Merge clusters that have a low SSE
28
Limitations of using K-means
• K-means can have problems when the data has:– different sizes
– different densities
– non-globular shapes
– outliers!
29
How does this scale… (for MapReduce)In the map step:• Read the cluster centers
into memory from a sequencefile
• Iterate over each cluster center for each input key/value pair.
• Measure the distances and save the nearest center which has the lowest distance to the vector
• Write the clustercenter with its vector to the filesystem.
In the reduce step (we get associated vectors for each center):• Iterate over each value vector
and calculate the average vector. (Sum each vector and devide each part by the number of vectors we received).
• This is the new center, save it into a SequenceFile.
• Check the convergence between the clustercenter that is stored in the key object and the new center.
• If it they are not equal, increment an update counter
30
Making k-means streaming
• Two broad approaches:– Solving the k-means as it comes:• Guha, Mishra, Motwani, O’Callaghan (2001)
• Charikar, O'Callaghan, and Panigrahy (2003)
• Braverman, Meyerson, Ostrovsky, Roytman, Shindler, and Tagiku (2011)
– Solving k-means using weighted coresets:• Select a small sample of points that are weighted
• Weights are such that the solution of the k-means on the subset is similar to the original dataset
31
Fast Streaming K-means
Shindler, Wong, Myerson, NIPS (2011)
Shindler, NIPS presentation (2011)
32
Fast Streaming K-means
• Intuition on why this works: The probability that point x belongs to some cluster is proportional to its distance from the “mean”
– referred to as “facility” here
• Costliest step: measuring δ:
– Use approximate nearest neighbor algorithms
• Space complexity: Ω(k log n)– You are only storing neighborhood info
– Use hashing and metric embedding (not discussed)
• Time complexity: o(nk)Shindler, Wong, Myerson, NIPS (2011)
33
Hierarchical Clustering
34
Hierarchical Clustering
• Produce a set of nested clusters organized as a hierarchical tree
• Can be conveniently visualized as a dendrogram:– a tree like representation which records the
sequences of merges and splits
35
Types of Hierarchical Clustering
• Agglomerative Clustering:– Start with points as individual points (leaves)
– At each step, merge the closest pair of clusters until one cluster (or k clusters) remain
• Divisive Clustering: – Start with one, all inclusive cluster
– At each step, split a cluster until each cluster has a point (or there are k clusters)
• Traditional hierarchical clustering:– uses similarity or distance matrix
– merge or split one cluster at a time
36
Agglomerative Clustering
• One of the more popular algorithms
• Basic algorithm is straightforwardAgglomerative Clustering Algorithm
1 : Compute the distance matrix2 : Let each data point be a cluster3 : repeat
3 : Merge the two closest clusters4 : Update the distance matrix
5: until only a single cluster remains
Key operation is the computation of the proximity of two clusters →
Different approaches to defining the distance between clusters
distinguish the different algorithms
37
Starting Situation
• Start with clusters of individual data points and a distance matrix
p1 p2 p3 p4 p5 p…
p1
p2
p3
p5
p…
38
Next step: Group points…
• After merging a few of these data points
C1
C2
C3
C4
C5
c1 c2 c3 c4 c5
c1
c2
c3
c4
c5
39
Next step: Merge clusters…
• After merging a few of these data points
C1
C2
C3
C4
C5
c1 c2 c3 c4 c5
c1
c2
c3
c4
c5
40
How to merge and update the distance matrix?
• Measure of similarity:– Min
– Max
– Group average
– Distance between centroids
– Other methods driven by an objective function
• How do these look on the clustering process?
41
Defining inter-cluster similarity
• Min (single link)
• Max (complete link)
• Group Average (average link)
• Distance between centroids
42
Single Link
• non-spherical/non-convex clusters
43
Complete Link Clustering
• Better suited for datasets with noise
• Tends to form smaller clusters
• Biased toward more globular clusters
44
Average link / Unweighted Pair Group Method using Arithmetic Averages (UPGMA)
• Compromise between single and complete linkage
• Works generally well in practice
45
How do we say when two clusterings are similar?
• Ward’s method– Similarity of two clusters is based on the
increase in SSE when two clusters are merged
• Advantage:– Less susceptible to errors/outliers in the data
– Analog of the K-means comparison
– Can be used to initialize K-means
• Disadvantage:– Biased toward more globular clusters
46
Space and Time Complexity
• Space Complexity: O(N2)– N is the number of data points
– N2 entries in the distance matrix
• Time Complexity: O(N3)– Many cases: N-steps for tree construction, and
at each step the distance matrix with O(N2) entries must be updated
– Complexity can be reduced to O(N2logN) in some cases
47
Let’s talk about Scaling!
• Specific type of hierarchical clustering algorithm:– UPGMA (average linking)
– Most widely used in bioinformatics literature
• However impractical for scaling to entire genome!– Need the whole distance/ dissimilarity matrix in
memory (N2)!
– How can we exploit sparsity?
48
Problem of interest…• Given a large
number of sequences and we have a way to determine how two or more sequences are similar
• We have a pairwise matrix dissimilarity matrix
• Build a hierarchical clustering routine for understanding how proteins (or other bio-molecules) have evolved
49
The problem with UPGMA: Distance matrix computation is expensive
• We are computing the arithmetic mean between the sequences
• This is not defined when we have sparse inputs
Triangle inequality is not satisfied based on how we have defined the way clusters are built…
50
Strategy to scale up this for Big Data
• Two aspects to handle:– Missing edges
– Sparsity in the distance matrix
detection threshold – for missing edge data…
We are completing “missing” values in D using ψ!
51
Sparse UPGMA: Speeding
Space: O(E)note E << N2
Time: O(E log V)
Still Expensive for E can be arbitrarily large!
How do we deal with this?
52
Streaming for Sparsity: Multi-round Memory Constrained (MC-UPGMA)
• Two components needed:– Memory
constrained clustering unit• Holds only a subset
of the E that needs to be processed in the current round
– Memory constrained merging unit:• Ensures we get only
valid edges
Space is only O(N) depending on how many sequences we have to load at any given time…
Time: O(E log V)
53
Limitations of Hierarchical Clustering
• Greedy: once we make a decision for merging, it cannot be usually undone– Or can be expensive to undo
– Methods exist to alter this
• No global function is being minimized or maximized
• Different schemes of hierarchical clustering have limitations:– Sensitivity to noise and outliers
– Difficulty in handling different shapes
– Chaining, breaking of clusters…
54
Density-based Spatial Clustering of Applications with Noise (DBSCAN)
55
Preliminaries
• Density is defined to be the number of points within a radius (ε)– In this case, density = 9ε
• Core point has more than a specified number of points (minPts) at ε
– Points are interior to a cluster
• Border points have < minPts at ε but are within vicinity of the core point
• A noise point is neither a core point nor a border point
ε
minPts = 4
core point
border point
noise point
56
DBSCAN Algorithm
57
Illustration of DBSCAN: Assignment of Core, Border and Noise Points
58
DBSCAN: Finding Clusters
59
Advantages and Limitations
• Resistant to noise
• Can handle clusters of different sizes and shapes
• Eps and MinPts are dependent on each other– Can be difficult to
specify
• Different density clusters within the same class can be difficult to find
60
Advantages and Limitations
• Varying density data
• High dimensional data
61
How to determine Eps and MinPoints
• For points within a cluster, kth nearest neighbors are roughly at the same distance
• Noise points are farther away in general
• Plot by sorting the distance of every point to its kth nearest neighbor
62
How do we validate clusters?
63
Cluster validity
• For supervised learning:– we had a class label,
– which meant we could identify how good our training and testing errors were
– Metric: Accuracy, Precision, Recall
• For clustering: – How do we measure the “goodness” of the
resulting clusters?
64
Clustering random data (overfitting)
If you ask a clustering algorithm to find clusters, it will find some
65
Different aspects of validating clsuters• Determine the clustering tendency of a set of
data, i.e., whether non-random structure actually exists in the data (e.g., to avoid overfitting)
• External Validation: Compare the results of a cluster analysis to externally known class labels (ground truth).
• Internal Validation: Evaluating how well the results of a cluster analysis fit the data without reference to external information.
• Compare clusterings to determine which is better.
• Determining the ‘correct’ number of clusters.
66
Measures of cluster validity• External Index: Used to measure the
extent to which cluster labels match externally supplied class labels. – Entropy, Purity, Rand Index
• Internal Index: Used to measure the goodness of a clustering structure without respect to external information.– Sum of Squared Error (SSE), Silhouette
coefficient
• Relative Index: Used to compare two different clusterings or clusters. – Often an external or internal index is used for
this function, e.g., SSE or entropy
67
Measuring Cluster Validation with Correlation
• Proximity Matrix vs. Incidence matrix:
– A matrix Kij with 1 if the point belongs to the
same cluster; 0 otherwise
• Compute the correlation between the two matrices:– Only n(n-1)/2 values to be computed
– High values indicate similarity between points in the same cluster
• Not suited for density based clustering
68
Another approach: use similarity matrix for cluster validation
69
Internal Measures: SSE
• SSE is also a good measure to understand how good the clustering is– Lower SSE good clustering
• Can be used to estimate number of clusters
70
More on Clustering a little later…
• We will discuss other forms of clustering in the following classes
• Next class:– please bring your brief write up on the two
papers
top related