clustering francisco moreno extractos de mining of massive datasets rajamaran, leskovec & ullman

54
Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Upload: bryce-hankes

Post on 15-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Clustering

Francisco MorenoExtractos de Mining of Massive Datasets

Rajamaran, Leskovec & Ullman

Page 2: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Introduction

• Clustering is the process of examining a collection of “points,” and grouping the points into “clusters” according to some distance measure.

• The goal is that points in the same cluster have a small distance from one another

Page 3: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Introduction

• The idea is to offer methods for discovering clusters in data particularly in situations where the data is very large

• There are two major approaches to clustering – hierarchical

– point-assignment

Page 4: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Introduction

• A dataset suitable for clustering is a collection of points, which are objects belonging to some space.

• All spaces for which we can perform a clustering have a distance measure, giving a distance between any two points in the space.

Page 5: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Introduction

• The common Euclidean distance (square root of the sums of the squares of the differences between the coordinates of the points in each dimension) serves for all Euclidean spaces, although there are other options for distance measures in Euclidean spaces, including the Manhattan distance

Page 6: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Clustering Example

Clustering of dogs considering Height and Weight

Page 7: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Introduction

• However, modern clustering problems are not so simple.

• They may involve Euclidean spaces of very high dimension or spaces that are not Euclidean at all. For example, it is challenging to cluster documents

Page 8: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Introduction

• Requirements for a function on pairs of pointsto be a distance measure:

- Distances are always nonnegative- The distance between a point and itself is 0- Distance is symmetric- Distance measures obey the triangle inequality;

the distance from x to y to z is never less than the distance going from x to z directly.

Thus, in Euclidean geometry, the shortest distance between two points is a straight line.

Page 9: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Clustering strategies

1. Hierarchical algorithms: they start with each point in its own cluster.- Clusters are combined based on their “closeness,”

using one of many possible definitions of “close.”- Combination stops when further combination

leads to clusters that are undesirable for one of several reasons. For example, we may stop when we have a predetermined number of clusters.

Page 10: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Clustering strategies

2. Point assignment algorithms: Points are assigned to the cluster into which it best fits.- The process is normally preceded by a short

phase in which initial clusters are estimated. - Variations allow occasional combining or

splitting of clusters, or may allow points to be unassigned if they are outliers (points too far from any of the current clusters).

Page 11: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Clustering strategies

We must consider if the algorithm assumes a Euclidean space, or if the algorithm works for an arbitrary distance measure: – in a Euclidean space it is possible to summarize a

collection of points by their centroid (the average of the points).

– In a non-Euclidean space, there is no notion of a centroid, and we are forced to develop another way to summarize clusters.

Page 12: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Centroid example

Centroid

Page 13: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Centroid (xc, yc) of a set of n points: {(x1, y1), (x2, y2), …, (xn, yn)}

Page 14: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Hierarchical clustering

• Any hierarchical clustering algorithm works as follows:– We begin with every point in its own cluster. – As time goes on, larger clusters will be constructed

by combining two smaller clusters, and we have to decide in advance:

Page 15: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

1. How will clusters be represented?2. How will we choose which two clusters to merge?3. When will we stop combining clusters?

Hierarchical clustering

Page 16: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Hierarchical clustering

• Once we have answers to these questions, the algorithm can be described succinctly as:

Hierarchical clustering algorithmWHILE it is not time to stop DO

pick the “best” two clusters to merge;combine those two clusters into one

cluster;END;

Page 17: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Hierarchical clustering

• To begin, we shall assume the space is Euclidean. That allows us to represent a cluster by its centroid or average of the points in the cluster.

• Note that in a cluster of one point, that point is the centroid, so we can initialize the clusters straightforwardly.

Page 18: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Hierarchical clustering

• We can then use the merging rule that the distance between any two clusters is the Euclidean distance between their centroids, and we should pick the two clusters at the shortest distance.

Page 19: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Hierarchical clustering

Cluster A Cluster B

Distance between centroids

Page 20: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Example

• Consider the following points• These points live in a 2-dimensional Euclidean

space• Initially, each point is in a cluster by itself and

is the centroid of that cluster.

Page 21: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Example

Page 22: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Example

• Among all the pairs of points, there are two pairs that are closest: (10,5) and (11,4) or (11,4) and (12,3). Each is at distance √2.

• Let us break ties arbitrarily and decide to combine (11,4) with (12,3).

• The result is shown in the following figure including the centroid of the new cluster, which is at (11.5, 3.5).

Page 23: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Example

Centroid

New cluster

Page 24: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Example

• One might think that (10,5) gets combined with the new cluster next, since it is so close to (11,4). But our distance rule requires us to compare only cluster centroids, and the distance from (10,5) to the centroid of the new cluster is 1.5*√2, which is slightly greater than 2.

2,121

Page 25: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Example

• Thus, now the two closest clusters are those of the points (4,8) and (4,10). We combine them into one cluster with centroid (4,9).

Page 26: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Example

+ (4, 9) Centroid

New cluster

Centroid

Page 27: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Example

• At this point, the two closest centroids are (10,5) and (11.5, 3.5), so we combine these two clusters. The result is a cluster of three points (10,5), (11,4), and (12,3). The centroid of this cluster is (11,4), which happens to be one of the points of the cluster, but that situation is coincidental. The state of the clusters is shown in the following figure

Page 28: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Example

Centroid

New cluster

Centroid

Page 29: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Example

• Now, there are several pairs of centroids that are at distance √5, and these are the closest centroids:

a) (6,8) is combined with the cluster of two elements having centroid (4,9). Centroid (4.7, 8.7)b) (2,2) is combined with (3,4). Centroid (2.5, 3)c) (9,3) is combined with the cluster of three elements having centroid (11,4). Centroid (10.5, 3.8)

Page 30: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Example

Centroid

Centroid

Centroid

New cluster

New cluster

New cluster

Page 31: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Example

• We can continue to combine clusters further… • Now, we will discuss stopping rules, that is,

when should we stop?

Page 32: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Stopping the Clustering Process

There are several approaches we might use to stopping the clustering process:a) We could be told, or have a belief, about how many clusters there are in the data. For example, if we are told that the data about dogs is taken from Chihuahuas, Dachshunds, and Beagles, then we know to stop when there are three clusters left.

Page 33: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Stopping the Clustering Process

b) We could stop combining when at some point the best combination of existing clusters produces a cluster that is inadequate. For example, we could establish that any cluster have a distance between the centroid and its points no greater than some limit.

Page 34: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Stopping the Clustering Process

c) We could continue clustering until there is only one cluster. However, it is meaningless to return a single cluster consisting of all the points. Instead, we could return the tree representing the way in which all the pointswere combined. For example, the tree could represent the likely order in which two species branched from a common ancestor.

Page 35: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

One Cluster

Page 36: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Hierarchical Clustering

The basic algorithm for hierarchical clustering is not very efficient: At each step, we must compute the distances between each pair of clusters, in order to find the best merger. The initial step takes O(n2) time, but subsequent steps take time proportional to (n − 1)2, (n − 2)2, ... The sum of squares up to n is O(n3), so this algorithm is cubic.

Page 37: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Hierarchical Clustering

• A somewhat more efficient algorithm can be implemented (with overall running time of O(n2log n)), but it still puts a limit on how large n can be before it becomes infeasible.

Page 38: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Alternative Rules for Controlling Hierarchical Clustering

We have seen one rule for picking the best clusters to merge: find the pair with the smallest distance between their centroids. Some other options are:a) Take the distance between two clusters to be the minimum of the distances between any two points, one chosen from each cluster. For example, in the following figure

Page 39: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Alternative Rules for Controlling Hierarchical Clustering

Next clusters to be merged

√2

Page 40: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Alternative Rules for Controlling Hierarchical Clustering

We would next chose to cluster the point (10,5) (a clusters of a single point) with the cluster of two points, since (10,5) has distance √2, and no other pair of unclustered (that is, clusters of a single point) points is that close.

Page 41: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Alternative Rules for Controlling Hierarchical Clustering

b) Take the distance between two clusters to be the average distance of all pairs of points, one from each cluster.c) Combine the two clusters whose resulting cluster has the lowest radius. The radius of a cluster is the maximum distance between all the points and the centroid.

Page 42: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Alternative Rules for Controlling Hierarchical Clustering

d) Combine the two clusters whose resulting cluster has the lowest diameter. The diameter of a cluster is the maximum distance between any two points of the cluster.

Page 43: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Alternatives Rules for Stopping the Clustering Process

Based on the previous ideas, here are some other options to stop the merging process:

a) Stop if the diameter/radius of the cluster that results from the best merger exceeds a threshold. b) Stop if the density of the cluster that results from the best merger is below some threshold. The density can be defined in many different ways. Roughly, it should be the number of cluster points per unit volume of the cluster.

Page 44: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Alternatives Rules for Stopping the Clustering Process

c) Stop when there is evidence that the next pair of clusters to be combined yields a “bad” cluster. For example, we could track the average diameter of clusters. If we combine two clusters and suddenly the average diameter jumps then these clusters do not deserve to be combined.

Page 45: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Hierarchical Clustering in a Non-Euclidean Space

• When the space is non-Euclidean, we need to use some distance measure that is computed from points, such as Jaccard or edit distance, among others.

• A problem arises here: when we need to represent a cluster, we cannot replace a collection of points by their centroid.

Page 46: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Jaccard Coefficient

Jaccard coefficient measures similarity between sample sets i and j:

JC(i,j) = c/(a + b + c)Where:• c is the number of common elements between i and j• a is the number of elements exclusive of i • b is the number of elements exclusive of j

• If i and j share all the elements, JC(i,j) = 1• If i and j do not share any element, JC(i,j) = 0.

Page 47: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Edit Distance

• Informally, the edit distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other.

Page 48: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Example

For example, the edit distance between "kitten" and "sitting" is three, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:• kitten → sitten (substitution of "s" for "k")• sitten → sittin (substitution of "i" for "e")• sittin → sitting (insertion of "g" at the end).

Page 49: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Hierarchical Clustering in a Non-Euclidean Space

• Given that we cannot combine points in a cluster when the space is non-Euclidean, our only choice is to pick one of the points of the cluster itself to represent the cluster.

• Ideally, this point is close to all the points of the cluster, so it in some sense lies in the “center.”

• We call the representative point the clustroid.

Page 50: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Hierarchical Clustering in a Non-Euclidean Space

Common choices include selecting as the clustroid the point that minimizes:• The sum of the distances to the other points in

the cluster.• The maximum distance to another point in the

cluster.• The sum of the squares of the distances to the

other points in the cluster.

Page 51: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Example

• Suppose we are using edit distance, and a cluster consists of the four points abcd, aecdb, abecb, and ecdab. Their distances are found in the following table:

Page 52: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Example

abcd abecb aecdb ecdab

abcd 0 3 3 5

abecb 3 0 2 4

aecdb 3 2 0 2

ecdab 5 4 2 0

Page 53: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Example

Page 54: Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman

Example

• We can see from these measurements that whichever of the three criteria we choose, aecdb will be selected as the clustroid.

• In general, different criteria could yield different clustroids.