clustering

92
Distance Based Methods Lecture 14 (and final) November 30, 2005 Multivariate Analysis

Upload: rajanikantha7473

Post on 22-Oct-2014

183 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clustering

Distance Based Methods

Lecture 14 (and final)

November 30, 2005Multivariate Analysis

Page 2: Clustering

Lecture #14 - 11/30/2005

Today’s LectureMeasures of distance (similarity and dissimilarity).

Multivariate methods that use distances (dissimilarities) and similarities.

Hierarchical clustering algorithms.Multidimensional scaling.K-means clustering.

Final exam handed out (due Thursday, December 15th, 5pm by hand, 11:59:59pm by email).

Page 3: Clustering

Lecture #14 - 11/30/2005

Clustering DefinitionsClustering objects (or observations) can provide detail regarding the nature and structure of data.Clustering is distinct from classification in terminology.

Classification pertains to a known number of groups, with the objective being to assign new observations to these groups.Methods of cluster analysis is a technique where no assumptions are made concerning the number of groups or the group structure.

Page 4: Clustering

Lecture #14 - 11/30/2005

Magnitude of the ProblemClustering algorithms make use of measures of similarity (or alternatively, dissimilarity) to define and group variables or observations.Clustering presents a host of technical problems.For a reasonable sized data set with n objects (either variables or individuals), the number of ways of grouping nobjects into k groups is:

nk

j

kj jjk

k ∑=

−⎟⎟⎠

⎞⎜⎜⎝

⎛−⎟

⎠⎞

⎜⎝⎛

01

!1

For example, there are over four trillion ways that 25 objects can be clustered into 4 groups – which solution is best?

Page 5: Clustering

Lecture #14 - 11/30/2005

Numerical ProblemsIn theory, one way to find the best solution is to try each possible grouping of all of the objects – an optimization process called integer programming. It is difficult, if not impossible, to do such a method given the state of today’s computers (although computers are catching up with such problems).Rather than using such brute-force type methods, a set of heuristics have been developed to allow for fast clustering of objects in to groups.Such methods are called heuristics because they do not guarantee that the solution will be optimal (best), only that the solution will be better than most.

Page 6: Clustering

Lecture #14 - 11/30/2005

Clustering Heuristic Inputs

The inputs into clustering heuristics are in the form of measures of similarities or dissimilarities. The result of the heuristic depends in large part on the measure of similarity/dissimilarity used by the procedure.Because of this, we will start this lecture by examining measures of distance (used synonymously with similarity/dissimilarity).

Page 7: Clustering

Lecture #14 - 11/30/2005

Measures of Distance

Care must be taken with choosing the metric by which similarity is quantified.Important considerations include:

The nature of the variables (e.g., discrete, continuous, binary).Scales of measurement (nominal, ordinal, interval, or ratio).The nature of the matter under study.

Page 8: Clustering

Lecture #14 - 11/30/2005

Clustering Variables vs. Clustering Observations

When variables are to be clustered, oft used measures of similarity include correlation coefficients (or similar measures for non-continuous variables).When observations are to be clustered, distance metrics are often used.

Page 9: Clustering

Lecture #14 - 11/30/2005

Euclidean DistanceEuclidean distance is a frequent choice of a distance metric:

)()()()()(),( 22

222

11

yxyxyx

−′−=

−++−+−= pp yxyxyxd K

This distance is often chosen because it represents an understandable metric.But sometimes, it may not be best choice.

Page 10: Clustering

Lecture #14 - 11/30/2005

Euclidean Distance?Imagine I wanted to know how many miles it was from my old house in Sacramento to Lombard Street in San Francisco…

Knowing how far it was on a straight line would not do me too much good, particularly with the number of one-way streets that exist in San Francisco.

Page 11: Clustering

Lecture #14 - 11/30/2005

Other popular distance metrics include the Minkowski metric:

The key to this metric is the choice of m:If m = 2, this provides the Euclidean distance.If m = 1, this provides the “city-block” distance.

Other Distance Metrics

mp

i

mii yxd

/1

1),( ⎥

⎤⎢⎣

⎡−= ∑

=

yx

Page 12: Clustering

Lecture #14 - 11/30/2005

Preferred Distance Properties

It is often desirable to use a distance metric that meets the following properties:

d(P,Q) = d(Q,P)d(P,Q) > 0 if P ≠ Qd(P,Q) = 0 if P = Qd(P,Q) ≤ d(P,R) + d(R,Q)

The Euclidean and Minkowski metrics satisfy these properties.

Page 13: Clustering

Lecture #14 - 11/30/2005

Triangle InequalityThe fourth property is called the triangle inequality, which often gets violated by non-routine measures of distance.This inequality can be shown by the following triangle (with lines representing Euclidean distances):

QP

R

d(P,Q)

d(P,R)d(R,Q)

d(P,R) d(R,Q)

d(P,Q)

Page 14: Clustering

Lecture #14 - 11/30/2005

Binary VariablesIn the case of binary-valued variables (variables that have a 0/1 coding), many other distance metrics may be defined.The Euclidean distance provides a count of the number of mismatched observations:

Here, d(i,k) = 2This is sometimes called the Hamming Distance.

Page 15: Clustering

Lecture #14 - 11/30/2005

Other Binary Distance Measures

There are a number of other ways to define the distance between a set of binary variables (as shown on p. 674).Most of these measures reflect the varied importance placed on differing cells in a 2 x 2 table.

Page 16: Clustering

Lecture #14 - 11/30/2005

General Distance Measure Properties

Use of measures of distance that are monotonic in their ordering of object distances will provide identical results from clustering heuristics.Many times this will only be an issue if the distance measure is for binary variables orthe distance measure does not satisfy the triangle inequality.

Page 17: Clustering

Lecture #14 - 11/30/2005

Hierarchical Clustering Methods

Because of the large size of possible clustering solutions, searching through all combinations is not feasible.

Hierarchical clustering techniques proceed by taking a set of objects and grouping a set at a time.

Two types of hierarchical clustering methods exist: Agglomerative hierarchical methods.Divisive hierarchical methods.

Page 18: Clustering

Lecture #14 - 11/30/2005

Agglomerative clustering methods start first with the individual objects.

Initially, each object is it’s own cluster.

The most similar objects are then grouped together into a single cluster (with two objects).

We will find that what we mean by similar will change

depending on the method

Agglomerative Clustering Methods

Page 19: Clustering

Lecture #14 - 11/30/2005

Agglomerative Clustering Methods

The remaining steps involve merging the clusters according the similarity or dissimilarity of the objects within the cluster to those outside of the cluster.

The method concludes when all objects are part of a single cluster.

Page 20: Clustering

Lecture #14 - 11/30/2005

Divisive Clustering MethodsDivisive hierarchical methods works in the opposite direction – beginning with a single, n-object sized cluster.

The large cluster is then divided into two subgroups where the objects in opposing groups are relatively distant from each other.

Page 21: Clustering

Lecture #14 - 11/30/2005

Divisive Clustering Methods

The process continues similarly until there are as many clusters as there are objects.

While it is briefly discussed in the chapter, we will not discuss divisive clustering methods in detail here in this class.

Page 22: Clustering

Lecture #14 - 11/30/2005

To SummarizeSo you can see that we have this idea of steps

At each step two clusters combine to form one (Agglomerative)

OR…

At each step a cluster is divided into two new clusters (Divisive)

Page 23: Clustering

Lecture #14 - 11/30/2005

Methods for Viewing ClustersAs you could imagine, when we consider the methods for hierarchical clustering, there are a large number of clusters that are formed sequentially.One of the most frequently used tools to view the clusters (and level at which they were formed) is the dendrogram.A dendrogram is a graph that describes the differing hierarchical clusters, and the distance at which each is formed.

Page 24: Clustering

Lecture #14 - 11/30/2005

Example Data Set #1To demonstrate several of the hierarchical clustering methods, an example data set is used.Data come from a 1991 study by the economic research department of the union bank of Switzerland representing economic conditions of 48 cities around the world.Three variables were collected:

Average working hours for 12 occupations.Price of 112 goods and services excluding rent.Index of net hourly earnings in 12 occupations.

Page 25: Clustering

Lecture #14 - 11/30/2005

Example Of DendogramThis is an example of an agglomerate

cluster where cities start off in there own group and then are combined

For example, Here Amsterdam and Brussels were combined to form a

group

Notice that we do not specify groups, but if we know how

many we want…we simply go to the step where there are that

many groups

The Cities

Where the lines connect is when those two previous groups were

joined

Page 26: Clustering

Lecture #14 - 11/30/2005

Similarity?So, we mentioned that:

The most similar objects are then grouped together into a single cluster (with two objects).

So the next question is how do we measure similarity between clusters.

More specifically, how do we redefine it when a cluster contains a combination of old clusters.

We find that there are several ways to define similar and each way defines a new method of clustering.

Page 27: Clustering

Lecture #14 - 11/30/2005

Agglomerative MethodsNext we discuss several different way to complete Agglomerative hierarchical clustering:

Single LinkageComplete LinkageAverage LinkageCentroidMedianWard Method

Page 28: Clustering

Lecture #14 - 11/30/2005

Agglomerative Methods

In describing these…

I will give the definition of how we define similar when clusters are combined.

And for the first two I will give detailed examples.

Page 29: Clustering

Lecture #14 - 11/30/2005

Example Distance Matrix

The example will be based on the distance matrix below

1 2 3 4 5

1 0 9 3 6 11

2 9 0 7 5 10

3 3 7 0 9 2

4 6 5 9 0 8

5 11 10 2 8 0

Page 30: Clustering

Lecture #14 - 11/30/2005

Single Linkage

The single linkage method of clustering involves combining clusters by finding the “nearest neighbor” – the cluster closest to any given observation within the current cluster.

Cluster 1

1

2

Cluster 2

3

5

4

Notice that this is equal to the minimum distance of any observation in cluster one with any observation in cluster 3

Page 31: Clustering

Lecture #14 - 11/30/2005

Single LinkageSo the distance between any two clusters is:

{ })(min),( jidBAD y,y= For all yi in A and yj in B

1

2

3

5

4

Notice any other distance is longer

So how would we do this using our distance matrix?

Page 32: Clustering

Lecture #14 - 11/30/2005

Single Linkage ExampleThe first step in the process is to determine the two elements with the smallest distance, and combine them into a single cluster.

Here, the two objects that are most similar are objects 3 and 5…we will now combine these into a new cluster, and compute the distance from that cluster to the remaining clusters (objects) via the single linkage rule.

1 2 3 4 5

1 0 9 3 6 11

2 9 0 7 5 10

3 3 7 0 9 2

4 6 5 9 0 8

5 11 10 2 8 0

Page 33: Clustering

Lecture #14 - 11/30/2005

Single Linkage ExampleThe shaded rows/columns are the portions of the table with the distances from an object to object 3 or object 5. The distance from the (35) cluster to the remaining objects is given below:

d(35),1 = min{d31,d51} = min{3,11} = 3d(35),2 = min{d32,d52} = min{7,10} = 7d(35),4 = min{d34,d54} = min{9,8} = 8

1 2 3 4 5

1 0 9 3 6 11

2 9 0 7 5 10

3 3 7 0 9 2

4 6 5 9 0 8

5 11 10 2 8 0

These are the distances of 3 and 5 with 1. Our rule says that the distance of our new cluster with

1 is equal to the minimum of theses two values…3

These are the distances of 3 and 5 with 2. Our rule says that the distance of our new cluster with

2 is equal to the minimum of theses two values…7

In equation form our new distances are

Page 34: Clustering

Lecture #14 - 11/30/2005

Single Linkage ExampleUsing the distance values, we now consolidate our table so that (35) is now a single row/column.The distance from the (35) cluster to the remaining objects is given below:

d(35)1 = min{d31,d51} = min{3,11} = 3d(35)2 = min{d32,d52} = min{7,10} = 7d(35)4 = min{d34,d54} = min{9,8} = 8

(35) 1 2 4

(35) 0

1 3 0

2 7 9 0

4 8 6 5 0

Page 35: Clustering

Lecture #14 - 11/30/2005

Single Linkage ExampleWe now repeat the process, by finding the smallest distance between within the set of remaining clusters.The smallest distance is between object 1 and cluster (35).Therefore, object 1 joins cluster (35), creating cluster (135).

(35) 1 2 4

(35) 0 3 7 8

1 3 0 9 6

2 7 9 0 5

4 8 6 5 0

The distance from cluster (135) to the other clusters is then computed:

d(135)2 = min{d(35)2,d12} = min{7,9} = 7d(135)4 = min{d(35)4,d14} = min{8,6} = 6

Page 36: Clustering

Lecture #14 - 11/30/2005

Single Linkage ExampleUsing the distance values, we now consolidate our table so that (135) is now a single row/column.The distance from the (135) cluster to the remaining objects is given below:

(135) 2 4

(135) 0

2 7 0

4 6 5 0

d(135)2 = min{d(35)2,d12} = min{7,9} = 7d(135)4 = min{d(35)4,d14} = min{8,6} = 6

Page 37: Clustering

Lecture #14 - 11/30/2005

Single Linkage ExampleWe now repeat the process, by finding the smallest distance between within the set of remaining clusters.The smallest distance is between object 2 and object 4.These two objects will be joined to from cluster (24).The distance from (24) to (135) is then computed.

d(135)(24) = min{d(135)2,d(135)4} = min{7,6} = 6

The final cluster is formed (12345) with a distance of 6.

(135) 2 4

(135) 0

2 7 0

4 6 5 0

Page 38: Clustering

Lecture #14 - 11/30/2005

The Dendogram

For example, here is where 3 and 5 joined

Page 39: Clustering

Lecture #14 - 11/30/2005

Complete LinkageThe complete linkage method of clustering involves combining clusters by finding the “farthest neighbor”– the cluster farthest to any given observation within the current cluster. This ensures that all objects in a cluster are within some maximum distance of each other.

Cluster 1

1

2

Cluster 2

3 4

5

Page 40: Clustering

Lecture #14 - 11/30/2005

Complete LinkageSo the distance between any two clusters is:

{ })(max),( jidBAD y,y= For all yi in A and yj in B

1

2

34

5

Notice any other distance is shorter

So how would we do this using our distance matrix?

Page 41: Clustering

Lecture #14 - 11/30/2005

Complete Linkage ExampleTo demonstrate complete linkage in action, consider the five-object distance matrix.

The first step in the process is to determine the two elements with the smallest distance, and combine them into a single cluster.

Here, the two objects that are most similar are objects 3 and 5.

We will now combine these into a new cluster, and compute the distance from that cluster to the remaining clusters (objects) via the complete linkage rule.

1 2 3 4 5

1 0

2 9 0

3 3 7 0

4 6 5 9 0

5 11 10 2 8 0

Page 42: Clustering

Lecture #14 - 11/30/2005

Complete Linkage ExampleThe shaded rows/columns are the portions of the table with the distances from an object to object 3 or object 5.

The distance from the (35) cluster to the remaining objects is given below:

d(35)1 = max{d31,d51} = max{3,11} = 11d(35)2 = max{d32,d52} = max{7,10} = 10d(35)4 = max{d34,d54} = max{9,8} = 9

1 2 3 4 5

1 0 9 3 6 11

2 9 0 7 5 10

3 3 7 0 9 2

4 6 5 9 0 8

5 11 10 2 8 0

Page 43: Clustering

Lecture #14 - 11/30/2005

Complete Linkage ExampleUsing the distance values, we now consolidate our table so that (35) is now a single row/column.

The distance from the (35) cluster to the remaining objects is given below:

d(35)1 = max{d31,d51} = max{3,11} = 11d(35)2 = max{d32,d52} = max{7,10} = 10d(35)4 = max{d34,d54} = max{9,8} = 9

(35) 1 2 4

(35) 0

1 11 0

2 10 9 0

4 9 6 5 0

Notice our new computed distances

with (35)

Page 44: Clustering

Lecture #14 - 11/30/2005

Complete Linkage ExampleWe now repeat the process, by finding the smallest distance between within the set of remaining clusters.

The smallest distance is between object 2 and object 4. Therefore, they form cluster (24).

The distance from cluster (24) to the other clusters is then computed:

d(24)(135) = max{d2(35),d4(35)} = max{10,9} = 10d(24)1 = max{d21,d41} = max{9,6} = 9

(35) 1 2 4

(35) 0 11 10 9

1 11 0 9 6

2 10 9 0 5

4 9 6 5 0

So now we use our rule to combine 2

and 4

Page 45: Clustering

Lecture #14 - 11/30/2005

Complete Linkage ExampleUsing the distance values, we now consolidate our table so that (24) is now a single row/column.

The distance from the (24) cluster to the remaining objects is given below:

d(24)(135) = max{d2(35),d4(35)} = max{10,9} = 10d(24)1 = max{d21,d41} = max{9,6} = 9

(35) (24) 1

(35) 0 10 11

(24) 10 0 9

1 11 9 0

Notice our 10 and 9

Page 46: Clustering

Lecture #14 - 11/30/2005

Complete Linkage ExampleWe now repeat the process, by finding the smallest distance between within the set of remaining clusters.The smallest distance is between cluster (24) and object 1.These two objects will be joined to from cluster (124).The distance from (124) to (35) is then computed.

d(35)(124) = max{d1(35),d(24)(35)} = max{11,10} = 11

The final cluster is formed (12345) with a distance of 11.

(35) (24) 1

(35) 0 10

(24) 10 0

1 11 9 0

Page 47: Clustering

Lecture #14 - 11/30/2005

The Dendogram

Page 48: Clustering

Lecture #14 - 11/30/2005

Average LinkageThe average linkage method proceeds similarly to the single and complete linkage methods, with the exception that at the end of each agglomeration step, the distance between clusters is now represented by the average distance of all objects within each cluster.

In reality, the average linkage method will produce very similar results to the complete linkage method.

Page 49: Clustering

Lecture #14 - 11/30/2005

So…To summarize…we have explained three of the methods to combine groups.

Notice that once things are in the same group they cannot be separated

Also, as before with factor analysis the method you use is largely up to you

Page 50: Clustering

Lecture #14 - 11/30/2005

Example Distance Matrix

To further demonstrate the agglomerative methods, imagine we had the following distance matrix for five objects:

1 2 3 4 5

1 0

2 9 0

3 3 7 0

4 6 5 9 0

5 11 10 2 8 0

Page 51: Clustering

Lecture #14 - 11/30/2005

Linkage Methods for Agglomerative Clustering

We will discuss three techniques for agglomerative clustering:

1. Single linkage.2. Complete linkage.3. Average linkage.

Each of these techniques provides an assignment rule that determines when an observation becomes part of a cluster.

Page 52: Clustering

Lecture #14 - 11/30/2005

Cluster Analysis in SASTo perform hierarchical clustering in SAS, proc cluster is used.Prior to using proc cluster, the data must be converted from the raw observations into a distance matrix.SAS provides a built-in macro to create distance matrices that are then used as input into the proc cluster routine.To produce dendrograms in SAS, proc tree is used.

Page 53: Clustering

Lecture #14 - 11/30/2005

Distance MacroPrior to specifying the distance macro, a pair of ‘%include’statements must be placed (preferably at the beginning of the syntax file):

%include 'C:\Program Files\SAS Institute\SAS\V8\stat\sample\xmacro.sas';%include 'C:\Program Files\SAS Institute\SAS\V8\stat\sample\distnew.sas';

The generic syntax for the distance-creating macro is to use %distance() – all arguments are optional.

%distance(data=cities, id=city, options=nomiss, out=distcity, shape=square, method=euclid, var=work--salary);

For more information on the distance macro, go to:

http://support.sas.com/ctx/samples/index.jsp?sid=475

Page 54: Clustering

Lecture #14 - 11/30/2005

Distance Macro Example

%include 'C:\Program Files\SAS Institute\SAS\V8\stat\sample\xmacro.sas'; %include 'C:\Program Files\SAS Institute\SAS\V8\stat\sample\distnew.sas'; title '1991 City Data'; data cities; input city $ work price salary; cards; Amsterdam 1714 65.6 49.0 Athens 1792 53.8 30.4 ... ; %distance(data=cities, id=city, options=nomiss, out=distcity, shape=square, method=euclid, var=work--salary);

Page 55: Clustering

Lecture #14 - 11/30/2005

Proc Cluster

The proc cluster guide can be found at:http://www.id.unizh.ch/software/unix/statmath/sas/sasdoc/stat/chap23/index.htm

The generic syntax is:

proc cluster data=dataset method=single; id idvariable; run;

Page 56: Clustering

Lecture #14 - 11/30/2005

Same Example – in SAS*SAS Example #1; %include 'C:\Program Files\SAS Institute\SAS\V8\stat\sample\xmacro.sas'; %include 'C:\Program Files\SAS Institute\SAS\V8\stat\sample\distnew.sas'; title 'small distance matrix'; data dist(type=distance); input object1-object5 name $; cards; 0 . . . . object1 9 0 . . . object2 3 7 0 . . object3 6 5 9 0 . object4 11 10 2 8 0 object5 ; run; title 'Single Link'; proc cluster data=dist method=single nonorm; id name; run; proc tree horizontal spaces=2; id name; run;

The CLUSTER Procedure Single Linkage Cluster Analysis Mean Distance Between Observations = 7 Cluster History Norm T Min i NCL --Clusters Joined--- FREQ Dist e 4 object3 object5 2 0.2857 3 object1 CL4 3 0.4286 2 object2 object4 2 0.7143 1 CL3 CL2 5 0.8571

Page 57: Clustering

Lecture #14 - 11/30/2005

Dendrogram for Single Link Clustering Method

Page 58: Clustering

Lecture #14 - 11/30/2005

Complete Linkage ExampleThe shaded rows/columns are the portions of the table with the distances from an object to object 3 or object 5. The distance from the (35) cluster to the remaining objects is given below:

d(35)1 = max{d31,d51} = max{3,11} = 11d(35)2 = max{d32,d52} = max{7,10} = 10d(35)4 = max{d34,d54} = max{9,8} = 9

1 2 3 4 5

1 0

2 9 0

3 3 7 0

4 6 5 9 0

5 11 10 2 8 0

Page 59: Clustering

Lecture #14 - 11/30/2005

Same Example – in SAStitle 'Complete Link'; proc cluster data=dist method=complete nonorm; id name; run; proc tree horizontal spaces=2; id name; run;

The CLUSTER Procedure Complete Linkage Cluster Analysis Cluster History T Max i NCL --Clusters Joined--- FREQ Dist e 4 object3 object5 2 2 3 object2 object4 2 5 2 object1 CL3 3 9 1 CL2 CL4 5 11

Page 60: Clustering

Lecture #14 - 11/30/2005

Dendrogram for Complete Link Clustering Method

Page 61: Clustering

Lecture #14 - 11/30/2005

Average LinkageThe average linkage method proceeds similarly to the single and complete linkage methods, with the exception that at the end of each agglomeration step, the distance between clusters is now represented by average distance of all objects within each cluster.In reality, the average linkage method will produce very similar results to the complete linkage method.In the interests of time, I will forgo the gory details and get right into the SAS code…

Page 62: Clustering

Lecture #14 - 11/30/2005

Average Link Example – in SAS

title 'Average Link'; proc cluster data=dist method=average nonorm; id name; run; proc tree horizontal spaces=2; id name; run;

The CLUSTER Procedure Average Linkage Cluster Analysis Cluster History T RMS i NCL --Clusters Joined--- FREQ Dist e 4 object3 object5 2 2 3 object2 object4 2 5 2 object1 CL3 3 7.6485 1 CL2 CL4 5 8.4063

Page 63: Clustering

Lecture #14 - 11/30/2005

Dendrogram for Complete Link Clustering Method

Page 64: Clustering

Lecture #14 - 11/30/2005

Clustering CutoffsIn hierarchical clustering, the final decision is to decide just what distance to use to evaluate the final set of clusters.Many people choose a spot based on one of two criteria:

1. There is a large “gap” between agglomeration steps.2. The clusters represent sets of objects that are consistent

with preconceived theories.To demonstrate, let’s pull out the economic data from the cities of the world.

Page 65: Clustering

Lecture #14 - 11/30/2005

City Data ExampleFor this data set, we will use the average linkage method of agglomeration.This corresponds to SAS Example #2 in the code…

%distance(data=cities, id=city, options=nomiss, out=distcity, shape=square, method=euclid, var=work--salary); proc cluster data=distcity method=average; id city; run; proc tree horizontal spaces=2; id city; run;

Page 66: Clustering

Lecture #14 - 11/30/2005

City Data Example The CLUSTER Procedure Average Linkage Cluster Analysis Root-Mean-Square Distance Between Observations = 250.8616 Cluster History Norm T RMS i NCL --Clusters Joined--- FREQ Dist e 45 Caracas Singpore 2 0.0251 44 London Paris 2 0.0298 43 Milan Vienna 2 0.0331 42 Amsterda Brussels 2 0.0409 41 Lisbon Rio_de_J 2 0.0586 40 Mexico_C Nairobi 2 0.059 39 Copenhag Madrid 2 0.0593 38 Geneva Zurich 2 0.0636 37 Bogota Kuala_Lu 2 0.0641 36 Frankfur Sydney 2 0.0804 35 Nicosia Seoul 2 0.0814 34 Dublin CL44 3 0.0822 33 Chicago New_York 2 0.0824 32 Johannes CL40 3 0.0834

Page 67: Clustering

Lecture #14 - 11/30/2005

City Data Dendrogram

Page 68: Clustering

Lecture #14 - 11/30/2005

K-means ClusteringAnother procedure used to cluster objects is the k-means clustering method.K-means initially (and randomly) assigns objects to a total of k clusters.The method then goes through each object and reassigns the object to the cluster that has the smallest distance.The mean (or centroid) of the cluster is then updated to reflect the new observation.This process is repeated until very little change in the clusters occurs.

Page 69: Clustering

Lecture #14 - 11/30/2005

K-means Clustering

K-means allows for the user to not have to worry about what distance value to use to set a final criterion for a cut-off, as in hierarchical clustering.K-means is sensitive to local optima.

Solutions will depend upon starting values.There is a need to re-run this method multiple times to find the best solution.

Page 70: Clustering

Lecture #14 - 11/30/2005

K-means Example (SAS Example #3)

To do K-means clustering in SAS, the fastclus procedure is used. The guide for the fastclus procedure can be found at:

http://www.id.unizh.ch/software/unix/statmath/sas/sasdoc/stat/chap27/index.htm

We will attempt to use proc fastclus to cluster our city data set into three clusters.

Page 71: Clustering

Lecture #14 - 11/30/2005

K-means Example*SAS Example #3 - K-means with Cities; proc fastclus data=cities maxc=3 maxiter=1000 out=clus; var work price salary; run; proc sort data=clus; by cluster; run; proc print data=clus; var city cluster; run;

Page 72: Clustering

Lecture #14 - 11/30/2005

1991 City Data 22:13 Saturday, November 27, 2005 34 The FASTCLUS Procedure Replace=FULL Radius=0 Maxclusters=3 Maxiter=1000 Converge=0.02 Initial Seeds Cluster work price salary ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 1583.000000 115.500000 63.700000 2 1978.000000 71.900000 46.300000 3 2375.000000 63.800000 27.800000 Minimum Distance Between Initial Seeds = 397.5133 Iteration History Relative Change in Cluster Seeds Iteration Criterion 1 2 3 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 75.0340 0.3370 0.0896 0.1411 2 48.8119 0.0639 0.0304 0.2519 3 43.4215 0.0118 0.0161 0 Convergence criterion is satisfied. Criterion Based on Final Seeds = 43.3177 Cluster Summary Maximum Distance RMS Std from Seed Radius Nearest Distance Between Cluster Frequency Deviation to Observation Exceeded Cluster Cluster Centroids ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 23 39.4432 163.8 2 222.2 2 18 47.4112 119.2 1 222.2 3 5 58.8981 154.3 2 259.8

Page 73: Clustering

Lecture #14 - 11/30/2005

Statistics for Variables Variable Total STD Within STD R-Square RSQ/(1-RSQ) ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ work 174.34255 70.90538 0.841945 5.326923 price 21.38918 20.69266 0.105665 0.118149 salary 24.75770 23.79629 0.117217 0.132782 OVER-ALL 102.41381 44.80336 0.817123 4.468145 Pseudo F Statistic = 96.07 The FASTCLUS Procedure Replace=FULL Radius=0 Maxclusters=3 Maxiter=1000 Converge=0.02 Approximate Expected Over-All R-Squared = 0.87227 Cubic Clustering Criterion = -2.025 WARNING: The two values above are invalid for correlated variables. Cluster Means Cluster work price salary ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 1740.826087 75.739130 45.369565 2 1962.777778 67.394444 38.216667 3 2221.400000 53.900000 17.540000 Cluster Standard Deviations Cluster work price salary ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 62.80246968 20.12462849 17.83641410 2 72.77568724 21.58764549 31.32299005 3 99.21844587 19.87171356 12.95272172

Page 74: Clustering

Lecture #14 - 11/30/2005

Obs city CLUSTER 1 Amsterda 1 2 Athens 1 3 Brussels 1 4 Copenhag 1 5 Dublin 1 6 Dusseldo 1 7 Frankfur 1 8 Helsinki 1 9 Lagos 1 10 Lisbon 1 11 London 1 12 Luxembou 1 13 Madrid 1 14 Milan 1 15 Montreal 1 16 Nicosia 1 17 Oslo 1 18 Paris 1 19 Rio_de_J 1 20 Seoul 1 21 Stockhol 1 22 Sydney 1 23 Vienna 1 24 Bombay 2 25 Buenos_A 2 26 Caracas 2 27 Chicago 2 28 Geneva 2 29 Houston 2 30 Johannes 2 31 Los_Ange 2 32 Mexico_C 2 33 Nairobi 2 34 New_York 2 35 Panama 2 36 Sao_Paul 2 37 Singpore 2 38 Tel_Aviv 2 39 Tokyo 2 40 Toronto 2 41 Zurich 2 42 Bogota 3 43 Hong_Kon 3 44 Kuala_Lu 3 45 Manila 3 46 Taipei 3

Page 75: Clustering

Lecture #14 - 11/30/2005

Multidimensional ScalingUp to this point, we have been working with methods for taking distances and then clustering objects. But does using distances remind you of anything?Have you ever been in a really long car ride, and had nothing better to look at than an atlas?Did you ever notice that the back of the atlas provides a nice distance matrix to look at?Those distances come from the proximity of cities on the map…

Page 76: Clustering

Lecture #14 - 11/30/2005

Multidimensional ScalingMultidimensional scaling is the process by which a map is drawn from a set of distances.Before we get into the specifics of the procedure, let’s begin with an example.Let’s take a state (that is a favorite of mine) that you may or may not be very familiar with…Nevada.

Page 77: Clustering

Lecture #14 - 11/30/2005

Multidimensional ScalingYou may be familiar with several cities in NV…

Las VegasRenoCarson City

But have you heard of…

Lovelock?Fernley?Winnemucca?

Page 78: Clustering

Lecture #14 - 11/30/2005

Mapping NevadaFrom the State of Nevada website, I was able to download a mileage chart for the distance between cities in the Great State of Nevada.Using this chart and MDS, we will now draw a map of the Silver State with the cities in their proper locations…

Page 79: Clustering

Lecture #14 - 11/30/2005

Nevada…from MDS vs. from a Map

Page 80: Clustering

Lecture #14 - 11/30/2005

How MDS WorksMultidimensional scaling attempts to recreate the distances in the observed distance matrix by a set of coordinates in a smaller number of dimensions.The algorithm works by selecting a set of values for distances, then computing an index of how bad the distances fit, called Stress:

( )[ ]

2/1

2)(

2)()( ˆ)Stress(

⎪⎭

⎪⎬

⎪⎩

⎪⎨

⎧ −=

∑∑

∑∑

<

<

ki

qik

ki

qik

qik

d

ddq

Page 81: Clustering

Lecture #14 - 11/30/2005

How MDS Works (Continued)

Based on an optimization procedure, new values for the distances are selected, and the new value for Stress is computed.The algorithm exits when values of Stress do not change by large amounts.Note that this procedure is also subject to local optima – multiple runs of the procedure may be necessary.

Page 82: Clustering

Lecture #14 - 11/30/2005

Types of MDS

Although this is a gross simplification of the methods used for MDS, there are generally two techniques used to scale distances:

Metric MDS – uses distances (magnitudes) to obtain solution.Nonmetric MDS – uses ordinal (rank) information of object’s distances to obtain solution.

Page 83: Clustering

Lecture #14 - 11/30/2005

MDS of Paired Comparison Data

In graduate school, I once went through a phase where I was into NASCAR.A friend of mine and I would always argue about whether or not NASCAR was a sport.As a graduate student in Quantitative Psychology, we were trained to settle such arguments the old-fashioned way, by applying advanced quantitative methods to data sets.

Page 84: Clustering

Lecture #14 - 11/30/2005

MDS of Sports PerceptionsSo, I set out to understand a bit more about how auto racing in general is perceived relative to other “sports.”I created a questionnaire asking for people to judge the similarity of two sports at a time, across 21 different sports.I had eight of my closest friends complete the survey (which was quite lengthy because of all of the pairs of sports needing to be compared).

Page 85: Clustering

Lecture #14 - 11/30/2005

MDS in SAS

To do MDS in SAS, we will use proc mds.The online user guide for proc mds can be found at:http://www.id.unizh.ch/software/unix/statmath/sas/sasdoc/stat/chap40/index.htm

This is also SAS Example #5 in the example code file online.

Page 86: Clustering

Lecture #14 - 11/30/2005

MDS of Sports Perceptionsproc mds data=sports level=ordinal coef=diagonal dimension=2 pfinal out=out outres=res ; run; title1 'Plot of configuration'; %plotit(data=out(where=(_type_='CONFIG')), datatype=mds, labelvar=_name_, vtoh=1.75); %plotit(data=out(where=(_type_='CONFIG')), datatype=mds, plotvars=dim1 dim2, labelvar=_name_, vtoh=1.75); run;

Page 87: Clustering

Lecture #14 - 11/30/2005

Plot of configuration 00:39 Sunday, November 28, 2005 16 Multidimensional Scaling: Data=WORK.SPORTS.DATA Shape=TRIANGLE Condition=MATRIX Level=ORDINAL Coef=DIAGONAL Dimension=2 Formula=1 Fit=1 Mconverge=0.01 Gconverge=0.01 Maxiter=100 Over=2 Ridge=0.0001 Badness- Convergence Measures of-Fit Change in ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Iteration Type Criterion Criterion Monotone Gradient ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 0 Initial 0.4480 . . . 1 Monotone 0.2841 0.1638 0.3576 0.9220 2 Lev-Mar 0.2230 0.0612 . . 3 Monotone 0.2094 0.0136 0.0704 0.8708 4 Lev-Mar 0.1411 0.0683 . . 5 Monotone 0.1322 0.008879 0.0481 0.7882 6 Lev-Mar 0.1022 0.0300 . . 7 Monotone 0.0904 0.0118 0.0438 0.6619 8 Lev-Mar 0.0768 0.0136 . . 9 Monotone 0.0735 0.003282 0.0185 0.5160 10 Lev-Mar 0.0684 0.005102 . . 11 Monotone 0.0674 0.001066 0.009985 0.4212 12 Lev-Mar 0.0644 0.002934 . 0.3329 13 Lev-Mar 0.0637 0.000700 . 0.3121 14 Lev-Mar 0.0635 0.000179 . 0.3112 15 Lev-Mar 0.0631 0.000418 . 0.2982 16 Lev-Mar 0.0630 0.000121 . 0.2992 17 Lev-Mar 0.0627 0.000308 . 0.2916 18 Lev-Mar 0.0626 0.000107 . 0.2941 19 Lev-Mar 0.0624 0.000242 . 0.2909 20 Lev-Mar 0.0622 0.000111 . 0.2966 21 Lev-Mar 0.0620 0.000239 . 0.2998 22 Lev-Mar 0.0618 0.000178 . 0.3123 23 Lev-Mar 0.0615 0.000308 . 0.3289 24 Lev-Mar 0.0612 0.000336 . 0.3602 25 Lev-Mar 0.0607 0.000526 . 0.4074 26 Lev-Mar 0.0599 0.000731 . 0.4780 27 Lev-Mar 0.0588 0.001147 . 0.5656 28 Lev-Mar 0.0570 0.001775 . 0.6501 29 Lev-Mar 0.0543 0.002735 . 0.7025 30 Lev-Mar 0.0502 0.004068 . 0.7066 31 Lev-Mar 0.0463 0.003924 . 0.6801 32 Lev-Mar 0.0415 0.004753 . 0.5928 33 Lev-Mar 0.0393 0.002200 . 0.5302 34 Lev-Mar 0.0388 0.000509 . 0.5132 35 Lev-Mar 0.0386 0.000194 . 0.5064 36 Lev-Mar 0.0384 0.000177 . 0.4998 37 Lev-Mar 0.0384 0.0000557 . 0.4978 38 Lev-Mar 0.0383 0.0000665 . 0.4952 39 Lev-Mar 0.0383 0.0000261 . 0.4943 40 Lev-Mar 0.0383 0.0000214 . 0.4935 41 Lev-Mar 0.0383 0.0000194 . 0.4927 42 Lev-Mar 0.0383 9.6585E-7 . 0.4927

Multidimensional Scaling: Data=WORK.SPORTS.DATA Shape=TRIANGLE Condition=MATRIX Level=ORDINAL Coef=DIAGONAL Dimension=2 Formula=1 Fit=1 Badness- Convergence Measures of-Fit Change in ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Iteration Type Criterion Criterion Monotone Gradient ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 43 Lev-Mar 0.0382 0.0000181 . 0.4920 44 Lev-Mar 0.0374 0.000870 . 0.4543 45 Lev-Mar 0.0373 0.0000622 . 0.4514 46 Lev-Mar 0.0373 3.8282E-6 . 0.4512 47 Lev-Mar 0.0373 3.6946E-6 . 0.4510 48 Lev-Mar 0.0373 1.4014E-6 . 0.4510 49 Lev-Mar 0.0373 2.0672E-6 . 0.4509 50 Lev-Mar 0.0373 1.4809E-6 . 0.3626 51 Lev-Mar 0.0373 1.4051E-7 . 0.4508 52 Lev-Mar 0.0373 1.5143E-6 . 0.4507 53 Lev-Mar 0.0373 6E-7 . 0.4507 54 Lev-Mar 0.0373 3.74E-7 . 0.4507 55 Lev-Mar 0.0373 4.7229E-7 . 0.4507 56 Lev-Mar 0.0373 2.1251E-7 . 0.4507 57 Lev-Mar 0.0373 2.5058E-7 . 0.4506 58 Lev-Mar 0.0373 2.1027E-7 . 0.4506 59 Lev-Mar 0.0373 4.6209E-8 . 0.4506 60 Lev-Mar 0.0373 1.8169E-7 . 0.4506 61 Lev-Mar 0.0373 9.0378E-8 . 0.4506 62 Lev-Mar 0.0373 4.7828E-8 . 0.4506 63 Lev-Mar 0.0373 6.909E-8 . 0.4506 64 Lev-Mar 0.0373 3.5366E-8 . 0.4506 65 Lev-Mar 0.0373 3.1205E-8 . 0.4506 66 Lev-Mar 0.0373 3.1588E-8 . 0.4506 67 Lev-Mar 0.0373 1.0794E-8 . 0.4485 68 Lev-Mar 0.0373 2.2003E-8 . 0.4506 69 Lev-Mar 0.0373 1.4047E-8 . 0.4236 70 Lev-Mar 0.0373 1.497E-10 . 0.4485 71 Lev-Mar 0.0373 1.6452E-8 . 0.3462 72 Lev-Mar 0.0373 5.8861E-9 . 0.3618 73 Lev-Mar 0.0373 3.9711E-9 . 0.3597 74 Lev-Mar 0.0373 4.8141E-9 . 0.3490 75 Lev-Mar 0.0373 2.108E-9 . 0.3614 76 Lev-Mar 0.0373 2.7021E-9 . 0.3597 77 Lev-Mar 0.0373 2.1861E-9 . 0.3530 78 Lev-Mar 0.0373 4.371E-10 . 0.3610 79 Lev-Mar 0.0373 1.9715E-9 . 0.3597 80 Lev-Mar 0.0373 9.517E-10 . 0.3553 81 Lev-Mar 0.0373 5.147E-10 . 0.000382 Convergence criteria are satisfied.

Page 88: Clustering

Lecture #14 - 11/30/2005

Multidimensional Scaling: Data=WORK.SPORTS.DATA Shape=TRIANGLE Condition=MATRIX Level=ORDINAL Coef=DIAGONAL Dimension=2 Formula=1 Fit=1 Configuration Dim1 Dim2 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Auto_Racing 2.06 2.50 Baseball 0.14 0.37 Basketball -0.72 0.15 Biking 1.56 1.14 Bowling -0.22 -1.27 Boxing 0.60 -1.27 Card_Games -1.16 0.68 Chess -0.75 0.27 Football -1.46 -0.10 Golf 0.55 -1.06 Hockey -1.27 0.64 Martial_Arts 0.15 -1.96 Pool -0.15 -0.34 Skiing 1.24 -0.23 Soccer -1.47 0.29 Swimming 0.06 0.66 Tennis -0.91 -0.67 Track 0.81 1.20 Volleyball -0.12 -0.49 Weightlifting 1.07 -0.52 Dimension Coefficients _MATRIX_ 1 2 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 0.97 1.03 Multidimensional Scaling: Data=WORK.SPORTS.DATA Shape=TRIANGLE Condition=MATRIX Level=ORDINAL Coef=DIAGONAL Dimension=2 Formula=1 Fit=1 Number of Badness-of- Uncorrected Nonmissing Fit Distance Distance _MATRIX_ Data Weight Criterion Correlation Correlation ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 45 1.00 0.04 1.00 1.00

Page 89: Clustering

Lecture #14 - 11/30/2005

MDS Plot

Page 90: Clustering

Lecture #14 - 11/30/2005

More MDSAs alluded to earlier, there are multiple methods for performing MDS.There are ways of examining MDS for each individual (Individual Differences Scaling –INDSCAL)There are ways of doing MDS on categorical data (Correspondence Analysis).Depending on the levels of need, I am thinking about creating an MDS course where we can go into MDS in great detail.

Page 91: Clustering

Lecture #14 - 11/30/2005

Wrapping UpToday we covered methods for clustering and displaying distances.Again, we only scratched the surface of the methods presented today.For further information, please take my clustering seminar next semester.KU is going bowling!

Page 92: Clustering

Lecture #14 - 11/30/2005

Next Time…

I will be gone next week, but will be holding office hours on Monday 12/12 from 1-4pm and on Tuesday 12/13 from 1-4pm (and by appointment).Hopefully our paths will cross again in the future.It has been a joy to be your instructor, and I hope you found this class to be everything you expected.