lecture 3 - cluster analysis

8/2/2019 Lecture 3 - Cluster Analysis

1/62

Cluster Analysis

Determining the similarity between data points

Hierarchical Clustering (Wards method)

K-means Clustering

Fuzzy c-means Clustering


2/62

Regression Clustering

How can we analyse these two different types of data distributions?


3/62

Cluster analysis provides a method to separate data points into

groups (so-called clusters) which have similar properties.

Cluster Analysis is a multivariate technique so it can work inlarge numbers of dimensions, where each dimension representsone property of the data point.

If we plot points in space what can we use as a good measure ofhow similar or different two points are?

A natural measure ofthe similarity of points

in space is the distancewhich separates them.


4/62

A natural measure of the similarity of points in

space is the distance which separates them.

(x1,y1)

(x2,y2)

2

21

2

21 )()( yyxxr +=

Euclid

eanDis

tance

(r)


5/62

Euclidean distances can be calculated for any number of dimensions.For example in a system with 4 dimensions; w, x, y, z.

2

21

2

21

2

21

2

21 )()()()( zzyyxxwwr +++=

If we have two data points (pt1 & pt2) as column vectors, where each

value represents the location in a given dimension, we could find theEuclidean distance between them by:

function r=euclid(pt1,pt2)

%find difference in all dimensionsdifference=pt1-pt2;

%square the differencedifference=difference.^2;

%sum the differences across all dimensionstotal=sum(difference);

%take the square-root to find the Euclidean distancer=sqrt(total);


6/62

Manhattan (Street Block) Distance

(x1,y1)

(x2,y2)2

21

2

21 )()( yyxxr +=

Instead of taking the shortest distance between the two points

(i.e the Euclidean distance), we sum their separation in each dimension.

2

21 )( xx

2

21 )( yy


7/62

2

21

2

21

2

21

2

21 )()()()( zzyyxxwwr +++=

If we have two data points (pt1 & pt2) as column vectors, where each

value represents the location in a given dimension, we could find theManhattan distance between them by:

function r=manhattan(pt1,pt2)

%find difference in all dimensionsdifference=pt1-pt2;

%square the differencedifference=difference.^2;

% take the square-root to find the absolute distance in each dimensiondifference=sqrt(difference);

%sum the distances across all dimensionsr=sum(difference);

Again Manhattan distances can be calculated for any number ofdimensions. For example in a system with 4 dimensions; w, x, y, z.


8/62

Data normalisation

In cases where the numbers of one variable are muchlarger than the other variables they can dominate thedistance measure, for example, in a 4D analysis ofhumans we could have:

Height, foot length, finger lengthand ear width

In this case is clear that the absolute values of heightare much larger than the other variables thereforethey will dominate the distance and the similaritymeasure.

To avoid this effect each variable is normalised beforethe analysis, this means setting the mean of eachvariable to 0 and the standard deviation to 1. This is

called standardizingthe data, or taking the z-score.


9/62

Data normalisationWe can standardize data (set the mean to 0 and the standard deviation

to 1), very simply in Matlab:

>> x=rand(100,1); Generate 100 random numbers to act as our data

>> mean(x) Find the mean (should be ~0.5)

>> std(x) Find the S.D. (should be ~0.29)

>> xn=x-mean(x); Form variable xn which is x minus the mean of x

>> mean(xn) Find the mean of the adjusted values (~0.0)

>> xn=xn./std(x); Adjust the S.D. of xn, divide by the S.D. of x

>> std(xn) Find the S.D. of the adjusted values (1.0)

This procedure can be performed using the zscore function:>> x = rand(100,1);

>> xn = zscore(x);

>> mean(x)

>> std(x)


10/62

Cluster analysis, also called segmentation analysisor

taxonomy analysis, is a way to create groups ofobjects, or clusters, in such a way that theproperties of objects in the same cluster are verysimilar and the properties of objects in different

clusters are quite distinct.

Cluster analysis works best with normally

distributed data. This is not a strict assumption, butfor example log-normal variables should betransformed to normal variables.

In most cases every parameter is standardized(mean of 0, standard deviation of 1) to have equal

weight in analysis


11/62


12/62

Join the two closestpoints and define a new

point at their centre

Hierarchical Clustering (Ward's method)


13/62

Find the next twoclosest points and

define a new point attheir centre



14/62

Find the next twoclosest points and

define a new point attheir centre



15/62

Find the closest centrepoints and define a new




16/62

Find the closest centrepoints and define a new




17/62

We can represent thesequence of links and their

size using a dendrogram



18/62





19/62





20/62





21/62





22/62


Exercise


23/62



24/62


We can now use Matlab to check your result for theprevious exercise.

Set up the X and Y points>> clear all>> close all

>> X=[-0.4326 , -1.6656 , 0.1253 , 0.2877 , -1.1465 , 1.1909]';

>> Y=[1.1892 , -0.0376 , 0.3273 , 0.1746 , -0.1867 , 0.7258]';

Form a matrix by combining x and y>> input=[X,Y]

Set up a series of labels for the points between A and F>> labels={'A';'B';'C';'D';'E';'F'};


25/62


To perform the clustering we first need to find the Euclidean distancebetween the data points using the pdistfunction.

>> p=pdist(input,'euclidean')

Given the distances, the Ward links between the points can be calculated:

>> L=linkage(p,'ward')

Finally, we produce the dendrogram plot to show the links and add the labels:

>> dendro(L,labels)


26/62



27/62

Hierarchical Clustering of Iris data

Sepals & Petals:

Length (cm)

Width (cm)

Iris versicolor

R.A. Fisher (1936), "The use of multiple

measurements in taxonomic problems",Annals of Eugenics 7, 179-188.

E. Anderson (1935), "The irises of the

Gasp peninsula",Bulletin of the

American Iris Society 59, 2-5.


28/62

Hierarchical Clustering of Iris data

>> clear all clear the memory>> close all close all the figure windows>> load iris_clusters load the data into matlab

We must combine all four variables into a 30 x 4 matrix>> input=[petal_length,petal_width,sepal_length,sepal_width];

Next, standardize the data using the zscore function>> input=zscore(input);

Call the clustering function with the input data

>> p=pdist(input,'euclidean');>> L=linkage(p,'ward')

>> dendro(L)


29/62

Total of 30 Irises, from 2 different species. Use hierarchicalclustering to determine which flowers belong to which species.


30/62

A disadvantage of the dendrogram is that if we have a lot ofsamples it quickly becomes difficult to interpret. This is shown

when we repeat the analysis for 100 iris specimens.


31/62

Deterministic (k-means) method:

Find an arrangement of cluster centres and associate everysample to one of these clusters.

The best cluster solution minimises the total distancebetween the data points and their assigned cluster centre.

Procedure:Choose the number of clusters for the analysis.

Assume starting positions for the cluster centres.

Attribute every sample to its nearest cluster centre.

Recalculate the position of the cluster centre until a minimumtotal distance is obtained.

k-means clustering method


32/62

For this 2-dimensional data set, it is clear there are2 clusters. It is therefore simple to determine

cluster centres and assign each point to a cluster.


33/62

The cluster centres are the points in space which minimise the distancebetween themselves and the data points assigned to that cluster. Each

data point can only belong to one cluster (so-called hardclustering).

Cluster 1low Z1, high Z2

Cluster 2

high Z1, low Z2

The characteristics of a cluster are given by the location of the cluster centre in space.


34/62

The cluster centres have locations with the same number of dimensionsas the original data. Therefore to obtain a clear plot of the clusters in

two-dimensions we must choose 2 variables which show a good separationbetween the clusters (red points show the cluster centres).

Cluster 1(small width, small length)

Cluster 2(large width, large length)

W l l l 3 l d l b i i l h h


35/62

We can also calculate a 3 cluster model but it is clear that we have onedata cluster which has been split into two parts. Generally, you should try

and keep the number of clusters low and make sure there is a physicalinterpretation that explains what process each cluster represents.

Cluster 1

Cluster 2

Cluster 3


36/62

What input data should you use?

There are two key ideas that you should consider whenconstructing the input data set for cluster analysis.

The properties of a cluster are determined from its position

with the data space. Therefore if we are to understand theresults of the cluster analysis we must have a clearunderstanding of each of input variables.

Dont include variables which you dont understand.

Dont over-represent any one process or property of the data set.If I have 9 input variables which represent sediment transport

mechanisms and only 1 which represents source area then myanalysis will be biased toward a separation based only on transport.

Think carefully about your input data and what it represents,dont include parameters in the analysis without a clear

reason to do so.


37/62

How many clusters should be included in a model?

The silhouette value for each point is a measure of how

similar that point is to points in its own cluster comparedto points in other clusters, and ranges from -1 to +1.

We can form an idea of how many clusters should beincluded in a model using the so-called Silhouette plot.

+1 : points are very distant from their neighbouring clusters.

0 : points are not distinctly in one cluster or another.

-1: points are probably assigned to the wrong cluster.

So we should select a model with a number of clusters thatproduces a silhouette plot with values close to +1.


38/62

Lets take the example of a data set which clearly has 4 clusters


39/62

Mean silhouette value = 0.64Cluster 2

Cluster 1

Applying a 2 cluster model

Clearly the model is not complex enough and the dispersion in Cluster 1is demonstrated by the low silhouette values of some of the points


40/62

Mean silhouette value = 0.74


This model is better, clusters 2 & 3 are well defined and this isdemonstrated by their high silhouette values. Cluster 1 is still too

dispersed and this is shown by some low silhouette values.


41/62



This is the optimum model, all the points have high silhouette values andgive a mean value of 0.97.


42/62



This model is overly complex. One of the data clusters has been split toform two clusters and this is demonstrated by the low silhouette values

for Clusters 2 & 3. The mean silhouette value has deceased to 0.87


43/62



Again this model is overly complex. One of the data clusters has beensplit to form three clusters and this is demonstrated by the low

silhouette values for Clusters 1, 3 & 6.

W fi d h i b f l b l i h


44/62

We can find the optimum number of clusters by plotting the meansilhouette value for each of the models.

A 4 cluster modelis optimum

Rocks Dataset


45/62

134 Portuguese Rocks with the following properties:

Physical-Mechanical tests

RMCS Compression breaking load, norm DIN 52105/E226 (Kg/cm2)

RCSG Compression breaking load, after freezing tests, norm DIN 52105/E226 (Kg/cm2)

RMFX Bending strength, norm DIN 52112 (Kg/cm2)

MVAP Volumetric weight, norm DIN 52102 (Kg/m3)

AAPN Water absortion at N.P. conditions, norm DIN 52103 (%)

PAOA Apparent porosity, norm LNEC E-216-1968 (%)

CDLT Thermal linear expansion coefficient (x10^-6/C)

RDES Abrasion test, NP-309 (mm)

RCHQ Impact test: minimum fall height (cm)

Chemical analysis

SiO2 ... TiO2 : oxides (%)

Source: IGM - Instituto Geolgico-Mineiro, Porto, Portugal

collected by J Gis, Dep. Engenharia de Minas, Faculdade de Engenharia, Universidade do Porto, Porto, Portugal

Includes: Granites, Diorites, Marbles, Slates, Limestones and Breccias

K-means clustering of the Rocks data set


46/62

K m an c u t r ng f t c ata t

Load the data into Matlab (18 parameters and 134 samples)

>> clear all>> load rocks

Next, standardize the input data using the zscore function>> input=zscore(input);

Test the mean silhouette value for different numbers ofclusters.

First for the cluster model,for example, a 3 cluster model:>> [cluster,cc] = kmeans(input,3);

>> s=mean(silhouette(input,cluster))

cluster shows which data point is assigned to which clusterand cc gives the cluster centres. Try different numbers ofclusters testing for the best mean silhouette value (the

highest value of s)

We can find the optimum number of clusters by plotting the mean


47/62

We can find the optimum number of clusters by plotting the meansilhouette value for each of the models.

4 cluster model isoptimum

You may get a different result, forexample 3 clusters may appear to be

better than 4, how could this be?


48/62

K-means clustering of the Rocks data set

Once you have decided how many clusters to use, form the model

>> [cluster,cc] = kmeans(input,4); this example is for 4 clusters.

Now we need to see which rocks belong to which clusters and theirvarious classes, we can do this using a function written specifically forthis data set:

>> cluster_rocks(cluster)

The function contains all the rock sample classes so it will tell youwhich classes have been placed in which groups. The classifications willbe written to the screen.

The occurrence of different rock classes in each cluster tells us


49/62

-------------------Cluster 1 contains:Number of granites = 0Number of diorites = 0

Number of marbles = 0Number of slates = 0Number of limestones = 9Number of breccias = 2-------------------

Cluster 2 contains:Number of granites = 32Number of diorites = 9Number of marbles = 0Number of slates = 1

Number of limestones = 0Number of breccias = 0-------------------



Cluster 4 contains:Number of granites = 0Number of diorites = 0Number of marbles = 51Number of slates = 0

Number of limestones = 19Number of breccias = 4-------------------

The occurrence of different rock classes in each cluster tells uswhat the clusters may represent.

We can also look at the cluster centres to understand how the data is split.

We can also look at the cluster centres to understand how the data is split.


50/62





p

0001530012Cluster 404422031568Cluster 2

TiO2K2ONa2OMgOCaOMnOFe2O3Al2O3SiO2

Composition of the cluster centres (%)

Fuzzy Clustering


51/62

y g

In K-means clustering each sample belongs to only one cluster.

With fuzzy clustering each sample can belong to more than one cluster.How much a sample belongs to a cluster is defined by the membership

Membership is related to the distance (similarity) between a sampleand a given cluster centre, memberships will always sum to 1.

Equally similar to

CC1 & CC2

0.50.5Sample 5

Slight more similarto CC2 than CC1

0.60.4Sample 4

More similar to CC1than CC2

0.10.9Sample 3

Extremelydissimilar to CC1and similar to CC2.

0.990.01Sample 2

Extremely similarto CC1 anddissimilar to CC2.

0.010.99Sample 1

ConclusionCluster 2

MembershipCluster 1

Membership

Note:

thesedefinitionsaremyownand

do

not

representanaccepted

classification

scheme


52/62

Fuzzy Clustering

Included in the calculation of the fuzzy clusters is the so-calledfuzzy exponent (q). This value describes how fuzzy the model is.

q = 1, there is no fuzziness and each sample is assigned toonly one cluster (i.e it has one membership of 1 and the restare zero). This is the same as traditional K-means clustering.

q = , there is no separation in the model, with all thesamples belonging equally to all the clusters.

Normally, q is set between 1.5 and 3 depending on the problem youare studying. Many geoscience studies have obtained meaningfulresults using q = 1.5

Geology of a rocky bank


53/62

Sediment grab samples (GR) were taken from the New Zealand

Star Bank (SE Australia). Sediment composition (grains > 1mm)was analysed for a number of different components.

gy y

Geology of a rocky bank


54/62


55/62

Grab sample locations on Star Bank


56/62

Grab sample locat ons on Star Bank

Fuzzy clustering of the Star Bank data set


57/62

Now well perform the fuzzy clustering and plot the membership results on aseries of new images. First we must zscore the compositional data>> input=zscore(input);

Calculate a 2 cluster model, mem gives the membership and cc the cluster centre

>> [mem,cc]=fuzzycm(input,2);

Now well make a new image and then plot point size as a function ofmembership to cluster 1 (large points mean a higher membership).

>> figure>> image(Fig)

>> set(gca,'visible', 'off') removes the axes

>> hold on

>> for i=1:length(mem)

plot(samples(i,1),samples(i,2),'ok','markerfacecolor','g','markersize',mem(i,1).*12);

end

Membership to cluster 1 (granite outcrop)


58/62

p g p

The samples on the granite outcrops(GR7,GR12) have a strong

membership to this cluster. Noticethat the samples close to the

outcrops also have a reasonably highmembership to this cluster.


59/62

Fuzzy clustering of the Star Bank data set

Now well make a new image and then plot point size as a function ofmembership to cluster 2 (large points mean a higher membership).

>> figure

>> image(Fig)>> set(gca,'visible', 'off') removes the axes

>> hold on

>> for i=1:length(mem)

plot(samples(i,1),samples(i,2),'ok','markerfacecolor', 'c','markersize',mem(i,2).*12);

end

Membership to cluster 2 (sediment)


60/62

The samples away from the outcropshave high memberships to this cluster.

Notice that samples closer to theoutcrops have lower memberships andwe see transitional cases (e.g. GR13)

which belong to both clusters.

Important points to consider when performing cluster analysis.


61/62

Outliers can have a strong influence on cluster analysis, so you should

test for any outliers before you begin.


62/62

lecture 3 - cluster analysis

Documents