lecture 3 - cluster analysis
TRANSCRIPT
-
8/2/2019 Lecture 3 - Cluster Analysis
1/62
Cluster Analysis
Determining the similarity between data points
Hierarchical Clustering (Wards method)
K-means Clustering
Fuzzy c-means Clustering
-
8/2/2019 Lecture 3 - Cluster Analysis
2/62
Regression Clustering
How can we analyse these two different types of data distributions?
-
8/2/2019 Lecture 3 - Cluster Analysis
3/62
Cluster analysis provides a method to separate data points into
groups (so-called clusters) which have similar properties.
Cluster Analysis is a multivariate technique so it can work inlarge numbers of dimensions, where each dimension representsone property of the data point.
If we plot points in space what can we use as a good measure ofhow similar or different two points are?
A natural measure ofthe similarity of points
in space is the distancewhich separates them.
-
8/2/2019 Lecture 3 - Cluster Analysis
4/62
A natural measure of the similarity of points in
space is the distance which separates them.
(x1,y1)
(x2,y2)
2
21
2
21 )()( yyxxr +=
Euclid
eanDis
tance
(r)
-
8/2/2019 Lecture 3 - Cluster Analysis
5/62
Euclidean distances can be calculated for any number of dimensions.For example in a system with 4 dimensions; w, x, y, z.
2
21
2
21
2
21
2
21 )()()()( zzyyxxwwr +++=
If we have two data points (pt1 & pt2) as column vectors, where each
value represents the location in a given dimension, we could find theEuclidean distance between them by:
function r=euclid(pt1,pt2)
%find difference in all dimensionsdifference=pt1-pt2;
%square the differencedifference=difference.^2;
%sum the differences across all dimensionstotal=sum(difference);
%take the square-root to find the Euclidean distancer=sqrt(total);
-
8/2/2019 Lecture 3 - Cluster Analysis
6/62
Manhattan (Street Block) Distance
(x1,y1)
(x2,y2)2
21
2
21 )()( yyxxr +=
Instead of taking the shortest distance between the two points
(i.e the Euclidean distance), we sum their separation in each dimension.
2
21 )( xx
2
21 )( yy
-
8/2/2019 Lecture 3 - Cluster Analysis
7/62
2
21
2
21
2
21
2
21 )()()()( zzyyxxwwr +++=
If we have two data points (pt1 & pt2) as column vectors, where each
value represents the location in a given dimension, we could find theManhattan distance between them by:
function r=manhattan(pt1,pt2)
%find difference in all dimensionsdifference=pt1-pt2;
%square the differencedifference=difference.^2;
% take the square-root to find the absolute distance in each dimensiondifference=sqrt(difference);
%sum the distances across all dimensionsr=sum(difference);
Again Manhattan distances can be calculated for any number ofdimensions. For example in a system with 4 dimensions; w, x, y, z.
-
8/2/2019 Lecture 3 - Cluster Analysis
8/62
Data normalisation
In cases where the numbers of one variable are muchlarger than the other variables they can dominate thedistance measure, for example, in a 4D analysis ofhumans we could have:
Height, foot length, finger lengthand ear width
In this case is clear that the absolute values of heightare much larger than the other variables thereforethey will dominate the distance and the similaritymeasure.
To avoid this effect each variable is normalised beforethe analysis, this means setting the mean of eachvariable to 0 and the standard deviation to 1. This is
called standardizingthe data, or taking the z-score.
-
8/2/2019 Lecture 3 - Cluster Analysis
9/62
Data normalisationWe can standardize data (set the mean to 0 and the standard deviation
to 1), very simply in Matlab:
>> x=rand(100,1); Generate 100 random numbers to act as our data
>> mean(x) Find the mean (should be ~0.5)
>> std(x) Find the S.D. (should be ~0.29)
>> xn=x-mean(x); Form variable xn which is x minus the mean of x
>> mean(xn) Find the mean of the adjusted values (~0.0)
>> xn=xn./std(x); Adjust the S.D. of xn, divide by the S.D. of x
>> std(xn) Find the S.D. of the adjusted values (1.0)
This procedure can be performed using the zscore function:>> x = rand(100,1);
>> xn = zscore(x);
>> mean(x)
>> std(x)
-
8/2/2019 Lecture 3 - Cluster Analysis
10/62
Cluster analysis, also called segmentation analysisor
taxonomy analysis, is a way to create groups ofobjects, or clusters, in such a way that theproperties of objects in the same cluster are verysimilar and the properties of objects in different
clusters are quite distinct.
Cluster analysis works best with normally
distributed data. This is not a strict assumption, butfor example log-normal variables should betransformed to normal variables.
In most cases every parameter is standardized(mean of 0, standard deviation of 1) to have equal
weight in analysis
-
8/2/2019 Lecture 3 - Cluster Analysis
11/62
-
8/2/2019 Lecture 3 - Cluster Analysis
12/62
Join the two closestpoints and define a new
point at their centre
Hierarchical Clustering (Ward's method)
-
8/2/2019 Lecture 3 - Cluster Analysis
13/62
Find the next twoclosest points and
define a new point attheir centre
Hierarchical Clustering (Ward's method)
-
8/2/2019 Lecture 3 - Cluster Analysis
14/62
Find the next twoclosest points and
define a new point attheir centre
Hierarchical Clustering (Ward's method)
-
8/2/2019 Lecture 3 - Cluster Analysis
15/62
Find the closest centrepoints and define a new
point at their centre
Hierarchical Clustering (Ward's method)
-
8/2/2019 Lecture 3 - Cluster Analysis
16/62
Find the closest centrepoints and define a new
point at their centre
Hierarchical Clustering (Ward's method)
-
8/2/2019 Lecture 3 - Cluster Analysis
17/62
We can represent thesequence of links and their
size using a dendrogram
Hierarchical Clustering (Ward's method)
-
8/2/2019 Lecture 3 - Cluster Analysis
18/62
We can represent thesequence of links and their
size using a dendrogram
Hierarchical Clustering (Ward's method)
-
8/2/2019 Lecture 3 - Cluster Analysis
19/62
We can represent thesequence of links and their
size using a dendrogram
Hierarchical Clustering (Ward's method)
-
8/2/2019 Lecture 3 - Cluster Analysis
20/62
We can represent thesequence of links and their
size using a dendrogram
Hierarchical Clustering (Ward's method)
-
8/2/2019 Lecture 3 - Cluster Analysis
21/62
Hierarchical Clustering (Ward's method)
We can represent thesequence of links and their
size using a dendrogram
-
8/2/2019 Lecture 3 - Cluster Analysis
22/62
Hierarchical Clustering (Ward's method)
Exercise
-
8/2/2019 Lecture 3 - Cluster Analysis
23/62
Hierarchical Clustering (Ward's method)
-
8/2/2019 Lecture 3 - Cluster Analysis
24/62
Hierarchical Clustering (Ward's method)
We can now use Matlab to check your result for theprevious exercise.
Set up the X and Y points>> clear all>> close all
>> X=[-0.4326 , -1.6656 , 0.1253 , 0.2877 , -1.1465 , 1.1909]';
>> Y=[1.1892 , -0.0376 , 0.3273 , 0.1746 , -0.1867 , 0.7258]';
Form a matrix by combining x and y>> input=[X,Y]
Set up a series of labels for the points between A and F>> labels={'A';'B';'C';'D';'E';'F'};
-
8/2/2019 Lecture 3 - Cluster Analysis
25/62
Hierarchical Clustering (Ward's method)
To perform the clustering we first need to find the Euclidean distancebetween the data points using the pdistfunction.
>> p=pdist(input,'euclidean')
Given the distances, the Ward links between the points can be calculated:
>> L=linkage(p,'ward')
Finally, we produce the dendrogram plot to show the links and add the labels:
>> dendro(L,labels)
-
8/2/2019 Lecture 3 - Cluster Analysis
26/62
Hierarchical Clustering (Ward's method)
-
8/2/2019 Lecture 3 - Cluster Analysis
27/62
Hierarchical Clustering of Iris data
Sepals & Petals:
Length (cm)
Width (cm)
Iris versicolor
R.A. Fisher (1936), "The use of multiple
measurements in taxonomic problems",Annals of Eugenics 7, 179-188.
E. Anderson (1935), "The irises of the
Gasp peninsula",Bulletin of the
American Iris Society 59, 2-5.
-
8/2/2019 Lecture 3 - Cluster Analysis
28/62
Hierarchical Clustering of Iris data
>> clear all clear the memory>> close all close all the figure windows>> load iris_clusters load the data into matlab
We must combine all four variables into a 30 x 4 matrix>> input=[petal_length,petal_width,sepal_length,sepal_width];
Next, standardize the data using the zscore function>> input=zscore(input);
Call the clustering function with the input data
>> p=pdist(input,'euclidean');>> L=linkage(p,'ward')
>> dendro(L)
-
8/2/2019 Lecture 3 - Cluster Analysis
29/62
Total of 30 Irises, from 2 different species. Use hierarchicalclustering to determine which flowers belong to which species.
-
8/2/2019 Lecture 3 - Cluster Analysis
30/62
A disadvantage of the dendrogram is that if we have a lot ofsamples it quickly becomes difficult to interpret. This is shown
when we repeat the analysis for 100 iris specimens.
-
8/2/2019 Lecture 3 - Cluster Analysis
31/62
Deterministic (k-means) method:
Find an arrangement of cluster centres and associate everysample to one of these clusters.
The best cluster solution minimises the total distancebetween the data points and their assigned cluster centre.
Procedure:Choose the number of clusters for the analysis.
Assume starting positions for the cluster centres.
Attribute every sample to its nearest cluster centre.
Recalculate the position of the cluster centre until a minimumtotal distance is obtained.
k-means clustering method
-
8/2/2019 Lecture 3 - Cluster Analysis
32/62
For this 2-dimensional data set, it is clear there are2 clusters. It is therefore simple to determine
cluster centres and assign each point to a cluster.
-
8/2/2019 Lecture 3 - Cluster Analysis
33/62
The cluster centres are the points in space which minimise the distancebetween themselves and the data points assigned to that cluster. Each
data point can only belong to one cluster (so-called hardclustering).
Cluster 1low Z1, high Z2
Cluster 2
high Z1, low Z2
The characteristics of a cluster are given by the location of the cluster centre in space.
-
8/2/2019 Lecture 3 - Cluster Analysis
34/62
The cluster centres have locations with the same number of dimensionsas the original data. Therefore to obtain a clear plot of the clusters in
two-dimensions we must choose 2 variables which show a good separationbetween the clusters (red points show the cluster centres).
Cluster 1(small width, small length)
Cluster 2(large width, large length)
W l l l 3 l d l b i i l h h
-
8/2/2019 Lecture 3 - Cluster Analysis
35/62
We can also calculate a 3 cluster model but it is clear that we have onedata cluster which has been split into two parts. Generally, you should try
and keep the number of clusters low and make sure there is a physicalinterpretation that explains what process each cluster represents.
Cluster 1
Cluster 2
Cluster 3
-
8/2/2019 Lecture 3 - Cluster Analysis
36/62
What input data should you use?
There are two key ideas that you should consider whenconstructing the input data set for cluster analysis.
The properties of a cluster are determined from its position
with the data space. Therefore if we are to understand theresults of the cluster analysis we must have a clearunderstanding of each of input variables.
Dont include variables which you dont understand.
Dont over-represent any one process or property of the data set.If I have 9 input variables which represent sediment transport
mechanisms and only 1 which represents source area then myanalysis will be biased toward a separation based only on transport.
Think carefully about your input data and what it represents,dont include parameters in the analysis without a clear
reason to do so.
-
8/2/2019 Lecture 3 - Cluster Analysis
37/62
How many clusters should be included in a model?
The silhouette value for each point is a measure of how
similar that point is to points in its own cluster comparedto points in other clusters, and ranges from -1 to +1.
We can form an idea of how many clusters should beincluded in a model using the so-called Silhouette plot.
+1 : points are very distant from their neighbouring clusters.
0 : points are not distinctly in one cluster or another.
-1: points are probably assigned to the wrong cluster.
So we should select a model with a number of clusters thatproduces a silhouette plot with values close to +1.
-
8/2/2019 Lecture 3 - Cluster Analysis
38/62
Lets take the example of a data set which clearly has 4 clusters
-
8/2/2019 Lecture 3 - Cluster Analysis
39/62
Mean silhouette value = 0.64Cluster 2
Cluster 1
Applying a 2 cluster model
Clearly the model is not complex enough and the dispersion in Cluster 1is demonstrated by the low silhouette values of some of the points
-
8/2/2019 Lecture 3 - Cluster Analysis
40/62
Mean silhouette value = 0.74
Applying a 3 cluster model
This model is better, clusters 2 & 3 are well defined and this isdemonstrated by their high silhouette values. Cluster 1 is still too
dispersed and this is shown by some low silhouette values.
-
8/2/2019 Lecture 3 - Cluster Analysis
41/62
Mean silhouette value = 0.97
Applying a 4 cluster model
This is the optimum model, all the points have high silhouette values andgive a mean value of 0.97.
-
8/2/2019 Lecture 3 - Cluster Analysis
42/62
Mean silhouette value = 0.87
Applying a 5 cluster model
This model is overly complex. One of the data clusters has been split toform two clusters and this is demonstrated by the low silhouette values
for Clusters 2 & 3. The mean silhouette value has deceased to 0.87
-
8/2/2019 Lecture 3 - Cluster Analysis
43/62
Mean silhouette value = 0.91
Applying a 6 cluster model
Again this model is overly complex. One of the data clusters has beensplit to form three clusters and this is demonstrated by the low
silhouette values for Clusters 1, 3 & 6.
W fi d h i b f l b l i h
-
8/2/2019 Lecture 3 - Cluster Analysis
44/62
We can find the optimum number of clusters by plotting the meansilhouette value for each of the models.
A 4 cluster modelis optimum
Rocks Dataset
-
8/2/2019 Lecture 3 - Cluster Analysis
45/62
134 Portuguese Rocks with the following properties:
Physical-Mechanical tests
RMCS Compression breaking load, norm DIN 52105/E226 (Kg/cm2)
RCSG Compression breaking load, after freezing tests, norm DIN 52105/E226 (Kg/cm2)
RMFX Bending strength, norm DIN 52112 (Kg/cm2)
MVAP Volumetric weight, norm DIN 52102 (Kg/m3)
AAPN Water absortion at N.P. conditions, norm DIN 52103 (%)
PAOA Apparent porosity, norm LNEC E-216-1968 (%)
CDLT Thermal linear expansion coefficient (x10^-6/C)
RDES Abrasion test, NP-309 (mm)
RCHQ Impact test: minimum fall height (cm)
Chemical analysis
SiO2 ... TiO2 : oxides (%)
Source: IGM - Instituto Geolgico-Mineiro, Porto, Portugal
collected by J Gis, Dep. Engenharia de Minas, Faculdade de Engenharia, Universidade do Porto, Porto, Portugal
Includes: Granites, Diorites, Marbles, Slates, Limestones and Breccias
K-means clustering of the Rocks data set
-
8/2/2019 Lecture 3 - Cluster Analysis
46/62
K m an c u t r ng f t c ata t
Load the data into Matlab (18 parameters and 134 samples)
>> clear all>> load rocks
Next, standardize the input data using the zscore function>> input=zscore(input);
Test the mean silhouette value for different numbers ofclusters.
First for the cluster model,for example, a 3 cluster model:>> [cluster,cc] = kmeans(input,3);
>> s=mean(silhouette(input,cluster))
cluster shows which data point is assigned to which clusterand cc gives the cluster centres. Try different numbers ofclusters testing for the best mean silhouette value (the
highest value of s)
We can find the optimum number of clusters by plotting the mean
-
8/2/2019 Lecture 3 - Cluster Analysis
47/62
We can find the optimum number of clusters by plotting the meansilhouette value for each of the models.
4 cluster model isoptimum
You may get a different result, forexample 3 clusters may appear to be
better than 4, how could this be?
-
8/2/2019 Lecture 3 - Cluster Analysis
48/62
K-means clustering of the Rocks data set
Once you have decided how many clusters to use, form the model
>> [cluster,cc] = kmeans(input,4); this example is for 4 clusters.
Now we need to see which rocks belong to which clusters and theirvarious classes, we can do this using a function written specifically forthis data set:
>> cluster_rocks(cluster)
The function contains all the rock sample classes so it will tell youwhich classes have been placed in which groups. The classifications willbe written to the screen.
The occurrence of different rock classes in each cluster tells us
-
8/2/2019 Lecture 3 - Cluster Analysis
49/62
-------------------Cluster 1 contains:Number of granites = 0Number of diorites = 0
Number of marbles = 0Number of slates = 0Number of limestones = 9Number of breccias = 2-------------------
Cluster 2 contains:Number of granites = 32Number of diorites = 9Number of marbles = 0Number of slates = 1
Number of limestones = 0Number of breccias = 0-------------------
-------------------Cluster 3 contains:Number of granites = 0Number of diorites = 1
Number of marbles = 0Number of slates = 6Number of limestones = 0Number of breccias = 0-------------------
Cluster 4 contains:Number of granites = 0Number of diorites = 0Number of marbles = 51Number of slates = 0
Number of limestones = 19Number of breccias = 4-------------------
The occurrence of different rock classes in each cluster tells uswhat the clusters may represent.
We can also look at the cluster centres to understand how the data is split.
We can also look at the cluster centres to understand how the data is split.
-
8/2/2019 Lecture 3 - Cluster Analysis
50/62
-------------------Cluster 2 contains:Number of granites = 32Number of diorites = 9
Number of marbles = 0Number of slates = 1Number of limestones = 0Number of breccias = 0-------------------
-------------------Cluster 4 contains:Number of granites = 0Number of diorites = 0
Number of marbles = 51Number of slates = 0Number of limestones = 19Number of breccias = 4-------------------
p
0001530012Cluster 404422031568Cluster 2
TiO2K2ONa2OMgOCaOMnOFe2O3Al2O3SiO2
Composition of the cluster centres (%)
Fuzzy Clustering
-
8/2/2019 Lecture 3 - Cluster Analysis
51/62
y g
In K-means clustering each sample belongs to only one cluster.
With fuzzy clustering each sample can belong to more than one cluster.How much a sample belongs to a cluster is defined by the membership
Membership is related to the distance (similarity) between a sampleand a given cluster centre, memberships will always sum to 1.
Equally similar to
CC1 & CC2
0.50.5Sample 5
Slight more similarto CC2 than CC1
0.60.4Sample 4
More similar to CC1than CC2
0.10.9Sample 3
Extremelydissimilar to CC1and similar to CC2.
0.990.01Sample 2
Extremely similarto CC1 anddissimilar to CC2.
0.010.99Sample 1
ConclusionCluster 2
MembershipCluster 1
Membership
Note:
thesedefinitionsaremyownand
do
not
representanaccepted
classification
scheme
-
8/2/2019 Lecture 3 - Cluster Analysis
52/62
Fuzzy Clustering
Included in the calculation of the fuzzy clusters is the so-calledfuzzy exponent (q). This value describes how fuzzy the model is.
q = 1, there is no fuzziness and each sample is assigned toonly one cluster (i.e it has one membership of 1 and the restare zero). This is the same as traditional K-means clustering.
q = , there is no separation in the model, with all thesamples belonging equally to all the clusters.
Normally, q is set between 1.5 and 3 depending on the problem youare studying. Many geoscience studies have obtained meaningfulresults using q = 1.5
Geology of a rocky bank
-
8/2/2019 Lecture 3 - Cluster Analysis
53/62
Sediment grab samples (GR) were taken from the New Zealand
Star Bank (SE Australia). Sediment composition (grains > 1mm)was analysed for a number of different components.
gy y
Geology of a rocky bank
-
8/2/2019 Lecture 3 - Cluster Analysis
54/62
-
8/2/2019 Lecture 3 - Cluster Analysis
55/62
Grab sample locations on Star Bank
-
8/2/2019 Lecture 3 - Cluster Analysis
56/62
Grab sample locat ons on Star Bank
Fuzzy clustering of the Star Bank data set
-
8/2/2019 Lecture 3 - Cluster Analysis
57/62
Now well perform the fuzzy clustering and plot the membership results on aseries of new images. First we must zscore the compositional data>> input=zscore(input);
Calculate a 2 cluster model, mem gives the membership and cc the cluster centre
>> [mem,cc]=fuzzycm(input,2);
Now well make a new image and then plot point size as a function ofmembership to cluster 1 (large points mean a higher membership).
>> figure>> image(Fig)
>> set(gca,'visible', 'off') removes the axes
>> hold on
>> for i=1:length(mem)
plot(samples(i,1),samples(i,2),'ok','markerfacecolor','g','markersize',mem(i,1).*12);
end
Membership to cluster 1 (granite outcrop)
-
8/2/2019 Lecture 3 - Cluster Analysis
58/62
p g p
The samples on the granite outcrops(GR7,GR12) have a strong
membership to this cluster. Noticethat the samples close to the
outcrops also have a reasonably highmembership to this cluster.
-
8/2/2019 Lecture 3 - Cluster Analysis
59/62
Fuzzy clustering of the Star Bank data set
Now well make a new image and then plot point size as a function ofmembership to cluster 2 (large points mean a higher membership).
>> figure
>> image(Fig)>> set(gca,'visible', 'off') removes the axes
>> hold on
>> for i=1:length(mem)
plot(samples(i,1),samples(i,2),'ok','markerfacecolor', 'c','markersize',mem(i,2).*12);
end
Membership to cluster 2 (sediment)
-
8/2/2019 Lecture 3 - Cluster Analysis
60/62
The samples away from the outcropshave high memberships to this cluster.
Notice that samples closer to theoutcrops have lower memberships andwe see transitional cases (e.g. GR13)
which belong to both clusters.
Important points to consider when performing cluster analysis.
-
8/2/2019 Lecture 3 - Cluster Analysis
61/62
Outliers can have a strong influence on cluster analysis, so you should
test for any outliers before you begin.
-
8/2/2019 Lecture 3 - Cluster Analysis
62/62