lecture 3 - cluster analysis

Upload: amrajee

Post on 05-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 Lecture 3 - Cluster Analysis

    1/62

    Cluster Analysis

    Determining the similarity between data points

    Hierarchical Clustering (Wards method)

    K-means Clustering

    Fuzzy c-means Clustering

  • 8/2/2019 Lecture 3 - Cluster Analysis

    2/62

    Regression Clustering

    How can we analyse these two different types of data distributions?

  • 8/2/2019 Lecture 3 - Cluster Analysis

    3/62

    Cluster analysis provides a method to separate data points into

    groups (so-called clusters) which have similar properties.

    Cluster Analysis is a multivariate technique so it can work inlarge numbers of dimensions, where each dimension representsone property of the data point.

    If we plot points in space what can we use as a good measure ofhow similar or different two points are?

    A natural measure ofthe similarity of points

    in space is the distancewhich separates them.

  • 8/2/2019 Lecture 3 - Cluster Analysis

    4/62

    A natural measure of the similarity of points in

    space is the distance which separates them.

    (x1,y1)

    (x2,y2)

    2

    21

    2

    21 )()( yyxxr +=

    Euclid

    eanDis

    tance

    (r)

  • 8/2/2019 Lecture 3 - Cluster Analysis

    5/62

    Euclidean distances can be calculated for any number of dimensions.For example in a system with 4 dimensions; w, x, y, z.

    2

    21

    2

    21

    2

    21

    2

    21 )()()()( zzyyxxwwr +++=

    If we have two data points (pt1 & pt2) as column vectors, where each

    value represents the location in a given dimension, we could find theEuclidean distance between them by:

    function r=euclid(pt1,pt2)

    %find difference in all dimensionsdifference=pt1-pt2;

    %square the differencedifference=difference.^2;

    %sum the differences across all dimensionstotal=sum(difference);

    %take the square-root to find the Euclidean distancer=sqrt(total);

  • 8/2/2019 Lecture 3 - Cluster Analysis

    6/62

    Manhattan (Street Block) Distance

    (x1,y1)

    (x2,y2)2

    21

    2

    21 )()( yyxxr +=

    Instead of taking the shortest distance between the two points

    (i.e the Euclidean distance), we sum their separation in each dimension.

    2

    21 )( xx

    2

    21 )( yy

  • 8/2/2019 Lecture 3 - Cluster Analysis

    7/62

    2

    21

    2

    21

    2

    21

    2

    21 )()()()( zzyyxxwwr +++=

    If we have two data points (pt1 & pt2) as column vectors, where each

    value represents the location in a given dimension, we could find theManhattan distance between them by:

    function r=manhattan(pt1,pt2)

    %find difference in all dimensionsdifference=pt1-pt2;

    %square the differencedifference=difference.^2;

    % take the square-root to find the absolute distance in each dimensiondifference=sqrt(difference);

    %sum the distances across all dimensionsr=sum(difference);

    Again Manhattan distances can be calculated for any number ofdimensions. For example in a system with 4 dimensions; w, x, y, z.

  • 8/2/2019 Lecture 3 - Cluster Analysis

    8/62

    Data normalisation

    In cases where the numbers of one variable are muchlarger than the other variables they can dominate thedistance measure, for example, in a 4D analysis ofhumans we could have:

    Height, foot length, finger lengthand ear width

    In this case is clear that the absolute values of heightare much larger than the other variables thereforethey will dominate the distance and the similaritymeasure.

    To avoid this effect each variable is normalised beforethe analysis, this means setting the mean of eachvariable to 0 and the standard deviation to 1. This is

    called standardizingthe data, or taking the z-score.

  • 8/2/2019 Lecture 3 - Cluster Analysis

    9/62

    Data normalisationWe can standardize data (set the mean to 0 and the standard deviation

    to 1), very simply in Matlab:

    >> x=rand(100,1); Generate 100 random numbers to act as our data

    >> mean(x) Find the mean (should be ~0.5)

    >> std(x) Find the S.D. (should be ~0.29)

    >> xn=x-mean(x); Form variable xn which is x minus the mean of x

    >> mean(xn) Find the mean of the adjusted values (~0.0)

    >> xn=xn./std(x); Adjust the S.D. of xn, divide by the S.D. of x

    >> std(xn) Find the S.D. of the adjusted values (1.0)

    This procedure can be performed using the zscore function:>> x = rand(100,1);

    >> xn = zscore(x);

    >> mean(x)

    >> std(x)

  • 8/2/2019 Lecture 3 - Cluster Analysis

    10/62

    Cluster analysis, also called segmentation analysisor

    taxonomy analysis, is a way to create groups ofobjects, or clusters, in such a way that theproperties of objects in the same cluster are verysimilar and the properties of objects in different

    clusters are quite distinct.

    Cluster analysis works best with normally

    distributed data. This is not a strict assumption, butfor example log-normal variables should betransformed to normal variables.

    In most cases every parameter is standardized(mean of 0, standard deviation of 1) to have equal

    weight in analysis

  • 8/2/2019 Lecture 3 - Cluster Analysis

    11/62

  • 8/2/2019 Lecture 3 - Cluster Analysis

    12/62

    Join the two closestpoints and define a new

    point at their centre

    Hierarchical Clustering (Ward's method)

  • 8/2/2019 Lecture 3 - Cluster Analysis

    13/62

    Find the next twoclosest points and

    define a new point attheir centre

    Hierarchical Clustering (Ward's method)

  • 8/2/2019 Lecture 3 - Cluster Analysis

    14/62

    Find the next twoclosest points and

    define a new point attheir centre

    Hierarchical Clustering (Ward's method)

  • 8/2/2019 Lecture 3 - Cluster Analysis

    15/62

    Find the closest centrepoints and define a new

    point at their centre

    Hierarchical Clustering (Ward's method)

  • 8/2/2019 Lecture 3 - Cluster Analysis

    16/62

    Find the closest centrepoints and define a new

    point at their centre

    Hierarchical Clustering (Ward's method)

  • 8/2/2019 Lecture 3 - Cluster Analysis

    17/62

    We can represent thesequence of links and their

    size using a dendrogram

    Hierarchical Clustering (Ward's method)

  • 8/2/2019 Lecture 3 - Cluster Analysis

    18/62

    We can represent thesequence of links and their

    size using a dendrogram

    Hierarchical Clustering (Ward's method)

  • 8/2/2019 Lecture 3 - Cluster Analysis

    19/62

    We can represent thesequence of links and their

    size using a dendrogram

    Hierarchical Clustering (Ward's method)

  • 8/2/2019 Lecture 3 - Cluster Analysis

    20/62

    We can represent thesequence of links and their

    size using a dendrogram

    Hierarchical Clustering (Ward's method)

  • 8/2/2019 Lecture 3 - Cluster Analysis

    21/62

    Hierarchical Clustering (Ward's method)

    We can represent thesequence of links and their

    size using a dendrogram

  • 8/2/2019 Lecture 3 - Cluster Analysis

    22/62

    Hierarchical Clustering (Ward's method)

    Exercise

  • 8/2/2019 Lecture 3 - Cluster Analysis

    23/62

    Hierarchical Clustering (Ward's method)

  • 8/2/2019 Lecture 3 - Cluster Analysis

    24/62

    Hierarchical Clustering (Ward's method)

    We can now use Matlab to check your result for theprevious exercise.

    Set up the X and Y points>> clear all>> close all

    >> X=[-0.4326 , -1.6656 , 0.1253 , 0.2877 , -1.1465 , 1.1909]';

    >> Y=[1.1892 , -0.0376 , 0.3273 , 0.1746 , -0.1867 , 0.7258]';

    Form a matrix by combining x and y>> input=[X,Y]

    Set up a series of labels for the points between A and F>> labels={'A';'B';'C';'D';'E';'F'};

  • 8/2/2019 Lecture 3 - Cluster Analysis

    25/62

    Hierarchical Clustering (Ward's method)

    To perform the clustering we first need to find the Euclidean distancebetween the data points using the pdistfunction.

    >> p=pdist(input,'euclidean')

    Given the distances, the Ward links between the points can be calculated:

    >> L=linkage(p,'ward')

    Finally, we produce the dendrogram plot to show the links and add the labels:

    >> dendro(L,labels)

  • 8/2/2019 Lecture 3 - Cluster Analysis

    26/62

    Hierarchical Clustering (Ward's method)

  • 8/2/2019 Lecture 3 - Cluster Analysis

    27/62

    Hierarchical Clustering of Iris data

    Sepals & Petals:

    Length (cm)

    Width (cm)

    Iris versicolor

    R.A. Fisher (1936), "The use of multiple

    measurements in taxonomic problems",Annals of Eugenics 7, 179-188.

    E. Anderson (1935), "The irises of the

    Gasp peninsula",Bulletin of the

    American Iris Society 59, 2-5.

  • 8/2/2019 Lecture 3 - Cluster Analysis

    28/62

    Hierarchical Clustering of Iris data

    >> clear all clear the memory>> close all close all the figure windows>> load iris_clusters load the data into matlab

    We must combine all four variables into a 30 x 4 matrix>> input=[petal_length,petal_width,sepal_length,sepal_width];

    Next, standardize the data using the zscore function>> input=zscore(input);

    Call the clustering function with the input data

    >> p=pdist(input,'euclidean');>> L=linkage(p,'ward')

    >> dendro(L)

  • 8/2/2019 Lecture 3 - Cluster Analysis

    29/62

    Total of 30 Irises, from 2 different species. Use hierarchicalclustering to determine which flowers belong to which species.

  • 8/2/2019 Lecture 3 - Cluster Analysis

    30/62

    A disadvantage of the dendrogram is that if we have a lot ofsamples it quickly becomes difficult to interpret. This is shown

    when we repeat the analysis for 100 iris specimens.

  • 8/2/2019 Lecture 3 - Cluster Analysis

    31/62

    Deterministic (k-means) method:

    Find an arrangement of cluster centres and associate everysample to one of these clusters.

    The best cluster solution minimises the total distancebetween the data points and their assigned cluster centre.

    Procedure:Choose the number of clusters for the analysis.

    Assume starting positions for the cluster centres.

    Attribute every sample to its nearest cluster centre.

    Recalculate the position of the cluster centre until a minimumtotal distance is obtained.

    k-means clustering method

  • 8/2/2019 Lecture 3 - Cluster Analysis

    32/62

    For this 2-dimensional data set, it is clear there are2 clusters. It is therefore simple to determine

    cluster centres and assign each point to a cluster.

  • 8/2/2019 Lecture 3 - Cluster Analysis

    33/62

    The cluster centres are the points in space which minimise the distancebetween themselves and the data points assigned to that cluster. Each

    data point can only belong to one cluster (so-called hardclustering).

    Cluster 1low Z1, high Z2

    Cluster 2

    high Z1, low Z2

    The characteristics of a cluster are given by the location of the cluster centre in space.

  • 8/2/2019 Lecture 3 - Cluster Analysis

    34/62

    The cluster centres have locations with the same number of dimensionsas the original data. Therefore to obtain a clear plot of the clusters in

    two-dimensions we must choose 2 variables which show a good separationbetween the clusters (red points show the cluster centres).

    Cluster 1(small width, small length)

    Cluster 2(large width, large length)

    W l l l 3 l d l b i i l h h

  • 8/2/2019 Lecture 3 - Cluster Analysis

    35/62

    We can also calculate a 3 cluster model but it is clear that we have onedata cluster which has been split into two parts. Generally, you should try

    and keep the number of clusters low and make sure there is a physicalinterpretation that explains what process each cluster represents.

    Cluster 1

    Cluster 2

    Cluster 3

  • 8/2/2019 Lecture 3 - Cluster Analysis

    36/62

    What input data should you use?

    There are two key ideas that you should consider whenconstructing the input data set for cluster analysis.

    The properties of a cluster are determined from its position

    with the data space. Therefore if we are to understand theresults of the cluster analysis we must have a clearunderstanding of each of input variables.

    Dont include variables which you dont understand.

    Dont over-represent any one process or property of the data set.If I have 9 input variables which represent sediment transport

    mechanisms and only 1 which represents source area then myanalysis will be biased toward a separation based only on transport.

    Think carefully about your input data and what it represents,dont include parameters in the analysis without a clear

    reason to do so.

  • 8/2/2019 Lecture 3 - Cluster Analysis

    37/62

    How many clusters should be included in a model?

    The silhouette value for each point is a measure of how

    similar that point is to points in its own cluster comparedto points in other clusters, and ranges from -1 to +1.

    We can form an idea of how many clusters should beincluded in a model using the so-called Silhouette plot.

    +1 : points are very distant from their neighbouring clusters.

    0 : points are not distinctly in one cluster or another.

    -1: points are probably assigned to the wrong cluster.

    So we should select a model with a number of clusters thatproduces a silhouette plot with values close to +1.

  • 8/2/2019 Lecture 3 - Cluster Analysis

    38/62

    Lets take the example of a data set which clearly has 4 clusters

  • 8/2/2019 Lecture 3 - Cluster Analysis

    39/62

    Mean silhouette value = 0.64Cluster 2

    Cluster 1

    Applying a 2 cluster model

    Clearly the model is not complex enough and the dispersion in Cluster 1is demonstrated by the low silhouette values of some of the points

  • 8/2/2019 Lecture 3 - Cluster Analysis

    40/62

    Mean silhouette value = 0.74

    Applying a 3 cluster model

    This model is better, clusters 2 & 3 are well defined and this isdemonstrated by their high silhouette values. Cluster 1 is still too

    dispersed and this is shown by some low silhouette values.

  • 8/2/2019 Lecture 3 - Cluster Analysis

    41/62

    Mean silhouette value = 0.97

    Applying a 4 cluster model

    This is the optimum model, all the points have high silhouette values andgive a mean value of 0.97.

  • 8/2/2019 Lecture 3 - Cluster Analysis

    42/62

    Mean silhouette value = 0.87

    Applying a 5 cluster model

    This model is overly complex. One of the data clusters has been split toform two clusters and this is demonstrated by the low silhouette values

    for Clusters 2 & 3. The mean silhouette value has deceased to 0.87

  • 8/2/2019 Lecture 3 - Cluster Analysis

    43/62

    Mean silhouette value = 0.91

    Applying a 6 cluster model

    Again this model is overly complex. One of the data clusters has beensplit to form three clusters and this is demonstrated by the low

    silhouette values for Clusters 1, 3 & 6.

    W fi d h i b f l b l i h

  • 8/2/2019 Lecture 3 - Cluster Analysis

    44/62

    We can find the optimum number of clusters by plotting the meansilhouette value for each of the models.

    A 4 cluster modelis optimum

    Rocks Dataset

  • 8/2/2019 Lecture 3 - Cluster Analysis

    45/62

    134 Portuguese Rocks with the following properties:

    Physical-Mechanical tests

    RMCS Compression breaking load, norm DIN 52105/E226 (Kg/cm2)

    RCSG Compression breaking load, after freezing tests, norm DIN 52105/E226 (Kg/cm2)

    RMFX Bending strength, norm DIN 52112 (Kg/cm2)

    MVAP Volumetric weight, norm DIN 52102 (Kg/m3)

    AAPN Water absortion at N.P. conditions, norm DIN 52103 (%)

    PAOA Apparent porosity, norm LNEC E-216-1968 (%)

    CDLT Thermal linear expansion coefficient (x10^-6/C)

    RDES Abrasion test, NP-309 (mm)

    RCHQ Impact test: minimum fall height (cm)

    Chemical analysis

    SiO2 ... TiO2 : oxides (%)

    Source: IGM - Instituto Geolgico-Mineiro, Porto, Portugal

    collected by J Gis, Dep. Engenharia de Minas, Faculdade de Engenharia, Universidade do Porto, Porto, Portugal

    Includes: Granites, Diorites, Marbles, Slates, Limestones and Breccias

    K-means clustering of the Rocks data set

  • 8/2/2019 Lecture 3 - Cluster Analysis

    46/62

    K m an c u t r ng f t c ata t

    Load the data into Matlab (18 parameters and 134 samples)

    >> clear all>> load rocks

    Next, standardize the input data using the zscore function>> input=zscore(input);

    Test the mean silhouette value for different numbers ofclusters.

    First for the cluster model,for example, a 3 cluster model:>> [cluster,cc] = kmeans(input,3);

    >> s=mean(silhouette(input,cluster))

    cluster shows which data point is assigned to which clusterand cc gives the cluster centres. Try different numbers ofclusters testing for the best mean silhouette value (the

    highest value of s)

    We can find the optimum number of clusters by plotting the mean

  • 8/2/2019 Lecture 3 - Cluster Analysis

    47/62

    We can find the optimum number of clusters by plotting the meansilhouette value for each of the models.

    4 cluster model isoptimum

    You may get a different result, forexample 3 clusters may appear to be

    better than 4, how could this be?

  • 8/2/2019 Lecture 3 - Cluster Analysis

    48/62

    K-means clustering of the Rocks data set

    Once you have decided how many clusters to use, form the model

    >> [cluster,cc] = kmeans(input,4); this example is for 4 clusters.

    Now we need to see which rocks belong to which clusters and theirvarious classes, we can do this using a function written specifically forthis data set:

    >> cluster_rocks(cluster)

    The function contains all the rock sample classes so it will tell youwhich classes have been placed in which groups. The classifications willbe written to the screen.

    The occurrence of different rock classes in each cluster tells us

  • 8/2/2019 Lecture 3 - Cluster Analysis

    49/62

    -------------------Cluster 1 contains:Number of granites = 0Number of diorites = 0

    Number of marbles = 0Number of slates = 0Number of limestones = 9Number of breccias = 2-------------------

    Cluster 2 contains:Number of granites = 32Number of diorites = 9Number of marbles = 0Number of slates = 1

    Number of limestones = 0Number of breccias = 0-------------------

    -------------------Cluster 3 contains:Number of granites = 0Number of diorites = 1

    Number of marbles = 0Number of slates = 6Number of limestones = 0Number of breccias = 0-------------------

    Cluster 4 contains:Number of granites = 0Number of diorites = 0Number of marbles = 51Number of slates = 0

    Number of limestones = 19Number of breccias = 4-------------------

    The occurrence of different rock classes in each cluster tells uswhat the clusters may represent.

    We can also look at the cluster centres to understand how the data is split.

    We can also look at the cluster centres to understand how the data is split.

  • 8/2/2019 Lecture 3 - Cluster Analysis

    50/62

    -------------------Cluster 2 contains:Number of granites = 32Number of diorites = 9

    Number of marbles = 0Number of slates = 1Number of limestones = 0Number of breccias = 0-------------------

    -------------------Cluster 4 contains:Number of granites = 0Number of diorites = 0

    Number of marbles = 51Number of slates = 0Number of limestones = 19Number of breccias = 4-------------------

    p

    0001530012Cluster 404422031568Cluster 2

    TiO2K2ONa2OMgOCaOMnOFe2O3Al2O3SiO2

    Composition of the cluster centres (%)

    Fuzzy Clustering

  • 8/2/2019 Lecture 3 - Cluster Analysis

    51/62

    y g

    In K-means clustering each sample belongs to only one cluster.

    With fuzzy clustering each sample can belong to more than one cluster.How much a sample belongs to a cluster is defined by the membership

    Membership is related to the distance (similarity) between a sampleand a given cluster centre, memberships will always sum to 1.

    Equally similar to

    CC1 & CC2

    0.50.5Sample 5

    Slight more similarto CC2 than CC1

    0.60.4Sample 4

    More similar to CC1than CC2

    0.10.9Sample 3

    Extremelydissimilar to CC1and similar to CC2.

    0.990.01Sample 2

    Extremely similarto CC1 anddissimilar to CC2.

    0.010.99Sample 1

    ConclusionCluster 2

    MembershipCluster 1

    Membership

    Note:

    thesedefinitionsaremyownand

    do

    not

    representanaccepted

    classification

    scheme

  • 8/2/2019 Lecture 3 - Cluster Analysis

    52/62

    Fuzzy Clustering

    Included in the calculation of the fuzzy clusters is the so-calledfuzzy exponent (q). This value describes how fuzzy the model is.

    q = 1, there is no fuzziness and each sample is assigned toonly one cluster (i.e it has one membership of 1 and the restare zero). This is the same as traditional K-means clustering.

    q = , there is no separation in the model, with all thesamples belonging equally to all the clusters.

    Normally, q is set between 1.5 and 3 depending on the problem youare studying. Many geoscience studies have obtained meaningfulresults using q = 1.5

    Geology of a rocky bank

  • 8/2/2019 Lecture 3 - Cluster Analysis

    53/62

    Sediment grab samples (GR) were taken from the New Zealand

    Star Bank (SE Australia). Sediment composition (grains > 1mm)was analysed for a number of different components.

    gy y

    Geology of a rocky bank

  • 8/2/2019 Lecture 3 - Cluster Analysis

    54/62

  • 8/2/2019 Lecture 3 - Cluster Analysis

    55/62

    Grab sample locations on Star Bank

  • 8/2/2019 Lecture 3 - Cluster Analysis

    56/62

    Grab sample locat ons on Star Bank

    Fuzzy clustering of the Star Bank data set

  • 8/2/2019 Lecture 3 - Cluster Analysis

    57/62

    Now well perform the fuzzy clustering and plot the membership results on aseries of new images. First we must zscore the compositional data>> input=zscore(input);

    Calculate a 2 cluster model, mem gives the membership and cc the cluster centre

    >> [mem,cc]=fuzzycm(input,2);

    Now well make a new image and then plot point size as a function ofmembership to cluster 1 (large points mean a higher membership).

    >> figure>> image(Fig)

    >> set(gca,'visible', 'off') removes the axes

    >> hold on

    >> for i=1:length(mem)

    plot(samples(i,1),samples(i,2),'ok','markerfacecolor','g','markersize',mem(i,1).*12);

    end

    Membership to cluster 1 (granite outcrop)

  • 8/2/2019 Lecture 3 - Cluster Analysis

    58/62

    p g p

    The samples on the granite outcrops(GR7,GR12) have a strong

    membership to this cluster. Noticethat the samples close to the

    outcrops also have a reasonably highmembership to this cluster.

  • 8/2/2019 Lecture 3 - Cluster Analysis

    59/62

    Fuzzy clustering of the Star Bank data set

    Now well make a new image and then plot point size as a function ofmembership to cluster 2 (large points mean a higher membership).

    >> figure

    >> image(Fig)>> set(gca,'visible', 'off') removes the axes

    >> hold on

    >> for i=1:length(mem)

    plot(samples(i,1),samples(i,2),'ok','markerfacecolor', 'c','markersize',mem(i,2).*12);

    end

    Membership to cluster 2 (sediment)

  • 8/2/2019 Lecture 3 - Cluster Analysis

    60/62

    The samples away from the outcropshave high memberships to this cluster.

    Notice that samples closer to theoutcrops have lower memberships andwe see transitional cases (e.g. GR13)

    which belong to both clusters.

    Important points to consider when performing cluster analysis.

  • 8/2/2019 Lecture 3 - Cluster Analysis

    61/62

    Outliers can have a strong influence on cluster analysis, so you should

    test for any outliers before you begin.

  • 8/2/2019 Lecture 3 - Cluster Analysis

    62/62