cluster analysis - uibk.ac.at · introduction distancemeasures formation ofgroups (clusters)inca...
Post on 24-Aug-2018
216 Views
Preview:
TRANSCRIPT
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Cluster Analysis
Janette Walde
janette.walde@uibk.ac.at
Department of StatisticsUniversity of Innsbruck
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Outline I
1 IntroductionProblemsIdea of Cluster Analysis
2 Distance MeasuresGeneral CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables
3 Formation of Groups (Clusters) in CAHierarchical Agglomerative CA Methods
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Outline IIPartitional Clustering: k-means Clustering
4 How to obtain the number of clusters?
5 Further Analysis of the obtained clusters
6 Useful R-Commands
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
ProblemsIdea of Cluster Analysis
Problems/Questions
Cluster analysis was developed in taxonomy. Theaim was originally to get away from the high degreeof subjectivity when single taxonomists performed agrouping.
Clustering is used to build groups of genes with relatedexpression patterns (co-expressed genes): ”In analyzingDNA microarray gene-expression data, a major role hasbeen played by various cluster-analysis techniques, mostnotably by hierarchical clustering, K-means clusteringand self-organizing maps. These clustering techniquescontribute significantly to our understanding of theunderlying biological phenomena.” Genome Biology2002, 3(2):research0009.1–0009.8.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
ProblemsIdea of Cluster Analysis
Problems/Questions, cont.
In plant and animal ecology, clustering is used todescribe and to make spatial and temporal comparisonsof communities of organisms in heterogeneousenvironments; in plant systematics to generate artificialphylogenies or clusters of organisms at the species, genusor higher level that share a number of attributes.
CA might be used to classify regions based on vegetationcommunities or abundances of species.
We may find out various stakeholders regarding anational park via conducting a survey questioningfarmers, visitors, inhabitants of the municipalities ...
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
ProblemsIdea of Cluster Analysis
Idea of Cluster analysis
Clusters are formed numerically – on the basisof distance measures. Interpretation of clustersis the final step.
The resulting clusters should exhibit highinternal (within-cluster) homogeneity and highexternal (between-cluster) heterogeneity.
The Interpretation of clusters has to take intoaccount (needs to be consistent with) themathematical procedures.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
ProblemsIdea of Cluster Analysis
Idea of Cluster analysis, Cont.
Comparing clusters (of cases) with respect toadditional variables (i.e. variables, which havenot been considered in the formation of theclusters) can help in the interpretation ofclusters (subsequent to CA).
CA itself is usually not combined with thecalculation of statistical significance.
CA does – likewise to Factorial Analysis - servedata reduction purposes.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
ProblemsIdea of Cluster Analysis
Cluster analysis of cases
Cluster analysis evaluates the similarity of cases(e.g. persons, products, areas, other entities)with respect to a defined set of variables.
Cases are grouped into clusters on the basis oftheir similarities. Similar cases shall be assignedto the same cluster. Dissimilar cases shall beassigned to different clusters.
The number of clusters to be formed can bedefined in advance or certain criteria aredefined and applied to the data.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
ProblemsIdea of Cluster Analysis
Cluster analysis of variables
When variables are clustered, the similarity ofthe variables is evaluated with respect to thesimilarity of the values, which a predefined setof persons have on these variables.
Similar variables shall be assigned to the samecluster of variables. Dissimilar variables shall beassigned to different clusters.
The number of clusters to be formed can bedefined in advance or certain criteria aredefined and applied to the data.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
ProblemsIdea of Cluster Analysis
Clustering cases or variables?
Whether cases or variables should be clustereddepends on the research question you have.
Mathematically there exists no fundamentaldifference between clustering cases or variables.
Clustering variables starts with the Transposedmatrix, as compared to the matrix you startwith when clustering cases.
Similarities of cases wrt. their values of certain variables
(clustering cases) versus similarities of variables wrt. the
values of certain cases on these variables (clustering variables).Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables
Distance measures in CA I
The definition of the distance measure, anddeciding whether to use standardized or rawdata are fundamental steps in CA.
Measures of distance (dissimilarity versussimilarity/proximity) → Recursively mostsimilar clusters are unified.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables
Distance measures in CA II
In most CA approaches, the first round of asequential clustering procedure assumes nopre-existing clusters of cases. Instead, in thefirst round each case is regarded as a cluster →agglomerative methods.
Accordingly in the first round of a sequentialCA procedure the distance is measuredbetween all cases, that is, if n cases areincluded, the number of distances to becalculated equals n·(n−1)
2 .
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables
Distance measures in CA III
The scale level of your data can limit thenumber of distance measures which makesense.
If a variable is dichotomous two cases can onlybe identical or different from each other.
If variables are ordinal scaled (ranks), then thedistances between values are difficult tointerpret.→ It is problematic to use distance measures,which are appropriate for interval scale data,
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables
Distance measures in CA IV
for ordinal scaled data.→ A possibility is splitting ranks using themedian, and thus recode the variables (e.g.0 = below median; 1 = above median) tosubsequently apply a distance measure fordichotomous variables.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables
Distance measures in CA V
If a variable is nominal scaled with more thantwo levels (e.g. nationality; region ofresidence), then a dummy coding of thevariables can transform them into dichotomousvariables.
If your variables are interval scaled(quantitative variables), then distances can beproperly interpreted.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables
General properties of a distance measure
d(x , y) ≥ 0, the distance is never negative
d(x , y) = 0, if x = y
d(x , y) = d(y , x), symmetric
d(x , y) ≤ d(x , z) + d(z , y)
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables
Distance measures for interval scale data
Cases/Variables V1 V2 V3 V4 V5 V6 V71 5 2 3 0 1 0 12 4 4 3 3 1 1 13 10 7 8 5 6 5 6
Different aspects of similarity (distance) can befocused:
Cases 1 and 2 are similar regarding theabsolute values of the variables → mostdistance measures (e.g. Euclidean distance)focus this aspectCases 1 and 3 ...
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables
Distance measures for interval scale data
Cases/Variables V1 V2 V3 V4 V5 V6 V71 5 2 3 0 1 0 12 4 4 3 3 1 1 13 10 7 8 5 6 5 6
Cases 1 and 2Cases 1 and 3 are similar with respect to theprofile (increase and decrease/relative values)over the variables.→ To focus on this aspect, product moment
correlation can be selected as similaritymeasure!
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables
Distance measures for interval scale data
Cases/Variables V1 V2 V3 V4 V5 V6 V71 5 2 3 0 1 0 12 4 4 3 3 1 1 13 10 7 8 5 6 5 6
Cases 1 and 2Cases 1 and 3 are similar with respect to theprofile over the variables.→ After standardizing the variables (e.g.z-transforming) conventional distance measures(e.g. Euclidean distance) also result in a focuson this latter aspect.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables
Euclidian distance
The Euclidean Distance between two cases Aand B is the “straight line” between the twocases. Assume A(xA, yA),B(xB, yB), then
dEuklidean(A,B) =√
(xA − xB)2 + (yA − yB)2.The value of the Euclidean distance dependson the scale of the variables. → Standardizethe variables!Generally, two cases having the variable vectors
x and y : d(x , y) =√
∑ki=1(xi − yi)2, where k
is the number of variables.Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables
Minkowski Metrik
General basis of both, the City-Block-Distanceand the Euclidean Distance (ordinary distance,Pythagorean metric) are the so calledMinkowski-Metrics, corresponding to theformula: d(x , y) = (
∑ki=1 |xi − yi |
r )1r , where r
is called the Minkowski constant.
In the City-Block Metric r = 1, in EuclideanDistance r = 2. These distance measure canbe calculated for any number of variables(dimensions).
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables
Minkowski Metrik, Cont.
High values of r increase the weight of largedistances relative to small ones.
Dominance (Supremum) metric: r → ∞, thusonly the biggest difference matters
For applying Minkowski-Metrics the scales ofthe variables should be identical. Else, astandardization (e.g. z-transformation of allvariables) must be accomplished.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables
Cosine
The distance between two cases x and y isdefined as:
d(x , y) =
∑ki=1 xiyi
√
∑
i x2i
∑
i y2i
The direction of the variable vectors is decisiveand not their length. The ”angle” between thetwo variable vectors is computed.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables
Pearson Correlation
Correlation coefficient between x and y
rx ,y =
∑k
i=1(xi − x)(yi − y)
[∑k
i=1(xi − x)2∑k
i=1(yi − y)2]12
Measure of Similarity
Note: Clustering variables using Pearson correlation asdistance measure is similar to Factorial Analysis as bothare closely related to the correlation matrix of theinvolved variables and both aim to identify similaritiesbetween these variables.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables
Distance measures for binary variables
Cases/variables V1 V2 V3 V4 V5 V6 V71 1 1 1 0 1 0 12 0 1 0 0 0 1 13 1 1 0 0 0 0 0
Even for binary data manifold distance (similarity)measures exist, e.g.:
1 Simple Matching Coefficient (SMC) =matches/number of paired variables.→ e.g. SMC (case 1, case 2) = 3/7; SMC(case 2, case 3) = 4/7.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables
Distance measures for binary variables
Cases/variables V1 V2 V3 V4 V5 V6 V71 1 1 1 0 1 0 12 0 1 0 0 0 1 13 1 1 0 0 0 0 0
1 Simple Matching Coefficient.2 Jaccard Matching coefficient = matches with
characteristic present (=1)/ number of pairedvariables where characteristic is given (=1) atleast once.→ e.g. JMC (case 1, case 2) = 2/6; JMC(case 2, case 3) = 1/4
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables
Distance measures for binary variables
Cases/variables V1 V2 V3 V4 V5 V6 V71 1 1 1 0 1 0 12 0 1 0 0 0 1 13 1 1 0 0 0 0 0
1 Simple Matching Coefficient.2 Jaccard Matching coefficient.3 Phi-coefficient → Product-moment correlation
formula applied to binary data that is coded 0and 1.
In these three methods, the distance measure isdefined by: d = (1 - similarity measure).
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
General CommentsProperties of Distance MeasuresDistance Measures for Interval Scale DataDistance (Similarity) Measures for Binary Variables
Comments regarding distance measures
Specific distance measures for nominal scaledvariables that are not binary as well as distancemeasures for ordinal scaled variables areavailable.
Act with caution if variables with different levelof measurements are used combined!
Analogous to the distances between cases,distances between variables can also becalculated.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
Formation of groups (clusters) in CA
Different methods can be used to form clusters.
Hierarchical Agglomerative CA methods startwith the finest partitioning (i.e. each case[variable] forms one cluster) and sequentiallyreduce the number of clusters by 1 throughunifying two clusters.Divisive methods which start with one supercluster are very uncommon and not touchedhere.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
Formation of clusters in CA, Cont.
In hierarchical CA methods, clusters are sequentiallyenlarged. Once included an element is included in acluster it will not be excluded from the cluster.
In non-hierarchical CA methods, elements cansequentially be included and excluded into differentclusters in order to identify the best cluster structure.→ Thus, they are more flexible and complex.→ The k-means method is the most frequently appliednon-hierarchical CA method.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
Algorithms for establishing clusters
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
Hierarchical agglomerative CA methods
Different hierarchical CA methods use differentcriteria for selecting which two clusters will beunified in the next step.
Hierarchical Methods are for example SingleLinkage, Complete Linkage, Average Linkage,[weighted/unweighted] Centroid-method, andthe Ward-method.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
Hierarchical agglomerative CA methods,Cont.
Even if the same distance measure is used,these methods result in different distancesbetween clusters → i.e. different distancematrices (But only if at least one of theclusters contains more than one case).Ward method is somewhat different approachas it uses a sums of square criterion and unifiesthe two clusters which results in the smallestincrease of error variance.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
The procedure of hierarchical methods
1 Start with finest grouping, i.e. each case is a cluster.
2 Calculation of distance matrix.
3 The two cluster with the lowest distance are identifiesand unified to a common cluster.
4 A new distance matrix for the reduced number ofclusters is calculated (where the two formerly separatedcluster are now considered as one cluster).
5 The two cluster with the lowest distance in the reduceddistance matrix are unified.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
The procedure of hierarchical methods,Cont.
... Recursion to step 4, ... etc. - until all cases areunified in a single super cluster or some predefinedcriterion for stopping the agglomeration schedule isdefined.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
Single Linkage
1. Single Linkage calculates the distanceD(R;P&Q) between a (divisible)cluster containingthe two cases P, Q, and the indivisible cluster R as:
D(R;P&Q) = min{D(R;P),D(R;Q)}
Here, the smallest distance between a cluster P&Qand the cluster R determines the distance betweenthe two clusters.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
Single Linkage, Cont.
The method is also known as Nearest Neighbor asthose two clusters, which have the nearest neighbor(i.e. where the pair with the smallest distancebetween two objects of the two clusters exists) areunified.
This results in a tendency towards the formation oflarge clusters.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
Complete Linkage
2. Complete Linkage calculates the distanceD(R;P&Q) between a (divisible) cluster containingthe two cases P, Q, and the indivisible cluster R as:
D(R;P&Q) = max{D(R;P),D(R;Q)}
Here, the largest distance between a cluster P&Qand the cluster R determines the distance betweenthe two clusters.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
Complete Linkage, Cont.
The method is also known as Furthest Neighbor asthe two clusters where the furthest neighbor (i.e.The pair with the largest distance between twoobjects of the two clusters) is least distant areunified.
This results in a tendency towards the formation ofsmall clusters.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
Average Linkage between groups
3. Average Linkage between groups calculates thedistance D(R;P&Q) between a (divisible) clustercontaining the two cases P, Q, and the indivisiblecluster R as:
D(R;P&Q) = mean(D(R;P),D(R;Q))
Here, the average of the pair-wise distances betweenall the pairs formed by objects of both clustersdetermines the effective distance between the twoclusters.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
Average Linkage between groups, Cont.
3. Average Linkage:
D(R;P&Q) = mean(D(R;P),D(R;Q))
Here, the average of the pair-wise distances betweenall the pairs formed by objects of both clustersdetermines the effective distance between the twoclusters.→ The two clusters where the average betweengroup distance is smallest are unified.→ This method is frequently used.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
Average Linkage within groups
4. Average Linkage within groups calculates themean distance D(R;P&Q) between all cases in thecluster to be formed out of the divisible cluster andthe indivisible cluster R:
D(R;P&Q) = mean(D(R;P),D(R;Q),D(P;Q))
Here, the average of the pair-wise distances betweenall the objects in the new cluster is calculated foreach possible cluster fusion. The two clustershaving the lowest average within cluster distance areunified.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
Cluster centroids
For calculating the distances between two large clusters,the 4 previous methods function analogous to ourexample of an indivisible cluster (R) and a composedcluster with two cases (P,Q).
In the four methods described so far, individual objects ofthe (non-elementary) composed clusters are consideredfor calculating the distance between two clusters.The subsequent methods focuses on the two clustercentroids.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
Unweighted Centroid Method
5. The unweighted Centroid Method (= MedianMethod) calculates the distance between clusters asthe distance between the centroids of the clusters.
In principle, the position of the centroid of aCluster is defined by the average value of theobjects forming the cluster of each of thevariables that are considered in the CA. In thecase of an elementary cluster with only oneobject, this object represents the centroid.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
Unweighted Centroid Method, Cont.
However, in the unweighted Centroid Method(= Median Method) when calculating thedistances between two clusters only the twocentroids are considered instead of the singleobjects within the two clusters.→ The centroid of the cluster resulting fromthe fusion of two clusters results from themeans of the two clusters.→ Large and small clusters are not weighteddifferentially when they are unified.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
Weighted Centroid Method
In the unweighted centroid method the size of thetwo sub-clusters that are unified is ignored.→ The resulting centroid is usually different fromthe centroid of all elementary objects (i.e. cases)contained in the unified cluster (it is rather thecentroid of the two cluster centroids)
⇓
The (weighted) Centroid Method considers thedifferences in the size (number of objects) of thetwo original sub-clusters.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
Weighted Centroid Method, Cont.
Thus in the case of the fusion of a small and a largecluster, the centroid of the resulting cluster is closerto the centroid of the large cluster than to thecentroid of the small cluster.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
Ward Method (Min. Variance Method)
7. The idea has much in common with an ANOVA.The two clusters that are connected to the smallestincrease in the error sums of squares (ESS) aresequentially unified.
Let Xijk denote the value for variable k inobservation j belonging to cluster i :
ESS(X ) =∑
i
∑
j
∑
k
(Xijk − Xi ·k)2
=∑
clusters
∑
cases
(Xij − centroidi )2
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
Ward Procedure I
1 At the beginning each case is a cluster.2 In the first step of the algorithm, n − 1 clusters
are formed, one of size two and the remainingof size 1. The error sum of squares iscomputed. The pair of sample units that yieldthe smallest ESS will form the first cluster.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
Ward Procedure II
3 Then, in the second step of the algorithm,n − 2 clusters are formed from that n − 1clusters defined in step 2. These may includetwo clusters of size 2, or a single cluster of size3 including the two items clustered in step 1.Again, the value of ESS is minimized.
4 Thus, at each step of the algorithm clusters orobservations are combined in such a way as tominimize the results of error from the squares.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
Ward Procedure III
5 The algorithm stops when all sample units arecombined into a single large cluster of size n.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
Ward method, Cont.
The Ward method is frequently applied.
Homogeneous clusters.
It has a tendency to generate clusters of similar size (i.e.with similar numbers of cases).
However, the Ward method’s tendency towards equallylarge clusters can be an advantage (e.g. if homogenouscluster sizes are aimed at) as well as a disadvantage (e.g.if unbalanced cluster sizes do better reflect the reality).In the latter case, Average Linkage or the (weighted)Centroid method could produce better results.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
Partitional clustering methods
A partitioning in G clusters is given (A predefinedclustering solution can be used as starting point).
Number of clusters remains constant.
The partitioning is suboptimal and the algorithm tries toimprove it.
Iteratively rearrangements improve the partitioning.
The methods differ in the criteria to measure thisimprovement and the rules for the rearrangements.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
k-means clustering
Follows the idea of an ANOVA. The clusters areestablished in order to maximize the F-statistic:The ratio of the variance between clusters andthe variance inside the clusters is maximized.
The main advantages of this algorithm are itssimplicity and speed which allows it to run onlarge data sets.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Hierarchical Agglomerative CA MethodsPartitional Clustering: k-means Clustering
k-means clustering-algorithm
Randomly generate G clusters/centroids.
Compute the centroids for each cluster.
Calculate for each case the distance to thecentroid of each cluster.
Move the case into the cluster with thesmallest distance.
Repeat the above two steps till no change inthe cluster appears.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Dendrogram
The results of a CA can be depicted in aDendrogram. It shows the sequence in which theclusters have been formed and the increases ofwithin cluster distances that are connected with theformation of ever larger clusters (i.e., the increase oferror variance in the case of the Ward Method)
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Dendrogram, Cont.
Together with the classification of the objects,the Dendrogram is the most interesting outputof a CA.
It can be used to judge the homogeneity of theclusters.
It can also be used to define the number ofclusters that should be used for the finalclassification of the objects
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Dendrogram, Cont.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Classification
If the cases in the data set are assigned toclusters (clustering of cases), then the numberof the cluster in which each case is includedcan be saved as a new variable.
The number of clusters to be formed can bepredefined or a range of solutions can begenerated.
Subsequent analyzes comparing the differentclusters with respect to variables of interestbecomes possible.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Classification, Cont.
If clusters differ significantly with respect tovariables of interest that are rather independentof the variables which have been used forclustering the cases, then the meaningfulnessof the clusters and the usefulness of theclustering procedure becomes evident.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Useful R-commands I
Hierarchic clustering (function hclust) is in standardR and available without loading any specificlibraries. Hierarchic clustering needs dissimilarities asits input. Standard R has function dist to calculatemany dissimilarity functions, but for communitydata we may prefer vegan function vegdist havingecologically useful dissimilarity indices. The default
index is Bray-Curtis (djk =∑
i |xij−xik |∑
i (xij+xik)):
d < −dist(data) or d < −vegdist(data)
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Useful R-commands II
The ecologically useful indices in vegan have anupper limit of 1 for completely different sites (noshared species), and use simple differences ofabundances. In contrast, the standard Euclideandistance has no upper limit, but varies with the sumof total abundances of compared sites when thereare no shared species, and uses squares ofdifferences of abundances. There are many otherecologically useful indices in vegdist, but Bray-Curtisis usually not a bad choice.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Useful R-commands III
Single linkage method:singlink < −hclust(d ,method = ”single”) withthe dendrogram plot(singlink)
Complete linkage :complink < −hclust(d ,method = ”complete”)and plot(complink, hang = -1)
Average linkage methods:averlink < −hclust(d ,method = ”aver”) withplot(averlink , hang = −1)
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Useful R-commands IV
The fixed classification can be visuallydemonstrated with rect.hclust
function:plot(singlink , hang = −1) withrect.hclust(singlink , 3)
We can extract classification in a certain levelusing function cutree:cl < −cutree(complink , 3). This gives anumeric classification vector of clusteridentities.
Janette Walde Cluster Analysis
IntroductionDistance Measures
Formation of Groups (Clusters) in CAHow to obtain the number of clusters?
Further Analysis of the obtained clustersUseful R-Commands
Useful R-commands V
We can tabulate the numbers of observationsin each cluster: table(cl).
We can compare two clustering schemes bycross-tabulation: table(cl , cutree(singlink , 3)).
Also available: library(cluster).
Janette Walde Cluster Analysis
top related