an introduction to cluster analysis · cluster analysis based on mixture model • i present a...
TRANSCRIPT
Whatcanyousayaboutthefigure?
2
0.0 0.5 1.0 1.5
0.0
0.5
1.0
signal T
sign
al C
• ≈1500subjects
• Twomeasurementspersubject
3
0.0 0.5 1.0 1.5
0.0
0.5
1.0
signal T
sign
al C
40.0 0.5 1.0 1.5
0.0
0.5
1.0
signal T
sign
al C
CC
TT
CT
ClusterAnalysis
• Seeksrulestogroupdata– Largebetween-clusterdifference– Smallwithin-clusterdifference
• Exploratory
• Aimstounderstand/learntheunknownsubstructureofmultivariatedata
5
ClusterAnalysisvsClassification
• Dataareunlabeled
• Thenumberofclustersareunknown
• “Unsupervised”learning
• Goal:findunknownstructures
6
• Thelabelsfortrainingdataareknown
• Thenumberofclassesareknown
• “Supervised”learning
• Goal:allocatenewobservations,whoselabelsareunknown,tooneoftheknownclasses
TheIrisData
• ItwascollectedbyF.A.Fisher• Afamousdatasetthathasbeenwidelyusedintextbooks
• Fourfeatures:– sepallengthincm– sepalwidthincm– petallengthincm– petalwidthincm
TheIrisData
• Threetypes:– Setosa
– Versicolor
– Virginica
8
TheIrisDataSepalL.SepalW.PetalL.PetalW.[1,]5.13.51.40.2[2,]4.93.01.40.2[3,]4.73.21.30.2[4,]4.63.11.50.2[5,]5.03.61.40.2[6,]5.43.91.70.4[7,]4.63.41.40.3[8,]5.03.41.50.2[9,]4.42.91.40.2…………[45,]5.13.81.90.4[46,]4.83.01.40.3[47,]5.13.81.60.2[48,]4.63.21.40.2[49,]5.33.71.50.2[50,]5.03.31.40.2
SepalL.SepalW.PetalL.PetalW.[1,]6.33.36.02.5[2,]5.82.75.11.9[3,]7.13.05.92.1[4,]6.32.95.61.8[5,]6.53.05.82.2[6,]7.63.06.62.1[7,]4.92.54.51.7[8,]7.32.96.31.8[9,]6.72.55.81.8…………[45,]6.73.35.72.5[46,]6.73.05.22.3[47,]6.32.55.01.9[48,]6.53.05.22.0[49,]6.23.45.42.3[50,]5.93.05.11.8
IrisSetosa IrisVirginica
SepalL.SepalW.PetalL.PetalW.[1,]7.03.24.71.4[2,]6.43.24.51.5[3,]6.93.14.91.5[4,]5.52.34.01.3[5,]6.52.84.61.5[6,]5.72.84.51.3[7,]6.33.34.71.6[8,]4.92.43.31.0[9,]6.62.94.61.3…………[45,]5.62.74.21.3[46,]5.73.04.21.2[47,]5.72.94.21.3[48,]6.22.94.31.3[49,]5.12.53.01.1[50,]5.72.84.11.3
IrisVersicolor
TheIrisData
10
SepalL. SepalW. PetalL. PetalW.
Setosa
Versicolor
Virginica
ClusteringMethods
• Model-free:– Nonhierarchicalclustering.K-means.– Hierarchicalclustering.Basedonsimilaritymeasures
• Model-basedclustering
11
Model-FreeClusteringNonhierarchicalClustering:K-Means
12
K-Means
• Assigneachobservationtotheclusterwiththenearestmean
• “Nearest”isusuallydefinedbasedonEuclideandistance
13
K-Means:Algorithm
• Step0:Preprocessdata.Standardizedataifappropriate
• Step1:PartitiontheobservationsintoK initialclusters.
• Step2– 2.a(updatestep):Calculatethecentroids.– 2.b(assignmentstep):Assigneachobservationtoitsnearestcluster.
• Repeatstep2untilnomorechangesinassignments
14
From“AnIntroductiontoStatisticalLearning”15
Remarks
• Beforeconvergence,eachstepisguaranteedtodecreasethewithin-clustersumofsquaresobjective
• Withinafinitenumberofsteps,thealgorithmmightconvergetoa(local)minimum
• Usedifferentandrandominitialvalues
16
DifferentInitialValues
17From“AnIntroductiontoStatisticalLearning”
Example:ClusterAnalysisofIrisData(PetalL&W)
• Pretendthattheiristypesoftheobservationsareunknown=>clusteranalysis
• Asanexample,andforillustrationpurpose,wewillusepetallengthandwidth
• ChooseK=3• K-means
18
K-MeanClustering:Iris(PetalL&W)
19Note:theanimationinthefiguredoesn’tworkappropriatelyonMAC.
K-MeanClustering:Iris(PetalL&W)
20
K-MeanClustering:Iris(PetalL&W)
21
Iteration= 1
0.0
0.2
0.4
0.6
0.8
1.0
Iteration= 2
0.0
0.2
0.4
0.6
0.8
1.0
Iteration= 3
0.0
0.2
0.4
0.6
0.8
1.0
Iteration= 4
0.0
0.2
0.4
0.6
0.8
1.0
Iteration= 5
0.0
0.2
0.4
0.6
0.8
1.0
Iteration= 6
0.0
0.2
0.4
0.6
0.8
1.0
Iteration= 7
0.0
0.2
0.4
0.6
0.8
1.0
Iteration= 8
0.0
0.2
0.4
0.6
0.8
1.0
Iteration= 9
0.0
0.2
0.4
0.6
0.8
1.0
Setosa VersicolorVirginica
Setosa VersicolorVirginica
Setosa VersicolorVirginica
Setosa VersicolorVirginica
Setosa VersicolorVirginica
Setosa VersicolorVirginica
Setosa VersicolorVirginica
Setosa VersicolorVirginica
Setosa VersicolorVirginica
K-MeanClustering:Iris(PetalL&W)
Setosa VersicolorVirginica22
Note:theanimationinthefiguredoesn’tworkappropriatelyonMAC.
Model-FreeClustering:HierarchicalClustering
23
HierarchicalClustering
• Thenumberofclustersisnotrequired• Givesatree-basedrepresentationofobservations- dendrogram
• Eachleafrepresentsanobservation
• Leavessimilar witheachotherarefusedtobranches
• Leaves/branchessimilarwitheachotherarefusedtobranches
• …24
Setosa Virginica Versicolor
25
HierarchicalClustering
• Togrowatree,weneedtodefinedissimilarities(distances)betweenleaves/branches– Twoleaves:easy.Onecanuseadissimilaritymeasure
– Aleafandabranch:therearedifferentoptions– Twobranches:similarto“aleafandabranch”,therearedifferentoptions
26
DistancebetweenTwoBranches/Clusters
Singlelinkage
Completelinkage
Averagelinkage
Manyotheroptions!27
Model-BasedClustering
28
Model-BasedClustering:MixtureModel
• ConsiderarandomvariableX.• WesayitfollowsamixtureofK distributionsifitsdistributioncanberepresentedusingKdistributions:
• Theweightspk,k=1,…,K arenonnegativenumbersandtheyaddupto1
29
ClusterAnalysisBasedonMixtureModel
• Ipresentafrequentistversion– Chooseanappropriatemodel.E.g.,AGaussianmixturemodelwithK=2clusters
– Writedownthelikelihoodfunction– Findthemaximumlikelihoodestimateoftheparameters
– CalculatethePr(clusterk|observationxi)fori=1,…,n,k=1,2
30
The Maximum Likelihood Estimate (MLE) of the Parameters
• Aneasy-to-implementalgorithmtofindtheMLEsiscalledtheExpectationandMaximization(EM)algorithm
• Initializeparameters• Estep:calculate“conditional”expectation.
– “conditional”meansconditionaloncurrentestimateoftheparameters
– Thisstepinvolvescalculatingprob(clusterk|obs I,currentestimateofpara),k=1,…,K,i=1,…,n
– Thisstepissimilartotheassignment stepinanK-meansalgorithm
31
TheMaximumLikelihoodEstimate(MLE)oftheParameters
• TheMstep:findthesetofvaluesthatmaximizetheconditionalexpectationcalculatedintheEstep.Thisstepupdatestheparametervalues
• RepeattheEandMstepsuntilconvergence
32
EM vsK-Mean
EM• Step1:initialization• E: Calculateconditionalprobabilities
• Mstep:Findoptimalvaluesforparameters
• RepeattheEandMstepsuntilconvergence
• Allowsclusterstohavedifferentshapes
33
K-Mean• Step1:initialization• Step2a:guessclustermembership
• Step2b:findclustercenters
• Repeat2a-2buntilconvergence
Example:GaussianMixtureModel
• Observeddata(simulatedfromtwonormaldistributions)– 0.371.180.162.601.330.181.491.743.582.694.513.392.380.794.122.962.983.943.823.59
• AssumingK=2
• Parameters:μ1,μ0,σ1,σ0,p
34
Example:simulateddata
Group1~N(1,1)
Group2~N(3,1) 35
Note:theanimationinthefiguredoesn’tworkappropriatelyonMAC.
Example:ClusterAnalysisofIrisDataUsingPetalLength
Setosa Versicolor36
Note:theanimationinthefiguredoesn’tworkappropriatelyonMAC.
RPackage:MCLUST• DevelopedbyAdrianRaftery andcolleagues
• Gaussianmixturemodel
• EM
• Clustering,classification,densityestimation
• Pleasetryitout!
37
ClusteringAnalysisForMultidimensionalData
38
MultidimensionalData• Humanfaces,images
• 3Dobjects
• Textdocuments
• Brainimaging
39
40
WholeBrainConnectivity
Sub1
Sub2
Sub3
Sub4
task1task2task3rest1rest2rest3
BrainConnectivityvsFingerprint
41SubjectID
42
task1task2task3rest1rest2rest3
SomeTechnicalDetails
43
?