an introduction to cluster analysis · cluster analysis based on mixture model • i present a...

AnIntroductiontoClusterAnalysis

ZhaoxiaYuDepartmentofStatistics

[email protected]

1

Whatcanyousayaboutthefigure?

2

0.0 0.5 1.0 1.5

0.0

0.5

1.0

signal T

sign

al C

• ≈1500subjects

• Twomeasurementspersubject

3

0.0 0.5 1.0 1.5

0.0

0.5

1.0

signal T

sign

al C

40.0 0.5 1.0 1.5

0.0

0.5

1.0

signal T

sign

al C

CC

TT

CT

ClusterAnalysis

• Seeksrulestogroupdata– Largebetween-clusterdifference– Smallwithin-clusterdifference

• Exploratory

• Aimstounderstand/learntheunknownsubstructureofmultivariatedata

5

ClusterAnalysisvsClassification

• Dataareunlabeled

• Thenumberofclustersareunknown

• “Unsupervised”learning

• Goal:findunknownstructures

6

• Thelabelsfortrainingdataareknown

• Thenumberofclassesareknown

• “Supervised”learning

• Goal:allocatenewobservations,whoselabelsareunknown,tooneoftheknownclasses

TheIrisData

• ItwascollectedbyF.A.Fisher• Afamousdatasetthathasbeenwidelyusedintextbooks

• Fourfeatures:– sepallengthincm– sepalwidthincm– petallengthincm– petalwidthincm

TheIrisData

• Threetypes:– Setosa

– Versicolor

– Virginica

8

TheIrisDataSepalL.SepalW.PetalL.PetalW.[1,]5.13.51.40.2[2,]4.93.01.40.2[3,]4.73.21.30.2[4,]4.63.11.50.2[5,]5.03.61.40.2[6,]5.43.91.70.4[7,]4.63.41.40.3[8,]5.03.41.50.2[9,]4.42.91.40.2…………[45,]5.13.81.90.4[46,]4.83.01.40.3[47,]5.13.81.60.2[48,]4.63.21.40.2[49,]5.33.71.50.2[50,]5.03.31.40.2

SepalL.SepalW.PetalL.PetalW.[1,]6.33.36.02.5[2,]5.82.75.11.9[3,]7.13.05.92.1[4,]6.32.95.61.8[5,]6.53.05.82.2[6,]7.63.06.62.1[7,]4.92.54.51.7[8,]7.32.96.31.8[9,]6.72.55.81.8…………[45,]6.73.35.72.5[46,]6.73.05.22.3[47,]6.32.55.01.9[48,]6.53.05.22.0[49,]6.23.45.42.3[50,]5.93.05.11.8

IrisSetosa IrisVirginica

SepalL.SepalW.PetalL.PetalW.[1,]7.03.24.71.4[2,]6.43.24.51.5[3,]6.93.14.91.5[4,]5.52.34.01.3[5,]6.52.84.61.5[6,]5.72.84.51.3[7,]6.33.34.71.6[8,]4.92.43.31.0[9,]6.62.94.61.3…………[45,]5.62.74.21.3[46,]5.73.04.21.2[47,]5.72.94.21.3[48,]6.22.94.31.3[49,]5.12.53.01.1[50,]5.72.84.11.3

IrisVersicolor

TheIrisData

10

SepalL. SepalW. PetalL. PetalW.

Setosa

Versicolor

Virginica

ClusteringMethods

• Model-free:– Nonhierarchicalclustering.K-means.– Hierarchicalclustering.Basedonsimilaritymeasures

• Model-basedclustering

11

Model-FreeClusteringNonhierarchicalClustering:K-Means

12

K-Means

• Assigneachobservationtotheclusterwiththenearestmean

• “Nearest”isusuallydefinedbasedonEuclideandistance

13

K-Means:Algorithm

• Step0:Preprocessdata.Standardizedataifappropriate

• Step1:PartitiontheobservationsintoK initialclusters.

• Step2– 2.a(updatestep):Calculatethecentroids.– 2.b(assignmentstep):Assigneachobservationtoitsnearestcluster.

• Repeatstep2untilnomorechangesinassignments

14

From“AnIntroductiontoStatisticalLearning”15

Remarks

• Beforeconvergence,eachstepisguaranteedtodecreasethewithin-clustersumofsquaresobjective

• Withinafinitenumberofsteps,thealgorithmmightconvergetoa(local)minimum

• Usedifferentandrandominitialvalues

16

DifferentInitialValues

17From“AnIntroductiontoStatisticalLearning”

Example:ClusterAnalysisofIrisData(PetalL&W)

• Pretendthattheiristypesoftheobservationsareunknown=>clusteranalysis

• Asanexample,andforillustrationpurpose,wewillusepetallengthandwidth

• ChooseK=3• K-means

18

K-MeanClustering:Iris(PetalL&W)

19Note:theanimationinthefiguredoesn’tworkappropriatelyonMAC.


20


21

Iteration= 1

0.0

0.2

0.4

0.6

0.8

1.0

Iteration= 2

0.0

0.2

0.4

0.6

0.8

1.0

Iteration= 3

0.0

0.2

0.4

0.6

0.8

1.0

Iteration= 4

0.0

0.2

0.4

0.6

0.8

1.0

Iteration= 5

0.0

0.2

0.4

0.6

0.8

1.0

Iteration= 6

0.0

0.2

0.4

0.6

0.8

1.0

Iteration= 7

0.0

0.2

0.4

0.6

0.8

1.0

Iteration= 8

0.0

0.2

0.4

0.6

0.8

1.0

Iteration= 9

0.0

0.2

0.4

0.6

0.8

1.0

Setosa VersicolorVirginica










Setosa VersicolorVirginica22

Note:theanimationinthefiguredoesn’tworkappropriatelyonMAC.

Model-FreeClustering:HierarchicalClustering

23

HierarchicalClustering

• Thenumberofclustersisnotrequired• Givesatree-basedrepresentationofobservations- dendrogram

• Eachleafrepresentsanobservation

• Leavessimilar witheachotherarefusedtobranches

• Leaves/branchessimilarwitheachotherarefusedtobranches

• …24

Setosa Virginica Versicolor

25

HierarchicalClustering

• Togrowatree,weneedtodefinedissimilarities(distances)betweenleaves/branches– Twoleaves:easy.Onecanuseadissimilaritymeasure

– Aleafandabranch:therearedifferentoptions– Twobranches:similarto“aleafandabranch”,therearedifferentoptions

26

DistancebetweenTwoBranches/Clusters

Singlelinkage

Completelinkage

Averagelinkage

Manyotheroptions!27

Model-BasedClustering

28

Model-BasedClustering:MixtureModel

• ConsiderarandomvariableX.• WesayitfollowsamixtureofK distributionsifitsdistributioncanberepresentedusingKdistributions:

• Theweightspk,k=1,…,K arenonnegativenumbersandtheyaddupto1

29

ClusterAnalysisBasedonMixtureModel

• Ipresentafrequentistversion– Chooseanappropriatemodel.E.g.,AGaussianmixturemodelwithK=2clusters

– Writedownthelikelihoodfunction– Findthemaximumlikelihoodestimateoftheparameters

– CalculatethePr(clusterk|observationxi)fori=1,…,n,k=1,2

30

The Maximum Likelihood Estimate (MLE) of the Parameters

• Aneasy-to-implementalgorithmtofindtheMLEsiscalledtheExpectationandMaximization(EM)algorithm

• Initializeparameters• Estep:calculate“conditional”expectation.

– “conditional”meansconditionaloncurrentestimateoftheparameters

– Thisstepinvolvescalculatingprob(clusterk|obs I,currentestimateofpara),k=1,…,K,i=1,…,n

– Thisstepissimilartotheassignment stepinanK-meansalgorithm

31

TheMaximumLikelihoodEstimate(MLE)oftheParameters

• TheMstep:findthesetofvaluesthatmaximizetheconditionalexpectationcalculatedintheEstep.Thisstepupdatestheparametervalues

• RepeattheEandMstepsuntilconvergence

32

EM vsK-Mean

EM• Step1:initialization• E: Calculateconditionalprobabilities

• Mstep:Findoptimalvaluesforparameters

• RepeattheEandMstepsuntilconvergence

• Allowsclusterstohavedifferentshapes

33

K-Mean• Step1:initialization• Step2a:guessclustermembership

• Step2b:findclustercenters

• Repeat2a-2buntilconvergence

Example:GaussianMixtureModel

• Observeddata(simulatedfromtwonormaldistributions)– 0.371.180.162.601.330.181.491.743.582.694.513.392.380.794.122.962.983.943.823.59

• AssumingK=2

• Parameters:μ1,μ0,σ1,σ0,p

34

Example:simulateddata

Group1~N(1,1)

Group2~N(3,1) 35


Example:ClusterAnalysisofIrisDataUsingPetalLength

Setosa Versicolor36


RPackage:MCLUST• DevelopedbyAdrianRaftery andcolleagues

• Gaussianmixturemodel

• EM

• Clustering,classification,densityestimation

• Pleasetryitout!

37

ClusteringAnalysisForMultidimensionalData

38

MultidimensionalData• Humanfaces,images

• 3Dobjects

• Textdocuments

• Brainimaging

39

40

WholeBrainConnectivity

Sub1

Sub2

Sub3

Sub4

task1task2task3rest1rest2rest3

BrainConnectivityvsFingerprint

41SubjectID

42

task1task2task3rest1rest2rest3

SomeTechnicalDetails

43

?

an introduction to cluster analysis · cluster analysis based on mixture model • i present a...

Documents