uva cs 4501: machine learning lecture 19: unsupervised ... · unsupervised learning = learning from...

56
UVA CS 4501: Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of Computer Science 11/29/18 Dr. Yanjun Qi / UVA CS 1

Upload: others

Post on 16-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

UVACS4501:MachineLearning

Lecture19: UnsupervisedClustering(I)

Dr.Yanjun Qi

UniversityofVirginiaDepartmentofComputerScience

11/29/18

Dr.YanjunQi/UVACS

1

Page 2: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

Wherearewe?èmajorsectionsofthiscourse

q Regression(supervised)q Classification(supervised)

q Featureselection

q Unsupervisedmodelsq DimensionReduction(PCA)q Clustering(K-means,GMM/EM,Hierarchical)

q Learningtheoryq Graphicalmodels

11/29/18

Dr.YanjunQi/UVACS

2

Page 3: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

AnunlabeledDatasetX

• Data/points/instances/examples/samples/records:[rows]• Features/attributes/dimensions/independentvariables/covariates/predictors/regressors:[columns]

11/29/18

Dr.YanjunQi/UVACS

a data matrix of n observations on p variables x1,x2,…xp

Unsupervisedlearning =learningfromraw(unlabeled,unannotated,etc)data,asopposedtosuperviseddatawherelabelofexamplesisgiven

3

Page 4: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

Dr.YanjunQi/UVACS

Today:Whatisclustering?

• Arethereany“groups”?• Whatiseachgroup?• Howmany?• Howtoidentifythem?

4

Page 5: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

• Find groups (clusters) of data points such that data points in a group will be similar (or related) to one another and different from (or unrelated to) the data points in other groups

Whatisclustering?

Inter-cluster distances are maximized

Intra-cluster distances are minimized

11/29/18

Dr.YanjunQi/UVACS

5

Page 6: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

Dr.YanjunQi/UVACS

Whatisclustering?• Clustering:theprocessofgroupingasetofobjectsintoclassesofsimilarobjects– highintra-classsimilarity– lowinter-classsimilarity– Itisthecommonestformofunsupervisedlearning

• AcommonandimportanttaskthatfindsmanyapplicationsinScience,Engineering,informationScience,andotherplaces,e.g.

• Groupgenesthatperformthesamefunction• Groupindividualsthathassimilarpoliticalview• Categorizedocumentsofsimilartopics• Idealitysimilarobjectsfrompictures

6

Page 7: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

Dr.YanjunQi/UVACS

Whatisclustering?• Clustering:theprocessofgroupingasetofobjectsintoclassesofsimilarobjects– highintra-classsimilarity– lowinter-classsimilarity– Itisthecommonestformofunsupervisedlearning

• AcommonandimportanttaskthatfindsmanyapplicationsinScience,Engineering,informationScience,andotherplaces,e.g.

• Groupgenesthatperformthesamefunction• Groupindividualsthathassimilarpoliticalview• Categorizedocumentsofsimilartopics• Idealitysimilarobjectsfrompictures

7

Page 8: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

Dr.YanjunQi/UVACS

ToyExamples

• People

• Images

• Language

• species

8

Page 9: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

Application (I): Search

Result Clustering

11/29/18 9

Dr.YanjunQi/UVACS

Page 10: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

Application (II): Navigation

11/29/18 10

Dr.YanjunQi/UVACS

Page 11: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

Dr.YanjunQi/UVACS

Issuesforclustering• Whatisanaturalgroupingamongtheseobjects?

– Definitionof"groupness"• Whatmakesobjects“related”?

– Definitionof"similarity/distance"• Representation forobjects

– Vectorspace?Normalization?• Howmanyclusters?

– Fixedapriori?– Completelydatadriven?

• Avoid“trivial” clusters- toolargeorsmall• ClusteringAlgorithms

– Partitional algorithms– Hierarchicalalgorithms

• Formal foundationandconvergence11

Page 12: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

Dr.YanjunQi/UVACS

TodayRoadmap:clustering

§ Definitionof"groupness”§ Definitionof"similarity/distance"§ Representationforobjects§ Howmanyclusters?§ ClusteringAlgorithms

§ Partitional algorithms§ Hierarchicalalgorithms

§ Formalfoundationandconvergence12

Page 13: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

Dr.YanjunQi/UVACS

Whatisanaturalgroupingamongtheseobjects?

13

Page 14: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

Anotherexample:clusteringissubjective

A

B

A

B

A

B

A

B A

B

A

B

TwopossibleSolutions…

11/29/18 Dr.YanjunQi/UVACS 14

Page 15: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

Dr.YanjunQi/UVACS

TodayRoadmap:clustering

§ Definitionof"groupness”§ Definitionof"similarity/distance"§ Representationforobjects§ Howmanyclusters?§ ClusteringAlgorithms

§ Partitional algorithms§ Hierarchicalalgorithms

§ Formalfoundationandconvergence15

Page 16: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

Dr.YanjunQi/UVACS

WhatisSimilarity?

• Therealmeaningofsimilarityisaphilosophicalquestion.Wewilltakeamorepragmaticapproach

• Dependsonrepresentationandalgorithm.Formanyrep./alg.,easiertothinkintermsofadistance(ratherthansimilarity)betweenvectors.

Hardtodefine!Butweknowitwhenweseeit

16

Page 17: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

Dr.YanjunQi/UVACS

Whatpropertiesshouldadistancemeasurehave?

• D(A,B)=D(B,A) Symmetry

• D(A,A)=0 ConstancyofSelf-Similarity

• D(A,B)=0IIf A=B PositivitySeparation

• D(A,B)<= D(A,C)+D(B,C) TriangularInequality

17

Page 18: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

Dr.YanjunQi/UVACS

• D(A,B)=D(B,A) Symmetry– Otherwiseyoucouldclaim"AlexlookslikeBob,butBoblooksnothing

likeAlex"

• D(A,A)=0 ConstancyofSelf-Similarity– Otherwiseyoucouldclaim"AlexlooksmorelikeBob,thanBobdoes"

• D(A,B)=0IIf A=B PositivitySeparation– Otherwisethereareobjectsinyourworldthataredifferent,butyou

cannottellapart.

• D(A,B)<= D(A,C)+D(B,C) TriangularInequality– Otherwiseyoucouldclaim"AlexisverylikeBob,andAlexisverylike

Carl,butBobisveryunlikeCarl"

Intuitionsbehinddesirablepropertiesofdistancemeasure

18

Page 19: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

Dr.YanjunQi/UVACS

DistanceMeasures:MinkowskiMetric

• Supposetwoobjectx andy bothhavepfeatures

• TheMinkowski metricisdefinedby

• MostCommonMinkowskiMetrics

!!d(x , y)= |xi− yi

i=1

p

∑ |rr

!!

x = (x1 ,x2 ,!,xp)y = ( y1 , y2 ,!, yp)

1,r =2(Euclideandistance)d(x , y)= |xi− yii=1

p

∑ |22

2,r =1(Manhattandistance)d(x , y)= |xi− yii=1

p

∑ |

3,r = +∞("sup"distance)d(x , y)=max1≤i≤p

|xi− yi |19

Page 20: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

Dr.YanjunQi/UVACS

.},{max :distance sup"" :3. :distanceManhattan :2

. :distanceEuclidean :1

434734

5342 22

==+

=+

AnExample

4

3

x

y

20

Page 21: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

Dr.YanjunQi/UVACS

.},{max :distance sup"" :3. :distanceManhattan :2

. :distanceEuclidean :1

434734

5342 22

==+

=+

AnExample

4

3

x

y

21

Page 22: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

Dr.YanjunQi/UVACS

11011111100001110100111001001001101716151413121110987654321

GeneBGeneA

. :Distance Hamming 5141001 =+=+ )#()#(

• ManhattandistanceiscalledHammingdistancewhenallfeaturesarebinaryordiscrete.

– E.g.,GeneExpressionLevelsUnder17Conditions(1-High,0-Low)

Hammingdistance:discretefeatures

!!d(x , y)= |xi− yi

i=1

p

∑ |

22

Page 23: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

Dr.YanjunQi/UVACS

EditDistance:Agenerictechniqueformeasuringsimilarity

• Tomeasurethesimilaritybetweentwoobjects,transformoneoftheobjectsintotheother,andmeasurehowmucheffortittook.Themeasureofeffortbecomesthedistancemeasure.

ThedistancebetweenPattyandSelma.Changedresscolor,1pointChangeearringshape,1pointChangehairpart,1point

D(Patty,Selma)=3

ThedistancebetweenMargeandSelma.Changedresscolor,1pointAddearrings,1pointDecreaseheight,1pointTakeupsmoking,1pointLoseweight,1point

D(Marge,Selma)=5

ThisiscalledtheEditdistanceortheTransformationdistance23

SelmaPattyMarge

Page 24: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

• Pearsoncorrelationcoefficient

• Specialcase:cosinedistance11/29/18

Dr.YanjunQi/UVACS

. and where

)()(

))((),(

∑∑

∑ ∑

==

= =

=

==

−×−

−−=

p

iip

p

iip

p

i

p

iii

p

iii

yyxx

yyxx

yyxxyxs

1

1

1

1

1 1

22

1

1≤),( yxs

SimilarityMeasures:CorrelationCoefficient

yxyxyxs !!

!!

⋅⋅=),(

• Measuring thelinearcorrelationbetweentwosequences,xandy,

• givingavaluebetween+1and−1inclusive,where1istotalpositivecorrelation,0isnocorrelation,and−1istotalnegativecorrelation.

Correlationisunit independent

24

Page 25: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

Dr.YanjunQi/UVACS

SimilarityMeasures:e.g.,CorrelationCoefficientontimeseriessamples

Time

Gene A

Gene B

Gene A

Time

Gene B

Expression LevelExpression Level

Expression Level

Time

Gene A

Gene B

25

Correlationisunitindependent;

Ifyouscaleoneoftheobjectstentimes,youwillgetdifferenteuclidean distancesandsamecorrelationdistances.

Page 26: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

Dr.YanjunQi/UVACS

TodayRoadmap:clustering

§ Definitionof"groupness”§ Definitionof"similarity/distance"§ Representationforobjects§ Howmanyclusters?§ ClusteringAlgorithms

§ Partitional algorithms§ Hierarchicalalgorithms

§ Formalfoundationandconvergence26

Page 27: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

Dr.YanjunQi/UVACS

ClusteringAlgorithms

• Partitional algorithms– Usuallystartwitharandom(partial)partitioning

– Refineititeratively• Kmeansclustering• Mixture-Modelbasedclustering

• Hierarchicalalgorithms– Bottom-up,agglomerative– Top-down,divisive

27

Page 28: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

Dr.YanjunQi/UVACS

ClusteringAlgorithms

• Partitional algorithms– Usuallystartwitharandom(partial)partitioning

– Refineititeratively• Kmeansclustering• Mixture-Modelbasedclustering

• Hierarchicalalgorithms– Bottom-up,agglomerative– Top-down,divisive

28

Page 29: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

Dr.YanjunQi/UVACS

TodayRoadmap:clustering

§ Definitionof"groupness”§ Definitionof"similarity/distance"§ Representationforobjects§ Howmanyclusters?§ ClusteringAlgorithms

§ Partitional algorithms§ Hierarchicalalgorithms

§ Formalfoundationandconvergence29

Page 30: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

Dr.YanjunQi/UVACS

HierarchicalClustering• Buildatree-basedhierarchicaltaxonomy(dendrogram)fromasetofobjects,e.g.organisms,documents.

• Notethathierarchiesarecommonlyusedtoorganizeinformation,forexampleinawebportal.– Yahoo!hierarchyismanuallycreated,wewillfocusonautomaticcreationofhierarchies

Withbackbone Withoutbackbone

30

Page 31: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

31

(How-to) Hierarchical Clustering

• Given:asetofobjectsandthepairwisedistancematrix

• Find:atreethatoptimallyhierarchicalclusteringobjects?– Globallyoptimal:exhaustivelyenumeratealltree– Effectiveheuristicmethods:

Dr.YanjunQi/UVACS

Page 32: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

(How-to) Hierarchical ClusteringThe number of dendrograms with n leafs

= (2n -3)!/[(2(n -2)) (n -2)!]

Number Number of Possibleof Leafs Dendrograms2 13 34 155 105... …10 34,459,425

Bottom-Up (agglomerative):Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

Clustering:theprocessofgrouping asetofobjectsintoclassesofsimilarobjectsè

high intra-classsimilaritylowinter-classsimilarity

11/29/18

Dr.YanjunQi/UVACS

32

Page 33: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

(How-to) Hierarchical ClusteringThe number of dendrograms with n leafs

= (2n -3)!/[(2(n -2)) (n -2)!]

Number Number of Possibleof Leafs Dendrograms2 13 34 155 105... …10 34,459,425

Bottom-Up (agglomerative):Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

Clustering:theprocessofgrouping asetofobjectsintoclassesofsimilarobjectsè

high intra-classsimilaritylowinter-classsimilarity

Agreedylocal

optimalsolution

11/29/18

Dr.YanjunQi/UVACS

33

Page 34: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

Dr.YanjunQi/UVACS

34

(How-to) Hierarchical ClusteringThe number of dendrograms with n leafs

= (2n -3)!/[(2(n -2)) (n -2)!]

Number Number of Possibleof Leafs Dendrograms2 13 34 155 105... …10 34,459,425

Bottom-Up (agglomerative):Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

Clustering:theprocessofgrouping asetofobjectsintoclassesofsimilarobjectsè

high intra-classsimilaritylowinter-classsimilarity

Agreedylocal

optimalsolution

Page 35: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

0 8 8 7 7

0 2 4 4

0 3 3

0 1

0

D( , ) = 8D( , ) = 1

We begin with a distance matrix which contains the distances between every pair of objects in our database.

11/29/18

Dr.YanjunQi/UVACS

35

Page 36: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

…Consider all possible merges…

Choose the best

11/29/18

Dr.YanjunQi/UVACS

36

Page 37: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

…Consider all possible merges…

Choose the best

Consider all possible merges… …

Choose the best

11/29/18

Dr.YanjunQi/UVACS

37

Page 38: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

0 8 8 7 7

0 2 4 4

0 3 3

0 1

0

D( , ) = 8D( , ) = 1

We begin with a distance matrix which contains the distances between every pair of objects in our database.

11/29/18

Dr.YanjunQi/UVACS

38

Page 39: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

…Consider all possible merges…

Choose the best

Consider all possible merges… …

Choose the best

Consider all possible merges…

Choose the best…

11/29/18

Dr.YanjunQi/UVACS

39

Page 40: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

…Consider all possible merges…

Choose the best

Consider all possible merges… …

Choose the best

Consider all possible merges…

Choose the best…But how do we compute distances

between clusters rather than objects?

11/29/18

Dr.YanjunQi/UVACS

40

Page 41: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

Dr.YanjunQi/UVACS

Howtodecidethedistancesbetweenclusters?

• Single-Link– NearestNeighbor:theirclosestmembers.

• Complete-Link– FurthestNeighbor:theirfurthestmembers.

• Average:– averageofallcross-clusterpairs.

41

Page 42: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

Computing distance between clusters: Single Link

• cluster distance = distance of two closest members in each class

- Potentially long and skinny clusters

11/29/18

Dr.YanjunQi/UVACS

42

Page 43: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

Computing distance between clusters: : Complete Link

• cluster distance = distance of two farthest members

+ tight clusters

11/29/18

Dr.YanjunQi/UVACS

43

Page 44: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

Computing distance between clusters: Average Link

• cluster distance = average distance of all pairs

the most widely used measure

Robust against noise

11/29/18

Dr.YanjunQi/UVACS

44

Page 45: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

Example: single link

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0458907910

03602

0

54321

54321

1234

5

11/29/18

Dr.YanjunQi/UVACS

45

Page 46: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

Example: single link

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0458907910

03602

0

54321

54321

1234

5

11/29/18

Dr.YanjunQi/UVACS

46

Page 47: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

Example: single link

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0458907910

03602

0

54321

54321

1234

5

11/29/18

Dr.YanjunQi/UVACS

47

Page 48: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

Example: single link

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0458907910

03602

0

54321

54321

⎥⎥⎥⎥

⎢⎢⎢⎢

0458079

030

543)2,1(

543)2,1(

1234

5

8}8,9min{},min{9}9,10min{},min{3}3,6min{},min{

5,25,15),2,1(

4,24,14),2,1(

3,23,13),2,1(

======

===

ddddddddd

11/29/18

Dr.YanjunQi/UVACS

48

Page 49: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

Example: single link

⎥⎥⎥

⎢⎢⎢

04507

0

54)3,2,1(

54)3,2,1(

1234

5

⎥⎥⎥⎥

⎢⎢⎢⎢

0458079

030

543)2,1(

543)2,1(

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0458907910

03602

0

54321

54321

5}5,8min{},min{7}7,9min{},min{

5,35),2,1(5),3,2,1(

4,34),2,1(4),3,2,1(

======

dddddd

11/29/18

Dr.YanjunQi/UVACS

49

Page 50: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

Example: single link

⎥⎥⎥

⎢⎢⎢

04507

0

54)3,2,1(

54)3,2,1(

1234

5

⎥⎥⎥⎥

⎢⎢⎢⎢

0458079

030

543)2,1(

543)2,1(

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0458907910

03602

0

54321

54321

5},min{ 5),3,2,1(4),3,2,1()5,4(),3,2,1( == ddd

11/29/18

Dr.YanjunQi/UVACS

50

Page 51: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

29 2 6 11 9 17 10 13 24 25 26 20 22 30 27 1 3 8 4 12 5 14 23 15 16 18 19 21 28 7

1

2

3

4

5

6

7

Average linkage

Single linkage

Height represents distance between objects / clusters

Partitionsbycuttingthedendrogram atadesiredlevel:eachconnectedcomponent formsacluster.

11/29/18

Dr.YanjunQi/UVACS

51

Page 52: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

Dr.YanjunQi/UVACS

HierarchicalClustering• Bottom-UpAgglomerativeClustering

– Startswitheachobjectinaseparatecluster– thenrepeatedlyjoinstheclosest pairofclusters,– untilthereisonlyonecluster.

Thehistoryofmergingformsabinarytreeorhierarchy(dendrogram)

• Top-Downdivisive– Startingwithallthedatainasinglecluster,– Considereverypossiblewaytodividetheclusterintotwo.Choose

thebestdivision– Andrecursivelyoperateonbothsides.

52

Page 53: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

11/29/18

Dr.YanjunQi/UVACS

ComputationalComplexity

• Inthefirstiteration,allHACmethodsneedtocomputesimilarityofallpairsofn individualinstanceswhichisO(n2p).

• Ineachofthesubsequentn−2mergingiterations,computethedistancebetweenthemostrecentlycreatedclusterandallotherexistingclusters.

• Forthesubsequentsteps,inordertomaintainanoverallO(n2)performance,computingsimilaritytoeachotherclustermustbedoneinconstanttime.ElseO(n2 logn)orO(n3)ifdonenaively

53

Page 54: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

SummaryofHierarchalClusteringMethods

• Noneedtospecifythenumberofclustersinadvance.

• Hierarchicalstructuremapsnicelyontohumanintuitionforsomedomains

• Theydonotscalewell:timecomplexityofatleastO(n2),wheren isthenumberoftotalobjects.

• Likeanyheuristicsearchalgorithms,localoptimaareaproblem.

• Interpretationofresultsis(very)subjective.

11/29/18

Dr.YanjunQi/UVACS

54

Page 55: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

Hierarchical Clustering

Clustering

Choice of distance metrics

No clearly defined loss

greedy bottom-up (or top-down)

Dendrogram(tree)

Task

Representation

Score Function

Search/Optimization

Models, Parameters

11/29/18 55

Dr.YanjunQi/UVACS

Page 56: UVA CS 4501: Machine Learning Lecture 19: Unsupervised ... · Unsupervised learning = learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data ... Intuitions

References

q Hastie,Trevor,etal.Theelementsofstatisticallearning.Vol.2.No.1.NewYork:Springer,2009.

q BigthankstoProf.EricXing@CMUforallowingmetoreusesomeofhisslides

q BigthankstoProf.Ziv Bar-Joseph@CMUforallowingmetoreusesomeofhisslides

11/29/18

Dr.YanjunQi/UVACS

56