stats 170a: project in data science exploratory …...stats 170a: project in data science...
TRANSCRIPT
![Page 1: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/1.jpg)
Stats170A:ProjectinDataScience
ExploratoryDataAnalysis:ClusteringAlgorithms
Padhraic SmythDepartmentofComputerScienceBrenSchoolofInformationandComputerSciencesUniversityofCalifornia,Irvine
![Page 2: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/2.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 2
Assignment5
RefertotheWikipage
DuenoononMondayFebruary12th toEEEdropbox
Note:duebeforeclass(by2pm)
Questions?
![Page 3: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/3.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 3
WhatisExploratoryDataAnalysis?
• EDA={visualization,clustering,dimensionreduction,….}
• Forsmallnumbersofvariables,EDA=visualization
• Forlargenumbersofvariables,weneedtobecleverer– Clustering,dimension reduction, embedding algorithms– Thesearetechniques thatessentiallyreducehigh-dimensional datato
something wecanlookat
• Today’slecture:– Finishupvisualization– Overviewofclusteringalgorithms
![Page 4: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/4.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 4
Tufte’s PrinciplesofVisualization
Graphicalexcellence…
– isthewell-designedpresentationofinterestingdata– amatterofsubstance,ofstatistics,andofdesign
– consistsofcomplexideascommunicated withclarity,precisionandefficiency
– isthatwhichgivestotheviewerthegreatestnumberofideasintheshortesttimewiththeleastinkinthesmallestspace
– requirestellingthetruthaboutthedata
![Page 5: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/5.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 5
DifferentWaysofPresentingtheSameData
FromKarlBroman,viawww.cs.princeton.edu/
![Page 6: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/6.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 6
PrincipleofProportionalInk(orHowtoLiewithVisualization)
![Page 7: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/7.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 7
PrincipleofProportionalInk(orHowtoLiewithVisualization)
![Page 8: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/8.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 8
PotentiallyMisleadingScalesontheX-axis
![Page 9: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/9.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 9
Example:VisualizationofNapoleon’s1812March
Illustratessizeofarmy,direction, location,temperature,date…allononechart
![Page 10: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/10.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 10
FromNewYorkTimes,Feb22018
DataJournalism
![Page 11: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/11.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 11
ExploratoryDataAnalysis:Clustering
![Page 12: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/12.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 12
x1
x2
Example:ClusteringVectorsina2-DimensionalSpace
Eachpoint(or2dvector)representsadocument
![Page 13: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/13.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 13
x1
x2
Cluster1
Cluster2
Example:PossibleClusters
![Page 14: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/14.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 14
x1
x2
Cluster1
Cluster2
Example:HowmanyClusters?
Cluster3
![Page 15: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/15.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 15
ClusterStructureinReal-WorldData
0.0 0.5 1.0 1.5
0.0
0.5
1.0
signal T
sign
al C
≈1500subjects
Twomeasurementspersubject
FigurefromProfZhaoxia Yu,StatisticsDepartment,UCIrvine
![Page 16: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/16.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 16
ClusterStructureinReal-WorldData
0.0 0.5 1.0 1.5
0.0
0.5
1.0
signal T
sign
al C
≈1500subjects
Twomeasurementspersubject
FigurefromProfZhaoxia Yu,StatisticsDepartment,UCIrvine
![Page 17: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/17.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 1717
0.0 0.5 1.0 1.5
0.0
0.5
1.0
signal T
sign
al C
CC
TT
CT
FigurefromProfZhaoxia Yu,StatisticsDepartment,UCIrvine
![Page 18: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/18.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 18
IssuesinClustering
• Representation– Howdowerepresentourexamplesasdatavectors?
• Distance– Howdowewanttodefinedistancebetweenvectors?
• Algorithm– Whattypeofalgorithmdowewanttousetosearchforclusters?– Whatisthetimeandspacecomplexityofthealgorithm?
• NumberofClusters– Howmanyclustersdowewant?
No“right”answertothesequestionsingeneral…itdependsontheapplication
![Page 19: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/19.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 19
ClusterAnalysisvsClassification
• Dataareunlabeled
• Thenumberofclustersareunknown
• “Unsupervised”learning• Goal:findunknown
structures
19
• Thelabelsfortrainingdataareknown
• Thenumberofclassesareknown
• “Supervised”learning• Goal:allocatenew
observations,whoselabelsareunknown,tooneoftheknownclasses
![Page 20: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/20.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 20
Clustering:TheK-MeansAlgorithm
![Page 21: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/21.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 21
Notation
NdocumentsRepresenteachdocumentasavectorofTterms(e.g.,countsortf-idf)
Thevectorfortheith documentis:xi =(xi1,xi2,…,xij ,....,xiT ),i =1,…..N
Document-Termmatrix• xij istheith row,jth column• columnscorrespondtoterms• rowscorrespondtodocuments
WecanthinkofourdocumentsasbeinginaT-dimensionalspace,withclustersas“cloudsofpoints”
![Page 22: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/22.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 22
TheK-MeansClusteringAlgorithm
Input:Nvectorsx1,….xN ofdimensionDK=numberofclusters(K>1)
Output:– Kclustercenters,c1,….cK, eachcenterisavectorofdimensionD– (Equivalently) Alistofclusterassignments (values1toK)foreachoftheN
inputvectors
Note:InK-meanseachinputvectorx isassignedtooneandonlyoneclusterk,orclustercenterck
TheK-meansalgorithmpartitions theNdatavectorsintoKdisjointgroups
![Page 23: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/23.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 23
x1
x2
Cluster1
Cluster2
ExampleofK-MeansOutputwith2Clusters
c1
c2
BluecirclesareexamplesofdocumentsRedcirclesareexamplesofclustercenters
![Page 24: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/24.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 24
SquaredErrorDistance
),,,( 21 Txxxx !=
ConsidertwovectorseachwithTcomponents(i.e.,dimensionT)
∑=
−=T
jjjE yxyxd
1
2)(),(
Acommondistancemetricissquarederrordistance:
Intwodimensionsthesquarerootofthisistheusualnotionofspatialdistance,i.e.,Euclideandistance
),,,( 21 Tyyyy !=
![Page 25: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/25.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 25
SquaredErrorsandClusterCenters
• Squarederror(distance)betweenadatapointx andaclustercenterc:
dist [x ,c ]=Σj (xj - cj )2
IndexjisovertheDcomponents/dimensions ofthevectors
Cluster1
c1
![Page 26: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/26.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 26
SquaredErrorsandClusterCenters
• Squarederror(distance)betweenadatapointx andaclustercenterc:
dist [x ,c ]=Σj (xj - cj )2
• Totalsquarederrorbetweenaclustercenterck andallNk pointsassignedtothatcluster:
Sk =Σi d[xi ,ck ]
SumisovertheDcomponents/dimensions ofthevectors
Thissumisovervectors,overtheNk pointsassigned toclusterk
DistancedefinedasEuclideandistance
Cluster1
c1
![Page 27: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/27.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 27
SquaredErrorsandClusterCenters
• Squarederror(distance)betweenadatapointx andaclustercenterc:
dist [x ,c ]=Σj (xj - cj )2
• Totalsquarederrorbetweenaclustercenterck andallNk pointsassignedtothatcluster:
Sk =Σi d[xi ,ck ]
• TotalsquarederrorsummedacrossKclusters
SSE=Σk Sk
SumisovertheDcomponents/dimensions ofthevectors
SumisovertheNk points assignedtoclusterk
SumisovertheKclusters
DistancedefinedasEuclideandistance
![Page 28: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/28.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 28
K-meansObjectiveFunction
• K-means:minimizethetotalsquarederror,i.e.,findtheKclustercentersck,andassignments,thatminimize
SSE = Σk Sk =Σk (Σi d[xi ,ck ])
• K-meansseekstominimizeSSE,i.e.,findtheclustercenterssuchthatthesum-squared-errorissmallest– willplaceclustercentersstrategicallyto“cover”data– similartodatacompression (infactusedindatacompressionalgorithms)
![Page 29: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/29.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 29
K-MeansAlgorithm
• Randominitialization– SelecttheinitialKcentersrandomly fromNinputvectorsrandomly– Or,assigneachoftheNvectorsrandomly tooneoftheKclusters
• Iterate:– Assignment Step:
• AssigneachoftheNinputvectorstotheirclosestmean
– UpdatetheMean-Vectors(Kofthem)• Computeupdatedcenters:theaveragevalueofthevectorsassignedtok
New ck =1/Nk Σi xi
• Convergence:– Didanypointsgetreassigned?
• Yes:terminate• No:returntoIteratestep
SumisovertheNk points assignedtoclusterk
![Page 30: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/30.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 30
Pseudocode fortheK-meansAlgorithm
FromChapter16inManning,Raghavan,andSchutze
![Page 31: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/31.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 31
ExampleofK-MeansClustering
-2 -1 0 1 2 3 4 5 6 7 8
DIMENSION 1
-2
-1
0
1
2
3
4
5
6
7
DIM
ENSI
ON
2
Original Data
![Page 32: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/32.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 32
ExampleofK-MeansClustering
-1 0 1 2 3 4 5 6 7 8
DIMENSION 1
-1
0
1
2
3
4
5
6
7D
IMEN
SIO
N 2
Iteration 1
MeanSquaredError=3.45
![Page 33: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/33.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 33
ExampleofK-MeansClustering
-1 0 1 2 3 4 5 6 7 8
DIMENSION 1
-1
0
1
2
3
4
5
6
7D
IMEN
SIO
N 2
Iteration 2
MeanSquaredError=1.93
![Page 34: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/34.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 34
ExampleofK-MeansClustering
-1 0 1 2 3 4 5 6 7 8
DIMENSION 1
-1
0
1
2
3
4
5
6
7D
IMEN
SIO
N 2
Iteration 3
MeanSquaredError=1.25
![Page 35: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/35.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 35
ExampleofK-MeansClustering
-1 0 1 2 3 4 5 6 7 8
DIMENSION 1
-1
0
1
2
3
4
5
6
7D
IMEN
SIO
N 2
Iteration 5
MeanSquaredError=1.21
(converged)
![Page 36: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/36.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 36
K-means1. Pick number of
clusters (e.g. K=5) 2. Randomly guess K
cluster Center locations
Figure/slidefromAndrewMoore,CMU
![Page 37: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/37.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 37
K-means1. Pick number of
clusters (e.g. K=5) 2. Randomly guess K
cluster Center locations
3. Each datapoint finds out which Center it’s closest to.
Figure/slidefromAndrewMoore,CMU
![Page 38: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/38.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 38
K-means1. Pick number of
clusters (e.g. K=5) 2. Randomly guess K
cluster Center locations
3. Each datapoint finds out which Center it’s closest to.
4. Each Center finds the centroid of the points it owns
Figure/slidefromAndrewMoore,CMU
![Page 39: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/39.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 39
K-means1. Pick number of
clusters (e.g. K=5) 2. Randomly guess K
cluster Center locations
3. Each datapoint finds out which Center it’s closest to.
4. Each Center finds the centroid of the points it owns
5. New Centers => new boundaries
6. Repeat until no change
Figure/slidefromAndrewMoore,CMU
![Page 40: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/40.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 40
TheIrisData
• CollectedbyR.A.Fisher• Afamousearlydatasetinmultivariatedataanalysis
• Fourfeatures:– sepallength incm– sepalwidth incm– petallength incm– petalwidth incm
• Threedifferentspecies– Setosa– Versicolor– Virginica
![Page 41: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/41.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 41
K-MeansClusteringontheIrisData
![Page 42: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/42.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 42
K-MeansforImageCompression
![Page 43: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/43.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 43
AnExampleofDatawhereK-Meansdoesnotworkwell
IdealClusteringofDatain2Dimensions
![Page 44: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/44.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 44
AnExampleofDatawhereK-Meansdoesnotworkwell
K-meansClusteringResult,K=2IdealClusteringofDatain2Dimensions
![Page 45: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/45.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 45
From:http://scikit-learn.org/stable/modules/clustering.html#
![Page 46: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/46.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 46
PropertiesoftheK-MeansAlgorithm
• Timecomplexity?N=numberofdatapointsK=numberofclustersD=dimension ofdatapoints (number ofvariables)
O(NKd)intimeperiterationThisisgood: lineartimeineachinputparameter
• DoesK-meansalwaysfindaGlobalMinimum?i.e.,thesetofKcentersthatminimize theSSE?
No:alwaysconvergesto*some*localminimum, butnotnecessarilythebest• Dependsonthestartingpointchosen• CanprovethatSSEoneachiterationmusteither
– Decrease,or– Notchange(inwhichcasewehaveconverged)
[Thinkabouthowyoumightprovethis]
![Page 47: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/47.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 47
SummaryofKmeans
• Input:– Nvectors
• Output:Kclusters– Eachclusterrepresentedbyaclustermean(avector)– Assignseachdatapoint toitsclosestclustercenter
• Strengths– Fast:timecomplexityisO(NDK), i.e.,lineartimeinN,T,K– Simple toimplement
• Weaknesses:– Notguaranteed tofindthebestsolution (theglobalminimumofSSE)– AssumesafixedK,numberofclusters– UsesEuclideandistance– notnecessarilyideal
![Page 48: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/48.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 48
NumberofClusters?
• Generallyno“right”answer…itdependsontheapplication
• Wecanthinkofclusteringasatypeofdatacompressiontechnique:– AsK,thenumberofclustersgrows,wecompressthedatabetter,e.g.,lower
overallsquarederror– ButthisdoesnotmeanlargerKisalwaysbetter…..thelargerthevalueofKthe
harderitisforhumans tounderstand theclusteringresults
• Options?– PickavalueofKbasedonintuition/heuristics, e.g.,relativelysmallK(e.g.,K=5
or10)ifweareshowing theresultstoahuman– EvaluatedifferentvaluesofKifwehavesomeground truthforevaluationand
selectthebestvalueofKusing thetask-specificevaluationmeasure
![Page 49: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/49.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 49
HierarchicalClusteringAlgorithms
![Page 50: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/50.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 50
Setosa Virginica Versicolor
![Page 51: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/51.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 51
HierarchicalClustering
• Thenumberofclustersisnotrequired• Givesatree-basedrepresentationofobservations- dendrogram
• Eachleafrepresentsanobservation
• Leavesmostsimilar toeachotheraremerged
• Internalnodesmostsimilar toeachothermerged
• Processcontinuesrecursivelyuntilallnodesaremergedattherootnode
![Page 52: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/52.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 52
BasicConceptofHierarchicalClustering
Step 0 Step 1 Step 2 Step 3 Step 4
b
dc
e
a a b
d ec d e
a b c d e
Mergedatapoints,andthenclusters,inabottom-upfashion,untilalldatapointsarein1cluster.
Requiresthatwecandefinedistance/similaritybetweensetsofpoints
![Page 53: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/53.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 53
SimpleExampleofHierarchicalClustering
Dimension1
Dimension2
![Page 54: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/54.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 54
Complete-linkclusteringofReutersnewsstories
FigurefromChapter17ofManning,Raghavan,andSchutze
![Page 55: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/55.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 55
DistancebetweenTwoBranches/Clusters
Singlelinkage
Completelinkage
Averagelinkage
Many other options
![Page 56: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/56.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 56
ComplexityofHierarchicalClustering
• TimeComplexity (N=numofdocs,T=dimensionality)– Timetocomputeallpairwisedistances:O(N2 T)– Timetocreatethetree:O(N3)
->Overalltimecomplexity is O(N3 +N2 T)
• Spacecomplexity=O(N2)
• Thisisasignificantweaknessofhierarchicalclustering:scalespoorlyinN– OnepracticaloptionisfirstrunK-meanswith(e.g.,)K=20or100or500clustersand
then“clustertheclusters”fromK-means
![Page 57: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/57.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 57
AutomaticallyClusteringLanguagesinLinguistics
![Page 58: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/58.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 58
HierarchicalClusteringbasedonuservotesforfavoritebeers
Basedoncentroidmethod
Fromdata.ranker.com
![Page 59: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/59.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 59
“Heat-Map”Representation(humandata)
![Page 60: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/60.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 60
DiscoveringStructurefromaHeatMap ofBrainNetworkData
Fromhttps://seaborn.pydata.org/examples/structured_heatmap.html
![Page 61: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/61.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 61
SummaryofClusteringAlgorithms
• Usedforexploringdata– Cananswerquestions such“aretheresubgroups?”
• Differentclusteringalgorithms– K-means
• Simple,fast,easytointerpret• Tendstofind“circularclusters”,canfailoncomplexstructure• NumberofclustersKisfixedaheadoftime
– Hierarchicalagglomerativeclustering• Producesatreeofclusters(dendrogram)• Numberofclustersisnotfixed• Computationalcomplexityishigh,doesnotscalewelltolargeN
• Clusteringisusefulforexploration….butoneshouldbecareful– No“goldstandard”tocompareitto– Manydifferentmethods….cangivedifferent results
![Page 62: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science](https://reader030.vdocuments.us/reader030/viewer/2022040614/5ec5c1ba75eb2b22f126d837/html5/thumbnails/62.jpg)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 62
Assignment5
RefertotheWikipage
DuenoononMondayFebruary12th toEEEdropbox
Notechange:duebeforeclass(by2pm)