cs 5614: (big) data management systemspeople.cs.vt.edu/badityap/classes/cs5614-spr17/lectures/...bfr...
TRANSCRIPT
![Page 1: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/1.jpg)
CS5614:(Big)DataManagementSystems
B.AdityaPrakashLecture#16:Clustering
![Page 2: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/2.jpg)
HighDimensionalData
§ Givenacloudofdatapointswewanttounderstanditsstructure
VTCS5614 2Prakash2017
![Page 3: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/3.jpg)
3
TheProblemofClustering§ Givenasetofpoints,withanoDonofdistancebetweenpoints,groupthepointsintosomenumberofclusters,sothat– Membersofaclusterareclose/similartoeachother– Membersofdifferentclustersaredissimilar
§ Usually:– Pointsareinahigh-dimensionalspace– Similarityisdefinedusingadistancemeasure
• Euclidean,Cosine,Jaccard,editdistance,…
VTCS5614Prakash2017
![Page 4: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/4.jpg)
4
Example:Clusters&Outliers
x x x x x x x x x x x x x
x x
x xx x x x
x x x x
x x x x
x x x x x x x x x
x
x
x
VTCS5614
x x x x x x x x x x x x x
x x
x xx x x x
x x x x
x x x x
x x x x x x x x x
x Outlier Cluster
Prakash2017
![Page 5: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/5.jpg)
Clusteringisahardproblem!
VTCS5614 5Prakash2017
![Page 6: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/6.jpg)
6
Whyisithard?
§ Clusteringintwodimensionslookseasy§ Clusteringsmallamountsofdatalookseasy§ Andinmostcases,looksarenotdeceiving
§ ManyapplicaDonsinvolvenot2,but10or10,000dimensions
§ High-dimensionalspaceslookdifferent:Almostallpairsofpointsareataboutthesamedistance
VTCS5614Prakash2017
![Page 7: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/7.jpg)
ClusteringProblem:Galaxies§ Acatalogof2billion“skyobjects”representsobjectsby
theirradiaWonin7dimensions(frequencybands)§ Problem:Clusterintosimilarobjects,e.g.,galaxies,nearby
stars,quasars,etc.§ SloanDigitalSkySurvey
VTCS5614 7Prakash2017
![Page 8: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/8.jpg)
ClusteringProblem:MusicCDs§ IntuiWvely:Musicdividesintocategories,andcustomerspreferafewcategories– Butwhatarecategoriesreally?
§ RepresentaCDbyasetofcustomerswhoboughtit:
§ SimilarCDshavesimilarsetsofcustomers,andvice-versa
8VTCS5614Prakash2017
![Page 9: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/9.jpg)
ClusteringProblem:MusicCDsSpaceofallCDs:§ Thinkofaspacewithonedim.foreachcustomer– Valuesinadimensionmaybe0or1only– ACDisapointinthisspace(x1,x2,…,xk),wherexi=1ifftheithcustomerboughttheCD
§ ForAmazon,thedimensionistensofmillions
§ Task:FindclustersofsimilarCDsVTCS5614 9Prakash2017
![Page 10: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/10.jpg)
ClusteringProblem:Documents
Findingtopics:§ Representadocumentbyavector(x1,x2,…,xk),wherexi=1ifftheithword(insomeorder)appearsinthedocument– Itactuallydoesn’tmaberifkisinfinite;i.e.,wedon’tlimitthesetofwords
§ Documentswithsimilarsetsofwordsmaybeaboutthesametopic
10VTCS5614Prakash2017
![Page 11: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/11.jpg)
Cosine,Jaccard,andEuclidean§ AswithCDswehaveachoicewhenwethinkofdocumentsassetsofwordsorshingles:– Setsasvectors:Measuresimilaritybythecosinedistance
– Setsassets:MeasuresimilaritybytheJaccarddistance
– Setsaspoints:MeasuresimilaritybyEuclideandistance
VTCS5614 11Prakash2017
![Page 12: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/12.jpg)
12
Overview:MethodsofClustering
§ Hierarchical:– AgglomeraWve(bobomup):
• IniDally,eachpointisacluster• Repeatedlycombinethetwo“nearest”clustersintoone
– Divisive(topdown):• Startwithoneclusterandrecursivelysplitit
§ Pointassignment:– Maintainasetofclusters– Pointsbelongto“nearest”cluster
VTCS5614Prakash2017
![Page 13: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/13.jpg)
HierarchicalClustering
§ KeyoperaWon:Repeatedlycombinetwonearestclusters
§ ThreeimportantquesWons:– 1)Howdoyourepresentaclusterofmorethanonepoint?
– 2)Howdoyoudeterminethe“nearness”ofclusters?
– 3)Whentostopcombiningclusters?
VTCS5614 13Prakash2017
![Page 14: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/14.jpg)
HierarchicalClustering§ KeyoperaWon:Repeatedlycombinetwonearestclusters
§ (1)Howtorepresentaclusterofmanypoints?– Keyproblem:Asyoumergeclusters,howdoyourepresentthe“locaDon”ofeachcluster,totellwhichpairofclustersisclosest?
§ Euclideancase:eachclusterhasacentroid=averageofits(data)points
§ (2)Howtodetermine“nearness”ofclusters?– Measureclusterdistancesbydistancesofcentroids
VTCS5614 14Prakash2017
![Page 15: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/15.jpg)
Example:Hierarchicalclustering
Prakash2017 VTCS5614 15
![Page 16: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/16.jpg)
AndintheNon-EuclideanCase?WhatabouttheNon-Euclideancase?§ Theonly“locaDons”wecantalkaboutarethepointsthemselves– i.e.,thereisno“average”oftwopoints
§ Approach1:– (1)Howtorepresentaclusterofmanypoints?clustroid=(data)point“closest”tootherpoints
– (2)Howdoyoudeterminethe“nearness”ofclusters?Treatclustroidasifitwerecentroid,whencompuDnginter-clusterdistances
16VTCS5614Prakash2017
![Page 17: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/17.jpg)
“Closest”Point?§ (1)Howtorepresentaclusterofmanypoints?clustroid=point“closest”tootherpoints
§ Possiblemeaningsof“closest”:– Smallestmaximumdistancetootherpoints– Smallestaveragedistancetootherpoints– Smallestsumofsquaresofdistancestootherpoints
• FordistancemetricdclustroidcofclusterCis:
VTCS5614 17
∑∈Cxc
cxd 2),(min
Centroid is the avg. of all (data)points in the cluster. This means centroid is an “artificial” point. Clustroid is an existing (data)point that is “closest” to all other points in the cluster.
X
Cluster on 3 datapoints
Centroid
Clustroid
Datapoint
Prakash2017
![Page 18: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/18.jpg)
Defining“Nearness”ofClusters
§ (2)Howdoyoudeterminethe“nearness”ofclusters?– Approach2:Interclusterdistance=minimumofthedistancesbetweenanytwopoints,onefromeachcluster
– Approach3:PickanoDonof“cohesion”ofclusters,e.g.,maximumdistancefromtheclustroid
• Mergeclusterswhoseunionismostcohesive
18VTCS5614Prakash2017
![Page 19: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/19.jpg)
Cohesion
§ Approach3.1:Usethediameterofthemergedcluster=maximumdistancebetweenpointsinthecluster
§ Approach3.2:Usetheaveragedistancebetweenpointsinthecluster
§ Approach3.3:Useadensity-basedapproach– Takethediameteroravg.distance,e.g.,anddividebythenumberofpointsinthecluster
VTCS5614 19Prakash2017
![Page 20: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/20.jpg)
ImplementaWon
§ NaïveimplementaWonofhierarchicalclustering:– Ateachstep,computepairwisedistancesbetweenallpairsofclusters,thenmerge
– O(N3)
§ CarefulimplementaDonusingpriorityqueuecanreduceDmetoO(N2logN)– SWlltooexpensiveforreallybigdatasetsthatdonotfitinmemory
VTCS5614 20Prakash2017
![Page 21: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/21.jpg)
K-MEANSCLUSTERING
Prakash2017 VTCS5614 21
![Page 22: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/22.jpg)
k–meansAlgorithm(s)
§ AssumesEuclideanspace/distance
§ Startbypickingk,thenumberofclusters
§ IniDalizeclustersbypickingonepointpercluster– Example:Pickonepointatrandom,thenk-1otherpoints,eachasfarawayaspossiblefromthepreviouspoints
22VTCS5614Prakash2017
![Page 23: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/23.jpg)
PopulaWngClusters§ 1)Foreachpoint,placeitintheclusterwhosecurrentcentroiditisnearest
§ 2)Alerallpointsareassigned,updatethelocaDonsofcentroidsofthekclusters
§ 3)Reassignallpointstotheirclosestcentroid– SomeDmesmovespointsbetweenclusters
§ Repeat2and3unWlconvergence– Convergence:Pointsdon’tmovebetweenclustersandcentroidsstabilize
VTCS5614 23Prakash2017
![Page 24: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/24.jpg)
Example:AssigningClusters
VTCS5614 24
x
x
x
x
x
x
x x
x … data point … centroid
x
x
x
Clusters after round 1 Prakash2017
![Page 25: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/25.jpg)
Example:AssigningClusters
VTCS5614 25
x
x
x
x
x
x
x x
x … data point … centroid
x
x
x
Clusters after round 2 Prakash2017
![Page 26: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/26.jpg)
Example:AssigningClusters
VTCS5614 26
x
x
x
x
x
x
x x
x … data point … centroid
x
x
x
Clusters at the end Prakash2017
![Page 27: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/27.jpg)
Gegngthekright
Howtoselectk?§ Trydifferentk,lookingatthechangeintheaveragedistancetocentroidaskincreases
§ AveragefallsrapidlyunDlrightk,thenchangeslible
27
k
Average distance to
centroid
Best value of k
VTCS5614Prakash2017
![Page 28: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/28.jpg)
Example:Pickingk
VTCS5614 28
x x x x x x x x x x x x x
x x
x xx x x x
x x x x
x x x x
x x x x x x x x x
x
x
x
Too few; many long distances to centroid.
Prakash2017
![Page 29: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/29.jpg)
Example:Pickingk
VTCS5614 29
x x x x x x x x x x x x x
x x
x xx x x x
x x x x
x x x x
x x x x x x x x x
x
x
x
Just right; distances rather short.
Prakash2017
![Page 30: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/30.jpg)
Example:Pickingk
VTCS5614 30
x x x x x x x x x x x x x
x x
x xx x x x
x x x x
x x x x
x x x x x x x x x
x
x
x
Too many; little improvement in average distance.
Prakash2017
![Page 31: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/31.jpg)
THEBFRALGORITHM
Extensionofk-meanstolargedata
Prakash2017 VTCS5614 31
![Page 32: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/32.jpg)
BFRAlgorithm§ BFR[Bradley-Fayyad-Reina]isavariantofk-meansdesignedtohandleverylarge(disk-resident)datasets
§ AssumesthatclustersarenormallydistributedaroundacentroidinaEuclideanspace– StandarddeviaDonsindifferentdimensionsmayvary
• Clustersareaxis-alignedellipses
§ Efficientwaytosummarizeclusters(wantmemoryrequiredO(clusters)andnotO(data))
32VTCS5614Prakash2017
![Page 33: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/33.jpg)
BFRAlgorithm§ Pointsarereadfromdiskonemain-memory-fullataDme
§ MostpointsfrompreviousmemoryloadsaresummarizedbysimplestaWsWcs
§ Tobegin,fromtheiniDalloadweselecttheiniDalkcentroidsbysomesensibleapproach:– Takekrandompoints– TakeasmallrandomsampleandclusteropDmally– Takeasample;pickarandompoint,andthenk–1morepoints,eachasfarfromthepreviouslyselectedpointsaspossible
33VTCS5614Prakash2017
![Page 34: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/34.jpg)
ThreeClassesofPoints3setsofpointswhichwekeeptrackof:§ Discardset(DS):
– Pointscloseenoughtoacentroidtobesummarized
§ Compressionset(CS):– GroupsofpointsthatareclosetogetherbutnotclosetoanyexisDngcentroid
– Thesepointsaresummarized,butnotassignedtoacluster
§ Retainedset(RS):– IsolatedpointswaiDngtobeassignedtoacompressionset
VTCS5614 34Prakash2017
![Page 35: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/35.jpg)
BFR:“Galaxies”Picture
VTCS5614 35
A cluster. Its points are in the DS.
The centroid
Compressed sets. Their points are in the CS.
Points in the RS
Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points
Prakash2017
![Page 36: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/36.jpg)
SummarizingSetsofPointsForeachcluster,thediscardset(DS)issummarizedby:§ Thenumberofpoints,N§ ThevectorSUM,whoseithcomponentisthesumofthecoordinatesofthepointsintheithdimension
§ ThevectorSUMSQ:ithcomponent=sumofsquaresofcoordinatesinithdimension
VTCS5614
36A cluster. All its points are in the DS. The centroid
Prakash2017
![Page 37: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/37.jpg)
SummarizingPoints:Comments
§ 2d+1valuesrepresentanysizecluster– d=numberofdimensions
§ Averageineachdimension(thecentroid)canbecalculatedasSUMi/N– SUMi=ithcomponentofSUM
§ Varianceofacluster’sdiscardsetindimensioniis:(SUMSQi/N)–(SUMi/N)2– AndstandarddeviaDonisthesquarerootofthat
§ Nextstep:Actualclustering
37VTCS5614
Note: Dropping the “axis-aligned” clusters assumption would require storing full covariance matrix to summarize the cluster. So, instead of SUMSQ being a d-dim vector, it would be a d x d matrix, which is too big! Prakash2017
![Page 38: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/38.jpg)
The“Memory-Load”ofPoints
Processingthe“Memory-Load”ofpoints(1):§ 1)Findthosepointsthatare“sufficientlyclose”toaclustercentroidandaddthosepointstothatclusterandtheDS– Thesepointsaresoclosetothecentroidthattheycanbesummarizedandthendiscarded
§ 2)Useanymain-memoryclusteringalgorithmtoclustertheremainingpointsandtheoldRS– ClustersgototheCS;outlyingpointstotheRS
VTCS5614 38
Discard set (DS): Close enough to a centroid to be summarized. Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points
Prakash2017
![Page 39: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/39.jpg)
The“Memory-Load”ofPointsProcessingthe“Memory-Load”ofpoints(2):§ 3)DSset:AdjuststaDsDcsoftheclusterstoaccountforthenewpoints– AddNs,SUMs,SUMSQs
§ 4)ConsidermergingcompressedsetsintheCS
§ 5)Ifthisisthelastround,mergeallcompressedsetsintheCSandallRSpointsintotheirnearestcluster
VTCS5614 39
Discard set (DS): Close enough to a centroid to be summarized. Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points
Prakash2017
![Page 40: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/40.jpg)
BFR:“Galaxies”Picture
VTCS5614 40
A cluster. Its points are in the DS.
The centroid
Compressed sets. Their points are in the CS.
Points in the RS
Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points
Prakash2017
![Page 41: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/41.jpg)
AFewDetails…
§ Q1)Howdowedecideifapointis“closeenough”toaclusterthatwewilladdthepointtothatcluster?
§ Q2)Howdowedecidewhethertwocompressedsets(CS)deservetobecombinedintoone?
41VTCS5614Prakash2017
![Page 42: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/42.jpg)
HowCloseisCloseEnough?
§ Q1)Weneedawaytodecidewhethertoputanewpointintoacluster(anddiscard)
§ BFRsuggeststwoways:– TheMahalanobisdistanceislessthanathreshold– Highlikelihoodofthepointbelongingtocurrentlynearestcentroid
VTCS5614 42Prakash2017
![Page 43: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/43.jpg)
MahalanobisDistance
VTCS5614 43Prakash2017
![Page 44: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/44.jpg)
MahalanobisDistance
§ Ifclustersarenormallydistributedinddimensions,thenalertransformaDon,onestandarddeviaDon=√𝒅 – i.e.,68%ofthepointsoftheclusterwillhaveaMahalanobisdistance<√𝒅
§ AcceptapointforaclusterifitsM.D.is<somethreshold,e.g.2standarddeviaDons
44VTCS5614Prakash2017
![Page 45: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/45.jpg)
Picture:EqualM.D.Regions§ Euclideanvs.Mahalanobisdistance
VTCS5614 45
Contours of equidistant points from the origin
Uniformly distributed points, Euclidean distance
Normally distributed points, Euclidean distance
Normally distributed points, Mahalanobis distance
Prakash2017
![Page 46: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/46.jpg)
Should2CSclustersbecombined?
Q2)Should2CSsubclustersbecombined?§ Computethevarianceofthecombinedsubcluster
– N,SUM,andSUMSQallowustomakethatcalculaDonquickly
§ Combineifthecombinedvarianceisbelowsomethreshold
§ ManyalternaWves:Treatdimensionsdifferently,considerdensity
VTCS5614 46Prakash2017
![Page 47: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/47.jpg)
THECUREALGORITHM
Extensionofk-meanstoclustersofarbitraryshapes
Prakash2017 VTCS5614 47
![Page 48: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/48.jpg)
TheCUREAlgorithm§ ProblemwithBFR/k-means:
– Assumesclustersarenormallydistributedineachdimension
– Andaxesarefixed–ellipsesatananglearenotOK
§ CURE(ClusteringUsingREpresentaWves):– AssumesaEuclideandistance– Allowsclusterstoassumeanyshape– UsesacollecWonofrepresentaWvepointstorepresentclusters
48
Vs.
VTCS5614Prakash2017
![Page 49: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/49.jpg)
Example:UniversitySalaries
49
e e
e
e
e e
e
e e
e
e
h
h
h
h
h
h
h h
h
h
h
h h
salary
age
VTCS5614Prakash2017
![Page 50: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/50.jpg)
StarWngCURE2Passalgorithm.Pass1:§ 0)Pickarandomsampleofpointsthatfitinmainmemory
§ 1)IniWalclusters:– Clusterthesepointshierarchically–groupnearestpoints/clusters
§ 2)PickrepresentaWvepoints:– Foreachcluster,pickasampleofpoints,asdispersedaspossible
– Fromthesample,pickrepresentaDvesbymovingthem(say)20%towardthecentroidofthecluster
VTCS5614 50Prakash2017
![Page 51: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/51.jpg)
Example:IniWalClusters
51
e e
e
e
e e
e
e e
e
e
h
h
h
h
h
h
h h
h
h
h
h h
salary
age
VTCS5614Prakash2017
![Page 52: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/52.jpg)
Example:PickDispersedPoints
52
e e
e
e
e e
e
e e
e
e
h
h
h
h
h
h
h h
h
h
h
h h
salary
age
Pick (say) 4 remote points for each cluster.
VTCS5614Prakash2017
![Page 53: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/53.jpg)
Example:PickDispersedPoints
53
e e
e
e
e e
e
e e
e
e
h
h
h
h
h
h
h h
h
h
h
h h
salary
age
Move points (say) 20% toward the centroid.
VTCS5614Prakash2017
![Page 54: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/54.jpg)
FinishingCURE
Pass2:§ Now,rescanthewholedatasetandvisiteachpointpinthedataset
§ Placeitinthe“closestcluster”– NormaldefiniDonof“closest”:FindtheclosestrepresentaDvetopandassignittorepresentaDve’scluster
54VTCS5614
p
Prakash2017
![Page 55: CS 5614: (Big) Data Management Systemspeople.cs.vt.edu/badityap/classes/cs5614-Spr17/lectures/...BFR Algorithm Points are read from disk one main-memory-full at a me Most points from](https://reader036.vdocuments.us/reader036/viewer/2022081521/5e4d28b95e5e2c28282f0283/html5/thumbnails/55.jpg)
Summary§ Clustering:Givenasetofpoints,withanoDonofdistancebetweenpoints,groupthepointsintosomenumberofclusters
§ Algorithms:– AgglomeraDvehierarchicalclustering:
• Centroidandclustroid– k-means:
• IniDalizaDon,pickingk– BFR– CURE
VTCS5614 55Prakash2017