mining%of%massive%datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... ·...

13
Mining of Massive Datasets Leskovec, Rajaraman, and Ullman Stanford University

Upload: others

Post on 04-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-26 · 1) For&each&point,&place&itin&the&cluster&whose& currentcentroid&itis&nearest!

Mining  of  Massive  Datasets  Leskovec,  Rajaraman,  and  Ullman  Stanford  University  

Page 2: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-26 · 1) For&each&point,&place&itin&the&cluster&whose& currentcentroid&itis&nearest!

¡  Assumes  Euclidean  space/distance  

¡  Start  by  picking  k,  the  number  of  clusters  

¡  Ini<alize  clusters  by  picking  one  point  per  cluster  §  For  the  moment,  assume  we  pick  the  k  points  at  random  

2  

Page 3: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-26 · 1) For&each&point,&place&itin&the&cluster&whose& currentcentroid&itis&nearest!

¡  1)  For  each  point,  place  it  in  the  cluster  whose  current  centroid  it  is  nearest  

¡  2)  AAer  all  points  are  assigned,  update  the  loca<ons  of  centroids  of  the  k  clusters  

¡  3)  Reassign  all  points  to  their  closest  centroid  §  Some<mes  moves  points  between  clusters  

¡  Repeat  2  and  3  un.l  convergence  §  Convergence:  Points  don’t  move  between  clusters  and  centroids  stabilize  

3  

Page 4: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-26 · 1) For&each&point,&place&itin&the&cluster&whose& currentcentroid&itis&nearest!

4  

x

x

x

x

x

x

x x

x … data point … centroid

x

x

x

Round 1

Page 5: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-26 · 1) For&each&point,&place&itin&the&cluster&whose& currentcentroid&itis&nearest!

5  

x

x

x

x

x

x

x x

x … data point … centroid

x

x

x

Round 2

Page 6: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-26 · 1) For&each&point,&place&itin&the&cluster&whose& currentcentroid&itis&nearest!

6  

x

x

x

x

x

x

x x

x … data point … centroid

x

x

x

Round 3

Page 7: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-26 · 1) For&each&point,&place&itin&the&cluster&whose& currentcentroid&itis&nearest!

How  to  select  k?  ¡  Try  different  k,  looking  at  the  change  in  the  average  distance  to  centroid,  as  k  increases.  

 

7  

Page 8: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-26 · 1) For&each&point,&place&itin&the&cluster&whose& currentcentroid&itis&nearest!

8  

x x x x x x x x x x x x x

x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x

x

x

Too few; many long distances to centroid.

Page 9: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-26 · 1) For&each&point,&place&itin&the&cluster&whose& currentcentroid&itis&nearest!

9  

x x x x x x x x x x x x x

x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x

x

x

Just right; distances rather short.

Page 10: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-26 · 1) For&each&point,&place&itin&the&cluster&whose& currentcentroid&itis&nearest!

10  

x x x x x x x x x x x x x

x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x

x

x

Too many; little improvement in average distance.

Page 11: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-26 · 1) For&each&point,&place&itin&the&cluster&whose& currentcentroid&itis&nearest!

Average  falls  rapidly  un<l  right  k,  then  falls  much  more  slowly  

11  

k

Average distance to

centroid

Best value of k

Page 12: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-26 · 1) For&each&point,&place&itin&the&cluster&whose& currentcentroid&itis&nearest!

¡  Approach  1:  Sampling  §  Cluster  a  sample  of  the  data  using  hierarchical  clustering,  to  obtain  k  clusters  

§  Pick  a  point  from  each  cluster  (e.g.  point  closest  to  centroid)  

§  Sample  fits  in  main  memory  

¡  Approach  2:  Pick  “dispersed”  set  of  points  §  Pick  first  point  at  random  §  Pick  the  next  point  to  be  the  one  whose  minimum  distance  from  the  selected  points  is  as  large  as  possible  

§  Repeat  un<l  we  have  k  points  12  

Page 13: Mining%of%Massive%Datasets%lig-membres.imag.fr/leroyv/wp-content/uploads/sites/125/... · 2015-11-26 · 1) For&each&point,&place&itin&the&cluster&whose& currentcentroid&itis&nearest!

¡  In  each  round,  we  have  to  examine  each  input  point  exactly  once  to  find  closest  centroid  

¡  Each  round  is  O(kN)  for  N  points,  k  clusters  

¡  But  the  number  of  rounds  to  convergence  can  be  very  large!  

¡  Can  we  cluster  in  a  single  pass  over  the  data?  

13