graphlab dunning-clustering
DESCRIPTION
Talk on the upcoming Mahout nearest neighbor framework focussing particularly on the k-means acceleration provided by the streaming k-means implementation.TRANSCRIPT
![Page 1: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/1.jpg)
1©MapR Technologies - Confidential
Large-scale Single-pass k-Means Clustering at ScaleTed Dunning
![Page 2: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/2.jpg)
2©MapR Technologies - Confidential
Large-scale Single-pass k-Means Clustering
![Page 3: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/3.jpg)
3©MapR Technologies - Confidential
Large-scale k-Means Clustering
![Page 4: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/4.jpg)
4©MapR Technologies - Confidential
Goals
Cluster very large data sets Facilitate large nearest neighbor search Allow very large number of clusters Achieve good quality– low average distance to nearest centroid on held-out data
Based on Mahout Math Runs on Hadoop (really MapR) cluster FAST – cluster tens of millions in minutes
![Page 5: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/5.jpg)
5©MapR Technologies - Confidential
Non-goals
Use map-reduce (but it is there) Minimize the number of clusters Support metrics other than L2
![Page 6: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/6.jpg)
6©MapR Technologies - Confidential
Anti-goals
Multiple passes over original data Scale as O(k n)
![Page 7: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/7.jpg)
7©MapR Technologies - Confidential
Why?
![Page 8: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/8.jpg)
8©MapR Technologies - Confidential
K-nearest Neighbor withSuper Fast k-means
![Page 9: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/9.jpg)
9©MapR Technologies - Confidential
What’s that?
Find the k nearest training examples Use the average value of the target variable from them
This is easy … but hard– easy because it is so conceptually simple and you don’t have knobs to turn
or models to build– hard because of the stunning amount of math– also hard because we need top 50,000 results
Initial prototype was massively too slow– 3K queries x 200K examples takes hours– needed 20M x 25M in the same time
![Page 10: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/10.jpg)
10©MapR Technologies - Confidential
How We Did It
2 week hackathon with 6 developers from customer bank Agile-ish development To avoid IP issues– all code is Apache Licensed (no ownership question)– all data is synthetic (no question of private data)– all development done on individual machines, hosting on Github– open is easier than closed (in this case)
Goal is new open technology to facilitate new closed solutions
Ambitious goal of ~ 1,000,000 x speedup
![Page 11: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/11.jpg)
11©MapR Technologies - Confidential
How We Did It
2 week hackathon with 6 developers from customer bank Agile-ish development To avoid IP issues– all code is Apache Licensed (no ownership question)– all data is synthetic (no question of private data)– all development done on individual machines, hosting on Github– open is easier than closed (in this case)
Goal is new open technology to facilitate new closed solutions
Ambitious goal of ~ 1,000,000 x speedup– well, really only 100-1000x after basic hygiene
![Page 12: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/12.jpg)
12©MapR Technologies - Confidential
What We Did
Mechanism for extending Mahout Vectors– DelegatingVector, WeightedVector, Centroid
Shared memory matrix– FileBasedMatrix uses mmap to share very large dense matrices
Searcher interface– ProjectionSearch, KmeansSearch, LshSearch, Brute
Super-fast clustering– Kmeans, StreamingKmeans
![Page 13: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/13.jpg)
13©MapR Technologies - Confidential
Projection Search
java.lang.TreeSet!
![Page 14: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/14.jpg)
14©MapR Technologies - Confidential
How Many Projections?
![Page 15: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/15.jpg)
15©MapR Technologies - Confidential
K-means Search
Simple Idea– pre-cluster the data– to find the nearest points, search the nearest clusters
Recursive application– to search a cluster, use a Searcher!
![Page 16: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/16.jpg)
16©MapR Technologies - Confidential
![Page 17: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/17.jpg)
17©MapR Technologies - Confidential
x
![Page 18: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/18.jpg)
18©MapR Technologies - Confidential
![Page 19: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/19.jpg)
19©MapR Technologies - Confidential
![Page 20: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/20.jpg)
20©MapR Technologies - Confidential
x
![Page 21: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/21.jpg)
21©MapR Technologies - Confidential
But This Requires k-means!
Need a new k-means algorithm to get speed– Hadoop is very slow at iterative map-reduce– Maybe Pregel clones like Giraph would be better– Or maybe not
Streaming k-means is– One pass (through the original data)– Very fast (20 us per data point with threads)– Very parallelizable
![Page 22: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/22.jpg)
22©MapR Technologies - Confidential
Basic Method
Use a single pass of k-means with very many clusters– output is a bad-ish clustering but a good surrogate
Use weighted centroids from step 1 to do in-memory clustering– output is a good clustering with fewer clusters
![Page 23: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/23.jpg)
23©MapR Technologies - Confidential
Algorithmic Details
Foreach data point xn
compute distance to nearest centroid, ∂sample u, if u > ∂/ß add to nearest centroidelse create new centroid
if number of centroids > 10 log nrecursively cluster centroidsset ß = 1.5 ß if number of centroids did not decrease
![Page 24: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/24.jpg)
24©MapR Technologies - Confidential
How It Works
Result is large set of centroids– these provide approximation of original distribution– we can cluster centroids to get a close approximation of clustering original– or we can just use the result directly
![Page 25: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/25.jpg)
25©MapR Technologies - Confidential
Parallel Speedup?
✓
![Page 26: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/26.jpg)
26©MapR Technologies - Confidential
Warning, Recursive Descent
Inner loop requires finding nearest centroid
With lots of centroids, this is slow
But wait, we have classes to accelerate that!
![Page 27: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/27.jpg)
27©MapR Technologies - Confidential
Warning, Recursive Descent
Inner loop requires finding nearest centroid
With lots of centroids, this is slow
But wait, we have classes to accelerate that!
(Let’s not use k-means searcher, though)
![Page 28: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/28.jpg)
28©MapR Technologies - Confidential
Warning, Recursive Descent
Inner loop requires finding nearest centroid
With lots of centroids, this is slow
But wait, we have classes to accelerate that!
(Let’s not use k-means searcher, though)
Empirically, projection search beats 64 bit LSH by a bit
![Page 29: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/29.jpg)
29©MapR Technologies - Confidential
Moving to Scale
Map-reduce implementation nearly trivial
Map: rough-cluster input data, output ß, weighted centroids
Reduce: – single reducer gets all centroids– if too many centroids, merge using recursive clustering– optionally do final clustering in-memory
Combiner possible, but essentially never important
![Page 30: Graphlab dunning-clustering](https://reader036.vdocuments.us/reader036/viewer/2022062303/557f3981d8b42aa41d8b45ad/html5/thumbnails/30.jpg)
30©MapR Technologies - Confidential
Contact:– [email protected]– @ted_dunning
Slides and such:– http://info.mapr.com/ted-mlconf Hash tags: #mlconf #mahout #mapr