boston hug by ted dunning 2012
DESCRIPTION
Describes the state of Apache Mahout with special focus on the upcoming k-nearest neighbor and k-means clustering algorithms.TRANSCRIPT
![Page 1: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/1.jpg)
1©MapR Technologies - Confidential
Mahout, New and ImprovedNow with Super Fast Clustering
![Page 2: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/2.jpg)
2©MapR Technologies - Confidential
Agenda
What happened in Mahout 0.7– less bloat– simpler structure– general cleanup
![Page 3: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/3.jpg)
3©MapR Technologies - Confidential
To Cut Out Bloat
![Page 4: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/4.jpg)
4©MapR Technologies - Confidential
![Page 5: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/5.jpg)
5©MapR Technologies - Confidential
Bloat is Leaving in 0.7
Lots of abandoned code in Mahout– average code quality is poor– no users– no maintainers– why do we care?
Examples– old LDA– old Naïve Bayes– genetic algorithms
If you care, get on the mailing list
![Page 6: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/6.jpg)
6©MapR Technologies - Confidential
Bloat is Leaving in 0.7
Lots of abandoned code in Mahout– average code quality is poor– no users– no maintainers– why do we care?
Examples– old LDA– old Naïve Bayes– genetic algorithms
If you care, get on the mailing list– oops, too late since 0.7 is already released
![Page 7: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/7.jpg)
7©MapR Technologies - Confidential
Integration of Collections
![Page 8: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/8.jpg)
8©MapR Technologies - Confidential
Nobody Cares about Collections
We need it, math is built on it
Pull it into math
Broke the build (battle of the code expanders)
Fixed now (thanks to Grant)
![Page 9: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/9.jpg)
9©MapR Technologies - Confidential
Pig Vector
![Page 10: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/10.jpg)
10©MapR Technologies - Confidential
What is it?
Supports access to Mahout functionality from Pig
So far -- text vectorization
And classification
And model saving
![Page 11: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/11.jpg)
11©MapR Technologies - Confidential
What is it?
Supports Pig access to Mahout functions
So far text vectorization
And classification
And model saving
Kind of works (see pigML from twitter for better function)
![Page 12: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/12.jpg)
12©MapR Technologies - Confidential
Compile and Install
Start by compiling and installing mahout in your local repository:cd ~/Apache
git clone https://github.com/apache/mahout.git
cd mahout
mvn install -DskipTests
Then do the same with pig-vectorcd ~/Apache
git clone [email protected]:tdunning/pig-vector.git
cd pig-vector
mvn package
![Page 13: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/13.jpg)
13©MapR Technologies - Confidential
Tokenize and Vectorize Text
Tokenized is done using a text encoder– the dimension of the resulting vectors (typically 100,000-1,000,000– a description of the variables to be included in the encoding– the schema of the tuples that pig will pass together with their data types
Example:define EncodeVector
org.apache.mahout.pig.encoders.EncodeVector
('10','x+y+1', 'x:numeric, y:word, z:text');
You can also add a Lucene 3.1 analyzer in parentheses if you want something fancier
![Page 14: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/14.jpg)
14©MapR Technologies - Confidential
The Formula
Not normal arithmetic
Describes which variables to use, whether offset is included
Also describes which interactions to use
![Page 15: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/15.jpg)
15©MapR Technologies - Confidential
The Formula
Not normal arithmetic
Describes which variables to use, whether offset is included
Also describes which interactions to use– but that doesn’t do anything yet!
![Page 16: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/16.jpg)
16©MapR Technologies - Confidential
Load and Encode Data
Load the dataa = load '/Users/tdunning/Downloads/NNBench.csv' using PigStorage(',')
as (x1:int, x2:int, x3:int);
And encode itb = foreach a generate 1 as key, EncodeVector(*) as v;
Note that the true meaning of * is very subtle Now store it
store b into 'vectors.dat' using com.twitter.elephantbird.pig.store.SequenceFileStorage (
'-c com.twitter.elephantbird.pig.util.IntWritableConverter’, '-c com.twitter.elephantbird.pig.util.GenericWritableConverter
-t org.apache.mahout.math.VectorWritable’);
![Page 17: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/17.jpg)
17©MapR Technologies - Confidential
Train a Model
Pass previously encoded data to a sequential model trainerdefine train org.apache.mahout.pig.LogisticRegression('iterations=5, inMemory=true, features=100000, categories=alt.atheism comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns comp.graphics comp.windows.x rec.sport.baseball sci.med talk.politics.mideast comp.os.ms-windows.misc misc.forsale rec.sport.hockey sci.space talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt soc.religion.christian talk.religion.misc');
Note that the argument is a string with its own syntax
![Page 18: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/18.jpg)
18©MapR Technologies - Confidential
Reservations and Qualms
Pig-vector isn’t done
And it is ugly
And it doesn’t quite work
And it is hard to build
But there seems to be promise
![Page 19: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/19.jpg)
19©MapR Technologies - Confidential
Potential
Add Naïve Bayes Model?
Somehow simplify the syntax?
Try a recent version of elephant-bird?
Switch to pigML?
![Page 20: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/20.jpg)
20©MapR Technologies - Confidential
Large-scale k-Means Clustering
![Page 21: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/21.jpg)
21©MapR Technologies - Confidential
Goals
Cluster very large data sets Facilitate large nearest neighbor search Allow very large number of clusters Achieve good quality– low average distance to nearest centroid on held-out data
Based on Mahout Math Runs on Hadoop (really MapR) cluster FAST – cluster tens of millions in minutes
![Page 22: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/22.jpg)
22©MapR Technologies - Confidential
Non-goals
Use map-reduce (but it is there) Minimize the number of clusters Support metrics other than L2
![Page 23: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/23.jpg)
23©MapR Technologies - Confidential
Anti-goals
Multiple passes over original data Scale as O(k n)
![Page 24: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/24.jpg)
24©MapR Technologies - Confidential
Why?
![Page 25: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/25.jpg)
25©MapR Technologies - Confidential
K-nearest Neighbor withSuper Fast k-means
![Page 26: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/26.jpg)
26©MapR Technologies - Confidential
What’s that?
Find the k nearest training examples Use the average value of the target variable from them
This is easy … but hard– easy because it is so conceptually simple and you have few knobs to turn
or models to build– hard because of the stunning amount of math– also hard because we need top 50,000 results, not just single nearest
Initial prototype was massively too slow– 3K queries x 200K examples takes hours– needed 20M x 25M in the same time
![Page 27: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/27.jpg)
27©MapR Technologies - Confidential
Modeling with k-nearest Neighbors
a
b c
![Page 28: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/28.jpg)
28©MapR Technologies - Confidential
Subject to Some Limits
![Page 29: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/29.jpg)
29©MapR Technologies - Confidential
Log Transform Improves Things
![Page 30: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/30.jpg)
30©MapR Technologies - Confidential
Neighbors Depend on Good Presentation
![Page 31: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/31.jpg)
31©MapR Technologies - Confidential
How We Did It
2 week hackathon with 6 developers from MapR customer Agile-ish development To avoid IP issues– all code is Apache Licensed (no ownership question)– all data is synthetic (no question of private data)– all development done on individual machines, hosting on Github– open is easier than closed (in this case)
Goal is new open technology to facilitate new closed solutions
Ambitious goal of ~ 1,000,000 x speedup
![Page 32: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/32.jpg)
32©MapR Technologies - Confidential
How We Did It
2 week hackathon with 6 developers from customer bank Agile-ish development To avoid IP issues– all code is Apache Licensed (no ownership question)– all data is synthetic (no question of private data)– all development done on individual machines, hosting on Github– open is easier than closed (in this case)
Goal is new open technology to facilitate new closed solutions
Ambitious goal of ~ 1,000,000 x speedup– well, really only 100-1000x after basic hygiene
![Page 33: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/33.jpg)
33©MapR Technologies - Confidential
What We Did
Mechanism for extending Mahout Vectors– DelegatingVector, WeightedVector, Centroid
Shared memory matrix– FileBasedMatrix uses mmap to share very large dense matrices
Searcher interface– Brute, ProjectionSearch, KmeansSearch, LshSearch
Super-fast clustering– Kmeans, StreamingKmeans
![Page 34: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/34.jpg)
34©MapR Technologies - Confidential
Projection Search
java.lang.TreeSet!
![Page 35: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/35.jpg)
35©MapR Technologies - Confidential
Projection Search
Projection onto a line provides a total order on data Nearby points stay nearby Some other points also wind up close
Search points just before or just after the query point
![Page 36: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/36.jpg)
36©MapR Technologies - Confidential
How Many Projections?
![Page 37: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/37.jpg)
37©MapR Technologies - Confidential
K-means Search
Simple Idea– pre-cluster the data– to find the nearest points, search the nearest clusters
Recursive application– to search a cluster, use a Searcher!
![Page 38: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/38.jpg)
38©MapR Technologies - Confidential
![Page 39: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/39.jpg)
39©MapR Technologies - Confidential
x
![Page 40: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/40.jpg)
40©MapR Technologies - Confidential
![Page 41: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/41.jpg)
41©MapR Technologies - Confidential
![Page 42: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/42.jpg)
42©MapR Technologies - Confidential
x
![Page 43: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/43.jpg)
43©MapR Technologies - Confidential
But This Requires k-means!
Need a new k-means algorithm to get speed– Hadoop is very slow at iterative map-reduce– Maybe Pregel clones like Giraph would be better– Or maybe not
Streaming k-means is– One pass (through the original data)– Very fast (20 us per data point with threads on one node)– Very parallelizable
![Page 44: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/44.jpg)
44©MapR Technologies - Confidential
Basic Method
Use a single pass of k-means with very many clusters– output is a bad-ish clustering but a good surrogate
Use weighted centroids from step 1 to do in-memory clustering– output is a good clustering with fewer clusters
![Page 45: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/45.jpg)
45©MapR Technologies - Confidential
Algorithmic Details
Foreach data point xn
compute distance to nearest centroid, ∂sample u, if u > ∂/ß add to nearest centroidelse create new centroid
if number of centroids > k log nrecursively cluster centroidsset ß = 1.5 ß if number of centroids did not decrease
![Page 46: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/46.jpg)
46©MapR Technologies - Confidential
How It Works
Result is large set of centroids– these provide approximation of original distribution– we can cluster centroids to get a close approximation of clustering original– or we can just use the result directly
![Page 47: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/47.jpg)
47©MapR Technologies - Confidential
Parallel Speedup?
✓
![Page 48: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/48.jpg)
48©MapR Technologies - Confidential
Warning, Recursive Descent
Inner loop requires finding nearest centroid
With lots of centroids, this is slow
But wait, we have classes to accelerate that!
![Page 49: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/49.jpg)
49©MapR Technologies - Confidential
Warning, Recursive Descent
Inner loop requires finding nearest centroid
With lots of centroids, this is slow
But wait, we have classes to accelerate that!
(Let’s not use k-means searcher, though)
![Page 50: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/50.jpg)
50©MapR Technologies - Confidential
Warning, Recursive Descent
Inner loop requires finding nearest centroid
With lots of centroids, this is slow
But wait, we have classes to accelerate that!
(Let’s not use k-means searcher, though)
Empirically, projection search beats 64 bit LSH by a bit– More optimization may change this story
![Page 51: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/51.jpg)
51©MapR Technologies - Confidential
Moving to Ultra Mega Super Scale
Map-reduce implementation nearly trivial
Map: rough-cluster input data, output ß, weighted centroids
Reduce: – single reducer gets all centroids– if too many centroids, merge using recursive clustering– optionally do final clustering in-memory
Combiner possible, but not important
![Page 52: Boston Hug by Ted Dunning 2012](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b757b04a795905078b459f/html5/thumbnails/52.jpg)
52©MapR Technologies - Confidential
Contact:– [email protected]– @ted_dunning
Slides and such:– http://info.mapr.com/ted-boston-2012-07 Hash tags: #boston-hug #mahout #mapr