cmu lecture on hadoop performance
DESCRIPTION
A lecture by Ted Dunning describing several ways to make Hadoop programs go faster.TRANSCRIPT
1©MapR Technologies - Confidential
Hadoop Performance
2©MapR Technologies - Confidential
Agenda
What is performance? Optimization? Case 1: Aggregation Case 2: Recommendations Case 3: Clustering Case 4: Matrix decomposition
3©MapR Technologies - Confidential
What is Performance?
Is doing something faster better?
Is it the right task?
Do you have a wide enough view?
What is the right performance metric?
4©MapR Technologies - Confidential
Aggregation
Word-count and friends– How many times did X occur?– How many unique X’s occurred?
Associative metrics permit decomposition– Partial sums and grand totals for example– Use combiners– Use high resolution aggregates to compute low resolution aggregates
Rank-based statistics do not permit decomposition– Avoid them– Use approximations
5©MapR Technologies - Confidential
Inside Map-Reduce
5
Input Map CombineShuffleand sort
Reduce Output
Reduce
"The time has come," the Walrus said,"To talk of many things:Of shoes—and ships—and sealing-wax
the, 1time, 1has, 1come, 1…
come, [3,2,1]has, [1,5,2]the, [1,2,1]time, [10,1,3]…
come, 6has, 8the, 4time, 14…
6©MapR Technologies - Confidential
Don’t Do This
Raw
Daily
Weekly
Monthly
7©MapR Technologies - Confidential
Do This Instead
Raw
Daily Weekly
Monthly
8©MapR Technologies - Confidential
Aggregation
First rule:– Don’t read the big input multiple times– Compute longer term aggregates from short term aggregates
Second rule:– Don’t read the big input multiple times– Compute multiple windowed aggregates at the same time
9©MapR Technologies - Confidential
Rank Statistics Can Be Tamed
Approximate quartiles are easily computed– (but sorted data is evil)
Approximate unique counts are easily computed– use Bloom filter and extrapolate from number of set bits– use multiple filters at different down-sample rates
Approximate high or low approximate quantiles are easily computed– keep largest 1000 elements– keep largest 1000 elements from 10x down-sampled data– and so on
Approximate top-40 also possible
10©MapR Technologies - Confidential
Recommendations
Common patterns in the past may predict common patterns in the future
People who bought item x also bought item y
But also, people who bought Chinese food in the past, …
Or people in SoMa really liked this restaurant in the past
11©MapR Technologies - Confidential
People who bought …
Key operation is counting number of people who bought x and y– for all x’s and all y’s
The raw problem appears to be O(N^3)
At the least, O(k_max^2)– for most prolific user, there are k^2 pairs to count– k_max can be near N
Scalable problems must be O(N)
12©MapR Technologies - Confidential
But …
What do we learn from users who buy everything– they have no discrimination– they are often the QA team– they tell us nothing
What do we learn from items bought by everybody– the dual of omnivorous buyers– these are often teaser items– they tell us nothing
13©MapR Technologies - Confidential
Also …
What would you learn about a user from purchases– 1 … 20?– 21 … 100?– 101 … 1000?– 1001 … ∞?
What about learning about an item?– how many people do we need to see before we understand the item?
14©MapR Technologies - Confidential
So …
Cheat!
Downsample every user to at most 1000 interactions– most recent– most rare– random selection– whatever is easiest
Now k_max ≤ 1000
15©MapR Technologies - Confidential
The Fundamental Things Apply
Don’t read the raw data repeatedly
Sessionize and denormalize per hour/day/week– that is, group by user– expand items with categories and content descriptors if feasible
Feed all down-stream processing in one pass– baby join to item characteristics – downsample– count grand totals– compute cooccurrences
16©MapR Technologies - Confidential
Deployment Matters, Too
For restaurant case, basic recommendation info includes:– user x merchant histories– user x cuisine histories– top local restaurant by anomalous repeat visits– restaurant x indicator merchant cooccurrence matrix– restaurant x indicator cuisine cooccurrence matrix
These can all be stored and accessed using text retrieval techniques
Fast deployment using mirrors and NFS (not standard Hadoop)
17©MapR Technologies - Confidential
Non-Traditional Deployment Demo
DEMO
18©MapR Technologies - Confidential
EM Algorithms
Start with random model estimates Use model estimates to classify examples Use classified examples to find probability maximum estimates Use model estimates to classify examples Use classified examples to find probability maximum estimates … And so on …
19©MapR Technologies - Confidential
K-means as EM Algorithm
Assign a random seed to each cluster Assign points to nearest cluster Move cluster to average of contained points Assign points to nearest cluster
… and so on …
20©MapR Technologies - Confidential
K-means as Map-Reduce
Assignment of points to cluster is trivially parallel
Computation of new clusters is also parallel
Moving points to averages is ideal for map-reduce
21©MapR Technologies - Confidential
But …
With map-reduce, iteration is evil
Starting a program can take 10-30s
Saving data to disk and then immediately reading from disk is silly
Input might even fit in cluster memory
22©MapR Technologies - Confidential
Fix #1
Don’t do that! Use Spark– in memory interactive map-reduce– 100x to 1000x faster– must fit in memory
Use Giraph– BSP programming model rather than map-reduce– essentially map-reduce-reduce-reduce…
Use GraphLab– Like BSP without the speed brakes– 100x faster
23©MapR Technologies - Confidential
Fix #2
Use a sketch-based algorithm
Do one pass over the data to compute sketch of the data
Cluster the sketch
Done. With good theoretic bounds on accuracy
Speedup of 3000x or more
24©MapR Technologies - Confidential
An Example
25©MapR Technologies - Confidential
The Problem
Spirals are a classic “counter” example for k-means Classic low dimensional manifold with added noise
But clustering still makes modeling work well
26©MapR Technologies - Confidential
An Example
27©MapR Technologies - Confidential
An Example
28©MapR Technologies - Confidential
The Cluster Proximity Features
Every point can be described by the nearest cluster – 4.3 bits per point in this case– Significant error that can be decreased (to a point) by increasing number of
clusters Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign
bit + 2 proximities)– Error is negligible– Unwinds the data into a simple representation
29©MapR Technologies - Confidential
Lots of Clusters Are Fine
30©MapR Technologies - Confidential
Surrogate Method
Start with sloppy clustering into κ = k log n clusters Use this sketch as a weighted surrogate for the data Cluster surrogate data using ball k-means Results are provably good for highly clusterable data Sloppy clustering is on-line Surrogate can be kept in memory Ball k-means pass can be done at any time
31©MapR Technologies - Confidential
Algorithm Costs
O(k d log n) per point per iteration for Lloyd’s algorithm Number of iterations not well known Iteration > log n reasonable assumption
32©MapR Technologies - Confidential
Algorithm Costs
Surrogate methods– fast, sloppy single pass clustering with κ = k log n– fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n))
per point– fast, in-memory, high-quality clustering of κ weighted centroids
O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high qualityO(κ d log k) or O(d log κ log k) for larger k, looser quality
– result is k high-quality centroids• Even the sloppy clusters may suffice
33©MapR Technologies - Confidential
Algorithm Costs
How much faster for the sketch phase?– take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000– log k + log log n = 11 + 5 = 17– 30,000 times faster is a bona fide big deal
34©MapR Technologies - Confidential
Pragmatics
But this requires a fast search internally Have to cluster on the fly for sketch Have to guarantee sketch quality Previous methods had very high complexity
35©MapR Technologies - Confidential
How It Works
For each point– Find approximately nearest centroid (distance = d)– If (d > threshold) new centroid– Else if (u > d/threshold) new cluster– Else add to nearest centroid
If centroids > κ ≈ C log N– Recursively cluster centroids with higher threshold
Result is large set of centroids– these provide approximation of original distribution– we can cluster centroids to get a close approximation of clustering original– or we can just use the result directly
36©MapR Technologies - Confidential
Matrix Decomposition
Many big matrices can often be compressed
Often used in recommendations
=
37©MapR Technologies - Confidential
Neighest Neighbor
Very high dimensional vectors can be compressed to 10-100 dimensions with little loss of accuracy
Fast search algorithms work up to dimension 50-100, don’t work above that
38©MapR Technologies - Confidential
Random Projections
Many problems in high dimension can be reduce to low dimension
Reductions with good distance approximation are available
Surprisingly, these methods can be done using random vectors
39©MapR Technologies - Confidential
Fundamental Trick
Random orthogonal projection preserves action of A
40©MapR Technologies - Confidential
Projection Search
total ordering!
41©MapR Technologies - Confidential
LSH Bit-match Versus Cosine
42©MapR Technologies - Confidential
But How?
43©MapR Technologies - Confidential
Summary
Don’t repeat big scans– Cascade aggregations– Compute several aggregates at once
Use approximate measures for rank statistics Downsample where appropriate Use non-traditional deployment Use sketches Use random projections
44©MapR Technologies - Confidential
Contact Me!
We’re hiring at MapR in US and Europe
Come get the slides at http://www.mapr.com/company/events/cmu-hadoop-performance-11-1-12
Get the code athttps://github.com/tdunning
Contact me at [email protected] or @ted_dunning