cmu lecture on hadoop performance

1©MapR Technologies - Confidential

Hadoop Performance


Agenda

What is performance? Optimization? Case 1: Aggregation Case 2: Recommendations Case 3: Clustering Case 4: Matrix decomposition


What is Performance?

Is doing something faster better?

Is it the right task?

Do you have a wide enough view?

What is the right performance metric?


Aggregation

Word-count and friends– How many times did X occur?– How many unique X’s occurred?

Associative metrics permit decomposition– Partial sums and grand totals for example– Use combiners– Use high resolution aggregates to compute low resolution aggregates

Rank-based statistics do not permit decomposition– Avoid them– Use approximations


Inside Map-Reduce

5

Input Map CombineShuffleand sort

Reduce Output

Reduce

"The time has come," the Walrus said,"To talk of many things:Of shoes—and ships—and sealing-wax

the, 1time, 1has, 1come, 1…

come, [3,2,1]has, [1,5,2]the, [1,2,1]time, [10,1,3]…

come, 6has, 8the, 4time, 14…


Don’t Do This

Raw

Daily

Weekly

Monthly


Do This Instead

Raw

Daily Weekly

Monthly


Aggregation

First rule:– Don’t read the big input multiple times– Compute longer term aggregates from short term aggregates

Second rule:– Don’t read the big input multiple times– Compute multiple windowed aggregates at the same time


Rank Statistics Can Be Tamed

Approximate quartiles are easily computed– (but sorted data is evil)

Approximate unique counts are easily computed– use Bloom filter and extrapolate from number of set bits– use multiple filters at different down-sample rates

Approximate high or low approximate quantiles are easily computed– keep largest 1000 elements– keep largest 1000 elements from 10x down-sampled data– and so on

Approximate top-40 also possible


Recommendations

Common patterns in the past may predict common patterns in the future

People who bought item x also bought item y

But also, people who bought Chinese food in the past, …

Or people in SoMa really liked this restaurant in the past


People who bought …

Key operation is counting number of people who bought x and y– for all x’s and all y’s

The raw problem appears to be O(N^3)

At the least, O(k_max^2)– for most prolific user, there are k^2 pairs to count– k_max can be near N

Scalable problems must be O(N)


But …

What do we learn from users who buy everything– they have no discrimination– they are often the QA team– they tell us nothing

What do we learn from items bought by everybody– the dual of omnivorous buyers– these are often teaser items– they tell us nothing


Also …

What would you learn about a user from purchases– 1 … 20?– 21 … 100?– 101 … 1000?– 1001 … ∞?

What about learning about an item?– how many people do we need to see before we understand the item?


So …

Cheat!

Downsample every user to at most 1000 interactions– most recent– most rare– random selection– whatever is easiest

Now k_max ≤ 1000


The Fundamental Things Apply

Don’t read the raw data repeatedly

Sessionize and denormalize per hour/day/week– that is, group by user– expand items with categories and content descriptors if feasible

Feed all down-stream processing in one pass– baby join to item characteristics – downsample– count grand totals– compute cooccurrences


Deployment Matters, Too

For restaurant case, basic recommendation info includes:– user x merchant histories– user x cuisine histories– top local restaurant by anomalous repeat visits– restaurant x indicator merchant cooccurrence matrix– restaurant x indicator cuisine cooccurrence matrix

These can all be stored and accessed using text retrieval techniques

Fast deployment using mirrors and NFS (not standard Hadoop)


Non-Traditional Deployment Demo

DEMO


EM Algorithms

Start with random model estimates Use model estimates to classify examples Use classified examples to find probability maximum estimates Use model estimates to classify examples Use classified examples to find probability maximum estimates … And so on …


K-means as EM Algorithm

Assign a random seed to each cluster Assign points to nearest cluster Move cluster to average of contained points Assign points to nearest cluster

… and so on …


K-means as Map-Reduce

Assignment of points to cluster is trivially parallel

Computation of new clusters is also parallel

Moving points to averages is ideal for map-reduce


But …

With map-reduce, iteration is evil

Starting a program can take 10-30s

Saving data to disk and then immediately reading from disk is silly

Input might even fit in cluster memory


Fix #1

Don’t do that! Use Spark– in memory interactive map-reduce– 100x to 1000x faster– must fit in memory

Use Giraph– BSP programming model rather than map-reduce– essentially map-reduce-reduce-reduce…

Use GraphLab– Like BSP without the speed brakes– 100x faster


Fix #2

Use a sketch-based algorithm

Do one pass over the data to compute sketch of the data

Cluster the sketch

Done. With good theoretic bounds on accuracy

Speedup of 3000x or more


An Example


The Problem

Spirals are a classic “counter” example for k-means Classic low dimensional manifold with added noise

But clustering still makes modeling work well


An Example


The Cluster Proximity Features

Every point can be described by the nearest cluster – 4.3 bits per point in this case– Significant error that can be decreased (to a point) by increasing number of

clusters Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign

bit + 2 proximities)– Error is negligible– Unwinds the data into a simple representation


Lots of Clusters Are Fine


Surrogate Method

Start with sloppy clustering into κ = k log n clusters Use this sketch as a weighted surrogate for the data Cluster surrogate data using ball k-means Results are provably good for highly clusterable data Sloppy clustering is on-line Surrogate can be kept in memory Ball k-means pass can be done at any time


Algorithm Costs

O(k d log n) per point per iteration for Lloyd’s algorithm Number of iterations not well known Iteration > log n reasonable assumption


Algorithm Costs

Surrogate methods– fast, sloppy single pass clustering with κ = k log n– fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n))

per point– fast, in-memory, high-quality clustering of κ weighted centroids

O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high qualityO(κ d log k) or O(d log κ log k) for larger k, looser quality

– result is k high-quality centroids• Even the sloppy clusters may suffice


Algorithm Costs

How much faster for the sketch phase?– take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000– log k + log log n = 11 + 5 = 17– 30,000 times faster is a bona fide big deal


Pragmatics

But this requires a fast search internally Have to cluster on the fly for sketch Have to guarantee sketch quality Previous methods had very high complexity


How It Works

For each point– Find approximately nearest centroid (distance = d)– If (d > threshold) new centroid– Else if (u > d/threshold) new cluster– Else add to nearest centroid

If centroids > κ ≈ C log N– Recursively cluster centroids with higher threshold

Result is large set of centroids– these provide approximation of original distribution– we can cluster centroids to get a close approximation of clustering original– or we can just use the result directly


Matrix Decomposition

Many big matrices can often be compressed

Often used in recommendations

=


Neighest Neighbor

Very high dimensional vectors can be compressed to 10-100 dimensions with little loss of accuracy

Fast search algorithms work up to dimension 50-100, don’t work above that


Random Projections

Many problems in high dimension can be reduce to low dimension

Reductions with good distance approximation are available

Surprisingly, these methods can be done using random vectors


Fundamental Trick

Random orthogonal projection preserves action of A


Projection Search

total ordering!


LSH Bit-match Versus Cosine


But How?


Summary

Don’t repeat big scans– Cascade aggregations– Compute several aggregates at once

Use approximate measures for rank statistics Downsample where appropriate Use non-traditional deployment Use sketches Use random projections


Contact Me!

We’re hiring at MapR in US and Europe

Come get the slides at http://www.mapr.com/company/events/cmu-hadoop-performance-11-1-12

Get the code athttps://github.com/tdunning

Contact me at [email protected] or @ted_dunning

http://www.mapr.com/company/events/cmu-hadoop-performance-11-1-12

http://www.mapr.com/company/events/cmu-hadoop-performance-11-1-12

https://github.com/tdunning

https://github.com/tdunning

mailto:[email protected]

cmu lecture on hadoop performance

Technology

item x

restaurant case

cluster assign points

aggregation case

nearest cluster

number of people

raw data

random model estimates