cmu lecture on hadoop performance

1©MapR Technologies - Confidential

Hadoop Performance

Agenda

What is performance? Optimization? Case 1: Aggregation Case 2: Recommendations Case 3: Clustering Case 4: Matrix decomposition

What is Performance?

Is doing something faster better?

Is it the right task?

Do you have a wide enough view?

What is the right performance metric?

Aggregation

Word-count and friends– How many times did X occur?– How many unique X’s occurred?

Associative metrics permit decomposition– Partial sums and grand totals for example– Use combiners– Use high resolution aggregates to compute low resolution aggregates

Rank-based statistics do not permit decomposition– Avoid them– Use approximations

Inside Map-Reduce

Input Map CombineShuffleand sort

Reduce Output

Reduce

"The time has come," the Walrus said,"To talk of many things:Of shoes—and ships—and sealing-wax

the, 1time, 1has, 1come, 1…

come, [3,2,1]has, [1,5,2]the, [1,2,1]time, [10,1,3]…

come, 6has, 8the, 4time, 14…

Don’t Do This

Weekly

Monthly

Do This Instead

Daily Weekly

Monthly

Aggregation

First rule:– Don’t read the big input multiple times– Compute longer term aggregates from short term aggregates

Second rule:– Don’t read the big input multiple times– Compute multiple windowed aggregates at the same time

Rank Statistics Can Be Tamed

Approximate quartiles are easily computed– (but sorted data is evil)

Approximate unique counts are easily computed– use Bloom filter and extrapolate from number of set bits– use multiple filters at different down-sample rates

Approximate high or low approximate quantiles are easily computed– keep largest 1000 elements– keep largest 1000 elements from 10x down-sampled data– and so on

Approximate top-40 also possible

Recommendations

Common patterns in the past may predict common patterns in the future

People who bought item x also bought item y

But also, people who bought Chinese food in the past, …

Or people in SoMa really liked this restaurant in the past

People who bought …

Key operation is counting number of people who bought x and y– for all x’s and all y’s

The raw problem appears to be O(N^3)

At the least, O(k_max^2)– for most prolific user, there are k^2 pairs to count– k_max can be near N

Scalable problems must be O(N)

But …

What do we learn from users who buy everything– they have no discrimination– they are often the QA team– they tell us nothing

What do we learn from items bought by everybody– the dual of omnivorous buyers– these are often teaser items– they tell us nothing

Also …

What would you learn about a user from purchases– 1 … 20?– 21 … 100?– 101 … 1000?– 1001 … ∞?

What about learning about an item?– how many people do we need to see before we understand the item?

So …

Cheat!

Downsample every user to at most 1000 interactions– most recent– most rare– random selection– whatever is easiest

Now k_max ≤ 1000

The Fundamental Things Apply

Don’t read the raw data repeatedly

Sessionize and denormalize per hour/day/week– that is, group by user– expand items with categories and content descriptors if feasible

Feed all down-stream processing in one pass– baby join to item characteristics – downsample– count grand totals– compute cooccurrences

Deployment Matters, Too

For restaurant case, basic recommendation info includes:– user x merchant histories– user x cuisine histories– top local restaurant by anomalous repeat visits– restaurant x indicator merchant cooccurrence matrix– restaurant x indicator cuisine cooccurrence matrix

These can all be stored and accessed using text retrieval techniques

Fast deployment using mirrors and NFS (not standard Hadoop)

Non-Traditional Deployment Demo

EM Algorithms

Start with random model estimates Use model estimates to classify examples Use classified examples to find probability maximum estimates Use model estimates to classify examples Use classified examples to find probability maximum estimates … And so on …

K-means as EM Algorithm

Assign a random seed to each cluster Assign points to nearest cluster Move cluster to average of contained points Assign points to nearest cluster

… and so on …

K-means as Map-Reduce

Assignment of points to cluster is trivially parallel

Computation of new clusters is also parallel

Moving points to averages is ideal for map-reduce

But …

With map-reduce, iteration is evil

Starting a program can take 10-30s

Saving data to disk and then immediately reading from disk is silly

Input might even fit in cluster memory

Fix #1

Don’t do that! Use Spark– in memory interactive map-reduce– 100x to 1000x faster– must fit in memory

Use Giraph– BSP programming model rather than map-reduce– essentially map-reduce-reduce-reduce…

Use GraphLab– Like BSP without the speed brakes– 100x faster

Fix #2

Use a sketch-based algorithm

Do one pass over the data to compute sketch of the data

Cluster the sketch

Done. With good theoretic bounds on accuracy

Speedup of 3000x or more

An Example

The Problem

Spirals are a classic “counter” example for k-means Classic low dimensional manifold with added noise

But clustering still makes modeling work well

An Example

The Cluster Proximity Features

Every point can be described by the nearest cluster – 4.3 bits per point in this case– Significant error that can be decreased (to a point) by increasing number of

clusters Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign

bit + 2 proximities)– Error is negligible– Unwinds the data into a simple representation

Lots of Clusters Are Fine

Surrogate Method

Start with sloppy clustering into κ = k log n clusters Use this sketch as a weighted surrogate for the data Cluster surrogate data using ball k-means Results are provably good for highly clusterable data Sloppy clustering is on-line Surrogate can be kept in memory Ball k-means pass can be done at any time

Algorithm Costs

O(k d log n) per point per iteration for Lloyd’s algorithm Number of iterations not well known Iteration > log n reasonable assumption

Algorithm Costs

Surrogate methods– fast, sloppy single pass clustering with κ = k log n– fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n))

per point– fast, in-memory, high-quality clustering of κ weighted centroids

O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high qualityO(κ d log k) or O(d log κ log k) for larger k, looser quality

– result is k high-quality centroids• Even the sloppy clusters may suffice

Algorithm Costs

How much faster for the sketch phase?– take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000– log k + log log n = 11 + 5 = 17– 30,000 times faster is a bona fide big deal

Pragmatics

But this requires a fast search internally Have to cluster on the fly for sketch Have to guarantee sketch quality Previous methods had very high complexity

How It Works

For each point– Find approximately nearest centroid (distance = d)– If (d > threshold) new centroid– Else if (u > d/threshold) new cluster– Else add to nearest centroid

If centroids > κ ≈ C log N– Recursively cluster centroids with higher threshold

Result is large set of centroids– these provide approximation of original distribution– we can cluster centroids to get a close approximation of clustering original– or we can just use the result directly

Matrix Decomposition

Many big matrices can often be compressed

Often used in recommendations

Neighest Neighbor

Very high dimensional vectors can be compressed to 10-100 dimensions with little loss of accuracy

Fast search algorithms work up to dimension 50-100, don’t work above that

Random Projections

Many problems in high dimension can be reduce to low dimension

Reductions with good distance approximation are available

Surprisingly, these methods can be done using random vectors

Fundamental Trick

Random orthogonal projection preserves action of A

Projection Search

total ordering!

LSH Bit-match Versus Cosine

But How?

Summary

Don’t repeat big scans– Cascade aggregations– Compute several aggregates at once

Use approximate measures for rank statistics Downsample where appropriate Use non-traditional deployment Use sketches Use random projections

Contact Me!

We’re hiring at MapR in US and Europe

Come get the slides at http://www.mapr.com/company/events/cmu-hadoop-performance-11-1-12

Get the code athttps://github.com/tdunning

Contact me at tdunning@maprtech.com or @ted_dunning

cmu lecture on hadoop performance

item x

restaurant case

cluster assign points

aggregation case

nearest cluster

number of people

raw data

random model estimates

Technology

map reduce & hadoop - uni-hamburg.de · map reduce & hadoop...

lecture 11 hadoop &...

cmu scs 15-826: multimedia databases and data mining...

hive: sql in the hadoop environment lecture …...hive: sql...

hadoop lecture 2015

lecture 11: splines - cmu statistics

hadoop lecture for harvard's cs 264 -- october 19, 2009

lecture 12: directory-based cache coherence ·...

hadoop lecture bigdata analytics - uni-hamburg.de · hadoop...

lecture 11: optimizing inference via...

cmu scs 15-826: multimedia databases and data mining lecture...

last lecture - cmu

fall 2013 cmu cs 15-855 computational complexity lecture 2...

object-oriented software analysis and design mism/msit, cmu...

ryan o'donnell (cmu) yi wu (cmu, ibm) yuan zhou (cmu)

rgani cmu bayer lecture 2013

tutorial for mapreduce (hadoop) & large scale processing le...

lecture 12, hadoop

a220 week2 lecture web cmu

big data analysis using hadoop lecture 3 -...