cmu lecture on hadoop performance

44
1 ©MapR Technologies - Confidential Hadoop Performance

Upload: mapr-technologies

Post on 15-Jan-2015

195 views

Category:

Technology


2 download

DESCRIPTION

A lecture by Ted Dunning describing several ways to make Hadoop programs go faster.

TRANSCRIPT

Page 1: CMU Lecture on Hadoop Performance

1©MapR Technologies - Confidential

Hadoop Performance

Page 2: CMU Lecture on Hadoop Performance

2©MapR Technologies - Confidential

Agenda

What is performance? Optimization? Case 1: Aggregation Case 2: Recommendations Case 3: Clustering Case 4: Matrix decomposition

Page 3: CMU Lecture on Hadoop Performance

3©MapR Technologies - Confidential

What is Performance?

Is doing something faster better?

Is it the right task?

Do you have a wide enough view?

What is the right performance metric?

Page 4: CMU Lecture on Hadoop Performance

4©MapR Technologies - Confidential

Aggregation

Word-count and friends– How many times did X occur?– How many unique X’s occurred?

Associative metrics permit decomposition– Partial sums and grand totals for example– Use combiners– Use high resolution aggregates to compute low resolution aggregates

Rank-based statistics do not permit decomposition– Avoid them– Use approximations

Page 5: CMU Lecture on Hadoop Performance

5©MapR Technologies - Confidential

Inside Map-Reduce

5

Input Map CombineShuffleand sort

Reduce Output

Reduce

"The time has come," the Walrus said,"To talk of many things:Of shoes—and ships—and sealing-wax

the, 1time, 1has, 1come, 1…

come, [3,2,1]has, [1,5,2]the, [1,2,1]time, [10,1,3]…

come, 6has, 8the, 4time, 14…

Page 6: CMU Lecture on Hadoop Performance

6©MapR Technologies - Confidential

Don’t Do This

Raw

Daily

Weekly

Monthly

Page 7: CMU Lecture on Hadoop Performance

7©MapR Technologies - Confidential

Do This Instead

Raw

Daily Weekly

Monthly

Page 8: CMU Lecture on Hadoop Performance

8©MapR Technologies - Confidential

Aggregation

First rule:– Don’t read the big input multiple times– Compute longer term aggregates from short term aggregates

Second rule:– Don’t read the big input multiple times– Compute multiple windowed aggregates at the same time

Page 9: CMU Lecture on Hadoop Performance

9©MapR Technologies - Confidential

Rank Statistics Can Be Tamed

Approximate quartiles are easily computed– (but sorted data is evil)

Approximate unique counts are easily computed– use Bloom filter and extrapolate from number of set bits– use multiple filters at different down-sample rates

Approximate high or low approximate quantiles are easily computed– keep largest 1000 elements– keep largest 1000 elements from 10x down-sampled data– and so on

Approximate top-40 also possible

Page 10: CMU Lecture on Hadoop Performance

10©MapR Technologies - Confidential

Recommendations

Common patterns in the past may predict common patterns in the future

People who bought item x also bought item y

But also, people who bought Chinese food in the past, …

Or people in SoMa really liked this restaurant in the past

Page 11: CMU Lecture on Hadoop Performance

11©MapR Technologies - Confidential

People who bought …

Key operation is counting number of people who bought x and y– for all x’s and all y’s

The raw problem appears to be O(N^3)

At the least, O(k_max^2)– for most prolific user, there are k^2 pairs to count– k_max can be near N

Scalable problems must be O(N)

Page 12: CMU Lecture on Hadoop Performance

12©MapR Technologies - Confidential

But …

What do we learn from users who buy everything– they have no discrimination– they are often the QA team– they tell us nothing

What do we learn from items bought by everybody– the dual of omnivorous buyers– these are often teaser items– they tell us nothing

Page 13: CMU Lecture on Hadoop Performance

13©MapR Technologies - Confidential

Also …

What would you learn about a user from purchases– 1 … 20?– 21 … 100?– 101 … 1000?– 1001 … ∞?

What about learning about an item?– how many people do we need to see before we understand the item?

Page 14: CMU Lecture on Hadoop Performance

14©MapR Technologies - Confidential

So …

Cheat!

Downsample every user to at most 1000 interactions– most recent– most rare– random selection– whatever is easiest

Now k_max ≤ 1000

Page 15: CMU Lecture on Hadoop Performance

15©MapR Technologies - Confidential

The Fundamental Things Apply

Don’t read the raw data repeatedly

Sessionize and denormalize per hour/day/week– that is, group by user– expand items with categories and content descriptors if feasible

Feed all down-stream processing in one pass– baby join to item characteristics – downsample– count grand totals– compute cooccurrences

Page 16: CMU Lecture on Hadoop Performance

16©MapR Technologies - Confidential

Deployment Matters, Too

For restaurant case, basic recommendation info includes:– user x merchant histories– user x cuisine histories– top local restaurant by anomalous repeat visits– restaurant x indicator merchant cooccurrence matrix– restaurant x indicator cuisine cooccurrence matrix

These can all be stored and accessed using text retrieval techniques

Fast deployment using mirrors and NFS (not standard Hadoop)

Page 17: CMU Lecture on Hadoop Performance

17©MapR Technologies - Confidential

Non-Traditional Deployment Demo

DEMO

Page 18: CMU Lecture on Hadoop Performance

18©MapR Technologies - Confidential

EM Algorithms

Start with random model estimates Use model estimates to classify examples Use classified examples to find probability maximum estimates Use model estimates to classify examples Use classified examples to find probability maximum estimates … And so on …

Page 19: CMU Lecture on Hadoop Performance

19©MapR Technologies - Confidential

K-means as EM Algorithm

Assign a random seed to each cluster Assign points to nearest cluster Move cluster to average of contained points Assign points to nearest cluster

… and so on …

Page 20: CMU Lecture on Hadoop Performance

20©MapR Technologies - Confidential

K-means as Map-Reduce

Assignment of points to cluster is trivially parallel

Computation of new clusters is also parallel

Moving points to averages is ideal for map-reduce

Page 21: CMU Lecture on Hadoop Performance

21©MapR Technologies - Confidential

But …

With map-reduce, iteration is evil

Starting a program can take 10-30s

Saving data to disk and then immediately reading from disk is silly

Input might even fit in cluster memory

Page 22: CMU Lecture on Hadoop Performance

22©MapR Technologies - Confidential

Fix #1

Don’t do that! Use Spark– in memory interactive map-reduce– 100x to 1000x faster– must fit in memory

Use Giraph– BSP programming model rather than map-reduce– essentially map-reduce-reduce-reduce…

Use GraphLab– Like BSP without the speed brakes– 100x faster

Page 23: CMU Lecture on Hadoop Performance

23©MapR Technologies - Confidential

Fix #2

Use a sketch-based algorithm

Do one pass over the data to compute sketch of the data

Cluster the sketch

Done. With good theoretic bounds on accuracy

Speedup of 3000x or more

Page 24: CMU Lecture on Hadoop Performance

24©MapR Technologies - Confidential

An Example

Page 25: CMU Lecture on Hadoop Performance

25©MapR Technologies - Confidential

The Problem

Spirals are a classic “counter” example for k-means Classic low dimensional manifold with added noise

But clustering still makes modeling work well

Page 26: CMU Lecture on Hadoop Performance

26©MapR Technologies - Confidential

An Example

Page 27: CMU Lecture on Hadoop Performance

27©MapR Technologies - Confidential

An Example

Page 28: CMU Lecture on Hadoop Performance

28©MapR Technologies - Confidential

The Cluster Proximity Features

Every point can be described by the nearest cluster – 4.3 bits per point in this case– Significant error that can be decreased (to a point) by increasing number of

clusters Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign

bit + 2 proximities)– Error is negligible– Unwinds the data into a simple representation

Page 29: CMU Lecture on Hadoop Performance

29©MapR Technologies - Confidential

Lots of Clusters Are Fine

Page 30: CMU Lecture on Hadoop Performance

30©MapR Technologies - Confidential

Surrogate Method

Start with sloppy clustering into κ = k log n clusters Use this sketch as a weighted surrogate for the data Cluster surrogate data using ball k-means Results are provably good for highly clusterable data Sloppy clustering is on-line Surrogate can be kept in memory Ball k-means pass can be done at any time

Page 31: CMU Lecture on Hadoop Performance

31©MapR Technologies - Confidential

Algorithm Costs

O(k d log n) per point per iteration for Lloyd’s algorithm Number of iterations not well known Iteration > log n reasonable assumption

Page 32: CMU Lecture on Hadoop Performance

32©MapR Technologies - Confidential

Algorithm Costs

Surrogate methods– fast, sloppy single pass clustering with κ = k log n– fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n))

per point– fast, in-memory, high-quality clustering of κ weighted centroids

O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high qualityO(κ d log k) or O(d log κ log k) for larger k, looser quality

– result is k high-quality centroids• Even the sloppy clusters may suffice

Page 33: CMU Lecture on Hadoop Performance

33©MapR Technologies - Confidential

Algorithm Costs

How much faster for the sketch phase?– take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000– log k + log log n = 11 + 5 = 17– 30,000 times faster is a bona fide big deal

Page 34: CMU Lecture on Hadoop Performance

34©MapR Technologies - Confidential

Pragmatics

But this requires a fast search internally Have to cluster on the fly for sketch Have to guarantee sketch quality Previous methods had very high complexity

Page 35: CMU Lecture on Hadoop Performance

35©MapR Technologies - Confidential

How It Works

For each point– Find approximately nearest centroid (distance = d)– If (d > threshold) new centroid– Else if (u > d/threshold) new cluster– Else add to nearest centroid

If centroids > κ ≈ C log N– Recursively cluster centroids with higher threshold

Result is large set of centroids– these provide approximation of original distribution– we can cluster centroids to get a close approximation of clustering original– or we can just use the result directly

Page 36: CMU Lecture on Hadoop Performance

36©MapR Technologies - Confidential

Matrix Decomposition

Many big matrices can often be compressed

Often used in recommendations

=

Page 37: CMU Lecture on Hadoop Performance

37©MapR Technologies - Confidential

Neighest Neighbor

Very high dimensional vectors can be compressed to 10-100 dimensions with little loss of accuracy

Fast search algorithms work up to dimension 50-100, don’t work above that

Page 38: CMU Lecture on Hadoop Performance

38©MapR Technologies - Confidential

Random Projections

Many problems in high dimension can be reduce to low dimension

Reductions with good distance approximation are available

Surprisingly, these methods can be done using random vectors

Page 39: CMU Lecture on Hadoop Performance

39©MapR Technologies - Confidential

Fundamental Trick

Random orthogonal projection preserves action of A

Page 40: CMU Lecture on Hadoop Performance

40©MapR Technologies - Confidential

Projection Search

total ordering!

Page 41: CMU Lecture on Hadoop Performance

41©MapR Technologies - Confidential

LSH Bit-match Versus Cosine

Page 42: CMU Lecture on Hadoop Performance

42©MapR Technologies - Confidential

But How?

Page 43: CMU Lecture on Hadoop Performance

43©MapR Technologies - Confidential

Summary

Don’t repeat big scans– Cascade aggregations– Compute several aggregates at once

Use approximate measures for rank statistics Downsample where appropriate Use non-traditional deployment Use sketches Use random projections

Page 44: CMU Lecture on Hadoop Performance

44©MapR Technologies - Confidential

Contact Me!

We’re hiring at MapR in US and Europe

Come get the slides at http://www.mapr.com/company/events/cmu-hadoop-performance-11-1-12

Get the code athttps://github.com/tdunning

Contact me at [email protected] or @ted_dunning