cmu lecture on hadoop performance

Post on 15-Jan-2015

195 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

A lecture by Ted Dunning describing several ways to make Hadoop programs go faster.

TRANSCRIPT

1©MapR Technologies - Confidential

Hadoop Performance

2©MapR Technologies - Confidential

Agenda

What is performance? Optimization? Case 1: Aggregation Case 2: Recommendations Case 3: Clustering Case 4: Matrix decomposition

3©MapR Technologies - Confidential

What is Performance?

Is doing something faster better?

Is it the right task?

Do you have a wide enough view?

What is the right performance metric?

4©MapR Technologies - Confidential

Aggregation

Word-count and friends– How many times did X occur?– How many unique X’s occurred?

Associative metrics permit decomposition– Partial sums and grand totals for example– Use combiners– Use high resolution aggregates to compute low resolution aggregates

Rank-based statistics do not permit decomposition– Avoid them– Use approximations

5©MapR Technologies - Confidential

Inside Map-Reduce

5

Input Map CombineShuffleand sort

Reduce Output

Reduce

"The time has come," the Walrus said,"To talk of many things:Of shoes—and ships—and sealing-wax

the, 1time, 1has, 1come, 1…

come, [3,2,1]has, [1,5,2]the, [1,2,1]time, [10,1,3]…

come, 6has, 8the, 4time, 14…

6©MapR Technologies - Confidential

Don’t Do This

Raw

Daily

Weekly

Monthly

7©MapR Technologies - Confidential

Do This Instead

Raw

Daily Weekly

Monthly

8©MapR Technologies - Confidential

Aggregation

First rule:– Don’t read the big input multiple times– Compute longer term aggregates from short term aggregates

Second rule:– Don’t read the big input multiple times– Compute multiple windowed aggregates at the same time

9©MapR Technologies - Confidential

Rank Statistics Can Be Tamed

Approximate quartiles are easily computed– (but sorted data is evil)

Approximate unique counts are easily computed– use Bloom filter and extrapolate from number of set bits– use multiple filters at different down-sample rates

Approximate high or low approximate quantiles are easily computed– keep largest 1000 elements– keep largest 1000 elements from 10x down-sampled data– and so on

Approximate top-40 also possible

10©MapR Technologies - Confidential

Recommendations

Common patterns in the past may predict common patterns in the future

People who bought item x also bought item y

But also, people who bought Chinese food in the past, …

Or people in SoMa really liked this restaurant in the past

11©MapR Technologies - Confidential

People who bought …

Key operation is counting number of people who bought x and y– for all x’s and all y’s

The raw problem appears to be O(N^3)

At the least, O(k_max^2)– for most prolific user, there are k^2 pairs to count– k_max can be near N

Scalable problems must be O(N)

12©MapR Technologies - Confidential

But …

What do we learn from users who buy everything– they have no discrimination– they are often the QA team– they tell us nothing

What do we learn from items bought by everybody– the dual of omnivorous buyers– these are often teaser items– they tell us nothing

13©MapR Technologies - Confidential

Also …

What would you learn about a user from purchases– 1 … 20?– 21 … 100?– 101 … 1000?– 1001 … ∞?

What about learning about an item?– how many people do we need to see before we understand the item?

14©MapR Technologies - Confidential

So …

Cheat!

Downsample every user to at most 1000 interactions– most recent– most rare– random selection– whatever is easiest

Now k_max ≤ 1000

15©MapR Technologies - Confidential

The Fundamental Things Apply

Don’t read the raw data repeatedly

Sessionize and denormalize per hour/day/week– that is, group by user– expand items with categories and content descriptors if feasible

Feed all down-stream processing in one pass– baby join to item characteristics – downsample– count grand totals– compute cooccurrences

16©MapR Technologies - Confidential

Deployment Matters, Too

For restaurant case, basic recommendation info includes:– user x merchant histories– user x cuisine histories– top local restaurant by anomalous repeat visits– restaurant x indicator merchant cooccurrence matrix– restaurant x indicator cuisine cooccurrence matrix

These can all be stored and accessed using text retrieval techniques

Fast deployment using mirrors and NFS (not standard Hadoop)

17©MapR Technologies - Confidential

Non-Traditional Deployment Demo

DEMO

18©MapR Technologies - Confidential

EM Algorithms

Start with random model estimates Use model estimates to classify examples Use classified examples to find probability maximum estimates Use model estimates to classify examples Use classified examples to find probability maximum estimates … And so on …

19©MapR Technologies - Confidential

K-means as EM Algorithm

Assign a random seed to each cluster Assign points to nearest cluster Move cluster to average of contained points Assign points to nearest cluster

… and so on …

20©MapR Technologies - Confidential

K-means as Map-Reduce

Assignment of points to cluster is trivially parallel

Computation of new clusters is also parallel

Moving points to averages is ideal for map-reduce

21©MapR Technologies - Confidential

But …

With map-reduce, iteration is evil

Starting a program can take 10-30s

Saving data to disk and then immediately reading from disk is silly

Input might even fit in cluster memory

22©MapR Technologies - Confidential

Fix #1

Don’t do that! Use Spark– in memory interactive map-reduce– 100x to 1000x faster– must fit in memory

Use Giraph– BSP programming model rather than map-reduce– essentially map-reduce-reduce-reduce…

Use GraphLab– Like BSP without the speed brakes– 100x faster

23©MapR Technologies - Confidential

Fix #2

Use a sketch-based algorithm

Do one pass over the data to compute sketch of the data

Cluster the sketch

Done. With good theoretic bounds on accuracy

Speedup of 3000x or more

24©MapR Technologies - Confidential

An Example

25©MapR Technologies - Confidential

The Problem

Spirals are a classic “counter” example for k-means Classic low dimensional manifold with added noise

But clustering still makes modeling work well

26©MapR Technologies - Confidential

An Example

27©MapR Technologies - Confidential

An Example

28©MapR Technologies - Confidential

The Cluster Proximity Features

Every point can be described by the nearest cluster – 4.3 bits per point in this case– Significant error that can be decreased (to a point) by increasing number of

clusters Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign

bit + 2 proximities)– Error is negligible– Unwinds the data into a simple representation

29©MapR Technologies - Confidential

Lots of Clusters Are Fine

30©MapR Technologies - Confidential

Surrogate Method

Start with sloppy clustering into κ = k log n clusters Use this sketch as a weighted surrogate for the data Cluster surrogate data using ball k-means Results are provably good for highly clusterable data Sloppy clustering is on-line Surrogate can be kept in memory Ball k-means pass can be done at any time

31©MapR Technologies - Confidential

Algorithm Costs

O(k d log n) per point per iteration for Lloyd’s algorithm Number of iterations not well known Iteration > log n reasonable assumption

32©MapR Technologies - Confidential

Algorithm Costs

Surrogate methods– fast, sloppy single pass clustering with κ = k log n– fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n))

per point– fast, in-memory, high-quality clustering of κ weighted centroids

O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high qualityO(κ d log k) or O(d log κ log k) for larger k, looser quality

– result is k high-quality centroids• Even the sloppy clusters may suffice

33©MapR Technologies - Confidential

Algorithm Costs

How much faster for the sketch phase?– take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000– log k + log log n = 11 + 5 = 17– 30,000 times faster is a bona fide big deal

34©MapR Technologies - Confidential

Pragmatics

But this requires a fast search internally Have to cluster on the fly for sketch Have to guarantee sketch quality Previous methods had very high complexity

35©MapR Technologies - Confidential

How It Works

For each point– Find approximately nearest centroid (distance = d)– If (d > threshold) new centroid– Else if (u > d/threshold) new cluster– Else add to nearest centroid

If centroids > κ ≈ C log N– Recursively cluster centroids with higher threshold

Result is large set of centroids– these provide approximation of original distribution– we can cluster centroids to get a close approximation of clustering original– or we can just use the result directly

36©MapR Technologies - Confidential

Matrix Decomposition

Many big matrices can often be compressed

Often used in recommendations

=

37©MapR Technologies - Confidential

Neighest Neighbor

Very high dimensional vectors can be compressed to 10-100 dimensions with little loss of accuracy

Fast search algorithms work up to dimension 50-100, don’t work above that

38©MapR Technologies - Confidential

Random Projections

Many problems in high dimension can be reduce to low dimension

Reductions with good distance approximation are available

Surprisingly, these methods can be done using random vectors

39©MapR Technologies - Confidential

Fundamental Trick

Random orthogonal projection preserves action of A

40©MapR Technologies - Confidential

Projection Search

total ordering!

41©MapR Technologies - Confidential

LSH Bit-match Versus Cosine

42©MapR Technologies - Confidential

But How?

43©MapR Technologies - Confidential

Summary

Don’t repeat big scans– Cascade aggregations– Compute several aggregates at once

Use approximate measures for rank statistics Downsample where appropriate Use non-traditional deployment Use sketches Use random projections

44©MapR Technologies - Confidential

Contact Me!

We’re hiring at MapR in US and Europe

Come get the slides at http://www.mapr.com/company/events/cmu-hadoop-performance-11-1-12

Get the code athttps://github.com/tdunning

Contact me at tdunning@maprtech.com or @ted_dunning

top related