reining in the outliers in map-reduce clusters using mantri ganesh ananthanarayanan, srikanth...

Reining in the Outliers in Map-Reduce Clusters using Mantri

Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, Edward HarrisYi Lu, Bikas Saha, Edward Harris@Microsoft@Microsoft

Presenter: Weiyue XuPresenter: Weiyue Xu

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementationdesign and implementation

Credits

• Modified version of – http://

www.cs.duke.edu/courses/fall10/cps296.2/lectures/

– research.microsoft.com/en-us/UM/people/srikanth/data/Combating%20Outliers%20in%20Map-Reduce.web.pptx

http://www.cs.duke.edu/courses/fall10/cps296.2/lectures/

http://www.cs.duke.edu/courses/fall10/cps296.2/lectures/

http://www.cs.uiuc.edu/class/sp11/cs525/slides.021711.ppt



Outline

• Introduction

• Cause of the Outlier

• Mantri

• Evaluation

• Discussion and Conclusion

log(size of dataset)GB109

TB1012

PB1015

EB1018

log(size of cluster)

104

1

103

102

101

105

HPC,|| databases

mapreduce

MapReduce • Decouples customized data operations from mechanisms to scale• Widely used

• Cosmos (based on SVC’s Dryad) + Scope @ Bing• MapReduce @ Google• Hadoop inside Yahoo! and on Amazon’s Cloud (AWS)

e.g., the Internet, click logs, bio/genomic data

Local write

An Example

How it Works:

Goal Find frequent search queries to Bing

SELECT Query, COUNT(*) AS FreqFROM QueryTableHAVING Freq > X

What the user says:

Read Map Reduce

file block 0

job manager

task

task

tasktask

task

output block 0

output block 1

file block 1

file block 2

file block 3

assign work, get progress

Outliers slow down map-reduce jobs

Map.Read 22K

Map.Move 15K

Map 13K

Reduce 51K

Barrier

File System

We find that:

6

Tackling outliers, we can speed up jobs while using resources efficiently:• Quicker response improves productivity• Predictability supports SLAs• Better resource utilization

From a phase to a job

•A job may have many phases

•An outlier in an early phase has a cumulative effect

•Data loss may cause multi-phase recompute outliers

Delay due to a recompute readily cascades

Why outliers?

reduce

sort

Delay due to a recompute

map

8

Due to unavailable input, tasks have to be recomputed

Outlier

stragglers : Tasks that take 1.5 times the median task in that phaserecomputes : Tasks that are re-run because their output was lost

50% phases have > 10% stragglers and no recomputes10% of the stragglers take >10X longer

Frequency of Outliers

straggler straggler

At median, jobs slowed down by 35% due to outliers

Cost of Outliers

Previous solution

• The original MapReduce paper observed the problem and did not solve it in depth

• Current schemes (e.g. Hadoop, LATE) duplicate long-running tasks based on some metrics

• Drawbacks–Some may be unnecessary –Use extra resources–Placement may be a problem

What this Paper is About

• Identify fundamental causes of outliers

• Mantri: A Cause-, Resource-Aware mitigation scheme– Case by case analysis: takes distinct actions based

on cause– Considers opportunity cost of actions

• Results from a production deployment

Causes of Outliers

• Data Skew: data size varies across tasks in a phase.

Reduce task

Map output

uneven placement is typical in productionreduce tasks are placed at first available slots

• Crossrack Traffic

Causes of Outliers

Rack

•50% of recomputes happen on 5% of the machines

•Recompute increases resource usage

• Bad and Busy MachinesCauses of Outliers

•70% of cross track traffic is reduce traffic•Reduce reads from every map, tasks in a spot with slow network run slower•Tasks compete network among themselves

50% phases takes 62% longer to finish than ideal placement

Causes of Outliers

•Outliers cluster by time–Resource contention might be the cause

•Recomputes cluster by machines –Data loss may cause multiple recomputes

Mantri• Cause aware, and resource aware• Fix each problem with different strategies• Runtime = f (input, network, dataToProcess, ...)

Delay due to a recompute readily cascades

Why outliers?

reduce

sort

Delay due to a recompute

map

19

Mantri [Avoid Recomputations]

Mantri [Avoid Recomputations]• Idea: Replicate intermediate data, use copy if original

is unavailable• Challenge: What data to replicate?• Insight: Cost to recompute vs cost to replicate

M1

M2

tredo = r2(t2

+t1redo)

• Cost to recompute depends on data loss probabilities and time taken, and also recursively looks at prior phases.

• tredo > t rep Mantri preferentially acts on more costly inputs

20

t = predicted runtime of taskr = predicted probability of recomputation at machine

Reduce task

Map output

uneven placement is typical in productionreduce tasks are placed at first available slots

• Crossrack Traffic

Rack

Mantri [Network Aware Placement]

Mantri [Network Aware Placement]• Idea: Avoid hot-spots, keep traffic on a link proportiona

l to bandwidth• Challenges: Global co-ordination, congestion detection• Insights:

– Local control is a good approximation(each job balances its traffic)

– Link utilizations average out on the long task and are steady on the short task

If rack i has di map output need dui and dv

i for upload and download while bu

i and bvi bandwidths available

Place ai fraction of reduces such that: 22

• Data Skew - About 25% of outliers occur due to more dataToProcess (workload imbalance)

• Ignoring these is better than duplicating (state-of-the-art)

Mantri [Data Aware Task Ordering]

Mantri [Data Aware Task Ordering]• Problem: Workload imbalance causes tasks to straggle.• Idea: Restarting outliers that are lengthy is counter-

productive.• Insights:

– Theorem [Graham, 1969]– Scheduling tasks with longest processing time first is

at-most 33% worse than optimal schedule.

24

Builds an estimator T ~ function (dataToProcess) We schedule tasks in descending order of dataToProcess

Theorem [due to Graham, 1969] Doing so is no more than 33% worse than the optimal

Mantri [Resource Aware Restart]• Problem: 25% outliers remain, likely due to

contention@machine• Idea: Restart tasks elsewhere in the cluster asap• Challenge: restart or duplicate?

25

(a)

(b)

(c)

Running task

Potential restart (tnew)

nowtime

trem

√

××

Save time and resources iffP(c trem > (c+1) tnew) > δ

If pending work, duplicate iff save both time and resource Else, duplicate if expected savings are high

Continuously observe and kill wasteful copies

26

Summary• Reduce recomputation:

– preferentially replicate costly-to-recompute tasks• Poor network:

– each job locally avoids network hot-spots• DataToProcess:

– schedule in descending order of data size• Bad machines:

– quarantine persistently faulty machines• Others:

– restart or duplicate tasks, cognizant of resource cost.

Evaluation Methodology

• Mantri run on production clusters

• Baseline is results from Dryad

• Use trace-driven simulations to compare with other systems

Comparing Jobs in the Wild

w/ and w/o Mantri for one month of jobs in Bing production cluster

340 jobs that each repeated at least five times during May 25-28 (release) vs. Apr 1-30 (pre-release)

In Production, Restarts…

In Trace-Replay Simulations, Restarts…

CD

F %

clu

ster

res

ourc

es

CD

F %

clu

ster

res

ourc

es

Protecting Against Recomputes

CD

F %

clu

ster

res

ourc

es

Conclusion

• Outliers are a significant problem

• Happens due to many causes

• Mantri: cause and resource aware mitigation outperforms prior schemes

32

Discussion

• Mantri does case by case analysis for each cause, what if the causes are inter-dependent?

Questions or Comments?

Thanks!

Estimation of trem and tnew

• d: input data size

• dread: the amount read

Estimation of tnew

• processRate: estimated of all tasks in the phase• locationFactor: machine performance• d: input size

reining in the outliers in map-reduce clusters using mantri ganesh ananthanarayanan, srikanth...

Documents