1
Reining in the Outliers in MapReduce Jobs
using MantriGanesh Ananthanarayanan†, Srikanth Kandula*,
Albert Greenberg*, Ion Stoica†, Yi Lu*, Bikas Saha*, Ed Harris*
† UC Berkeley * Microsoft
2
MapReduce JobsBasis of analytics in modern Internet
services◦E.g., Dryad, Hadoop
Job {Phase} {Task}
Graph flow consists of pipelines as well as strict blocks
3
Example Dryad Job Graph
EXTRACT
AGGREGATE_PARTITION
FULL_AGGREGATE
PROCESS
COMBINE
PROCESS
Distr. File System
Distr. File System
Phase
PipelineBlocked untilinput is done
Map.1
Reduce.1
Map.2
Reduce.2
Join
EXTRACT
AGGREGATE_PARTITION
FULL_AGGREGATE
Distr. File System
4
Log Analysis from ProductionLogs from production cluster with
thousands of machines, sampled over six months
10,000+ jobs, 80PB of data, 4PB network transfers◦Task-level details◦Production and experimental jobs
5
Outliers hurt!Tasks that run longer than the rest in the
phase
Median phase has 10% outliers, running for >10x longer
Slow down jobs by 35% at median
Operational Inefficiency◦Unpredictability in completion times affect SLAs◦Hurts development productivity◦Wastes compute-cycles
6
Why do outliers occur?
Mantri: A system that mitigates outliers based on root-cause
analysis
Input Unavaila
ble
Read Input
Execute
Network Congesti
on
Local Contentio
n
Workload
Imbalance
7
Mantri’s Outlier MitigationAvoid Recomputation
Network-aware Task Placement
Duplicate Outliers
Cognizant of Workload Imbalance
Recomputes: Illustration(a) Barrier phases (b) Cascading
Recomputes
InflationIdeal
Actual
Inflation
Ideal
Actual
Recompute task Normal task 8
9
What causes recomputes? [1]
Faulty machines◦Bad disks, non-persistent hardware
quirks
(4%)
Set of faulty machines varies with time, not constant
10
What causes recomputes? [2]
Transient machine load◦Recomputes correlate with machine
load◦Requests for data access dropped
11
Replicate costly outputs
Task1
Task 2
Task 3 MR3
MR2
((MR3*(1-MR2)) * T3 (MR3 * MR2) (T3+T2)
+Replicate (TRep)
TRep < TRecomp
REPLICATE
TRecomp =
MR: Recompute Probability of a machine
Recompute only Task3 or both
Task3 as well as Task2
12
Transient Failure CausesRecomputes manifest in clutchesMachine prone to cause
recomputes till the problem is fixed◦Load abates, critical process restart
etc.
Clue: At least r recomputes within t time window on a machine
13
Speculative RecomputesAnticipatorily recompute tasks
whose outputs are unread
SpeculativeRecompute
SpeculativeRecompute
(Read Fail)
Unread Data
Task
Input Data
14
Mantri’s Outlier MitigationAvoid Recomputation
◦Preferential Replication + Speculative Recomp.
Network-aware Task Placement
Duplicate Outliers
Cognizant of Workload Imbalance
Reduce TasksTasks access output of tasks from
previous phasesReduce phase (74% of total
traffic)
Reduce
Map
Network
Local
Outlier!15
Distr. File System
16
Variable Congestion
Reduce taskMap outputRack
Smart placement smoothens hotspots
17
Traffic-based Allotment
For every rack:◦d : data◦u : available uplink bandwidth ◦v : available downlink bandwidth
Goal: Minimize phase completion time
Solve for task allocation fractions, ai
18
Local Control is a good approx.
Let rack i have ai fraction of tasks◦Time uploading, Tu = di (1 - ai) / ui
◦Time downloading, Td = (D – di) ai / vi
Timei = max {Tu , Td}
Goal: Minimize phase completion timeFor every rack:◦d : data, D: data over all racks◦u : available uplink bandwidth ◦v : available downlink bandwidth
Link utilizations average out in long term, are steady on the short term
19
Mantri’s Outlier MitigationAvoid Recomputation
◦Preferential Replication + Speculative Recomp.
Network-aware Task Placement◦Traffic on link proportional to bandwidth
Duplicate Outliers
Cognizant of Workload Imbalance
20
Contentions cause outliersTasks contend for local resources
◦Processor, memory etc.
Duplicate tasks elsewhere in the cluster◦Current schemes duplicate towards end of
the phase (e.g., LATE [OSDI 2008])
Duplicate outlier or schedule pending task?
21
Resource-Aware Restart
Running task Potential restart (tnew) now time
trem Save time and resources:P(c tnew < (c + 1) trem)
Continuously observe and kill wasteful copies
22
Mantri’s Outlier MitigationAvoid Recomputation
◦Preferential Replication + Speculative Recomp.
Network-aware Task Placement◦Traffic on link proportional to bandwidth
Duplicate Outliers◦Resource-Aware Restart
Cognizant of Workload Imbalance
23
Workload ImbalanceA quarter of the outlier tasks
have more data to process◦Unequal key partitions for reduce
tasksIgnoring these better than
duplication
Schedule tasks in descending order of data to process◦Time α (Data to Process)◦[Graham ‘69] At worse, 33% of
optimal
24
Mantri’s Outlier MitigationAvoid Recomputation
◦Preferential Replication + Speculative Recomp.
Network-aware Task Placement◦Traffic on link proportional to bandwidth
Duplicate Outliers◦Resource-Aware Restart
Cognizant of Workload Imbalance◦Schedule in descending order of size
Proactive
Reactive
Predict to act early
Be resource-aware
Act based on the cause
25
ResultsDeployed in production Bing
clusters
Trace-driven simulations◦Mimic workflow, failures, data skew◦Compare with existing and idealized
schemes
26
Jobs in the Wild
Act Early: Duplicates issued when task 42% done (77% for Dryad)
Light: Issues fewer copies (.47X as many as Dryad)
Accurate: 2.8x higher success rate of copies
Jobs faster by 32% at median, consuming lesser resources
27
Recomputation AvoidanceEliminates most recomputes with minimal extra resources
(Replication + Speculation) work well in tandem
28
Network-Aware Placement
Mantri well-approximates the ideal
Bandwidth approximations
29
SummaryFrom measurements in a production
cluster, ◦Outliers are a significant problem◦Are due to an interplay between storage,
network and map-reduce
Mantri, a cause-, resource-aware mitigation
Deployment shows encouraging results
“Reining in the Outliers in MapReduce Clusters using Mantri”, USENIX OSDI 2010