Download - Reining in the Outliers in MapReduce Jobs using Mantri

1

Reining in the Outliers in MapReduce Jobs

using MantriGanesh Ananthanarayanan†, Srikanth Kandula*,

Albert Greenberg*, Ion Stoica†, Yi Lu*, Bikas Saha*, Ed Harris*

† UC Berkeley * Microsoft

2

MapReduce JobsBasis of analytics in modern Internet

services◦E.g., Dryad, Hadoop

Job {Phase} {Task}

Graph flow consists of pipelines as well as strict blocks

3

Example Dryad Job Graph

EXTRACT

AGGREGATE_PARTITION

FULL_AGGREGATE

PROCESS

COMBINE

PROCESS

Distr. File System

Distr. File System

Phase

PipelineBlocked untilinput is done

Map.1

Reduce.1

Map.2

Reduce.2

Join

EXTRACT

AGGREGATE_PARTITION

FULL_AGGREGATE

Distr. File System

4

Log Analysis from ProductionLogs from production cluster with

thousands of machines, sampled over six months

10,000+ jobs, 80PB of data, 4PB network transfers◦Task-level details◦Production and experimental jobs

5

Outliers hurt!Tasks that run longer than the rest in the

phase

Median phase has 10% outliers, running for >10x longer

Slow down jobs by 35% at median

Operational Inefficiency◦Unpredictability in completion times affect SLAs◦Hurts development productivity◦Wastes compute-cycles

6

Why do outliers occur?

Mantri: A system that mitigates outliers based on root-cause

analysis

Input Unavaila

ble

Read Input

Execute

Network Congesti

on

Local Contentio

n

Workload

Imbalance

7

Mantri’s Outlier MitigationAvoid Recomputation

Network-aware Task Placement

Duplicate Outliers

Cognizant of Workload Imbalance

Recomputes: Illustration(a) Barrier phases (b) Cascading

Recomputes

InflationIdeal

Actual

Inflation

Ideal

Actual

Recompute task Normal task 8

9

What causes recomputes? [1]

Faulty machines◦Bad disks, non-persistent hardware

quirks

(4%)

Set of faulty machines varies with time, not constant

10

What causes recomputes? [2]

Transient machine load◦Recomputes correlate with machine

load◦Requests for data access dropped

11

Replicate costly outputs

Task1

Task 2

Task 3 MR3

MR2

((MR3*(1-MR2)) * T3 (MR3 * MR2) (T3+T2)

+Replicate (TRep)

TRep < TRecomp

REPLICATE

TRecomp =

MR: Recompute Probability of a machine

Recompute only Task3 or both

Task3 as well as Task2

12

Transient Failure CausesRecomputes manifest in clutchesMachine prone to cause

recomputes till the problem is fixed◦Load abates, critical process restart

etc.

Clue: At least r recomputes within t time window on a machine

13

Speculative RecomputesAnticipatorily recompute tasks

whose outputs are unread

SpeculativeRecompute

SpeculativeRecompute

(Read Fail)

Unread Data

Task

Input Data

14


◦Preferential Replication + Speculative Recomp.

Network-aware Task Placement

Duplicate Outliers


Reduce TasksTasks access output of tasks from

previous phasesReduce phase (74% of total

traffic)

Reduce

Map

Network

Local

Outlier!15

Distr. File System

16

Variable Congestion

Reduce taskMap outputRack

Smart placement smoothens hotspots

17

Traffic-based Allotment

For every rack:◦d : data◦u : available uplink bandwidth ◦v : available downlink bandwidth

Goal: Minimize phase completion time

Solve for task allocation fractions, ai

18

Local Control is a good approx.

Let rack i have ai fraction of tasks◦Time uploading, Tu = di (1 - ai) / ui

◦Time downloading, Td = (D – di) ai / vi

Timei = max {Tu , Td}

Goal: Minimize phase completion timeFor every rack:◦d : data, D: data over all racks◦u : available uplink bandwidth ◦v : available downlink bandwidth

Link utilizations average out in long term, are steady on the short term

19



Network-aware Task Placement◦Traffic on link proportional to bandwidth

Duplicate Outliers


20

Contentions cause outliersTasks contend for local resources

◦Processor, memory etc.

Duplicate tasks elsewhere in the cluster◦Current schemes duplicate towards end of

the phase (e.g., LATE [OSDI 2008])

Duplicate outlier or schedule pending task?

21

Resource-Aware Restart

Running task Potential restart (tnew) now time

trem Save time and resources:P(c tnew < (c + 1) trem)

Continuously observe and kill wasteful copies

22




Duplicate Outliers◦Resource-Aware Restart


23

Workload ImbalanceA quarter of the outlier tasks

have more data to process◦Unequal key partitions for reduce

tasksIgnoring these better than

duplication

Schedule tasks in descending order of data to process◦Time α (Data to Process)◦[Graham ‘69] At worse, 33% of

optimal

24




Duplicate Outliers◦Resource-Aware Restart

Cognizant of Workload Imbalance◦Schedule in descending order of size

Proactive

Reactive

Predict to act early

Be resource-aware

Act based on the cause

25

ResultsDeployed in production Bing

clusters

Trace-driven simulations◦Mimic workflow, failures, data skew◦Compare with existing and idealized

schemes

26

Jobs in the Wild

Act Early: Duplicates issued when task 42% done (77% for Dryad)

Light: Issues fewer copies (.47X as many as Dryad)

Accurate: 2.8x higher success rate of copies

Jobs faster by 32% at median, consuming lesser resources

27

Recomputation AvoidanceEliminates most recomputes with minimal extra resources

(Replication + Speculation) work well in tandem

28

Network-Aware Placement

Mantri well-approximates the ideal

Bandwidth approximations

29

SummaryFrom measurements in a production

cluster, ◦Outliers are a significant problem◦Are due to an interplay between storage,

network and map-reduce

Mantri, a cause-, resource-aware mitigation

Deployment shows encouraging results

“Reining in the Outliers in MapReduce Clusters using Mantri”, USENIX OSDI 2010

Download - Reining in the Outliers in MapReduce Jobs using Mantri

Top Related