starfish: a self-tuning system for big data analytics

Starfish: A Self-tuning System for Big Data Analytics

Herodotos Herodotou,Harold Lim, Fei Dong, Shivnath Babu

Duke University

2

Analysis in the Big Data Era

9/26/2011

Massive Data

DataAnalysi

s

Insight

Key to Success = Timely and Cost-Effective Analysis

Starfish

3

Hadoop MapReduce EcosystemPopular solution to Big Data Analytics

9/26/2011

MapReduce Execution Engine

Distributed File System

Hadoop

Java / C++ / R / Python OozieHivePig Elastic

MapReduceJaql

HBase

Starfish

4

Practitioners of Big Data AnalyticsWho are the users?

Data analysts, statisticians, computational scientists…Researchers, developers, testers…You!

Who performs setup and tuning?The users!Usually lack expertise to tune the system

9/26/2011 Starfish

5

Tuning ChallengesHeavy use of programming languages for

MapReduce programs (e.g., Java/python)

Data loaded/accessed as opaque files

Large space of tuning choices

Elasticity is wonderful, but hard to achieve (Hadoop has many useful mechanisms, but policies are lacking)

Terabyte-scale data cycles

9/26/2011 Starfish

6

Our goal: Provide good performance automatically

Starfish: Self-tuning System

9/26/2011

MapReduce Execution Engine

Distributed File System

Hadoop

Java / C++ / R / Python OozieHivePig Elastic

MapReduceJaql

HBase

StarfishAnalytics System

Starfish

7

What are the Tuning Problems?

9/26/2011

Job-level MapReduce

configuration

Workload management

Datalayout tuning

Cluster sizing

Workflow optimization

J1 J2

J3

J4

Starfish

8

Starfish’s Core Approach to Tuning

9/26/2011

1) if Δ(conf. parameters) then what …? 2) if Δ(data properties) then what …? 3) if Δ(cluster properties) then what …?

ProfilerCollects concisesummaries of

execution

What-if EngineEstimates impact of hypothetical

changes on execution

OptimizersSearch through space of tuning choices

Job

WorkflowWorkload

Data layout

Cluster

Starfish

Starfish Architecture

9/26/2011 9

Profiler What-if EngineWorkflow Optimizer

Workload Optimizer Elastisizer

Job Optimizer

Data ManagerMetadata

Mgr.Intermediate

Data Mgr.Data Layout & Storage Mgr.

Starfish

10

MapReduce Job Execution

9/26/2011

split 0 map out 0reduce

Two Map Waves One Reduce Wave

split 2 map

split 1 map split 3 map Out 1reduce

job j = < program p, data d, resources r, configuration c >

Starfish

11

What Controls MR Job Execution?

Space of configuration choices:Number of map tasksNumber of reduce tasksPartitioning of map outputs to reduce tasksMemory allocation to task-level buffersMultiphase external sorting in the tasksWhether output data from tasks should be compressedWhether combine function should be used

9/26/2011

job j = < program p, data d, resources r, configuration c >

Starfish

12

Effect of Configuration Settings

Use defaults or set manually (rules-of-thumb)Rules-of-thumb may not suffice

9/26/2011

Two-dimensional projection of a multi-dimensional surface(Word Co-occurrence MapReduce Program)

Rules-of-thumb settings

Starfish

13

MapReduce Job Tuning in a NutshellGoal:

Challenges: p is an arbitrary MapReduce program; c is high-dimensional; …

9/26/2011

),,,(minarg crdpFcSc

opt

),,,( crdpFperf

Profiler

What-if Engine

Optimizer

Runs p to collect a job profile (concise execution summary) of <p,d1,r1,c1> Given profile of <p,d1,r1,c1>, estimates virtual profile for <p,d2,r2,c2>Enumerates and searches through the optimization space S efficiently

Starfish

14

Job ProfileConcise representation of program execution as a jobRecords information at the level of “task phases”Generated by Profiler through measurement or by the

What-if Engine through estimation

9/26/2011

Memory Buffer

Merge

Sort,[Combine],[Compress]

Serialize,Partitionmap

Merge

split

DFS

SpillCollectMapReadStarfish

15

Job Profile FieldsDataflow: amount of data flowing through task phasesMap output bytes

Number of spills

Number of records in buffer per spill

9/26/2011

Costs: execution times at the level of task phasesRead phase time in the map task

Map phase time in the map task

Spill phase time in the map task

Dataflow Statistics: statistical information about dataflowWidth of input key-value pairs

Map selectivity in terms of records

Map output compression ratio

Cost Statistics: statistical information about resource costsI/O cost for reading from local disk per byte

CPU cost for executing the Mapper per record

CPU cost for uncompressing the input per byte

Starfish

16

Generating Profiles by MeasurementGoals

Have zero overhead when profiling is turned offRequire no modifications to HadoopSupport unmodified MapReduce programs written in

Java or Hadoop Streaming/Pipes (Python/Ruby/C++)

Approach: Dynamic (on-demand) instrumentationEvent-condition-action rules are specified (in Java)Leads to run-time instrumentation of Hadoop internalsMonitors task phases of MapReduce job executionWe currently use Btrace (Hadoop internals are in Java)

9/26/2011 Starfish

17

Generating Profiles by Measurement

9/26/2011

split 0 map out 0reduce

split 1 map

raw data

raw data

raw data

map profile

reduce profile

job profile

Use of Sampling• Profile fewer tasks• Execute fewer tasks

JVM = Java Virtual Machine, ECA = Event-Condition-Action

JVM JVM

JVM

Enable Profiling

ECA rules

Starfish

18

What-if Engine

Job Oracle

Virtual Job Profile for <p, d2, r2, c2>

What-if Engine

9/26/2011

Task Scheduler Simulator

JobProfile

<p, d1, r1, c1>

Properties of Hypothetical job

Input DataProperties

<d2>

ClusterResources

<r2>

ConfigurationSettings

<c2>

Possibly Hypothetical

Starfish

19

Virtual Profile Estimation

9/26/2011

Given profile for job j = <p, d1, r1, c1> estimate profile for job j' = <p, d2, r2, c2>

(Virtual) Profile for j'DataflowStatistics

Dataflow

CostStatistics

Costs

Profile for jInput

Data d2

Confi-guration

c2

Resourcesr2

Costs

White-box Models

CostStatisticsRelative

Black-boxModels

Dataflow

White-box Models

DataflowStatistics

CardinalityModels

Starfish

20

Job Optimizer

9/26/2011

Best Configuration Settings <copt> for <p, d2, r2>

Subspace Enumeration

Recursive Random Search

Just-in-Time Optimizer

JobProfile

<p, d1, r1, c1>

Input DataProperties

<d2>

ClusterResources

<r2>

What-ifcalls

Starfish

21

Workflow Optimization Space

9/26/2011

Job-level Configuration

Dataset-level Configuration

Physical

Optimization Space

Logical

Join Selection

Partition Function Selection

Vertical Packing

Inter-job Inter-job

Starfish

22

Optimizations on TF-IDF Workflow

9/26/2011

LogicalOptimization

M1R1

…D0 <{D},{W}>

J1

D1

D2

D4

M2R2

…<{D, W},{f}>

J2

…<{D},{W, f, c}>

J3, J4

…<{W},{D, t}>

Partition:{D}Sort: {D,W}

M1R1M2R2

…D0 <{D},{W}>

J1, J2

D2

D4

M3R3M4

…<{D},{W, f, c}>

J3, J4

…<{W},{D, t}>

PhysicalOptimization

Reducers= 50Compress = offMemory = 400…

…

…

…

Reducers= 20Compress = onMemory = 300…

LegendD = docname f = frequencyW = word c = countt = TF-IDF

M3R3M4

Starfish

23

New ChallengesWhat-if challenges:

Support concurrent job execution

Estimate intermediate data properties

Optimization challengesInteractions across jobsExtended optimization spaceFind good configuration

settings for individual jobs

9/26/2011

J1 J2

J3

J4

Workflow

Starfish

24

Cluster Sizing ProblemUse-cases for cluster sizing

Tuning the cluster size for elastic workloadsWorkload transitioning from development cluster to

production clusterMulti-objective cluster provisioning

GoalDetermine cluster resources & job-level configuration

parameters to meet workload requirements

9/26/2011 Starfish

25

Multi-objective Cluster Provisioning

9/26/2011

m1.small m1.large m1.xlarge c1.medium c1.xlarge0

200400600800

1,0001,200

Run

ning

Tim

e (m

in)

m1.small m1.large m1.xlarge c1.medium c1.xlarge0.002.004.006.008.00

10.00

EC2 Instance Type

Cos

t ($)

Cloud enables users to provision clusters in minutes

Starfish

Experimental Evaluation

9/26/2011 26

Starfish (versions 0.1, 0.2) to manage Hadoop on EC2Different scenarios: Cluster × Workload × Data

EC2 Node Type

CPU: EC2 units

Mem I/O Perf. Cost /hour

#Maps /node

#Reds/node

MaxMem /task

m1.small 1 (1 x 1) 1.7 GB moderate $0.085 2 1 300 MB

m1.large 4 (2 x 2) 7.5 GB high $0.34 3 2 1024 MB

m1.xlarge 8 (4 x 2) 15 GB high $0.68 4 4 1536 MB

c1.medium 5 (2 x 2.5) 1.7 GB moderate $0.17 2 2 300 MB

c1.xlarge 20 (8 x 2.5) 7 GB high $0.68 8 6 400 MB

cc1.4xlarge 33.5 (8) 23 GB very high $1.60 8 6 1536 MB

Starfish

Experimental Evaluation

9/26/2011 27

Starfish (versions 0.1, 0.2) to manage Hadoop on EC2Different scenarios: Cluster × Workload × Data

Abbr. MapReduce Program Domain Dataset

CO Word Co-occurrence Natural Lang Proc. Wikipedia (10GB – 22GB)

WC WordCount Text Analytics Wikipedia (30GB – 1TB)

TS TeraSort Business Analytics TeraGen (30GB – 1TB)

LG LinkGraph Graph Processing Wikipedia (compressed ~6x)

JO Join Business Analytics TPC-H (30GB – 1TB)

TF Term Freq. - Inverse Document Freq.

Information Retrieval Wikipedia (30GB – 1TB)

Starfish

28

Job Optimizer Evaluation

9/26/2011

Hadoop cluster: 30 nodes, m1.xlargeData sizes: 60-180 GB

TS WC LG JO TF CO0

10

20

30

40

50

60

Default Set-tingsRule-based OptimizerCost-based Optimizer

MapReduce Programs

Spee

dup

over

Def

ault

Starfish

29

Estimates from the What-if Engine

9/26/2011

Hadoop cluster: 16 nodes, c1.mediumMapReduce Program: Word Co-occurrenceData set: 10 GB Wikipedia

True surface Estimated surface

Starfish

30

Profiling Overhead Vs. Benefit

9/26/2011

1 5 10 20 40 60 80 1000

5

10

15

20

25

30

35

Percent of Tasks Profiled

Perc

ent O

verh

ead

over

Job

R

unni

ng T

ime

with

Pro

filin

g T

urne

d O

ff

1 5 10 20 40 60 80 1000.0

0.5

1.0

1.5

2.0

2.5

Percent of Tasks Profiled

Spee

dup

over

Job

run

w

ith R

BO

Set

tings

Hadoop cluster: 16 nodes, c1.mediumMapReduce Program: Word Co-occurrenceData set: 10 GB Wikipedia

Starfish

31

Multi-objective Cluster Provisioning

9/26/2011

m1.small m1.large m1.xlarge c1.medium c1.xlarge0

200400600800

1,0001,200

ActualPredicted

Run

ning

Tim

e (m

in)

m1.small m1.large m1.xlarge c1.medium c1.xlarge0.002.004.006.008.00

10.00

ActualPredicted

EC2 Instance Type for Target Cluster

Cos

t ($)

Instance Type for Source Cluster: m1.large

Starfish

32

More info: www.cs.duke.edu/starfish

9/26/2011

Job-level MapReduce

configuration

Workflow optimization

Workload management

Datalayout tuning

Cluster sizing

J1 J2

J3

J4

Starfish

starfish: a self-tuning system for big data analytics

Documents

data d

data analysts

data properties

javapython data

intermediate data mgr

big data analytics starfish

taskswhether output

tuning levels