a hadoop mapreduce performance prediction method

1

A Hadoop MapReduce Performance Prediction Method

Ge Song*+, Zide Meng*, Fabrice Huet*, Frederic Magoules+, Lei Yu# and Xuelian Lin#

* University of Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France

+ Ecole Centrale de Paris, France

# Beihang University, Beijing China

2

Background

• Hadoop MapReduce

INPUT

DATA

Split

Map

Map

Map

Map

Reduce

Reduce

JobMap

ReduceMap

ReduceMapMapMap

Reduce+

(Key, Value)Partion1Partion2

HDFS

3

Background

• Hadoop• Many steps within Map stage and Reduce stage• Different step may consume different type of resource

READ

Map

SORT

MERGE

OUTPUT

Map

4

Motivation

• Problems

SchedulingNo consideration about the execution time and different type of resources consumed

Hadoop

ParameterTuning

Numerous parameters, default value is not optimal

Hadoop

CPUIntensive

CPUIntensive

Hadoop

DefaultHadoopJobHadoop

Job

Default Conf

5

Motivation

• Solution

Predict the performance of Hadoop Jobs

Scheduling

Hadoop

ParameterTuning

Numerous parameters, default value is not optimal

No consideration about the execution time and different type of resources consumed

6

Related Work

• Existing Prediction Method 1：- Black Box Based

JobFeatures

Hadoop

Statistic/Learning Models

ExecutionTime

Lack of the analysis about

Hadoop

Hard to choose

7

Related Work

• Existing Prediction Method 2：- Cost Model Based

Job Feature

F(map)=f(read,map,sort,spill,merge,write)F(reduce)=f(read,write,merge,reduce,write)

Execution Time

Difficult to ensure

accuracy

Lots of concurrent processes

Hard to divide stages

HadoopRead

Hadoop

mapOutput

… Read … reduceOutput

8

Related Work

• A Brief Summary about Existing Prediction Method

Black Box Cost Model

Advantage Simple and EffectiveHigh accuracyHigh isomorphism

Detailed analysis about Hadoop processing Division is flexible (stage, resource)Multiple prediction

Short Coming

Lack of job feature extractionLack of analysisHard to divide each step and resource

Lack of job feature extractionA lot of concurrent, hard to modelBetter for theoretical analysis, not suitable for prediction

o Simple prediction,

o Lack of jobs (jar package + data) analysis

9

Goal

• Design a Hadoop MapReduce performance prediction system to:- Predict the job consumption of various type of resources (CPU, Disk IO, Network)- Predict the execution time of Map phase and Reduce phase

Prediction System

- Map execution time- Reduce execution time

- CPU Occupation Time- Disk Occupation Time- Network Occupation Time

Job

10

Design - 1

• Cost Model



COST

MODEL

Job

11

Cost Model [1]

• Analysis about Map- Modeling the resources (CPU Disk Network) consumption- Each stage involves only one type of resources

[1] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoop mapreduce,” in CLUSTER Workshops, 2012, pp. 231–239.

Initi

ation

Read Data

NetworkTransfer

CreateObject

Map Function

Sort In

MemoryRead/WriteDisk

MergeSort

WriteDisk

Serialization

MapCPU:Disk:Net:

12

Cost Model [1]

• Cost Function Parameters Analysis

– Type One： Constant• Hadoop System Consume， Initialization Consume

– Type Two： Job-related Parameters• Map Function Computational Complexity，Map

Input Records

– Type Three： Parameters defined by Cost Model• Sorting Coefficient, Complexity Factor

[1] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoop mapreduce,” in CLUSTER Workshops, 2012, pp. 231–239.

13

Parameters Collection

• Type One and Type Three– Type one: Run empty map tasks， calculate the system consumed

from the logs– Type Three: Extract the sort part from Hadoop source code, sort a

certain number of records.

• Type Two– Run a new job， analyze log

• High Latency• Large Overhead

– Sampling Data， only analyze the behavior of map function and reduce function• Almost no latency• Very low extra overhead

Job Analyzer

14

Job Analyzer - Implementation

• Job Analyzer – Implementation– Hadoop virtual execution environment

• Accept the job Jar File & Input Data

– Sampling Module• Sample input data by a certain

percentage (less than 5%).

– MR Module• Instantiate user job’s class in

using Java reflection

– Analyze Module• Input Data (Amount & Number)• Relative computational complexity• Data conversion rates (output/input)

SamplingModule

MR Module

Analyze Module

Hadoop virtual execution environment

Jar File + Input Data

Job Feature

15

Job Analyzer - Feasibility

– Data similarity: Logs have uniform format– Execution similarity: each record will be processed by the

same map & reduce function repeatedly

INPUT

DATA

Split

Map

Map

Map

Map

Reduce

Reduce

16

Design - 2

• Parameters Collection



COST

MODEL

Job Analyzer:Collect

Parameters of Type 2

Static Parameters Collection Module:Collect

Parameters of Type1 & Type 3

17

Prediction Model

• Problem Analysis-Many concurrent steps -- the total time can not be added up by the time of each part

Initi

ation

Read Data

NetworkTransfer

CreateObject

Map Function

Sort In

Memory

Read/WriteDisk

MergeSort

WriteDisk

Serialization

CPU:Disk:Net:

18

Prediction Model• Main Factors (according to the performance

model)- Map Stage

Initi

ation

Read Data

NetworkTransfer

CreateObject

Map Function

Sort In

Memory

Read/WriteDisk

MergeSort

WriteDisk

Serialization

Tmap=α0

+α1*MapInput

+α2*N

+α3*N*Log(N)

+α4*The complexity of map function

+α5*The conversion rate of map data

The amount of input data

The number of input records (N)

The complexity of Map function

The conversion rate of Map data

NlogN

19

Prediction Model• Experimental Analysis

– Test 4 kinds of jobs (0-10000 records)– Extract the features for linear regression– Calculate the correlation coefficient (R2)

Jobs Dedup WordCount Project Grep Total

R2 0.9982 0.9992 0.9991 0.9949 0.6157

20

Prediction Model

0 1000 2000 3000 4000 5000 6000 7000 8000 90000

500000

1000000

1500000

2000000

2500000

3000000

3500000

DedupGrepProjectWordCount

Number of Records

Exec

ution

Tim

e of

Map

- Very good linear relationship within the same kind of jobs.

- But no linear relationship among different kind of jobs.

21

Find the nearest jobs!

• Instance-Based Linear Regression– Find the nearest samples to the jobs to be predicted in history

logs – “nearest”-> similar jobs (Top K nearest, with K=10%-15%)– Do linear regression to the samples we have found– Calculate the prediction value

• Nearest：– The weighted distance of job features (weight w)– High contribution for job classification：

• map/reduce complexity，map/reduce data conversion rate

– Low contribution for job classification：• Data amount、Number of records

22

Prediction Module

• Procedure

Cos

t M

odel

Mai

n F

acto

rs

Tmap=α0+α1*MapInput+α2*N+α3*N*Log(N)+α4*The complexity of map function+α5*The conversion rate of map data

Job Features

Search for the nearest samples

Prediction Function

Prediction Results

1 2

3

4

5

6

7

23

Prediction Module

• Procedure

Training Set

Find-Neighbor Module

Prediction Results

Prediction Function

Cost Model

24

Design - 3

• Parameters Collection


COST

MODEL

Job Analyzer:Collect

Parameters of Type 2

Static Parameters Collection Module:Collect

Parameters of Type1 & Type 3

- Map execution time- Reduce execution timePrediction

Module

25

Experience

• Task Execution Time (Error Rate)– K=12%, and with w different for each feature– K=12%, and with w the same for each feature– K=25%, and with w different for each feature– 4 kinds of jobs, 64M-8G

1 4 7 10 13 16 19 22 25 28 31 34 37 400

20

40

60

80

100

120

140

160

180

Reduce Tasks

k=12%k=25%k=12%,w=1

Erro

r Rat

e (1

00%

)

1 4 7 10 13 16 19 22 25 28 31 34 37 400

10

20

30

40

50

60

70

80

90

Map Tasks

Erro

r Rat

e (1

00%）

Job ID Job ID

26

Conclusion

• Job Analyzer :– Analyze Job Jar + Input File– Collect parameters

• Prediction Module:– Find the main factor– Propose a linear equation– Job classification– Multiple prediction

27

Thank you!

Question?

28

Cost Model [1]

• Analysis about Reduce- Modeling the resources (CPU Disk Network) consumption- Each stage involves only one type of resources

Initi

ation

Read Data

NetworkTransfer

Create Object

Reduce Function

MergeSort

Read/WriteDisk

Network

Write DiskSerialization

Deserialization

Reduce CPU:Disk:Net:

29

Prediction Model• Main Factors (according to the performance

model)- Reduce Stage

Initi

ation

Read Data

NetworkTransfer

Create Object

Reduce Function

MergeSort

Read/WriteDisk

Network

Write DiskSerialization

Deserialization

Treduce=β0

+β1*MapInput

+β2*N

+β3*Nlog(N)

+β4*The complexity of Reduce function

+β5*The conversion rate of Map data

+β6*The conversion rate of Reduce data

The amount of input data

The number of input records

The complexity of Reduce function

The conversion rate of Map data

NlogN

The conversion rate of Reduce data

a hadoop mapreduce performance prediction method

Documents

execution time of map

time difficult

map stage

practical performance

predictionsimple prediction

disk io

hadoop processing division

theoretical analysis