a hadoop mapreduce performance prediction method

29
A Hadoop MapReduce Performance Prediction Method Ge Song *+ , Zide Meng * , Fabrice Huet * , Frederic Magoules + , Lei Yu # and Xuelian Lin # * University of Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France + Ecole Centrale de Paris, France # Beihang University, Beijing China 1

Upload: tarik-chandler

Post on 03-Jan-2016

33 views

Category:

Documents


3 download

DESCRIPTION

A Hadoop MapReduce Performance Prediction Method. Ge Song * + , Zide Meng * , Fabrice Huet * , Frederic Magoules + , Lei Yu # and Xuelian Lin # * University of Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France + Ecole Centrale de Paris, France # Beihang University, Beijing China. Background. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Hadoop MapReduce Performance Prediction Method

1

A Hadoop MapReduce Performance Prediction Method

Ge Song*+, Zide Meng*, Fabrice Huet*, Frederic Magoules+, Lei Yu# and Xuelian Lin#

* University of Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France

+ Ecole Centrale de Paris, France

# Beihang University, Beijing China

Page 2: A Hadoop MapReduce Performance Prediction Method

2

Background

• Hadoop MapReduce

INPUT

DATA

Split

Map

Map

Map

Map

Reduce

Reduce

JobMap

ReduceMap

ReduceMapMapMap

Reduce+

(Key, Value)Partion1Partion2

HDFS

Page 3: A Hadoop MapReduce Performance Prediction Method

3

Background

• Hadoop• Many steps within Map stage and Reduce stage• Different step may consume different type of resource

READ

Map

SORT

MERGE

OUTPUT

Map

Page 4: A Hadoop MapReduce Performance Prediction Method

4

Motivation

• Problems

SchedulingNo consideration about the execution time and different type of resources consumed

Hadoop

ParameterTuning

Numerous parameters, default value is not optimal

Hadoop

CPUIntensive

CPUIntensive

Hadoop

DefaultHadoopJobHadoop

Job

Default Conf

Page 5: A Hadoop MapReduce Performance Prediction Method

5

Motivation

• Solution

Predict the performance of Hadoop Jobs

Scheduling

Hadoop

ParameterTuning

Numerous parameters, default value is not optimal

No consideration about the execution time and different type of resources consumed

Page 6: A Hadoop MapReduce Performance Prediction Method

6

Related Work

• Existing Prediction Method 1:- Black Box Based

JobFeatures

Hadoop

Statistic/Learning Models

ExecutionTime

Lack of the analysis about

Hadoop

Hard to choose

Page 7: A Hadoop MapReduce Performance Prediction Method

7

Related Work

• Existing Prediction Method 2:- Cost Model Based

Job Feature

F(map)=f(read,map,sort,spill,merge,write)F(reduce)=f(read,write,merge,reduce,write)

Execution Time

Difficult to ensure

accuracy

Lots of concurrent processes

Hard to divide stages

HadoopRead

Hadoop

mapOutput

… Read … reduceOutput

Page 8: A Hadoop MapReduce Performance Prediction Method

8

Related Work

• A Brief Summary about Existing Prediction Method

Black Box Cost Model

Advantage Simple and EffectiveHigh accuracyHigh isomorphism

Detailed analysis about Hadoop processing Division is flexible (stage, resource)Multiple prediction

Short Coming

Lack of job feature extractionLack of analysisHard to divide each step and resource

Lack of job feature extractionA lot of concurrent, hard to modelBetter for theoretical analysis, not suitable for prediction

o Simple prediction,

o Lack of jobs (jar package + data) analysis

Page 9: A Hadoop MapReduce Performance Prediction Method

9

Goal

• Design a Hadoop MapReduce performance prediction system to:- Predict the job consumption of various type of resources (CPU, Disk IO, Network)- Predict the execution time of Map phase and Reduce phase

Prediction System

- Map execution time- Reduce execution time

- CPU Occupation Time- Disk Occupation Time- Network Occupation Time

Job

Page 10: A Hadoop MapReduce Performance Prediction Method

10

Design - 1

• Cost Model

- Map execution time- Reduce execution time

- CPU Occupation Time- Disk Occupation Time- Network Occupation Time

COST

MODEL

Job

Page 11: A Hadoop MapReduce Performance Prediction Method

11

Cost Model [1]

• Analysis about Map- Modeling the resources (CPU Disk Network) consumption- Each stage involves only one type of resources

[1] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoop mapreduce,” in CLUSTER Workshops, 2012, pp. 231–239.

Initi

ation

Read Data

NetworkTransfer

CreateObject

Map Function

Sort In

MemoryRead/WriteDisk

MergeSort

WriteDisk

Serialization

MapCPU:Disk:Net:

Page 12: A Hadoop MapReduce Performance Prediction Method

12

Cost Model [1]

• Cost Function Parameters Analysis

– Type One: Constant• Hadoop System Consume, Initialization Consume

– Type Two: Job-related Parameters• Map Function Computational Complexity,Map

Input Records

– Type Three: Parameters defined by Cost Model• Sorting Coefficient, Complexity Factor

[1] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoop mapreduce,” in CLUSTER Workshops, 2012, pp. 231–239.

Page 13: A Hadoop MapReduce Performance Prediction Method

13

Parameters Collection

• Type One and Type Three– Type one: Run empty map tasks, calculate the system consumed

from the logs– Type Three: Extract the sort part from Hadoop source code, sort a

certain number of records.

• Type Two– Run a new job, analyze log

• High Latency• Large Overhead

– Sampling Data, only analyze the behavior of map function and reduce function• Almost no latency• Very low extra overhead

Job Analyzer

Page 14: A Hadoop MapReduce Performance Prediction Method

14

Job Analyzer - Implementation

• Job Analyzer – Implementation– Hadoop virtual execution environment

• Accept the job Jar File & Input Data

– Sampling Module• Sample input data by a certain

percentage (less than 5%).

– MR Module• Instantiate user job’s class in

using Java reflection

– Analyze Module• Input Data (Amount & Number)• Relative computational complexity• Data conversion rates (output/input)

SamplingModule

MR Module

Analyze Module

Hadoop virtual execution environment

Jar File + Input Data

Job Feature

Page 15: A Hadoop MapReduce Performance Prediction Method

15

Job Analyzer - Feasibility

– Data similarity: Logs have uniform format– Execution similarity: each record will be processed by the

same map & reduce function repeatedly

INPUT

DATA

Split

Map

Map

Map

Map

Reduce

Reduce

Page 16: A Hadoop MapReduce Performance Prediction Method

16

Design - 2

• Parameters Collection

- Map execution time- Reduce execution time

- CPU Occupation Time- Disk Occupation Time- Network Occupation Time

COST

MODEL

Job Analyzer:Collect

Parameters of Type 2

Static Parameters Collection Module:Collect

Parameters of Type1 & Type 3

Page 17: A Hadoop MapReduce Performance Prediction Method

17

Prediction Model

• Problem Analysis-Many concurrent steps -- the total time can not be added up by the time of each part

Initi

ation

Read Data

NetworkTransfer

CreateObject

Map Function

Sort In

Memory

Read/WriteDisk

MergeSort

WriteDisk

Serialization

CPU:Disk:Net:

Page 18: A Hadoop MapReduce Performance Prediction Method

18

Prediction Model• Main Factors (according to the performance

model)- Map Stage

Initi

ation

Read Data

NetworkTransfer

CreateObject

Map Function

Sort In

Memory

Read/WriteDisk

MergeSort

WriteDisk

Serialization

Tmap=α0

+α1*MapInput

+α2*N

+α3*N*Log(N)

+α4*The complexity of map function

+α5*The conversion rate of map data

The amount of input data

The number of input records (N)

The complexity of Map function

The conversion rate of Map data

NlogN

Page 19: A Hadoop MapReduce Performance Prediction Method

19

Prediction Model• Experimental Analysis

– Test 4 kinds of jobs (0-10000 records)– Extract the features for linear regression– Calculate the correlation coefficient (R2)

Jobs Dedup WordCount Project Grep Total

R2 0.9982 0.9992 0.9991 0.9949 0.6157

Page 20: A Hadoop MapReduce Performance Prediction Method

20

Prediction Model

0 1000 2000 3000 4000 5000 6000 7000 8000 90000

500000

1000000

1500000

2000000

2500000

3000000

3500000

DedupGrepProjectWordCount

Number of Records

Exec

ution

Tim

e of

Map

- Very good linear relationship within the same kind of jobs.

- But no linear relationship among different kind of jobs.

Page 21: A Hadoop MapReduce Performance Prediction Method

21

Find the nearest jobs!

• Instance-Based Linear Regression– Find the nearest samples to the jobs to be predicted in history

logs – “nearest”-> similar jobs (Top K nearest, with K=10%-15%)– Do linear regression to the samples we have found– Calculate the prediction value

• Nearest:– The weighted distance of job features (weight w)– High contribution for job classification:

• map/reduce complexity,map/reduce data conversion rate

– Low contribution for job classification:• Data amount、Number of records

Page 22: A Hadoop MapReduce Performance Prediction Method

22

Prediction Module

• Procedure

Cos

t M

odel

Mai

n F

acto

rs

Tmap=α0+α1*MapInput+α2*N+α3*N*Log(N)+α4*The complexity of map function+α5*The conversion rate of map data

Job Features

Search for the nearest samples

Prediction Function

Prediction Results

1 2

3

4

5

6

7

Page 23: A Hadoop MapReduce Performance Prediction Method

23

Prediction Module

• Procedure

Training Set

Find-Neighbor Module

Prediction Results

Prediction Function

Cost Model

Page 24: A Hadoop MapReduce Performance Prediction Method

24

Design - 3

• Parameters Collection

- CPU Occupation Time- Disk Occupation Time- Network Occupation Time

COST

MODEL

Job Analyzer:Collect

Parameters of Type 2

Static Parameters Collection Module:Collect

Parameters of Type1 & Type 3

- Map execution time- Reduce execution timePrediction

Module

Page 25: A Hadoop MapReduce Performance Prediction Method

25

Experience

• Task Execution Time (Error Rate)– K=12%, and with w different for each feature– K=12%, and with w the same for each feature– K=25%, and with w different for each feature– 4 kinds of jobs, 64M-8G

1 4 7 10 13 16 19 22 25 28 31 34 37 400

20

40

60

80

100

120

140

160

180

Reduce Tasks

k=12%k=25%k=12%,w=1

Erro

r Rat

e (1

00%

)

1 4 7 10 13 16 19 22 25 28 31 34 37 400

10

20

30

40

50

60

70

80

90

Map Tasks

Erro

r Rat

e (1

00%)

Job ID Job ID

Page 26: A Hadoop MapReduce Performance Prediction Method

26

Conclusion

• Job Analyzer :– Analyze Job Jar + Input File– Collect parameters

• Prediction Module:– Find the main factor– Propose a linear equation– Job classification– Multiple prediction

Page 27: A Hadoop MapReduce Performance Prediction Method

27

Thank you!

Question?

Page 28: A Hadoop MapReduce Performance Prediction Method

28

Cost Model [1]

• Analysis about Reduce- Modeling the resources (CPU Disk Network) consumption- Each stage involves only one type of resources

Initi

ation

Read Data

NetworkTransfer

Create Object

Reduce Function

MergeSort

Read/WriteDisk

Network

Write DiskSerialization

Deserialization

Reduce CPU:Disk:Net:

Page 29: A Hadoop MapReduce Performance Prediction Method

29

Prediction Model• Main Factors (according to the performance

model)- Reduce Stage

Initi

ation

Read Data

NetworkTransfer

Create Object

Reduce Function

MergeSort

Read/WriteDisk

Network

Write DiskSerialization

Deserialization

Treduce=β0

+β1*MapInput

+β2*N

+β3*Nlog(N)

+β4*The complexity of Reduce function

+β5*The conversion rate of Map data

+β6*The conversion rate of Reduce data

The amount of input data

The number of input records

The complexity of Reduce function

The conversion rate of Map data

NlogN

The conversion rate of Reduce data