performance issues on hadoop clusters

57
Performance Issues on Hadoop Clusters Jiong Xie Advisor: Dr. Xiao Qin Committee Members: Dr. Cheryl Seals Dr. Dean Hendrix University Reader: Dr. Fa Foster Dai 06/18/22 1

Upload: xiao-qin

Post on 10-May-2015

4.359 views

Category:

Technology


2 download

DESCRIPTION

The MapReduce model has become an important parallel processing model for large- scale data-intensive applications like data mining and web indexing. Hadoop, an open-source implementation of MapReduce, is widely applied to support cluster computing jobs requiring low response time. The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into account for launching speculative map tasks, because it is assumed that most map tasks can quickly access their local data. Network delays due to data movement during running time have been ignored in the recent Hadoop research. Unfortunately, both the homogeneity and data locality assumptions in Hadoop are optimistic at best and unachievable at worst, potentially introducing performance problems in virtualized data centers. We show in this dissertation that ignoring the data-locality issue in heterogeneous cluster computing environments can noticeably reduce the performance of Hadoop. Without considering the network delays, the performance of Hadoop clusters would be significatly downgraded. In this dissertation, we address the problem of how to place data across nodes in a way that each node has a balanced data processing load. Apart from the data placement issue, we also design a prefetching and predictive scheduling mechanism to help Hadoop in loading data from local or remote disks into main memory. To avoid network congestions, we propose a preshuffling algorithm to preprocess intermediate data between the map and reduce stages, thereby increasing the throughput of Hadoop clusters. Given a data-intensive application running on a Hadoop cluster, our data placement, prefetching, and preshuffling schemes adaptively balance the tasks and amount of data to achieve improved data-processing performance. Experimental results on real data-intensive applications show that our design can noticeably improve the performance of Hadoop clusters. In summary, this dissertation describes three practical approaches to improving the performance of Hadoop clusters, and explores the idea of integrating prefetching and preshuffling in the native Hadoop system.

TRANSCRIPT

Page 1: Performance Issues on Hadoop Clusters

Performance Issues onHadoop Clusters

Jiong Xie

Advisor: Dr. Xiao Qin

Committee Members:

Dr. Cheryl Seals

Dr. Dean Hendrix

University Reader:

Dr. Fa Foster Dai

04/11/23 1

Page 2: Performance Issues on Hadoop Clusters

Overview of My Research

04/11/23 2

Data Placementon Heterogeneous

Cluster[HCW 10]

Data movementData locality Data shuffling

Prefetching Data from Disk to Memory

[Submit to IPDPS]

Reduce network congest

[To Be Submitted]

Page 3: Performance Issues on Hadoop Clusters

Data-Intensive Applications

04/11/23 3

Page 4: Performance Issues on Hadoop Clusters

Data-Intensive Applications (cont.)

04/11/23 4

Page 5: Performance Issues on Hadoop Clusters

Background

• MapReduce programming model is growing in popularity

• Hadoop is used by Yahoo, Facebook, Amazon.

04/11/23 5

Page 6: Performance Issues on Hadoop Clusters

Hadoop Overview--Mapreduce Running System

04/11/23 6

(J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI ’04, pages 137–150)

Page 7: Performance Issues on Hadoop Clusters

Hadoop Distributed File System

04/11/23 7

(http://lucene.apache.org/hadoop)

Page 8: Performance Issues on Hadoop Clusters

Motivations

• MapReduce provides– Automatic parallelization & distribution– Fault tolerance– I/O scheduling– Monitoring & status updates

04/11/23 8

Page 9: Performance Issues on Hadoop Clusters

Existing Hadoop Clusters

• Observation 1: Cluster nodes are dedicated– Data locality issues– Data transfer time

• Observation 2: The number of nodes is increased Scalability issues Shuffling overhead goes up

04/11/23 9

Page 10: Performance Issues on Hadoop Clusters

Proposed Solutions

04/11/23 10

P3: Preshuffling

P1: Data placement

P2: Prefetching

InputInput

OutputOutput

MapMap

MapMap

MapMap

MapMap

MapMap

ReduceReduce

ReduceReduce

ReduceReduce

Page 11: Performance Issues on Hadoop Clusters

Solutions

04/11/23 11

P3: Preshuffling

P1: Data placement

P2: Prefetching

Offline, distributed data, heterogeneous node

Online, data preloading Intermediate data movement, reducing traffic

Page 12: Performance Issues on Hadoop Clusters

Improving MapReduce Performance through Data

Placement in Heterogeneous Hadoop Clusters

04/11/23 12

Page 13: Performance Issues on Hadoop Clusters

Motivational Example

04/11/23 1313

Time (min)

Node A(fast)

Node B(slow)

Node C(slowest)

2x slower

3x slower

1 task/min

Page 14: Performance Issues on Hadoop Clusters

The Native Strategy

04/11/23 14

Node A

Node B

Node C

3 tasks

2 tasks

6 tasks

Loading Transferring Processing

Time (min)

Page 15: Performance Issues on Hadoop Clusters

Our Solution--Reducing data transfer time

04/11/23 15

Node A’

Node B’

Node C’

3 tasks

2 tasks

6 tasks

Loading Transferring Processing

Time (min)

Node A

Page 16: Performance Issues on Hadoop Clusters

Challenges  

04/11/23 16

• Does distribution strategy depend on applications?

• Initialization of data distribution

• The data skew problems– New data arrival– Data deletion   – Data updating– New joining nodes

Page 17: Performance Issues on Hadoop Clusters

Measure Computing Ratios

04/11/23 17

• Computing ratio

• Fast machines process large data sets

Time

Node A

Node B

Node C

2x slower

3x slower

1 task/min

Page 18: Performance Issues on Hadoop Clusters

Measuring Computing Ratios

04/11/23 18

Node Response time(s)

Ratio # of File Fragments

Speed

Node A 10 1 6 Fastest

Node B 20 2 3 Average

Node C 30 3 2 Slowest

1. Run an application, collect response time

2. Set ratio of a node offering the shortest response time as 1

3. Normalize ratios of other nodes

4. Calculate the least common multiple of these ratios

5. Determine the amount of data processed by each node

Page 19: Performance Issues on Hadoop Clusters

Initialize Data Distribution

04/11/23 19

Namenode

Datanodes

112233

File1445566

778899

aabb

cc

• Input files split into 64MB blocks

• Round-Robin data distribution algorithm

CBA

Portions 3:2:1

Page 20: Performance Issues on Hadoop Clusters

Data Redistribution

04/11/23 2020

1

1.Get network topology, ratio, and utilization

2.Build and sort two lists:under-utilized node list L1

over-utilized node list L2

3. Select the source and destination node from the lists.

4.Transfer data

5.Repeat step 3, 4 until the list is empty.

Namenode

1122

33

4455

66778899

aabbcc

CA

CBA

B

234

L1

L2

Portion 3:2:1

Page 21: Performance Issues on Hadoop Clusters

Experimental Environment

04/11/23 21

Five nodes in a Hadoop heterogeneous cluster

Node CPU Model CPU(Hz) L1 Cache(KB)

Node A Intel core 2 Duo 2*1G=2G 204

Node B Intel Celeron 2.8G 256

Node C Intel Pentium 3 1.2G 256

Node D Intel Pentium 3 1.2G 256

Node E Intel Pentium 3 1.2G 256

Page 22: Performance Issues on Hadoop Clusters

Benckmarks

• Grep: a tool searching for a regular expression in a text file

• WordCount: a program used to count words in a text file

• Sort: a program used to list the inputs in sorted order.

04/11/23 22

Page 23: Performance Issues on Hadoop Clusters

Response Time of Grep andWordcount in Each Node

04/11/23 23

Application dependenceComputing ratio is

Data size independence

Page 24: Performance Issues on Hadoop Clusters

Computing Ratio for Two Applications

04/11/23 24

Computing ratio of the five nodes with respective of Grep and Wordcount applications

Computing Node Ratios for Grep Ratios for Wordcount

Node A 1 1

Node B 2 2

Node C 3.3 5

Node D 3.3 5

Node E 3.3 5

Page 25: Performance Issues on Hadoop Clusters

Six Data Placement Decisions

04/11/23 25

Page 26: Performance Issues on Hadoop Clusters

Impact of data placement on performance of Grep

04/11/23 26

Page 27: Performance Issues on Hadoop Clusters

Impact of data placement on performance of WordCount

04/11/23 27

Page 28: Performance Issues on Hadoop Clusters

Summary of Data Placement

P1: Data Placement Strategy• Motivation: Fast machines process large data sets• Problem: Data locality issue in heterogeneous

clusters• Contributions: Distribute data according to

computing capability– Measure computing ratio– Initialize data placement– Redistribution

04/11/23 28

Page 29: Performance Issues on Hadoop Clusters

Predictive Scheduling and Prefetching for Hadoop clusters

04/11/23 29

Page 30: Performance Issues on Hadoop Clusters

Prefetching

• Goal: Improving performance

• Approach– Best effort to guarantee data locality.– Keeping data close to computing nodes– Reducing the CPU stall time

04/11/23 30

Page 31: Performance Issues on Hadoop Clusters

Challenges

• What to prefetch?

• How to prefetch?

• What is the size of blocks to be prefetched?

04/11/23 31

Page 32: Performance Issues on Hadoop Clusters

Dataflow in Hadoop

04/11/23 32

mapmap

mapmap

reducereduce

reducereduce

HDFSHDFS

Block 1

Block 2

3.Read Input

1.Submit job

2.Schedule

Local FS

Local FS

Local FS

Local FS

4. Run map

5.he

artb

eat

6. N

ext t

ask

7.Read new file

Page 33: Performance Issues on Hadoop Clusters

Dataflow in Hadoop

04/11/23 33

mapmap

mapmap

reducereduce

reducereduce

HDFSHDFS

Block 1

Block 2

3.Read Input

1.Submit job

2.Schedule+ more task+ meta data

Local FS

Local FS

Local FS

Local FS

4. Run map

5.he

artb

eat

6. N

ext t

ask

5.1.Read new file

6. N

ext t

ask

4. Run map

Page 34: Performance Issues on Hadoop Clusters

Prefetching Processing

04/11/23 34

6

7

8

Page 35: Performance Issues on Hadoop Clusters

Software Architecture

04/11/23 35

Page 36: Performance Issues on Hadoop Clusters

Grep Performance

04/11/23 36

9.5% 1G8.5% 2G

Page 37: Performance Issues on Hadoop Clusters

WordCount Performance

04/11/23 37

8.9% 1G8.1% 2G

Page 38: Performance Issues on Hadoop Clusters

Large/Small file in a node

04/11/23 38

9.1% Grep8.3% WordCount

18% Grep24% WordCount

Page 39: Performance Issues on Hadoop Clusters

Experiment Setting

04/11/23 39

Page 40: Performance Issues on Hadoop Clusters

Large/Small file in cluster

04/11/23 40

Page 41: Performance Issues on Hadoop Clusters

Summary

P2: Predictive Scheduler and Prefetching• Goal: Moving data before task assigns• Problem: Synchronization task and data• Contributions: Preloading the required data early

than the task assigned– Predictive scheduler– Prefetching mechanism– Worker thread

04/11/23 41

Page 42: Performance Issues on Hadoop Clusters

Adaptive Preshuffling in Hadoop clusters

04/11/23 42

Page 43: Performance Issues on Hadoop Clusters

Preshuffling

• Observation 1: Too much data move from Map worker to Reduce worker– Solution1: Map nodes apply pre-shuffling

functions to their local output

• Observation 2: No reduce can start until a map is complete.– Solution2: Intermediate data is pipelined

between mappers and reducers. 04/11/23 43

Page 44: Performance Issues on Hadoop Clusters

Preshuffling

• Goal : Minimize data shuffle during Reduce

• Approach– Pipeline– Overlap between map and data movement– Group map and reduce

• Challenges– Synchronize map and reduce– Data locality

04/11/23 44

Page 45: Performance Issues on Hadoop Clusters

Dataflow in Hadoop

04/11/23 45

mapmap

mapmap

reducereduce

reducereduce

HDFSHDFS

Block 1

Block 2

3.Read Input

1.Submit job 2.Schedule

Local FS

Local FS

Local FS

Local FS

5.he

artb

eat

6. N

ext t

ask

2. New task

HTTP GET

4. Run map3. Request data

HDFSHDFS

5.Write data

4. Send data

Page 46: Performance Issues on Hadoop Clusters

PreShuffle

04/11/23 46

Data request

mapmap

mapmap

reducereduce

reducereduce

Page 47: Performance Issues on Hadoop Clusters

In-memory buffer

04/11/23 47

Page 48: Performance Issues on Hadoop Clusters

Pipelining – A new design

04/11/23 48

HDFSHDFSHDFSHDFS

Block 1

Block 2

mapmap

mapmap

reducereduce

reducereduce

Page 49: Performance Issues on Hadoop Clusters

WordCount Performance

04/12/23 49

230 seconds vs 180 seconds

Page 50: Performance Issues on Hadoop Clusters

WordCount Performance

04/12/23 50

Page 51: Performance Issues on Hadoop Clusters

Sort Performace

04/12/23 51

Page 52: Performance Issues on Hadoop Clusters

Summary

P3: Preshuffling• Goal: Minimize data shuffling during the Reduce• Problem: task distribution and synchronization • Contributions: preshuffling agorithm

– Push data instead of tradition pull– In-memory buffer– Pipeline

04/12/23 52

Page 53: Performance Issues on Hadoop Clusters

Conclusion

04/12/23 53

InputInput

Output

Output

P3: Preshuffling

P1: Data placement

P2: Prefetching

Map

Map

Map

Map

Map

Map

Map

Map

Map

Map

Reduce

Reduce

Reduce

Reduce

Reduce

Reduce

Offline, distributed data, heterogeneous node

Online, data preloading, single node

Intermediate data movement, reducing traffic

Page 54: Performance Issues on Hadoop Clusters

Future Work

• Extend Pipelining– Implement the pipelining design

• Small files issue– Har file– Sequence file– CombineFileInputFormat

• Extend Data placement

04/12/23 54

Page 55: Performance Issues on Hadoop Clusters

Thanks!And Questions?

55

Page 56: Performance Issues on Hadoop Clusters

Run Time affected by Network Condition

04/12/23 56

Experiment result conducted by Yixian Yang

Page 57: Performance Issues on Hadoop Clusters

Traffic Volume affected by Network Condition

04/12/23 57

Experiment result conducted by Yixian Yang