hadoop network performance

Profiling the Network Performance of Hadoop Jobs

Team : Pramod Biligiri & Sayed Asad Ali

Talk OutlineIntroduction to the problemWhat is Hadoop?Hadoop’s MapReduce FrameworkShuffle as a BottleneckExperimental SetupChoice of BenchmarksTerasort DiscussionRanked Inverted Index DiscussionSummary and Future Work

Introduction to the problem

Reproduce existing results which show that the “Network” is the bottleneck in shuffle-intensive

Hadoop jobs.

What is Hadoop?A framework for distributed processing of large data sets across clusters of computers using simple programming models based on Google’s MapReduce.

Distinct Features:● Designed for Commodity Hardware● Highly Fault-tolerant● Horizontally Scalable● Push computation to data

MapReduce● MapReduce is a programming model for processing large data sets

with a parallel, distributed algorithm on a cluster● Programming Model

○ For each input record, generate (key, value)○ Apply reduce operation for all values corresponding to the

same key

Hadoop’s MapReduce Framework1. Prepare the Map() input2. Run the user-provided Map() code3. "Shuffle" the Map output to the Reduce processors4. Run the user-provided Reduce() code5. Produce the final output

MapReduce Flow

Shuffle!

Shuffle as a Bottleneck?“On average, the shuffle phase accounts for 33% of the running

time in these jobs. In addition, in 26% of the jobs with reduce tasks, shuffles account for more than 50% of the running time, and in 16% of jobs, they account for more than 70% of the running time. This confirms widely reported results that the network is a bottleneck in MapReduce”

Managing Data Transfers in Computer Clusters with Orchestra- Mosharaf Chowdhury et al

Chosen Benchmarks

● Terasort● Ranked Inverted Index

Experimental SetupsInstance type Memory CPU Elastic Compute Units Disk Network performance

Config 1 m1.large 7.5 GB 64-bit 4 2 x 420 GB Moderate

Config 2 m1.xlarge 15 GB 64-bit 8 4 x 420 GB High

SDSC custom 8 GB 64-bit/ Intel Xeon CPU 5140 @2.33 GHz, 4 cores

2 x 1.5 TB 1 Gb/s

Network Performance of EMRConflicting Values!Source 1 : with AppNeta pathtest

average : 753 Mb/shttp://www.appneta.com/resources/pathtest-download.html

Source 2 : “The available bandwidth is still 1 Gb/s, confirming anecdotal evidence that EC2 has full bisection bandwidth."

Opening Up Black Box Networks with CloudTalk, by Costin Raiciu et al

Source 3 : “The median TCP/UDP throughput of mediuminstances are both close to 760 Mb/s."

The Impact of Virtualization on Network Performance of Amazon EC2 Data Center, by Guohui Wang et al

Why Terasort?● Popular benchmark for Hadoop● Shipped with most Hadoop distributions.● Utilizes all aspects of the cluster - cpu, network, disk and memory● Large amount of data to shuffle (240 GB).● Representative of real world workloads

“This data shuffle pattern arises in large scale sorts, merges and join operations in the data center. We chose this test because, in our interactions with application developers, we learned that many use such operations with caution, because the operations are highly expensive in today’s data center network.”

source : VL2: A Scalable and Flexible Data Center Network - A. Greenberg et al.

Terasort - How it works:● Sorts 1 terabyte of data.● Each data item is 100 bytes in size.● The first 10 bytes of a data item constitute its sort key.● Format of input data:

○ <key 10 bytes><rowid 10 bytes><filler 78 bytes>\r\n■ key : random characters from ASCII 32-126■ rowid : an integer■ filler : random characters from the set A-Z

Terasort - How it works:Map

Partition input keys into different buckets

<Leverage Hadoop’s default sorting of Map output>

ReduceCollect outputs from different maps

Results

Comparison of Terasort on different configurations

Instance type

Config 1 m1.large (RAM 7.5 GB)

Config 2 m1.xlarge (RAM 15 GB)

SDSC Custom (RAM 8 GB)

Total job Time (min)

Map Time (min)

Reduce Time (min)

Shuffle Average Time

Shuffle Time %

Config 1 205 84 205 60 29.3

SDSC 166 60 90 36 21.7

Config 2 86 40 75 22 25.5

CDF of data transferred over the network during the lifetime of the job

Map ends

Shuffle starts

Shuffle endsReduce nearly done

Sorting of Map outputs (local to the node)

5100 6900

Network Transfer Rate on nodesNetwork Link Saturated

Disk I/O

Sorting of map outputs

Blue : ReadRed : Write

CPU Utilisation

Memory Statistics

Why Ranked Inverted Index?● For a given text corpus, for each word it generates a list of

documents containing the word in decreasing order of frequencyword -> (count1 | file1), (count2 | file2), ...

count1 > count2 > …

● A ranked inverted index is used often in text processing and information retrieval tasks

● Mentioned in the Tarazu paper as a Shuffle heavy workloadTarazu: Optimizing MapReduce On Heterogeneous Clusters, Faraz Ahmad et al.

Ranked Inverted Index - How it works:Map input: (word | filename) -> count

Map output: word -> (filename, count)

Reduce output: word -> (count1 | file1), (count2 | file2) ...It involves a sort of the values on the reduce side

(Note that the Map input is the output of another MapReduce job called sequence-count)

Experimental Results of Ranked Inverted Index

Instance type

Config 1 m1.large (RAM 7.5 GB)

Total job Time (min)

Map Time (min)

Reduce Time (min)

Shuffle Average Time

Shuffle Time %

Config 1 12 5.5 11.5 3.5 27.14

Input Data Set : 40 GB ftp://ftp.ecn.purdue.edu/fahmad/rankedinvindex_40GB.tar.bz2

CDF of data transferred over the network during the lifetime of the job

Map ends

Shuffle starts

Shuffle endsReduce nearly done

Replicating results to 3 Nodes

Network Transfer Rate on nodes

Network Link Saturated

Disk I/O Blue : ReadRed : Write

CPU Utilisation

Memory Statistics

Summary- Shuffle can constitute significant time of the total job runtime- Worth investing in good network connectivity for a compute cluster

Stuff that doesn’t add up!● Why does peak Network Bandwidth for Ranked Inverted Index

overshoot the 1Gb/s mark?● Why is the sort phase of RII so short?

Future Work● How does changing the various parameters make a difference? eg

io.sort.mb, io.sort.factor, fs.inmemory.size.mb● Effect of Combiners?● Varying the number of Map Tasks and Reduce Tasks● How many Map tasks are rack local or machine local?● Investigate the unresolved issues● Lack of precise information about “topology” and “network

bandwidth” for EMR Clusters

Thank you!

Standard Test ResultsInput Size Run Time on

Hadoop (min)Shuffle Volume Critical Path

tera-sort 300 2353 200 Shuffle

ranked-inverted-index 205 2322 219 Shuffle

hadoop network performance

Documents

flexible data center

todays data center network

data item

terabyte of data

processing large data

shuffleintensive hadoop

memory large

hadoop distributions