hadoop network performance
DESCRIPTION
Hadoop performance measurementTRANSCRIPT
Profiling the Network Performance of Hadoop Jobs
Team : Pramod Biligiri & Sayed Asad Ali
Talk OutlineIntroduction to the problemWhat is Hadoop?Hadoop’s MapReduce FrameworkShuffle as a BottleneckExperimental SetupChoice of BenchmarksTerasort DiscussionRanked Inverted Index DiscussionSummary and Future Work
Introduction to the problem
Reproduce existing results which show that the “Network” is the bottleneck in shuffle-intensive
Hadoop jobs.
What is Hadoop?A framework for distributed processing of large data sets across clusters of computers using simple programming models based on Google’s MapReduce.
Distinct Features:● Designed for Commodity Hardware● Highly Fault-tolerant● Horizontally Scalable● Push computation to data
MapReduce● MapReduce is a programming model for processing large data sets
with a parallel, distributed algorithm on a cluster● Programming Model
○ For each input record, generate (key, value)○ Apply reduce operation for all values corresponding to the
same key
Hadoop’s MapReduce Framework1. Prepare the Map() input2. Run the user-provided Map() code3. "Shuffle" the Map output to the Reduce processors4. Run the user-provided Reduce() code5. Produce the final output
MapReduce Flow
Shuffle!
Shuffle as a Bottleneck?“On average, the shuffle phase accounts for 33% of the running
time in these jobs. In addition, in 26% of the jobs with reduce tasks, shuffles account for more than 50% of the running time, and in 16% of jobs, they account for more than 70% of the running time. This confirms widely reported results that the network is a bottleneck in MapReduce”
Managing Data Transfers in Computer Clusters with Orchestra- Mosharaf Chowdhury et al
Chosen Benchmarks
● Terasort● Ranked Inverted Index
Experimental SetupsInstance type Memory CPU Elastic Compute Units Disk Network performance
Config 1 m1.large 7.5 GB 64-bit 4 2 x 420 GB Moderate
Config 2 m1.xlarge 15 GB 64-bit 8 4 x 420 GB High
SDSC custom 8 GB 64-bit/ Intel Xeon CPU 5140 @2.33 GHz, 4 cores
2 x 1.5 TB 1 Gb/s
Network Performance of EMRConflicting Values!Source 1 : with AppNeta pathtest
average : 753 Mb/shttp://www.appneta.com/resources/pathtest-download.html
Source 2 : “The available bandwidth is still 1 Gb/s, confirming anecdotal evidence that EC2 has full bisection bandwidth."
Opening Up Black Box Networks with CloudTalk, by Costin Raiciu et al
Source 3 : “The median TCP/UDP throughput of mediuminstances are both close to 760 Mb/s."
The Impact of Virtualization on Network Performance of Amazon EC2 Data Center, by Guohui Wang et al
Why Terasort?● Popular benchmark for Hadoop● Shipped with most Hadoop distributions.● Utilizes all aspects of the cluster - cpu, network, disk and memory● Large amount of data to shuffle (240 GB).● Representative of real world workloads
“This data shuffle pattern arises in large scale sorts, merges and join operations in the data center. We chose this test because, in our interactions with application developers, we learned that many use such operations with caution, because the operations are highly expensive in today’s data center network.”
source : VL2: A Scalable and Flexible Data Center Network - A. Greenberg et al.
Terasort - How it works:● Sorts 1 terabyte of data.● Each data item is 100 bytes in size.● The first 10 bytes of a data item constitute its sort key.● Format of input data:
○ <key 10 bytes><rowid 10 bytes><filler 78 bytes>\r\n■ key : random characters from ASCII 32-126■ rowid : an integer■ filler : random characters from the set A-Z
Terasort - How it works:Map
Partition input keys into different buckets
<Leverage Hadoop’s default sorting of Map output>
ReduceCollect outputs from different maps
Results
Comparison of Terasort on different configurations
Instance type
Config 1 m1.large (RAM 7.5 GB)
Config 2 m1.xlarge (RAM 15 GB)
SDSC Custom (RAM 8 GB)
Total job Time (min)
Map Time (min)
Reduce Time (min)
Shuffle Average Time
Shuffle Time %
Config 1 205 84 205 60 29.3
SDSC 166 60 90 36 21.7
Config 2 86 40 75 22 25.5
CDF of data transferred over the network during the lifetime of the job
Map ends
Shuffle starts
Shuffle endsReduce nearly done
Sorting of Map outputs (local to the node)
5100 6900
Network Transfer Rate on nodesNetwork Link Saturated
Disk I/O
Sorting of map outputs
Blue : ReadRed : Write
CPU Utilisation
Memory Statistics
Why Ranked Inverted Index?● For a given text corpus, for each word it generates a list of
documents containing the word in decreasing order of frequencyword -> (count1 | file1), (count2 | file2), ...
count1 > count2 > …
● A ranked inverted index is used often in text processing and information retrieval tasks
● Mentioned in the Tarazu paper as a Shuffle heavy workloadTarazu: Optimizing MapReduce On Heterogeneous Clusters, Faraz Ahmad et al.
Ranked Inverted Index - How it works:Map input: (word | filename) -> count
Map output: word -> (filename, count)
Reduce output: word -> (count1 | file1), (count2 | file2) ...It involves a sort of the values on the reduce side
(Note that the Map input is the output of another MapReduce job called sequence-count)
Experimental Results of Ranked Inverted Index
Instance type
Config 1 m1.large (RAM 7.5 GB)
Total job Time (min)
Map Time (min)
Reduce Time (min)
Shuffle Average Time
Shuffle Time %
Config 1 12 5.5 11.5 3.5 27.14
Input Data Set : 40 GB ftp://ftp.ecn.purdue.edu/fahmad/rankedinvindex_40GB.tar.bz2
CDF of data transferred over the network during the lifetime of the job
Map ends
Shuffle starts
Shuffle endsReduce nearly done
Replicating results to 3 Nodes
Network Transfer Rate on nodes
Network Link Saturated
Disk I/O Blue : ReadRed : Write
CPU Utilisation
Memory Statistics
Summary- Shuffle can constitute significant time of the total job runtime- Worth investing in good network connectivity for a compute cluster
Stuff that doesn’t add up!● Why does peak Network Bandwidth for Ranked Inverted Index
overshoot the 1Gb/s mark?● Why is the sort phase of RII so short?
Future Work● How does changing the various parameters make a difference? eg
io.sort.mb, io.sort.factor, fs.inmemory.size.mb● Effect of Combiners?● Varying the number of Map Tasks and Reduce Tasks● How many Map tasks are rack local or machine local?● Investigate the unresolved issues● Lack of precise information about “topology” and “network
bandwidth” for EMR Clusters
Q n A
Thank you!
Standard Test ResultsInput Size Run Time on
Hadoop (min)Shuffle Volume Critical Path
tera-sort 300 2353 200 Shuffle
ranked-inverted-index 205 2322 219 Shuffle