eneryg efficiency for mapreduce workloads: an indepth study boliang feng renmin university of china...

Eneryg Efficiency for MapReduce Workloads: An Indepth Study

Boliang Feng

Renmin University of [email protected], Dec 19 2012

@ ADC 2012

Outline

Why Energy?

Factors Affecting Energy Efficiency of MapReduce

Experimental Design

Analysis of Result

Key Finding and Recommendations

Conclusion and Future Work

Why energy?

Cooling

Cost

Enviormental Effect

Perfomance

Data Center: some numbers

The data center in Dallas, Oregon: ~50 MW Average electricity consumption in USA: ~900kwh/month/family, or 1.25KW

Power consumption is the major cost and constraint of data center

About 7000 data centers in USA

In US the data centers accounted for roughly 61 billion kWh (1.5% of the total U.S. electricity consumption) in 2006 (EPA 2007) The number is expected to be doubled by 2011

Green Computing in Cloud

Physical construction & Chip level

system software Virtiual Datacenter OS IBM: Power-Aware Request Distribution

Cluster level view Dynamic Resource Configuration Workload distribution …

Green application DBMS, MapReduce

Industrial standard Green Grid, PUE(Power Usage Effectiveness) , DiCE

Why MapReduce?

MapReduce & Hadoop MapReduce popular, fashionable distributed processing model for

parallel computing in data centers. Hadoop is an open-source implementation of MapReduce

New Challenges Little attention in the design of MapReduce platforms Perform automatic parallelization and distribution of computations MapReduce incorporates mechanisms to be resilient to failures

Aim to Answer 2 Questions

We aim to address the following two questions:

Which factors affect the cluster-wise energy efficiency of a MapReduce platform?

Is there any opportunity to perform tradeoff?

Factors Affect Energy Efficiency Identify 4 factors that affect the energy efficiency of

MapReduce: CPU intensiveness, I/O intensiveness Factors of the underlying distributed file system replica

factor as well as the file block size

Questions 1

Is there any opportunity? Identify four typical workloads of MapReduce that present

different kinds of application scenarios TextWrite, WordCount, GrepSearch, Terasort

Measuring the energy consumption with varied disparate cluster scales and other related factors

Questions 2

Energy Consumption Model for MapReduce

CPU intensiveness, I/O intensiveness

Replica Factors of the underlying distributed file system as well as the file block size

Factors Affecting Energy Efficiency ofMapReduce

Metric Power, Time, Energy Energy Efficiency (EE)

Cluster Setup 2.4GHZ Intel Core Duo processor, 4GB RAM, 1000Mbps NetCard Hadoop-0.20.2

Experiment Design

Workloads: TextWrite: Writes a large unsorted random sequence of words from a

word-list. Network-intensive, map only job WordCount: Map-only CPU-intensive job. Matching regular expressions

from input files. High CPU utilization in map stage GrepSearch: Balance between CPU-intensive(map stage) and I/O

intensive jobs(reduce stage). High map/reduce ratio Terasort: Sorting the official input datasets. CPU bound in map stage

and I/O bound in reduce stage. Low map/reduce ratio

Experiment Design

Varied cluster parameters: Cluster size: 2~6 nodes Replica factor: 1~5 replicas Block size: 16MB~1GB Data size: 5~20GB

Experiment Design

Analysis of Results

We run this four workloads with varied workload size, cluster scale, replica factor and block size

TextWrite(1)

It has almost a linear growth of both latency and energy consumption with the replica factor increasing

TextWrite(2)

With more nodes: great improvement of performance, 51.3%. Energy decreased from 71Wh to 49Wh

More nodes means the increase of power. But the response time reduction can trade-off the energy consumption

TextWrite(3)

Small block size enables more tasks to be processed in parallel.

When larger than 64MB, the parallelism of the system is reduced, so that energy consumption increases significantly

GrepSearch(1)

Higher degree of replica factor means more choices for tasks assignment, improving the load balancing of the system

HDFS replica placement policy not only improve data reliability, availability, but also improve the parallelism of the GrepSearch workload

GrepSearch(2)

More nodes means more resources and better performance

When the workload size is as small, the initial cost can not be amortized, and resources are sufficient. Thus, there is no obvious energy saving with the increase of the cluster size

GrepSearch(3)

With workload size increasing, the value of E-E is reduced

Small block size means large overhead on job initialization

Well-tuned block size can obtain energy saving by as much as 36.8%

TeraSort(1)

Can not see significant changes of performance with varied replica factors

More replicas would improve the load balance of map tasks

But would be large data transfer in the shuffle stage, affecting the progress of the whole job

TeraSort(2)

With varied cluster size and workload size

Larger workload increase the job runtime, and more IT components will lead to more energy waste

Add more servers provide higher I/O throughput achieves energy consumption reduction by 20.2%

TeraSort(3)

Fig.3.c shows the value of E-E with varied cluster size and workload size

Small block size means fine-grained input splits which will improve the performance of reduce sort

Big block size means less data shuffle

WordCount(1)

It employs almost 95% of the potential CPU

Increasing the degree of replica factor improve performance, parallelism

Large replica factors implies more opportunities for load balance

WordCount(2)

With more nodes added in the cluster, both response time and energy consumption decrease by 56.8% and 18.2% with 20GB data set

WordCount(3)

Small block size brings out high cost for tasks initialization

A large block size such as 1GB will make negative effects on parallelism of the system

Key Findings

With well-tuned system parameters and adaptive resource configurations, MapReduce cluster can achieve both performance improvement and good energy saving simultaneously in some instances

That is surprisingly contrast to previous works on cluster-level energy conservation.

Recommendations

For CPU intensive and high map/reduce ratio workloads, appropriate number of servers should be provided based on the workload size, ensuring load balance and adequate CPU resource

For I/O intensive workloads, fine-grained input splits is effective for shuffle and reduce stages, which is more energy efficient on condition that the initialization cost can be amortized

Improved data partitioning algorithms in map stage and content-ware reduce tasks scheduling strategies are key areas for energy efficiency, where refinements and improvements are needed.

Conclusion

We identified four factors that affect the energy efficiency of MapReduce based on the cluster

We chose four typical workloads of MapReduce and measured the energy consumption with varied disparate cluster scales and related factors

MapReduce cluster can achieve both significant energy saving and performance improvement simultaneously in some instances

Future Work

Verify the results of this paper in a larger size of clusters

More benchmarks of MapReduce should be introduced

Investigate the effects of changing other parameters: parallel reduce copies, memory limit, and file buffer size

Thank you Q&A

eneryg efficiency for mapreduce workloads: an indepth study boliang feng renmin university of china...

Documents