Transcript
Page 1: The Future of MapReduce In Data-Intensive Computing

The Future of MapReduce

In Data-Intensive Computing

Author: Thang M. Le [email protected]

Page 2: The Future of MapReduce In Data-Intensive Computing

1. INTRODUCTION

Collecting data and processing data are driving forces leading to business advances. However, building a

system to leverage these advantages is a challenging task. Without a proper infrastructure in place,

companies can neither manage nor catch up with their exponential growth of data. In a mist of finding a

standard model for data-intensive computing, the MapReduce framework [2] arises as a simplified

distributed computing model which can effectively process data on a large cluster of commodity

machines. MapReduce enjoys a widespread adoption thanks to its simplicity and efficiency. Within

Google, the implementation of MapReduce has consistently produced pleasant results for various

applications ranging from extracting properties of web pages to indexing documents for Google web

search service [2]. Yahoo! has developed Hadoop, an open source implementation of the MapReduce

architecture, for their data cloud computing. The Open Science Data Cloud maintained by the Open

Cloud Consortium integrates Hadoop into its software stack for managing, analyzing and sharing

scientific data [6]. The Phoenix project [10] adopts MapReduce to build a scalable performance system

capable of processing large data running on shared-memory machines. More and more companies are

planning to build their data analyzing system based on the MapReduce architecture.

Recently, there have been concerns regarding the efficiency of MapReduce systems. Direct comparisons

between Hadoop and various parallel database management systems (DBMS) have been done [8][9]. All

results indicate parallel DBMSs perform substantially faster than Hadoop by a factor of 3 to 6. On August

8, 2009, another benchmark was performed using Terabyte Sort benchmark test on LexisNexis High-

Performance Computing Cluster (HPCC) and Hadoop (release 0.19). In this benchmark, the result shows

HPCC system needs a total of 6 minutes and 27 seconds to complete the sorting test while Hadoop

requires 25 minutes 28 seconds [7]. We carefully revisited the MapReduce model to get a better

understanding on some of questionable design decisions which we think are the culprits to the differences

in performance. We also reviewed the above benchmark tests and various comparisons between parallel

DBMSs and MapReduce. We came to the conclusion that the benchmark might indicate the differences in

performance of the MapReduce architecture compared with the approach taken by commercial parallel

DBMSs. However, these differences complement the overall reliability of the MapReduce system, which

is much important when operating on a large cluster of thousands of commodity machines where

“component failures are the norm rather than the exception” [5]. In such environment, MapReduce has

shown its fast recovery advance through a sorting benchmark performed with the deaths of 200 out of

1746 workers. The result was the sort program was completed in 933 seconds compared with 891 seconds

where there was no failure [2]. In contrast, we have not seen this type of benchmark tests for parallel

DBMSs. Without statistics on how well a parallel DBMS reacts to large machine failures, it is insufficient

to have a fair comparison between MapReduce and traditional parallel DBMSs. To further clarify our

conclusion, we analyze the MapReduce model and compare it with the approach taken by parallel

DBMSs in section 2. We also discuss various optimizing capabilities existing in the MapReduce model

which developers should be aware of in order to excel MapReduce executions. Even though we believe

the MapReduce model is a right approach toward data-intensive computing, designers and developers still

need to build enhancements for optimization. In section 3, we provides further discussions and

improvements which we think would significantly contribute to the overall performance of MapReduce

systems.

Page 3: The Future of MapReduce In Data-Intensive Computing

2. A REVIEW OF THE MAP-REDUCE MODEL

The MapReduce programming model expresses an operation as two abstract functions: map and reduce.

A map function operates directly on input data and generates intermediate results under the form of

key/value pairs (might contain duplicate keys). The intermediate results are then passed to the reduce

function to produce a new set of key/value pairs (no duplicate keys) which can be the final output or an

input for a next map function. With this flexibility, developers can easily take advantage of MapReduce in

one execution or a chained list of executions for complex analytics [9]. Working on the consistent format

of key/value pairs for input/output and the ability to execute a chained list of MapReduce executions are

the strengths of the MapReduce programming model over SQL-like languages.

A map function and its associated reduce function have the producer-consumer relationship. They get

executed in sequential order. In order to explore parallelism, MapReduce further assumes:

Assumption1: Map functions can be applied to any subset of input data items independently

Assumption2: Reduce functions can be applied to pairs of intermediate results in any order.

Assumption1 allows MapReduce to split a lengthy execution into smaller executions. Assumption2 lets

MapReduce to perform tasks concurrently with no worry about synchronization. Together, MapReduce

can split a lengthy execution into smaller tasks, then executing these tasks on many machines to achieve a

high degree of concurrency. Because of this unique way of performing task, we relate MapReduce with

distributed computing rather than parallel computing.

To maximize efficiency, MapReduce introduces:

Master node: one master node is responsible to assign map tasks to map workers, reduce task to

reduce workers, coordinating reduce workers, maintaining various states for each task and takes

action when workers are unavailable.

Worker: a machine which executes map tasks or reduce tasks.

Map task: there are map tasks which are results from splits on input filenames. Each map

task will be assigned to a worker (map worker) for execution. The intermediate results will be

hashed using ( ) and will be kept in memory. Periodically, the buffer will be

written out into regions on the local disk of the map worker.

Reduce task: there are reduce tasks. Each reduce task will be assigned to a worker (reduce

worker). When a map task is completed, a map worker will update its regions with the master

node. In response, the master node then pushes the updated information to all current reduce

workers. Upon receiving the notification, a reduce worker performs remote procedure call (RPC)

to read local data stored in the map worker. The reduce worker then sorts this data by the

intermediate keys and passes the sorted results to the user defined reduce function.

From the basis, the MapReduce system embraces a shared-nothing hardware design [3]. This is also the

architecture used by many parallel DBMSs. The alternatives are shared-memory and shared-disks. In the

shared-nothing paradigm, machines work on their own memory and disks. They exchange data with each

other through message based communication over high-speed network. [3] provides full details on

advantages when using the shared-nothing design in parallel DBMSs. With the shared-nothing design, a

parallel DBMS system naturally performs computation right at a node storing data. This follows the

Page 4: The Future of MapReduce In Data-Intensive Computing

important principle “Move the code to the data” in data-intensive computing [7], which immediately

results in minimum network traffic. In contrast, the MapReduce model does not obtain this optimal

locality for granted. Given the assumption of contents of input files stored locally in workers which make

up a MapReduce cluster, the MapReduce model requires a complex scheduling logic to correctly assign a

map task processing certain input filenames to a worker which has these files stored locally. If such

scheduling is not possible, the logic should attempt to assign the map task to a worker near to the one has

the data in its local storage. This is the uncertainty in the MapReduce model which might cause

unexpected performance. Future improvement should allow MapReduce systems to directly inherit

soptimal task locality from the MapReduce model with less effort in implementation.

MapReduce differentiates a map function from a reduce function. It further distinguishes a map worker

from a reduce worker. Executing a map function on one machine and a reduce function on another

machine requires to move local data stored in the map worker to the reduce worker. The amount of data

transferred between workers is the main overhead in MapReduce [9]. In parallel DBMSs, exchanged data

is minimized thanks to built-in query optimizers. For example, if a query after doing a match on

sequential search also aggregates matched items, the size of the “push-back” data is negligible compared

with the amount of intermediate results have to be pulled by reduce workers in MapReduce. To

compensate for this issue, MapReduce introduces Combiner function to perform partial merge on

intermediate results prior to the completion of a map task. Combiner functions essentially play the role of

query optimizers in MapReduce systems. Separating a map worker from a reduce worker provides

various advantages rather drawbacks. The most benefit is MapReduce systems can achieve fast recovery,

dynamic load balancing and efficiency [2]. In an environment where hardware failures are common,

failure resilience and fast recovery are vital to the design of distributed systems. In MapReduce,

completed map tasks of a failed worker are required to be re-executed since the intermediate data stored

locally in the failed worker are not accessible. In contrast, completed reduce tasks of a failed worker are

already stored in global file system, hence, no further work is needed. For each execution, a master node

keeps track of states of map tasks and ( ) states of reduce tasks [2]. Maintaining these states

at any time allows the master node to effectively determine and schedule completed map tasks for re-

executions. Such fine-grained recovery is not possible in parallel DBMSs. Periodically, a map worker

writes its memory buffer to local disk for two-folds. First, intermediate results are presumably much

larger than the memory capacity of a map worker. Writing out to local disk is unavoidable. In this

manner, we suspect parallel DBMSs also perform various I/O operations internally. Second, writing out

data in memory to disk into regions expedites concurrent data transfer over network. Such advantage

relies on a high performance distributed file system with support for parallel I/O operations. Google File

System [5] is an example of this type of distributed file system.

3. DISCUSSIONS AND IMPROVEMENTS

There are existing “hot spots” in the MapReduce model where a large number of tasks are executed

concurrently. Master nodes and map workers are the two. With map tasks and reduce tasks, a master

node has to perform ( ) scheduling jobs and maintains ( ) states. Memory consumption is

not an issue but the high degree of concurrency being placed on the master node is a big concern.

Running on a large cluster of thousands of machines, the master node frequently needs to simultaneously

process thousands of update requests from thousands of map workers who just finish their map tasks. The

high level of parallel processing substantially impacts efficiency. Load balancing should be in place to

Page 5: The Future of MapReduce In Data-Intensive Computing

distribute computations across master nodes. In a similar manner, map workers are also facing the same

problem. Once a master node notifies reduce workers of a completed map task, these reduce workers

independently pull data from the local disk of the completed map worker. This imposes “back-pressure”

on the running system [9]. MapReduce relies on its underneath distributed file system to balance requests

across map worker’s replicas.

Even though the MapReduce paradigm contributes to a scalable, high performance distributed system

applicable for data-intensive computing, the model receives a lot of complaints for taking a “brute force”

approach to a problem. Working with unstructured data is a main reason. How much differences when

operating on structured data? A trivial example is performing a search for an item in an unsorted list

requires ( ) while it costs ( ) for the same operation on a sorted list. In a case where is

petabyte in size, a ( ) result is attractive. Can MapReduce achieve a better running time than ( )?

The answer is simply no. This explains why HPCC and commercial parallel DBMSs take lead over

MapReduce systems in practice. The picture is not quite ugly since the door to the efficiency is widely

open. MapReduce maintains a strict “one time pass” over data content. In return, MapReduce users

should define how much data the system needs to parse. A naïve way is to process all existing files.

However, if there are metadata built in place for guidance, the amount of input files can be reduced

substantially. Metadata are extensively utilized in parallel DBMSs and HPCC system [7]. For this reason,

not only a MapReduce system has to do its best, but also its users are responsible to do their best in

providing an optimal set of input files. Where it is appropriated, MapReduce should work with structured

data stored in a distributed storage system such as BigTable [1] rather than a generic distributed file

system such as GFS.

4. CONCLUSION

MapReduce is well-suited for processing data on large scale cluster. Its strong support for data-intensive

applications has its root from the programming model using map function and reduce function which can

be further executed concurrently across machines in a cluster. The model has many advantages over

traditional parallel DBMSs including the powerful programming model compared with SQL language,

better fault-tolerance and fast recovery. The performance differences mentioned above are mainly due to

the gap in optimization of a specific implementation when compared with highly optimized commercial

grade parallel DBMSs. Many optimizations can be introduced at later time when the system is more

mature. Future improvements on MapReduce should focus on the ease of achieving optimal task locality

and lowering the degree of concurrency at master nodes and map workers.

REFERENCES

[1] Bigtable: A Distributed Storage System for Structured Data.

http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/bigtable -osdi06.pdf

[2] MapReduce: Simplified data processing on large cluster.

http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/mapreduce -

osdi04.pdf

[3] David J. Dewitt and Jim Gray. Parallel Database System: The Future of High Performance Database Processing.

Communications of the ACM, 36(6):85-98, 1992.

[5] The Google File System.

Page 6: The Future of MapReduce In Data-Intensive Computing

http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google .com/en//papers/gfs-sosp2003.pdf

[6] An Overview of the Open Science Data Cloud. HPDC '10: Proceedings of the 19th ACM International

Symposium on High Performance Distributed Computing, 2010.

[7] Anthony M. Middleton. Data-Intensive Technologies for Cloud Computing. Handbook of Cloud Computing.

Springer, 2010.

[8] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D.J. Dewitt, S. Madden and M. Stonebraker. A Comparison of

Approaches to Large-Scale Data Analysis. SIGMOD '09: Proceedings of the 35th SIGMOD international conference

on Management of data, pages 165-178, 2009.

[9] M. Stonebraker, D. Abadi, D. J. Dewitt, S. Madden, E. Paulson, A. Pavlo and A. Rasin. MapReduce and Parallel

DBMSs: Friends or Foes? Communications of the ACM, 53(1):64-71, 2010

[10] Phoenix Rebirth: Scalable MapReduce on a Large-Scale Shared-Memory System.

http://mapreduce.stanford.edu/


Top Related