mapreduce - a flexible dp tool

Upload: shanmugasundaram-muthuswamy

Post on 02-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 MapReduce - A Flexible DP Tool

    1/6

    72 communicATions o THe Acm | jANuARy 2010 | Vol. 53 | No. 1

    MApReDUCe is Aprogramming model for processingand generaing large daa ses.4 Users specify amap funcion ha processes a key/value pair ogenerae a se of inermediae key/value pairs anda reduce funcion ha merges all inermediae

    values associaed wih he same inermediae key.We buil a sysem around his programming modelin 2003 o simplify consrucion of he inveredindex for handling searches a Google.com. Sincehen, more han 10,000 disinc programs have beenimplemened using MapReduce a Google, includingalgorihms for large-scale graph processing, exprocessing, machine learning, and saisical machine

    ranslaion. the Hadoop open source implemenaion

    Doi:10.1145/1629175.1629198

    MapReduce advantages over parallel databasesinclude storage-system independence andfne-grain ault tolerance or large jobs.

    BY JeReY DeAn AnD sAnJAY GHemAWAT

    mapRd:

    A lxblDataPrg

    Tl

    Illu

    stratIon

    byMarIusWatZ

    contributed articles

    o MapReduce has been used exten-sively outside o Google by a number o

    organizations.10,11To help illustrate the MapReduce

    programming model, consider theproblem o counting the number ooccurrences o each word in a large col-lection o documents. The user would

    write code like the ollowing pseudo-code:

    map(String key, String value):// key: document name// value: document contents

    for each word w in value:EmitIntermediate(w, 1);

    reduce(String key, Iterator values):// key: a word// values: a list of countsint result = 0;for each v in values:result += ParseInt(v);

    Emit(AsString(result));

    The map unction emits each wordplus an associated count o occurrences

    (just `1' in this simple example). The re-duce unction sums together all countsemitted or a particular word.

    MapReduce automatically paral-lelizes and executes the program on alarge cluster o commodity machines.The runtime system takes care o thedetails o partitioning the input data,scheduling the programs executionacross a set o machines, handlingmachine ailures, and managing re-quired inter-machine communication.MapReduce allows programmers with

    no experience with parallel and dis-tributed systems to easily utilize the re-sources o a large distributed system. Atypical MapReduce computation pro-cesses many terabytes o data on hun-dreds or thousands o machines. Pro-grammers nd the system easy to use,and more than 100,000 MapReduce

    jobs are executed on Googles clustersevery day.

    cpard t Paralll Databa

    The query languages built into paral-lel database systems are also used to

  • 7/27/2019 MapReduce - A Flexible DP Tool

    2/6

    jANuARy 2010 | Vol. 53 | No. 1 | communicATions o THe Acm 73

  • 7/27/2019 MapReduce - A Flexible DP Tool

    3/6

    74 communicATions o THe Acm | jANuARy 2010 | Vol. 53 | No. 1

    contributed articles

    would need to read only that sub-rangeinstead o scanning the entire Bigtable.Furthermore, like Vertica and other col-umn-store databases, we will read dataonly rom the columns needed or thisanalysis, since Bigtable can store datasegregated by columns.

    Yet another example is the process-

    ing o log data within a certain daterange; see the Join task discussion inthe comparison paper, where the Ha-doop benchmark reads through 155million records to process the 134,000records that all within the date rangeo interest. Nearly every logging sys-tem we are amiliar with rolls over toa new log le periodically and embedsthe rollover time in the name o eachlog le. Thereore, we can easily run aMapReduce operation over just the log

    les that may potentially overlap thespecied date range, instead o readingall log les.

    cplx t

    Map and Reduce unctions are otenairly simple and have straightorwardSQL equivalents. However, in manycases, especially or Map unctions, theunction is too complicated to be ex-pressed easily in a SQL query, as in theollowing examples:

    Extracting the set o outgoing links

    rom a collection o HTML documentsand aggregating by target document;

    Stitching together overlapping sat-

    ellite images to remove seams and toselect high-quality imagery or GoogleEarth;

    Generating a collection o inverted

    index les using a compression schemetuned or ecient support o Googlesearch queries;

    Processing all road segments in the

    world and rendering map tile imagesthat display these segments or Google

    Maps; andFault-tolerant parallel execution o

    programs written in higher-level lan-guages (such as Sawzall14 and Pig Lat-in12) across a collection o input data.

    Conceptually, such user denedunctions (UDFs) can be combined

    with SQL queries, but the experiencereported in the comparison paper indi-cates that UDF support is either buggy(in DBMS-X) or missing (in Vertica).These concerns may go away over thelong term, but or now, MapReduce is abetter ramework or doing more com-

    express the type o computations sup-ported by MapReduce. A 2009 paperby Andrew Pavlo et al. (reerred to hereas the comparison paper13) com-pared the perormance o MapReduceand parallel databases. It evaluatedthe open source Hadoop implementa-tion10 o the MapReduce programming

    model, DBMS-X (an unidentied com-mercial database system), and Vertica(a column-store database system roma company co-ounded by one o theauthors o the comparison paper). Ear-lier blog posts by some o the papersauthors characterized MapReduce asa major step backwards.5,6 In thisarticle, we address several misconcep-tions about MapReduce in these threepublications:

    MapReduce cannot use indices and

    implies a ull scan o all input data;MapReduce input and outputs are

    always simple les in a le system; andMapReduce requires the use o in-

    ecient textual data ormats.We also discuss other important is-

    sues:MapReduce is storage-system inde-

    pendent and can process data withoutrst requiring it to be loaded into a da-tabase. In many cases, it is possible torun 50 or more separate MapReduceanalyses in complete passes over the

    data beore it is possible to load the datainto a database and complete a singleanalysis;

    Complicated transormations are

    oten easier to express in MapReducethan in SQL; and

    Many conclusions in the compari-

    son paper were based on implementa-tion and evaluation shortcomings notundamental to the MapReduce model;

    we discuss these shortcomings later inthis article.

    We encourage readers to read the

    original MapReduce paper4 and thecomparison paper13 or more context.

    Htrg syt

    Many production environments con-tain a mix o storage systems. Customerdata may be stored in a relational data-base, and user requests may be loggedto a le system. Furthermore, as suchenvironments evolve, data may migrateto new storage systems. MapReduceprovides a simple model or analyzingdata in such heterogenous systems.End users can extend MapReduce to

    support a new storage system by de-ning simple reader and writer imple-mentations that operate on the storagesystem. Examples o supported storagesystems are les stored in distributedle systems,7 database query results,2,9data stored in Bigtable,3 and structuredinput les (such as B-trees). A single

    MapReduce operation easily processesand combines data rom a variety ostorage systems.

    Now consider a system in which aparallel DBMS is used to perorm alldata analysis. The input to such analy-sis must rst be copied into the parallelDBMS. This loading phase is inconve-nient. It may also be unacceptably slow,especially i the data will be analyzedonly once or twice ater being loaded.For example, consider a batch-oriented

    Web-crawling-and-indexing systemthat etches a set o Web pages andgenerates an inverted index. It seemsawkward and inecient to load the seto etched pages into a database just sothey can be read through once to gener-ate an inverted index. Even i the cost oloading the input into a parallel DBMSis acceptable, we still need an appropri-ate loading tool. Here is another placeMapReduce can be used; instead o

    writing a custom loader with its own adhoc parallelization and ault-tolerance

    support, a simple MapReduce programcan be written to load the data into theparallel DBMS.

    id

    The comparison paper incorrectly saidthat MapReduce cannot take advan-tage o pregenerated indices, leadingto skewed benchmark results in thepaper. For example, consider a largedata set partitioned into a collectiono nondistributed databases, perhapsusing a hash unction. An index can

    be added to each database, and theresult o running a database query us-ing this index can be used as an inputto MapReduce. I the data is stored inD database partitions, we will run Ddatabase queries that will become theD inputs to the MapReduce execution.Indeed, some o the authors o Pavlo etal. have pursued this approach in theirmore recent work.11

    Another example o the use o in-dices is a MapReduce that reads romBigtable. I the data needed maps to asub-range o the Bigtable row space, we

  • 7/27/2019 MapReduce - A Flexible DP Tool

    4/6

    contributed articles

    jANuARy 2010 | Vol. 53 | No. 1 | communicATions o THe Acm 75

    plicated tasks (such as those listed ear-lier) than the selection and aggregationthat are SQLs orte.

    strtrd Data ad sha

    Pavlo et al. did raise a good point thatschemas are helpul in allowing multi-ple applications to share the same data.

    For example, consider the ollowingschema rom the comparison paper:

    CREATE TABLE Rankings (pageURL VARCHAR(100)

    PRIMARY KEY,pageRank INT,avgDuration INT );

    The corresponding Hadoop bench-marks in the comparison paper usedan inecient and ragile textual or-mat with dierent attributes separated

    by vertical bar characters:

    137|http://www.somehost.com/index.html|602

    In contrast to ad hoc, inecientormats, virtually all MapReduce op-erations at Google read and write datain the Protocol Buer ormat.8 A high-level language describes the input andoutput types, and compiler-generatedcode is used to hide the details o en-coding/decoding rom application

    code. The corresponding protocol bu-er description or the Rankings data

    would be:

    message Rankings {required string pageurl = 1;required int32 pagerank = 2;required int32 avgduration = 3;

    }

    The ollowing Map unction rag-ment processes a Rankings record:

    Rankings r = new Rankings();r.parseFrom(value);if (r.getPagerank() > 10) { ... }

    The protocol buer rameworkallows types to be upgraded (in con-strained ways) without requiring exist-ing applications to be changed (or evenrecompiled or rebuilt). This level oschema support has proved sucientor allowing thousands o Google engi-neers to share the same evolving datatypes.

    Furthermore, the implementation

    o protocol buers uses an optimizedbinary representation that is morecompact and much aster to encodeand decode than the textual ormatsused by the Hadoop benchmarks in thecomparison paper. For example, theautomatically generated code to parsea Rankings protocol buer record

    runs in 20 nanoseconds per record ascompared to the 1,731 nanosecondsrequired per record to parse the tex-tual input ormat used in the Hadoopbenchmark mentioned earlier. Thesemeasurements were obtained on a JVMrunning on a 2.4GHz Intel Core-2 Duo.The Java code ragments used or thebenchmark runs were:

    // Fragment 1: protocol buf-fer parsing

    for (int i = 0; i < numItera-tions; i++) {rankings.parseFrom(value);pagerank = rankings.get-Pagerank();

    }

    // Fragment 2: text for-mat parsing (extracted fromBenchmark1.java// from the source codeposted by Pavlo et al.)for (int i = 0; i < numItera-

    tions; i++) {String data[] = value.to-String().split(\\|);pagerank = Integer.valueOf(data[0]);

    }

    Given the actor o an 80-old di-erence in this record-parsing bench-mark, we suspect the absolute num-bers or the Hadoop benchmarks inthe comparison paper are infated andcannot be used to reach conclusions

    about undamental dierences in theperormance o MapReduce and paral-lel DBMS.

    alt Tlra

    The MapReduce implementation usesa pull model or moving data betweenmappers and reducers, as opposed toa push model where mappers write di-rectly to reducers. Pavlo et al. correctlypointed out that the pull model can re-sult in the creation o many small lesand many disk seeks to move data be-tween mappers and reducers. Imple-

    mapRd a hghly tv

    ad fttl r larg-alalt-tlratdata aaly.

  • 7/27/2019 MapReduce - A Flexible DP Tool

    5/6

    76 communicATions o THe Acm | jANuARy 2010 | Vol. 53 | No. 1

    contributed articles

    ormat or structured data (protocolbuers) instead o inecient textualormats.

    Reading unnecessary data. The com-parison paper says, MR is always orcedto start a query with a scan o the entireinput le. MapReduce does not requirea ull scan over the data; it requires only

    an implementation o its input inter-ace to yield a set o records that matchsome input specication. Examples oinput specications are:

    All records in a set o les;

    All records with a visit-date in the

    range [2000-01-15..2000-01-22]; andAll data in Bigtable table T whose

    language column is Turkish.The input may require a ull scan

    over a set o les, as Pavlo et al. sug-gested, but alternate implementations

    are oten used. For example, the inputmay be a database with an index thatprovides ecient ltering or an in-dexed le structure (such as daily logles used or ecient date-based l-tering o log data).

    This mistaken assumption aboutMapReduce aects three o the vebenchmarks in the comparison paper(the selection, aggregation, and jointasks) and invalidates the conclusionsin the paper about the relative peror-mance o MapReduce and parallel da-

    tabases.Merging results. The measurements

    o Hadoop in all ve benchmarks in thecomparison paper included the costo a nal phase to merge the results othe initial MapReduce into one le. Inpractice, this merging is unnecessary,since the next consumer o MapReduceoutput is usually another MapReducethat can easily operate over the set oles produced by the rst MapReduce,instead o requiring a single merged in-put. Even i the consumer is not another

    MapReduce, the reducer processes inthe initial MapReduce can write directlyto a merged destination (such as a Big-table or parallel database table).

    Data loading. The DBMS measure-ments in the comparison paper dem-onstrated the high cost o loadinginput data into a database beore itis analyzed. For many o the bench-marks in the comparison paper, thetime needed to load the input data intoa parallel database is ve to 50 timesthe time needed to analyze the data viaHadoop. Put another way, or some o

    mentation tricks like batching, sorting,and grouping o intermediate data andsmart scheduling o reads are used byGoogles MapReduce implementationto mitigate these costs.

    MapReduce implementations tendnot to use a push model due to theault-tolerance properties required

    by Googles developers. Most MapRe-duce executions over large data setsencounter at least a ew ailures; apartrom hardware and sotware problems,Googles cluster scheduling system canpreempt MapReduce tasks by killingthem to make room or higher-prioritytasks. In a push model, ailure o a re-ducer would orce re-execution o allMap tasks.

    We suspect that as data sets growlarger, analyses will require more

    computation, and ault tolerance willbecome more important. There are al-ready more than a dozen distinct datasets at Google more than 1PB in sizeand dozens more hundreds o TBsin size that are processed daily usingMapReduce. Outside o Google, manyusers listed on the Hadoop users list11are handling data sets o multiple hun-dreds o terabytes or more. Clearly, asdata sets continue to grow, more users

    will need a ault-tolerant system likeMapReduce that can be used to process

    these large data sets eciently and e-ectively.

    Prra

    Pavlo et al. compared the perormanceo the Hadoop MapReduce implemen-tation to two database implementa-tions; here, we discuss the perormancedierences o the various systems:

    Engineering considerations. Startupoverhead and sequential scanningspeed are indicators o maturity o im-plementation and engineering trade-

    os, not undamental dierences inprogramming models. These dier-ences are certainly important but canbe addressed in a variety o ways. Forexample, startup overhead can be ad-dressed by keeping worker processeslive, waiting or the next MapReduce in-

    vocation, an optimization added morethan a year ago to Googles MapReduceimplementation.

    Google has also addressed sequen-tial scanning perormance with a varietyo perormance optimizations by, or ex-ample, using ecient binary-encoding

    ACMTransactions on

    Accessible

    Computing

    This quarterly publication is a

    quarterly journal that publishes

    refereed articles addressing issues

    of computing as it impacts the

    lives of people with disabilities.

    The journal will be of particular

    interest to SIGACCESS members

    and delegrates to its affiliated

    conference (i.e., ASSETS), as well

    as other international accessibility

    conferences.

    www.acm.org/taccess

    www.acm.org/subscribe

  • 7/27/2019 MapReduce - A Flexible DP Tool

    6/6

    contributed articles

    jANuARy 2010 | Vol. 53 | No. 1 | communicATions o THe Acm 77

    the benchmarks, starting with data in acollection o les on disk, it is possibleto run 50 separate MapReduce analy-ses over the data beore it is possible toload the data into a database and com-plete a single analysis. Long load timesmay not matter i many queries will berun on the data ater loading, but this

    is oten not the case; data sets are otengenerated, processed once or twice,and then discarded. For example, the

    Web-search index-building system de-scribed in the MapReduce paper4 is asequence o MapReduce phases wherethe output o most phases is consumedby one or two subsequent MapReducephases.

    cl

    The conclusions about perormance

    in the comparison paper were basedon fawed assumptions about MapRe-duce and overstated the benet o par-allel database systems. In our experi-ence, MapReduce is a highly eectiveand ecient tool or large-scale ault-tolerant data analysis. However, a ewuseul lessons can be drawn rom thisdiscussion:

    Startup latency. MapReduce imple-mentations should strive to reducestartup latency by using techniques like

    worker processes that are reused across

    dierent invocations;Data shufing. Careul attention

    must be paid to the implementation othe data-shufing phase to avoid gen-erating O(M*R) seeks in a MapReduce

    withMmap tasks and R reduce tasks;Textual ormats. MapReduce users

    should avoid using inecient textualormats;

    Natural indices. MapReduce usersshould take advantage o natural in-dices (such as timestamps in log lenames) whenever possible; and

    Unmerged output. Most MapReduceoutput should be let unmerged, sincethere is no benet to merging i thenext consumer is another MapReduceprogram.

    MapReduce provides many signi-cant advantages over parallel data-bases. First and oremost, it providesne-grain ault tolerance or large

    jobs; ailure in the middle o a multi-hour execution does not require re-starting the job rom scratch. Second,MapReduce is very useul or handlingdata processing and data loading in a

    heterogenous system with many di-erent storage systems. Third, MapRe-duce provides a good ramework orthe execution o more complicatedunctions than are supported directlyin SQL.

    References1. azi, a., bj-Pwikwki, K., ai, d.J.,

    sichz, a., ri, a. Hpdb: achic hi Mprc dbMschgi ic wk. I Proceedingso the Conerence on Very Large Databases (l,Fc, 2009); hp://.c../hp/

    2. a d sm, Ic. In-Database MapReduceor Rich Analytics; hp://www..cm/pc/mpc.php.

    3. Chg, F., d, J., Ghmw, s., Hih, W.C.,Wch, d.a., bw, M., Ch, t., Fik, a., G, r.e. big: a ii gm c . I Proceedings o theSeventh Symposium on Operating System Designand Implementation (s, Wa, nv. 68). uixacii, 2006; hp://.gg.cm/pp/ig.hm

    4. d, J. Ghmw, s. Mprc: simpif pcig g c. I Proceedings o

    the Sixth Symposium on Operating System Design andImplementation (s Fcic, Ca, dc. 68). uixacii, 2004; hp://.gg.cm/pp/mpc.hm

    5. dwi, d. sk, M. Mprc: a Mjsp bckw gp; hp://cm.vic.cm/-ivi/mpc--mj-p-ckw/

    6. dwi, d. sk, M. Mprc IIgp; hp://cm.vic.cm/-ivi/mpc-ii/

    7. Ghmw, s., Gi, H., lg, s.-t. thGg f m. I Proceedings o the 19th ACMSymposium on Operating Systems Principles (lkGg, ny, oc. 1922). aCM P, nw yk, 2003;hp://.gg.cm/pp/g.hm

    8. Gg. Pc b: Gg d IchgFm. dcmi p c ;hp://c.gg.cm/p/p/

    9. Gpm. Gpm Mprc: bigig nx-Gi aic tchg h epi;hp://www.gpm.cm/c/mpc/

    10. Hp. dcmi p c ;hp://hp.pch.g/c/

    11. Hp. u i; hp://wiki.pch.g/hp/Pwb

    12. o, C., r, b., sivv, u., Km, r., tmki, a. Pig li: a --ig gg pcig. I Proceedings o the ACM SIGMOD2008 International Conerence on Management oData (ack, nw Z, J 2008); hp://hp.pch.g/pig/

    13. Pv, a., P, e., ri, a., ai, d.J., dWi,d.J., M, s., sk, M. a cmpi ppch g-c i. IProceedings o the 2009 ACM SIGMOD InternationalConerence (Pvic, rI, J 29J 2). aCMP, nw yk, 2009; hp://.c.w./pjc/mpc-v-m/

    14. Pik, r., dw, s., Gim, r., Qi, s.Ipig h : P i wih swz.Scientifc Programming Journal, Special Issue onGrids and Worldwide Computing Programming Modelsand Inrastructure 13, 4, 227298. hp://.gg.cm/pp/wz.hm

    Jeffrey Dean ([email protected]) i Gg Fw ih sm Ic Gp Gg, MiViw, Ca.

    Sanjay Ghemawat ([email protected]) i GgFw i h sm Ic Gp Gg,Mi Viw, Ca.

    2010 aCM 0001-0782/10/0100 $10.00

    mapRdprvd f-gra

    alt tlrar larg jb;alr th ddl a lt-hrxt dt rqrrtartg th jbr rath.