lecture 11 notes, map reduce and hadoop

Upload: prakash-mestry

Post on 04-Apr-2018

233 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    1/39

    MapReduce

    From Wikipedia, the free encyclopedia

    Jump to:navigation, search

    Contents

    [hide]

    1

    Logical

    view

    1

    .1E

    x

    am

    p

    l

    e

    2

    Dataflo

    w 2

    .

    1I

    n

    pu

    t

    r

    e

    ad

    er

    2

    .2

    M

    a

    http://en.wikipedia.org/wiki/MapReduce#column-onehttp://en.wikipedia.org/wiki/MapReduce#column-onehttp://en.wikipedia.org/wiki/MapReduce#searchInputhttp://toggletoc%28%29/http://en.wikipedia.org/wiki/MapReduce#Logical_viewhttp://en.wikipedia.org/wiki/MapReduce#Logical_viewhttp://en.wikipedia.org/wiki/MapReduce#Logical_viewhttp://en.wikipedia.org/wiki/MapReduce#Examplehttp://en.wikipedia.org/wiki/MapReduce#Examplehttp://en.wikipedia.org/wiki/MapReduce#Examplehttp://en.wikipedia.org/wiki/MapReduce#Examplehttp://en.wikipedia.org/wiki/MapReduce#Examplehttp://en.wikipedia.org/wiki/MapReduce#Examplehttp://en.wikipedia.org/wiki/MapReduce#Examplehttp://en.wikipedia.org/wiki/MapReduce#Examplehttp://en.wikipedia.org/wiki/MapReduce#Examplehttp://en.wikipedia.org/wiki/MapReduce#Examplehttp://en.wikipedia.org/wiki/MapReduce#Dataflowhttp://en.wikipedia.org/wiki/MapReduce#Dataflowhttp://en.wikipedia.org/wiki/MapReduce#Dataflowhttp://en.wikipedia.org/wiki/MapReduce#Input_readerhttp://en.wikipedia.org/wiki/MapReduce#Input_readerhttp://en.wikipedia.org/wiki/MapReduce#Input_readerhttp://en.wikipedia.org/wiki/MapReduce#Input_readerhttp://en.wikipedia.org/wiki/MapReduce#Input_readerhttp://en.wikipedia.org/wiki/MapReduce#Input_readerhttp://en.wikipedia.org/wiki/MapReduce#Input_readerhttp://en.wikipedia.org/wiki/MapReduce#Input_readerhttp://en.wikipedia.org/wiki/MapReduce#Input_readerhttp://en.wikipedia.org/wiki/MapReduce#Input_readerhttp://en.wikipedia.org/wiki/MapReduce#Input_readerhttp://en.wikipedia.org/wiki/MapReduce#Input_readerhttp://en.wikipedia.org/wiki/MapReduce#Input_readerhttp://en.wikipedia.org/wiki/MapReduce#Input_readerhttp://en.wikipedia.org/wiki/MapReduce#Input_readerhttp://en.wikipedia.org/wiki/MapReduce#Map_functionhttp://en.wikipedia.org/wiki/MapReduce#Map_functionhttp://en.wikipedia.org/wiki/MapReduce#Map_functionhttp://en.wikipedia.org/wiki/MapReduce#Map_functionhttp://en.wikipedia.org/wiki/MapReduce#Map_functionhttp://en.wikipedia.org/wiki/MapReduce#column-onehttp://en.wikipedia.org/wiki/MapReduce#searchInputhttp://toggletoc%28%29/http://en.wikipedia.org/wiki/MapReduce#Logical_viewhttp://en.wikipedia.org/wiki/MapReduce#Logical_viewhttp://en.wikipedia.org/wiki/MapReduce#Logical_viewhttp://en.wikipedia.org/wiki/MapReduce#Examplehttp://en.wikipedia.org/wiki/MapReduce#Examplehttp://en.wikipedia.org/wiki/MapReduce#Examplehttp://en.wikipedia.org/wiki/MapReduce#Dataflowhttp://en.wikipedia.org/wiki/MapReduce#Dataflowhttp://en.wikipedia.org/wiki/MapReduce#Dataflowhttp://en.wikipedia.org/wiki/MapReduce#Input_readerhttp://en.wikipedia.org/wiki/MapReduce#Input_readerhttp://en.wikipedia.org/wiki/MapReduce#Input_readerhttp://en.wikipedia.org/wiki/MapReduce#Map_functionhttp://en.wikipedia.org/wiki/MapReduce#Map_function
  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    2/39

    p

    f

    u

    nc

    ti

    on

    2

    .3

    P

    art

    it

    i

    on

    f

    un

    c

    tio

    n

    2

    .4

    C

    om

    p

    ari

    s

    on

    f

    u

    nc

    ti

    on

    2

    .5

    R

    e

    http://en.wikipedia.org/wiki/MapReduce#Map_functionhttp://en.wikipedia.org/wiki/MapReduce#Map_functionhttp://en.wikipedia.org/wiki/MapReduce#Map_functionhttp://en.wikipedia.org/wiki/MapReduce#Map_functionhttp://en.wikipedia.org/wiki/MapReduce#Map_functionhttp://en.wikipedia.org/wiki/MapReduce#Map_functionhttp://en.wikipedia.org/wiki/MapReduce#Map_functionhttp://en.wikipedia.org/wiki/MapReduce#Map_functionhttp://en.wikipedia.org/wiki/MapReduce#Partition_functionhttp://en.wikipedia.org/wiki/MapReduce#Partition_functionhttp://en.wikipedia.org/wiki/MapReduce#Partition_functionhttp://en.wikipedia.org/wiki/MapReduce#Partition_functionhttp://en.wikipedia.org/wiki/MapReduce#Partition_functionhttp://en.wikipedia.org/wiki/MapReduce#Partition_functionhttp://en.wikipedia.org/wiki/MapReduce#Partition_functionhttp://en.wikipedia.org/wiki/MapReduce#Partition_functionhttp://en.wikipedia.org/wiki/MapReduce#Partition_functionhttp://en.wikipedia.org/wiki/MapReduce#Partition_functionhttp://en.wikipedia.org/wiki/MapReduce#Partition_functionhttp://en.wikipedia.org/wiki/MapReduce#Partition_functionhttp://en.wikipedia.org/wiki/MapReduce#Partition_functionhttp://en.wikipedia.org/wiki/MapReduce#Partition_functionhttp://en.wikipedia.org/wiki/MapReduce#Partition_functionhttp://en.wikipedia.org/wiki/MapReduce#Partition_functionhttp://en.wikipedia.org/wiki/MapReduce#Partition_functionhttp://en.wikipedia.org/wiki/MapReduce#Comparison_functionhttp://en.wikipedia.org/wiki/MapReduce#Comparison_functionhttp://en.wikipedia.org/wiki/MapReduce#Comparison_functionhttp://en.wikipedia.org/wiki/MapReduce#Comparison_functionhttp://en.wikipedia.org/wiki/MapReduce#Comparison_functionhttp://en.wikipedia.org/wiki/MapReduce#Comparison_functionhttp://en.wikipedia.org/wiki/MapReduce#Comparison_functionhttp://en.wikipedia.org/wiki/MapReduce#Comparison_functionhttp://en.wikipedia.org/wiki/MapReduce#Comparison_functionhttp://en.wikipedia.org/wiki/MapReduce#Comparison_functionhttp://en.wikipedia.org/wiki/MapReduce#Comparison_functionhttp://en.wikipedia.org/wiki/MapReduce#Comparison_functionhttp://en.wikipedia.org/wiki/MapReduce#Comparison_functionhttp://en.wikipedia.org/wiki/MapReduce#Comparison_functionhttp://en.wikipedia.org/wiki/MapReduce#Comparison_functionhttp://en.wikipedia.org/wiki/MapReduce#Comparison_functionhttp://en.wikipedia.org/wiki/MapReduce#Comparison_functionhttp://en.wikipedia.org/wiki/MapReduce#Comparison_functionhttp://en.wikipedia.org/wiki/MapReduce#Comparison_functionhttp://en.wikipedia.org/wiki/MapReduce#Reduce_functionhttp://en.wikipedia.org/wiki/MapReduce#Reduce_functionhttp://en.wikipedia.org/wiki/MapReduce#Reduce_functionhttp://en.wikipedia.org/wiki/MapReduce#Reduce_functionhttp://en.wikipedia.org/wiki/MapReduce#Reduce_functionhttp://en.wikipedia.org/wiki/MapReduce#Map_functionhttp://en.wikipedia.org/wiki/MapReduce#Map_functionhttp://en.wikipedia.org/wiki/MapReduce#Partition_functionhttp://en.wikipedia.org/wiki/MapReduce#Partition_functionhttp://en.wikipedia.org/wiki/MapReduce#Partition_functionhttp://en.wikipedia.org/wiki/MapReduce#Comparison_functionhttp://en.wikipedia.org/wiki/MapReduce#Comparison_functionhttp://en.wikipedia.org/wiki/MapReduce#Comparison_functionhttp://en.wikipedia.org/wiki/MapReduce#Reduce_functionhttp://en.wikipedia.org/wiki/MapReduce#Reduce_function
  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    3/39

    d

    u

    c

    ef

    u

    nc

    ti

    on

    2

    .6

    O

    u

    tp

    u

    tw

    ri

    te

    r

    3

    Distribution and

    reliabilit

    y

    4 Uses

    5

    Implementations

    6

    References

    7

    External

    links

    7

    .

    1P

    a

    pe

    r

    http://en.wikipedia.org/wiki/MapReduce#Reduce_functionhttp://en.wikipedia.org/wiki/MapReduce#Reduce_functionhttp://en.wikipedia.org/wiki/MapReduce#Reduce_functionhttp://en.wikipedia.org/wiki/MapReduce#Reduce_functionhttp://en.wikipedia.org/wiki/MapReduce#Reduce_functionhttp://en.wikipedia.org/wiki/MapReduce#Reduce_functionhttp://en.wikipedia.org/wiki/MapReduce#Reduce_functionhttp://en.wikipedia.org/wiki/MapReduce#Reduce_functionhttp://en.wikipedia.org/wiki/MapReduce#Reduce_functionhttp://en.wikipedia.org/wiki/MapReduce#Reduce_functionhttp://en.wikipedia.org/wiki/MapReduce#Reduce_functionhttp://en.wikipedia.org/wiki/MapReduce#Output_writerhttp://en.wikipedia.org/wiki/MapReduce#Output_writerhttp://en.wikipedia.org/wiki/MapReduce#Output_writerhttp://en.wikipedia.org/wiki/MapReduce#Output_writerhttp://en.wikipedia.org/wiki/MapReduce#Output_writerhttp://en.wikipedia.org/wiki/MapReduce#Output_writerhttp://en.wikipedia.org/wiki/MapReduce#Output_writerhttp://en.wikipedia.org/wiki/MapReduce#Output_writerhttp://en.wikipedia.org/wiki/MapReduce#Output_writerhttp://en.wikipedia.org/wiki/MapReduce#Output_writerhttp://en.wikipedia.org/wiki/MapReduce#Output_writerhttp://en.wikipedia.org/wiki/MapReduce#Output_writerhttp://en.wikipedia.org/wiki/MapReduce#Output_writerhttp://en.wikipedia.org/wiki/MapReduce#Output_writerhttp://en.wikipedia.org/wiki/MapReduce#Output_writerhttp://en.wikipedia.org/wiki/MapReduce#Distribution_and_reliabilityhttp://en.wikipedia.org/wiki/MapReduce#Distribution_and_reliabilityhttp://en.wikipedia.org/wiki/MapReduce#Distribution_and_reliabilityhttp://en.wikipedia.org/wiki/MapReduce#Distribution_and_reliabilityhttp://en.wikipedia.org/wiki/MapReduce#Distribution_and_reliabilityhttp://en.wikipedia.org/wiki/MapReduce#Useshttp://en.wikipedia.org/wiki/MapReduce#Implementationshttp://en.wikipedia.org/wiki/MapReduce#Implementationshttp://en.wikipedia.org/wiki/MapReduce#Implementationshttp://en.wikipedia.org/wiki/MapReduce#Implementationshttp://en.wikipedia.org/wiki/MapReduce#Referenceshttp://en.wikipedia.org/wiki/MapReduce#Referenceshttp://en.wikipedia.org/wiki/MapReduce#Referenceshttp://en.wikipedia.org/wiki/MapReduce#External_linkshttp://en.wikipedia.org/wiki/MapReduce#External_linkshttp://en.wikipedia.org/wiki/MapReduce#External_linkshttp://en.wikipedia.org/wiki/MapReduce#Papershttp://en.wikipedia.org/wiki/MapReduce#Papershttp://en.wikipedia.org/wiki/MapReduce#Papershttp://en.wikipedia.org/wiki/MapReduce#Papershttp://en.wikipedia.org/wiki/MapReduce#Papershttp://en.wikipedia.org/wiki/MapReduce#Papershttp://en.wikipedia.org/wiki/MapReduce#Papershttp://en.wikipedia.org/wiki/MapReduce#Papershttp://en.wikipedia.org/wiki/MapReduce#Reduce_functionhttp://en.wikipedia.org/wiki/MapReduce#Reduce_functionhttp://en.wikipedia.org/wiki/MapReduce#Output_writerhttp://en.wikipedia.org/wiki/MapReduce#Output_writerhttp://en.wikipedia.org/wiki/MapReduce#Output_writerhttp://en.wikipedia.org/wiki/MapReduce#Distribution_and_reliabilityhttp://en.wikipedia.org/wiki/MapReduce#Distribution_and_reliabilityhttp://en.wikipedia.org/wiki/MapReduce#Distribution_and_reliabilityhttp://en.wikipedia.org/wiki/MapReduce#Useshttp://en.wikipedia.org/wiki/MapReduce#Implementationshttp://en.wikipedia.org/wiki/MapReduce#Implementationshttp://en.wikipedia.org/wiki/MapReduce#Implementationshttp://en.wikipedia.org/wiki/MapReduce#Referenceshttp://en.wikipedia.org/wiki/MapReduce#Referenceshttp://en.wikipedia.org/wiki/MapReduce#Referenceshttp://en.wikipedia.org/wiki/MapReduce#External_linkshttp://en.wikipedia.org/wiki/MapReduce#External_linkshttp://en.wikipedia.org/wiki/MapReduce#External_linkshttp://en.wikipedia.org/wiki/MapReduce#Papershttp://en.wikipedia.org/wiki/MapReduce#Papers
  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    4/39

    s

    MapReduce is a software frameworkintroduced by Google to support parallel computations over large(multiplepetabyte[1]) data sets on clusters of computers. This framework is largely taken from mapand

    reduce functions commonly used infunctional programming,[2] although the actual semantics of the

    framework are not the same.[3]

    MapReduce implementations have been written inC++,Java,Python and other languages.

    [edit] Logical view

    TheMap andReduce functions ofMapReduce are both defined with respect to data structured in (key,

    value) pairs.Map takes one pair of data with a type on a data domain, and returns a list of pairs in adifferent domain:

    Map(k1,v1) -> list(k2,v2)

    The map function is applied in parallel to every item in the input dataset. This produces a list of (k2,v2)pairs for each call. After that, the MapReduce framework collects all pairs with the same key from all

    lists and groups them together, thus creating one group for each one of the different generated keys.

    TheReduce function is then applied in parallel to each group, which in turn produces a collection of

    values in the same domain:

    Reduce(k2, list (v2)) -> list(v2)

    EachReduce call typically produces either one value v2 or an empty return, though one call is allowed to

    return more than one value. The returns of all calls are collected as the desired result list.

    Thus the MapReduce framework transforms a list of (key, value) pairs into a list of values. This behavior

    is different from the functional programming map and reduce combination, which accepts a list of

    arbitrary values and returns one single value that combines allthe values returned by map.

    [edit] Example

    The canonical example application of MapReduce is a process to count the appearances of each different

    word in a set of documents:

    map(String name, String document):// key: document name// value: document contentsfor each word w in document:EmitIntermediate(w, 1);

    reduce(String word, Iterator partialCounts):

    // key: a word// values: a list of aggregated partial countsint result = 0;for each v in partialCounts:result += ParseInt(v);

    Emit(result);

    Here, each document is split in words, and each word is counted initially with a "1" value by the Map

    function, using the word as the result key. The framework puts together all the pairs with the same keyand feeds them to the same call toReduce, thus this function just needs to sum all of its input values to

    http://en.wikipedia.org/wiki/MapReduce#Papershttp://en.wikipedia.org/wiki/Software_frameworkhttp://en.wikipedia.org/wiki/Googlehttp://en.wikipedia.org/wiki/Petabytehttp://en.wikipedia.org/wiki/MapReduce#cite_note-0http://en.wikipedia.org/wiki/Map_(higher-order_function)http://en.wikipedia.org/wiki/Map_(higher-order_function)http://en.wikipedia.org/wiki/Fold_(higher-order_function)http://en.wikipedia.org/wiki/Functional_programminghttp://en.wikipedia.org/wiki/Functional_programminghttp://en.wikipedia.org/wiki/Functional_programminghttp://en.wikipedia.org/wiki/MapReduce#cite_note-map-1http://en.wikipedia.org/wiki/MapReduce#cite_note-2http://en.wikipedia.org/wiki/C%2B%2Bhttp://en.wikipedia.org/wiki/C%2B%2Bhttp://en.wikipedia.org/wiki/Java_(programming_language)http://en.wikipedia.org/wiki/Java_(programming_language)http://en.wikipedia.org/wiki/Java_(programming_language)http://en.wikipedia.org/wiki/Python_(programming_language)http://en.wikipedia.org/w/index.php?title=MapReduce&action=edit&section=1http://en.wikipedia.org/w/index.php?title=MapReduce&action=edit&section=2http://en.wikipedia.org/wiki/MapReduce#Papershttp://en.wikipedia.org/wiki/Software_frameworkhttp://en.wikipedia.org/wiki/Googlehttp://en.wikipedia.org/wiki/Petabytehttp://en.wikipedia.org/wiki/MapReduce#cite_note-0http://en.wikipedia.org/wiki/Map_(higher-order_function)http://en.wikipedia.org/wiki/Fold_(higher-order_function)http://en.wikipedia.org/wiki/Functional_programminghttp://en.wikipedia.org/wiki/MapReduce#cite_note-map-1http://en.wikipedia.org/wiki/MapReduce#cite_note-2http://en.wikipedia.org/wiki/C%2B%2Bhttp://en.wikipedia.org/wiki/Java_(programming_language)http://en.wikipedia.org/wiki/Python_(programming_language)http://en.wikipedia.org/w/index.php?title=MapReduce&action=edit&section=1http://en.wikipedia.org/w/index.php?title=MapReduce&action=edit&section=2
  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    5/39

    find the total appearances of that word.

    [edit] Dataflow

    The frozen part of the MapReduce framework is a large distributed sort. The hot spots, which theapplication defines, are:

    an input reader aMap function

    apartition function

    a compare function

    aReduce function

    an output writer

    [edit] Input reader

    The input readerdivides the input into 16MB to 128MB splits and the framework assigns one split to

    eachMap function. The input readerreads data from stable storage (typically a distributed file system

    likeGoogle File System) and generates key/value pairs.

    A common example will read a directory full of text files and return each line as a record.

    [edit] Map function

    EachMap function takes a series of key/value pairs, processes each, and generates zero or more output

    key/value pairs. The input and output types of the map can be (and often are) different from each other.

    If the application is doing a word count, the map function would break the line into words and output the

    word as the key and "1" as the value.

    [edit] Partition function

    The output of all of the maps is allocated to particularreduces by the applications'spartition function.

    Thepartition function is given the key and the number of reduces and returns the index of the desiredreduce.

    A typical default is to hash the key andmodulo the number ofreduces.

    [edit] Comparison function

    The input for each reduce is pulled from the machine where the map ran and sorted using the

    application's comparison function.

    [edit] Reduce function

    The framework calls the application's reduce function once for each unique key in the sorted order. The

    reduce can iterate through the values that are associated with that key and output 0 or more key/value

    pairs.

    In the word count example, the reduce function takes the input values, sums them and generates a singleoutput of the word and the final sum.

    http://en.wikipedia.org/w/index.php?title=MapReduce&action=edit&section=3http://en.wikipedia.org/w/index.php?title=MapReduce&action=edit&section=4http://en.wikipedia.org/wiki/Google_File_Systemhttp://en.wikipedia.org/wiki/Google_File_Systemhttp://en.wikipedia.org/w/index.php?title=MapReduce&action=edit&section=5http://en.wikipedia.org/w/index.php?title=MapReduce&action=edit&section=6http://en.wikipedia.org/wiki/Hash_functionhttp://en.wikipedia.org/wiki/Modulo_operationhttp://en.wikipedia.org/wiki/Modulo_operationhttp://en.wikipedia.org/w/index.php?title=MapReduce&action=edit&section=7http://en.wikipedia.org/w/index.php?title=MapReduce&action=edit&section=8http://en.wikipedia.org/w/index.php?title=MapReduce&action=edit&section=3http://en.wikipedia.org/w/index.php?title=MapReduce&action=edit&section=4http://en.wikipedia.org/wiki/Google_File_Systemhttp://en.wikipedia.org/w/index.php?title=MapReduce&action=edit&section=5http://en.wikipedia.org/w/index.php?title=MapReduce&action=edit&section=6http://en.wikipedia.org/wiki/Hash_functionhttp://en.wikipedia.org/wiki/Modulo_operationhttp://en.wikipedia.org/w/index.php?title=MapReduce&action=edit&section=7http://en.wikipedia.org/w/index.php?title=MapReduce&action=edit&section=8
  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    6/39

    [edit] Output writer

    The Output Writerwrites the output of the reduce to stable storage, usually a distributed file system, such

    asGoogle File System.

    [edit] Distribution and reliability

    MapReduce achieves reliability by parceling out a number of operations on the set of data to each node in

    the network; each node is expected to report back periodically with completed work and status updates. Ifa node falls silent for longer than that interval, the master node (similar to the master server in the Google

    File System) records the node as dead, and sends out the node's assigned work to other nodes. Individual

    operations use atomicoperations for naming file outputs as a double check to ensure that there are not

    parallel conflicting threads running; when files are renamed, it is possible to also copy them to anothername in addition to the name of the task (allowing forside-effects).

    The reduce operations operate much the same way, but because of their inferior properties with regard to

    parallel operations, the master node attempts to schedule reduce operations on the same node, or in the

    same rack as the node holding the data being operated on; this property is desirable as it conserves

    bandwidth across the backbone network of the datacenter.Implementations may not be highly-available; inHadoop, for example, theNameNode is aSingle Point

    of Failure for the distributed filesystem; if theJobTrackerfails, all outstanding work is lost.

    [edit] Uses

    MapReduce is useful in a wide range of applications, including: "distributed grep, distributed sort, web

    link-graph reversal, term-vector per host, web access log stats, inverted index construction, documentclustering, machine learning, statisticalmachine translation..." Most significantly, when MapReduce was

    finished, it was used to completely regenerate Google's index of the World Wide Web, and replaced the

    old ad hoc programs that updated the index and ran the various analyses.[4]

    MapReduce's stable inputs and outputs are usually stored in a distributed file system. The transient data isusually stored on local disk and fetched remotely by the reduces.

    David DeWitt andMichael Stonebraker, pioneering experts inparallel databases and shared nothing

    architectures, have made some controversial assertions about the breadth of problems that MapReduce

    can be used for. They called its interface too low-level, and questioned whether it really represents theparadigm shift its proponents have claimed it is.[5] They challenge the MapReduce proponents claims of

    novelty, citing Teradata as an example ofprior art that has existed for over two decades; they compared

    MapReduce programmers to Codasyl programmers, noting both are "writing in alow-level languageperforming low-level record manipulation".[5] MapReduce advocates promote the tool without

    seemingly paying attention to years of academic and commercial database research and real world

    use[citation needed]. MapReduce's use of input files and lack ofschema support prevents theperformance improvements enabled by common database system features such as B-trees and hashpartitioning, though projects such asPigLatin and Sawzall are starting to address these problems.[6]

    [edit] Implementations

    The Google MapReduce framework is implemented inC++ with interfaces in Python and Java.

    The Hadoop project is a free open source Java MapReduce implementation.

    Greenplum is a commercial MapReduce implementation, with support for Python, Perl, SQL and

    http://en.wikipedia.org/w/index.php?title=MapReduce&action=edit&section=9http://en.wikipedia.org/wiki/Google_File_Systemhttp://en.wikipedia.org/wiki/Google_File_Systemhttp://en.wikipedia.org/w/index.php?title=MapReduce&action=edit&section=10http://en.wikipedia.org/wiki/Google_File_Systemhttp://en.wikipedia.org/wiki/Google_File_Systemhttp://en.wikipedia.org/wiki/Atomicityhttp://en.wikipedia.org/wiki/Atomicityhttp://en.wikipedia.org/wiki/Side-effect_(computer_science)http://en.wikipedia.org/wiki/Hadoophttp://en.wikipedia.org/wiki/Hadoophttp://en.wikipedia.org/wiki/Single_Point_of_Failurehttp://en.wikipedia.org/wiki/Single_Point_of_Failurehttp://en.wikipedia.org/wiki/Single_Point_of_Failurehttp://en.wikipedia.org/w/index.php?title=MapReduce&action=edit&section=11http://en.wikipedia.org/wiki/Grephttp://en.wikipedia.org/wiki/Inverted_indexhttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Machine_translationhttp://en.wikipedia.org/wiki/Machine_translationhttp://en.wikipedia.org/wiki/World_Wide_Webhttp://en.wikipedia.org/wiki/MapReduce#cite_note-usage-3http://en.wikipedia.org/wiki/MapReduce#cite_note-usage-3http://en.wikipedia.org/wiki/Distributed_file_systemhttp://en.wikipedia.org/wiki/David_DeWitthttp://en.wikipedia.org/wiki/Michael_Stonebrakerhttp://en.wikipedia.org/wiki/Michael_Stonebrakerhttp://en.wikipedia.org/wiki/Parallel_databasehttp://en.wikipedia.org/wiki/Shared_nothing_architecturehttp://en.wikipedia.org/wiki/Shared_nothing_architecturehttp://en.wikipedia.org/wiki/Paradigm_shifthttp://en.wikipedia.org/wiki/MapReduce#cite_note-ddandms1-4http://en.wikipedia.org/wiki/Teradatahttp://en.wikipedia.org/wiki/Prior_arthttp://en.wikipedia.org/wiki/CODASYLhttp://en.wikipedia.org/wiki/Low-level_programming_languagehttp://en.wikipedia.org/wiki/Low-level_programming_languagehttp://en.wikipedia.org/wiki/MapReduce#cite_note-ddandms1-4http://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Logical_schemahttp://en.wikipedia.org/wiki/Logical_schemahttp://en.wikipedia.org/wiki/B-treehttp://en.wikipedia.org/wiki/Partition_(database)http://en.wikipedia.org/wiki/Partition_(database)http://en.wikipedia.org/w/index.php?title=PigLatin_(programming_Language)&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=PigLatin_(programming_Language)&action=edit&redlink=1http://en.wikipedia.org/wiki/Sawzall_(programming_language)http://en.wikipedia.org/wiki/MapReduce#cite_note-ddandms2-5http://en.wikipedia.org/wiki/MapReduce#cite_note-ddandms2-5http://en.wikipedia.org/w/index.php?title=MapReduce&action=edit&section=12http://en.wikipedia.org/wiki/Googlehttp://en.wikipedia.org/wiki/C%2B%2Bhttp://en.wikipedia.org/wiki/C%2B%2Bhttp://en.wikipedia.org/wiki/Python_(programming_language)http://en.wikipedia.org/wiki/Java_(programming_language)http://en.wikipedia.org/wiki/Hadoophttp://en.wikipedia.org/wiki/Java_(programming_language)http://en.wikipedia.org/w/index.php?title=Greenplum&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=MapReduce&action=edit&section=9http://en.wikipedia.org/wiki/Google_File_Systemhttp://en.wikipedia.org/w/index.php?title=MapReduce&action=edit&section=10http://en.wikipedia.org/wiki/Google_File_Systemhttp://en.wikipedia.org/wiki/Google_File_Systemhttp://en.wikipedia.org/wiki/Atomicityhttp://en.wikipedia.org/wiki/Side-effect_(computer_science)http://en.wikipedia.org/wiki/Hadoophttp://en.wikipedia.org/wiki/Single_Point_of_Failurehttp://en.wikipedia.org/wiki/Single_Point_of_Failurehttp://en.wikipedia.org/w/index.php?title=MapReduce&action=edit&section=11http://en.wikipedia.org/wiki/Grephttp://en.wikipedia.org/wiki/Inverted_indexhttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Machine_translationhttp://en.wikipedia.org/wiki/World_Wide_Webhttp://en.wikipedia.org/wiki/MapReduce#cite_note-usage-3http://en.wikipedia.org/wiki/Distributed_file_systemhttp://en.wikipedia.org/wiki/David_DeWitthttp://en.wikipedia.org/wiki/Michael_Stonebrakerhttp://en.wikipedia.org/wiki/Parallel_databasehttp://en.wikipedia.org/wiki/Shared_nothing_architecturehttp://en.wikipedia.org/wiki/Shared_nothing_architecturehttp://en.wikipedia.org/wiki/Paradigm_shifthttp://en.wikipedia.org/wiki/MapReduce#cite_note-ddandms1-4http://en.wikipedia.org/wiki/Teradatahttp://en.wikipedia.org/wiki/Prior_arthttp://en.wikipedia.org/wiki/CODASYLhttp://en.wikipedia.org/wiki/Low-level_programming_languagehttp://en.wikipedia.org/wiki/MapReduce#cite_note-ddandms1-4http://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Logical_schemahttp://en.wikipedia.org/wiki/B-treehttp://en.wikipedia.org/wiki/Partition_(database)http://en.wikipedia.org/wiki/Partition_(database)http://en.wikipedia.org/w/index.php?title=PigLatin_(programming_Language)&action=edit&redlink=1http://en.wikipedia.org/wiki/Sawzall_(programming_language)http://en.wikipedia.org/wiki/MapReduce#cite_note-ddandms2-5http://en.wikipedia.org/w/index.php?title=MapReduce&action=edit&section=12http://en.wikipedia.org/wiki/Googlehttp://en.wikipedia.org/wiki/C%2B%2Bhttp://en.wikipedia.org/wiki/Python_(programming_language)http://en.wikipedia.org/wiki/Java_(programming_language)http://en.wikipedia.org/wiki/Hadoophttp://en.wikipedia.org/wiki/Java_(programming_language)http://en.wikipedia.org/w/index.php?title=Greenplum&action=edit&redlink=1
  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    7/39

    other languages.

    Phoenix[1]is a shared-memory implementation of MapReduce implemented inC.

    MapReduce has also been implemented for the Cell Broadband Engine, also in C. [2]

    MapReduce has been implemented onNVIDIA GPUs (Graphics Processors) using CUDA[3].

    Qt Concurrent is a simplified version of the framework, implemented in C++, used for

    distributing a task between multiple processor cores.

    CouchDB uses a MapReduce framework for defining views over distributed documents Skynet is an open source Ruby implementation of Googles MapReduce framework

    Disco is an open source MapReduce implementation byNokia. Its core is written inErlang and

    jobs are normally written in Python.

    Aster Data Systems nCluster In-Database MapReduce implements MapReduce inside the

    database.

    [edit] References

    Specific references:

    1. ^ Google spotlights data center inner workings | Tech news blog - CNET News.com

    2. ^ "Our abstraction is inspired by the map and reduce primitives present in Lisp and many otherfunctional languages." -"MapReduce: Simplified Data Processing on Large Clusters", by Jeffrey

    Dean and Sanjay Ghemawat; fromGoogle Labs

    3. ^ "Google's MapReduce Programming Model -- Revisited" paper by Ralf Lammel; from

    Microsoft4. ^ "How Google Works". baselinemag.com. "As of October, Google was running about 3,000

    computing jobs per day through MapReduce, representing thousands of machine-days, according

    to a presentation by Dean. Among other things, these batch routines analyze the latest Web pagesand update Google's indexes."

    5. ^abDavid DeWitt; Michael Stonebraker. "MapReduce: A major step backwards".databasecolumn.com. Retrieved on 2008-08-27.

    6. ^ David DeWitt; Michael Stonebraker. "MapReduce II". databasecolumn.com. Retrieved on2008-08-27.

    General references:

    Dean, Jeffrey & Ghemawat, Sanjay (2004). "MapReduce: Simplified Data Processing on LargeClusters". Retrieved Apr. 6, 2005.

    http://csl.stanford.edu/~christos/sw/phoenix/http://csl.stanford.edu/~christos/sw/phoenix/http://csl.stanford.edu/~christos/sw/phoenix/http://en.wikipedia.org/wiki/C_programming_languagehttp://en.wikipedia.org/wiki/C_programming_languagehttp://en.wikipedia.org/wiki/Cell_Broadband_Enginehttp://en.wikipedia.org/wiki/Cell_Broadband_Enginehttp://en.wikipedia.org/wiki/C_programming_languagehttp://sourceforge.net/projects/mapreduce-cellhttp://en.wikipedia.org/wiki/NVIDIAhttp://en.wikipedia.org/wiki/NVIDIAhttp://en.wikipedia.org/wiki/CUDAhttp://www.cse.ust.hk/gpuqp/Mars.htmlhttp://labs.trolltech.com/page/Projects/Threads/QtConcurrenthttp://en.wikipedia.org/wiki/C%2B%2Bhttp://en.wikipedia.org/wiki/CouchDBhttp://skynet.rubyforge.org/http://en.wikipedia.org/wiki/Ruby_(programming_language)http://discoproject.org/http://en.wikipedia.org/wiki/Nokiahttp://en.wikipedia.org/wiki/Erlang_(programming_language)http://en.wikipedia.org/wiki/Erlang_(programming_language)http://en.wikipedia.org/wiki/Python_(programming_language)http://en.wikipedia.org/wiki/Aster_Data_Systemshttp://en.wikipedia.org/w/index.php?title=MapReduce&action=edit&section=13http://en.wikipedia.org/wiki/MapReduce#cite_ref-0http://news.cnet.com/8301-10784_3-9955184-7.htmlhttp://en.wikipedia.org/wiki/MapReduce#cite_ref-map_1-0http://labs.google.com/papers/mapreduce.htmlhttp://en.wikipedia.org/wiki/Google_Labshttp://en.wikipedia.org/wiki/Google_Labshttp://en.wikipedia.org/wiki/MapReduce#cite_ref-2http://www.cs.vu.nl/~ralf/MapReduce/paper.pdfhttp://en.wikipedia.org/wiki/Microsofthttp://en.wikipedia.org/wiki/MapReduce#cite_ref-usage_3-0http://www.baselinemag.com/article2/0,1540,1985048,00.asphttp://en.wikipedia.org/wiki/MapReduce#cite_ref-ddandms1_4-0http://en.wikipedia.org/wiki/MapReduce#cite_ref-ddandms1_4-0http://en.wikipedia.org/wiki/MapReduce#cite_ref-ddandms1_4-0http://en.wikipedia.org/wiki/MapReduce#cite_ref-ddandms1_4-1http://en.wikipedia.org/wiki/David_DeWitthttp://en.wikipedia.org/wiki/Michael_Stonebrakerhttp://en.wikipedia.org/wiki/Michael_Stonebrakerhttp://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.htmlhttp://en.wikipedia.org/wiki/2008http://en.wikipedia.org/wiki/August_27http://en.wikipedia.org/wiki/MapReduce#cite_ref-ddandms2_5-0http://en.wikipedia.org/wiki/David_DeWitthttp://en.wikipedia.org/wiki/Michael_Stonebrakerhttp://en.wikipedia.org/wiki/Michael_Stonebrakerhttp://www.databasecolumn.com/2008/01/mapreduce-continued.htmlhttp://en.wikipedia.org/wiki/2008http://en.wikipedia.org/wiki/August_27http://labs.google.com/papers/mapreduce.htmlhttp://labs.google.com/papers/mapreduce.htmlhttp://csl.stanford.edu/~christos/sw/phoenix/http://en.wikipedia.org/wiki/C_programming_languagehttp://en.wikipedia.org/wiki/Cell_Broadband_Enginehttp://en.wikipedia.org/wiki/C_programming_languagehttp://sourceforge.net/projects/mapreduce-cellhttp://en.wikipedia.org/wiki/NVIDIAhttp://en.wikipedia.org/wiki/CUDAhttp://www.cse.ust.hk/gpuqp/Mars.htmlhttp://labs.trolltech.com/page/Projects/Threads/QtConcurrenthttp://en.wikipedia.org/wiki/C%2B%2Bhttp://en.wikipedia.org/wiki/CouchDBhttp://skynet.rubyforge.org/http://en.wikipedia.org/wiki/Ruby_(programming_language)http://discoproject.org/http://en.wikipedia.org/wiki/Nokiahttp://en.wikipedia.org/wiki/Erlang_(programming_language)http://en.wikipedia.org/wiki/Python_(programming_language)http://en.wikipedia.org/wiki/Aster_Data_Systemshttp://en.wikipedia.org/w/index.php?title=MapReduce&action=edit&section=13http://en.wikipedia.org/wiki/MapReduce#cite_ref-0http://news.cnet.com/8301-10784_3-9955184-7.htmlhttp://en.wikipedia.org/wiki/MapReduce#cite_ref-map_1-0http://labs.google.com/papers/mapreduce.htmlhttp://en.wikipedia.org/wiki/Google_Labshttp://en.wikipedia.org/wiki/MapReduce#cite_ref-2http://www.cs.vu.nl/~ralf/MapReduce/paper.pdfhttp://en.wikipedia.org/wiki/Microsofthttp://en.wikipedia.org/wiki/MapReduce#cite_ref-usage_3-0http://www.baselinemag.com/article2/0,1540,1985048,00.asphttp://en.wikipedia.org/wiki/MapReduce#cite_ref-ddandms1_4-0http://en.wikipedia.org/wiki/MapReduce#cite_ref-ddandms1_4-1http://en.wikipedia.org/wiki/David_DeWitthttp://en.wikipedia.org/wiki/Michael_Stonebrakerhttp://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.htmlhttp://en.wikipedia.org/wiki/2008http://en.wikipedia.org/wiki/August_27http://en.wikipedia.org/wiki/MapReduce#cite_ref-ddandms2_5-0http://en.wikipedia.org/wiki/David_DeWitthttp://en.wikipedia.org/wiki/Michael_Stonebrakerhttp://www.databasecolumn.com/2008/01/mapreduce-continued.htmlhttp://en.wikipedia.org/wiki/2008http://en.wikipedia.org/wiki/August_27http://labs.google.com/papers/mapreduce.htmlhttp://labs.google.com/papers/mapreduce.html
  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    8/39

    MapReduce: A major step backwardsBy David DeWitt on January 17, 2008 4:20 PM | Permalink| Comments (42) | TrackBacks (1)

    [Note: Although the system attributes this post to a single author, it was written by David J. DeWitt and

    Michael Stonebraker]

    On January 8, a Database Column reader asked for our views on new distributed database research

    efforts, and we'll begin here with our views on MapReduce. This is a good time to discuss it, since the

    recent trade press has been filled with news of the revolution of so-called "cloud computing." Thisparadigm entails harnessing large numbers of (low-end) processors working in parallel to solve a

    computing problem. In effect, this suggests constructing a data center by lining up a large number of

    "jelly beans" rather than utilizing a much smaller number of high-end servers.

    For example, IBM and Google have announced plans to make a 1,000 processor cluster available to a few

    select universities to teach students how to program such clusters using a software tool called

    MapReduce [1]. Berkeley has gone so far as to plan on teaching their freshman how to program using the

    MapReduce framework.

    As both educators and researchers, we are amazed at the hype that the MapReduce proponents havespread about how it represents a paradigm shift in the development of scalable, data-intensive

    applications. MapReduce may be a good idea for writing certain types of general-purpose computations,

    but to the database community, it is:

    1. A giant step backward in the programming paradigm for large-scale data intensive applications

    2. A sub-optimal implementation, in that it uses brute force instead of indexing

    3. Not novel at all -- it represents a specific implementation of well known techniques developed

    nearly 25 years ago

    4. Missing most of the features that are routinely included in current DBMS

    5. Incompatible with all of the tools DBMS users have come to depend on

    First, we will briefly discuss what MapReduce is; then we will go into more detail about our fivereactions listed above.

    What is MapReduce?

    The basic idea of MapReduce is straightforward. It consists of two programs that the user writes called

    map and reduce plus a framework for executing a possibly large number of instances of each program ona compute cluster.

    The map program reads a set of "records" from an input file, does any desired filtering and/ortransformations, and then outputs a set of records of the form (key, data). As the map program produces

    output records, a "split" function partitions the records intoMdisjoint buckets by applying a function to

    the key of each output record. This split function is typically a hash function, though any deterministicfunction will suffice. When a bucket fills, it is written to disk. The map program terminates withMoutput

    files, one for each bucket.

    http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.htmlhttp://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.htmlhttp://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html#commentshttp://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html#trackbackhttp://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html#trackbackhttp://en.wikipedia.org/wiki/MapReducehttp://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.htmlhttp://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html#commentshttp://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html#trackbackhttp://en.wikipedia.org/wiki/MapReduce
  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    9/39

    In general, there are multiple instances of the map program running on different nodes of a compute

    cluster. Each map instance is given a distinct portion of the input file by the MapReduce scheduler to

    process. IfNnodes participate in the map phase, then there areMfiles on disk storage at each ofNnodes,for a total ofN*Mfiles;Fi,j, 1 i N, 1 j M.

    The key thing to observe is that all map instances use the same hash function. Hence, all output recordswith the same hash value will be in corresponding output files.

    The second phase of a MapReduce job executesMinstances of the reduce program,Rj, 1 j M. The

    input for each reduce instanceRj consists of the filesFi,j, 1 i N. Again notice that all output records

    from the map phase with the same hash value will be consumed by the same reduce instance -- no matter

    which map instance produced them. After being collected by the map-reduce framework, the inputrecords to a reduce instance are grouped on their keys (by sorting or hashing) and feed to the reduce

    program. Like the map program, the reduce program is an arbitrary computation in a general-purpose

    language. Hence, it can do anything it wants with its records. For example, it might compute some

    additional function over other data fields in the record. Each reduce instance can write records to an

    output file, which forms part of the "answer" to a MapReduce computation.

    To draw an analogy to SQL, map is like thegroup-by clause of an aggregate query. Reduce is analogousto the aggregate function (e.g., average) that is computed over all the rows with the same group-by

    attribute.

    We now turn to the five concerns we have with this computing paradigm.

    1. MapReduce is a step backwards in database access

    As a data processing paradigm, MapReduce represents a giant step backwards. The database communityhas learned the following three lessons from the 40 years that have unfolded since IBM first released IMSin 1968.

    Schemas are good.

    Separation of the schema from the application is good.

    High-level access languages are good.

    MapReduce has learned none of these lessons and represents a throw back to the 1960s, before modernDBMSs were invented.

    The DBMS community learned the importance of schemas, whereby the fields and their data types arerecorded in storage. More importantly, the run-time system of the DBMS can ensure that input records

    obey this schema. This is the best way to keep an application from adding "garbage" to a data set.

    MapReduce has no such functionality, and there are no controls to keep garbage out of its data sets. Acorrupted MapReduce dataset can actually silently break all the MapReduce applications that use that

    dataset.

    It is also crucial to separate the schema from the application program. If a programmer wants to write a

  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    10/39

    new application against a data set, he or she must discover the record structure. In modern DBMSs, the

    schema is stored in a collection of system catalogs and can be queried (in SQL) by any user to uncover

    such structure. In contrast, when the schema does not exist or is buried in an application program, the

    programmer must discover the structure by an examination of the code. Not only is this a very tediousexercise, but also the programmer must find the source code for the application. This latter tedium is

    forced onto every MapReduce programmer, since there are no system catalogs recording the structure of

    records -- if any such structure exists.

    During the 1970s the DBMS community engaged in a "great debate" between the relational advocates

    and the Codasyl advocates. One of the key issues was whether a DBMS access program should bewritten:

    By stating what you want - rather than presenting an algorithm for how to get it (relational view)

    By presenting an algorithm for data access (Codasyl view)

    The result is now ancient history, but the entire world saw the value of high-level languages and

    relational systems prevailed. Programs in high-level languages are easier to write, easier to modify, and

    easier for a new person to understand. Codasyl was rightly criticized for being "the assembly language ofDBMS access." A MapReduce programmer is analogous to a Codasyl programmer -- he or she is writing

    in a low-level language performing low-level record manipulation. Nobody advocates returning toassembly language; similarly nobody should be forced to program in MapReduce.

    MapReduce advocates might counter this argument by claiming that the datasets they are targeting haveno schema. We dismiss this assertion. In extracting a key from the input data set, the map function is

    relying on the existence of at least one data field in each input record. The same holds for a reduce

    function that computes some value from the records it receives to process.

    Writing MapReduce applications on top of Google's BigTable (or Hadoop's HBase) does not really

    change the situation significantly. By using a self-describing tuple format (row key, column name,{values}) different tuples within the same table can actually have different schemas. In addition,BigTable and HBase do not provide logical independence, for example with a view mechanism. Views

    significantly simplify keeping applications running when the logical schema changes.

    2. MapReduce is a poor implementation

    All modern DBMSs use hash or B-tree indexes to accelerate access to data. If one is looking for a subset

    of the records (e.g., those employees with a salary of 10,000 or those in the shoe department), then one

    can often use an index to advantage to cut down the scope of the search by one to two orders of

    magnitude. In addition, there is a query optimizer to decide whether to use an index or perform a brute-force sequential search.

    MapReduce has no indexes and therefore has only brute force as a processing option. It will be creamedwhenever an index is the better access mechanism.

    One could argue that value of MapReduce is automatically providing parallel execution on a grid ofcomputers. This feature was explored by the DBMS research community in the 1980s, and multiple

    prototypes were built including Gamma [2,3], Bubba [4], and Grace [5]. Commercialization of these

  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    11/39

    ideas occurred in the late 1980s with systems such as Teradata.

    In summary to this first point, there have been high-performance, commercial, grid-oriented SQL engines

    (with schemas and indexing) for the past 20 years. MapReduce does not fare well when compared withsuch systems.

    There are also some lower-level implementation issues with MapReduce, specifically skew and datainterchange.

    One factor that MapReduce advocates seem to have overlooked is the issue of skew. As described in"Parallel Database System: The Future of High Performance Database Systems," [6] skew is a huge

    impediment to achieving successful scale-up in parallel query systems. The problem occurs in the map

    phase when there is wide variance in the distribution of records with the same key. This variance, in turn,causes some reduce instances to take much longer to run than others, resulting in the execution time for

    the computation being the running time of the slowest reduce instance. The parallel database community

    has studied this problem extensively and has developed solutions that the MapReduce community might

    want to adopt.

    There is a second serious performance problem that gets glossed over by the MapReduce proponents.

    Recall that each of theNmap instances producesMoutput files -- each destined for a different reduceinstance. These files are written to a disk local to the computer used to run the map instance. IfNis 1,000

    andMis 500, the map phase produces 500,000 local files. When the reduce phase starts, each of the 500

    reduce instances needs to read its 1,000 input files and must use a protocol like FTP to "pull" each of itsinput files from the nodes on which the map instances were run. With 100s of reduce instances running

    simultaneously, it is inevitable that two or more reduce instances will attempt to read their input files

    from the same map node simultaneously -- inducing large numbers of disk seeks and slowing the

    effective disk transfer rate by more than a factor of 20. This is why parallel database systems do notmaterialize their split files and use push (to sockets) instead of pull. Since much of the excellent fault-

    tolerance that MapReduce obtains depends on materializing its split files, it is not clear whether the

    MapReduce framework could be successfully modified to use the push paradigm instead.

    Given the experimental evaluations to date, we have serious doubts about how well MapReduce

    applications can scale. Moreover, the MapReduce implementers would do well to study the last 25 yearsof parallel DBMS research literature.

    3. MapReduce is not novel

    The MapReduce community seems to feel that they have discovered an entirely new paradigm for

    processing large data sets. In actuality, the techniques employed by MapReduce are more than 20 yearsold. The idea of partitioning a large data set into smaller partitions was first proposed in "Application of

    Hash to Data Base Machine and Its Architecture" [11] as the basis for a new type of join algorithm. In

    "Multiprocessor Hash-Based Join Algorithms," [7], Gerber demonstrated how Kitsuregawa's techniquescould be extended to execute joins in parallel on a shared-nothing [8] cluster using a combination of

    partitioned tables, partitioned execution, and hash based splitting. DeWitt [2] showed how these

    techniques could be adopted to execute aggregates with and without group by clauses in parallel. DeWittand Gray [6] described parallel database systems and how they process queries. Shatdal and Naughton [9]

    explored alternative strategies for executing aggregates in parallel.

  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    12/39

  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    13/39

  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    14/39

    MapReduce IIBy David DeWitt on January 25, 2008 2:56 PM | Permalink| Comments (25) | TrackBacks (1)

    [Note: Although the system attributes this post to a single author, it was written by David J. DeWitt and

    Michael Stonebraker]

    Last week's MapReduce post attracted tens of thousands of readers and generated many comments,

    almost all of them attacking our critique. Just to let you know, we don't hold a personal grudge against

    MapReduce. MapReduce didn't kill our dog, steal our car, or try and date our daughters.

    Our motivations for writing about MapReduce stem from MapReduce being increasingly seen as the

    most advanced and/or only way to analyze massive datasets. Advocates promote the tool withoutseemingly paying attention to years of academic and commercial database research and real world use.

    The point of our initial post was to say that there are striking similarities between MapReduce and a fairly

    primitive parallel database system. As such, MapReduce can be significantly improved by learning from

    the parallel database community.

    So, hold off on your comments for just a few minutes, as we will spend the rest of this post addressingfour specific topics brought up repeatedly by those who commented on our previous blog:

    1. MapReduce is not a database system, so don't judge it as one

    2. MapReduce has excellent scalability; the proof is Google's use

    3. MapReduce is cheap and databases are expensive

    4. We are the old guard trying to defend our turf/legacy from the young turks

    Feedback No. 1: MapReduce is not a database system, so don't judge it as one

    It's not that we don't understand this viewpoint. We are not claiming that MapReduce is a databasesystem. What we are saying is that like a DBMS + SQL + analysis tools, MapReduce can be and is being

    used to analyze and perform computations on massive datasets. So we aren't judging apples and oranges.

    We are judging two approaches to analyzing massive amounts of information, even for less structuredinformation.

    To illustrate our point, assume that you have two very large files of facts. The first file contains structured

    records of the form:

    Rankings (pageURL, pageRank)

    Records in the second file have the form:

    UserVisits (sourceIPAddr, destinationURL, date, adRevenue)

    Someone might ask, "What IP address generated the most ad revenue during the week of January 15th tothe 22nd, and what was the average page rank of the pages visited?"

    http://www.databasecolumn.com/2008/01/mapreduce-continued.htmlhttp://www.databasecolumn.com/2008/01/mapreduce-continued.htmlhttp://www.databasecolumn.com/2008/01/mapreduce-continued.html#commentshttp://www.databasecolumn.com/2008/01/mapreduce-continued.html#trackbackhttp://www.databasecolumn.com/2008/01/mapreduce-continued.html#trackbackhttp://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.htmlhttp://www.databasecolumn.com/2008/01/mapreduce-continued.htmlhttp://www.databasecolumn.com/2008/01/mapreduce-continued.html#commentshttp://www.databasecolumn.com/2008/01/mapreduce-continued.html#trackbackhttp://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html
  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    15/39

    This question is a little tricky to answer in MapReduce because it consumes two data sets rather than one,

    and it requires a "join" of the two datasets to find pairs of Ranking and UserVisit records that have

    matching values for pageURL and destinationURL. In fact, it appears to require three MapReducephases, as noted below.

    Phase 1

    This phase filters UserVisits records that are outside the desired data range and then "joins"the qualifying records with records from the Rankings file.

    Map program: The map program scans through UserVisits and Rankings records.Each UserVisit record is filtered on the date range specification. Qualifying records

    are emitted with composite keys of the form where T1

    indicates that it is a UserVisits record. Rankings records are emitted with compositekeys of the form (T2 is a tag indicating it a Rankings record).

    Output records are repartitioned using a user-supplied partitioning function that only

    hashes on the URL portion of the composite key.

    Reduce Program: The input to the reduce program is a single sorted run of records in

    URL order. For each unique URL, the program splits the incoming records into two

    sets (one for Rankings records and one for UserVisits records) using the tagcomponent of the composite key. To complete the join, reduce finds all matching pairs

    of records of the two sets. Output records are in the form of Temp1 (sourceIPAddr,

    pageURL, pageRank, adRevenue).

    The reduce program must be capable of handling the case in which one or both of these sets

    with the same URL are too large to fit into memory and must be materialized on disk. Since

    access to these sets is through an iterator, a straightforward implementation will result inwhat is termed a nested-loops join. This join algorithm is known to have very bad

    performance I/O characteristics as "inner" set is scanned once for each record of the "outer"set.

    Phase 2

    This phase computes the total ad revenue and average page rank for each Source IP Address.

    Map program: Scan Temp1 using the identity function on sourceIPAddr.

    Reduce program: The reduce program makes a linear pass over the data. For eachsourceIPAddr, it will sum the ad-revenue and compute the average page rank,

    retaining the one with the maximum total ad revenue. Each reduce worker then

    outputs a single record of the form Temp2 (sourceIPAddr, total_adRevenue,

    average_pageRank).

    Phase 3

  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    16/39

    Map program: The program uses a single map worker that scans Temp2 and outputs

    the record with the maximum value for total_adRevenue.

    We realize that portions of the processing steps described above are handled automatically by the

    MapReduce infrastructure (e.g., sorting and partitioning the records). Although we have not written this

    program, we estimate that the custom parts of the code (i.e., the map() and reduce() functions) wouldrequire substantially more code than the two fairly simple SQL statements to do the same:

    Q1

    Select as Temp sourceIPAddr, avg(pageRank) as avgPR, sum(adRevenue) as adTotalFrom Rankings, UserVisits

    where Rankings.pageURL = UserVisits.destinationURL and

    date > "Jan 14" and date < "Jan 23"Group by sourceIPAddr

    Q2

    Select sourceIPAddr, adTotal, avgPR

    From TempWhere adTotal = max (adTotal)

    No matter what you think of SQL, eight lines of code is almost certainly easier to write and debug thanthe programming required for MapReduce. We believe that MapReduce advocates should consider the

    advantages that layering a high-level language like SQL could provide to users of MapReduce.

    Apparently we're not alone in this assessment, as efforts such as PigLatin and Sawzall appear to be

    promising steps in this direction.

    We also firmly believe that augmenting the input files with a schema would provide the basis forimproving the overall performance of MapReduce applications by allowing B-trees to be created on the

    input data sets and techniques like hash partitioning to be applied. These are technologies in widespread

    practice in today's parallel DBMSs, of which there are quite a number on the market, including ones fromIBM, Teradata, Netezza, Greenplum, Oracle, and Vertica. All of these should be able to execute this

    program with the same or better scalability and performance of MapReduce.

    Here's how these capabilities could benefit MapReduce:

    1. Indexing. The filter (date > "Jan 14" and date < "Jan 23") condition can be executed

    by using a B-tree index on the date attribute of the UserVisits table, avoiding asequential scan of the entire table.

    2. Data movement. When you load files into a distributed file system prior to running

    MapReduce, data items are typically assigned to blocks/partitions in sequential order.

    As records are loaded into a table in a parallel database system, it is standard practiceto apply a hash function to an attribute value to determine which node the record

    should be stored on (the same basic idea as is used to determine which reduce worker

    should get an output record from a map instance). For example, records being loaded

  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    17/39

    into the Rankings and UserVisits tables might be mapped to a node by hashing on the

    pageURL and destinationURL attributes, respectively. If loaded this way, the join of

    Rankings and UserVisits in Q1 above would be performed completely locally with

    absolutely no data movement between nodes. Furthermore, as result records from thejoin are materialized, they will be pipelined directly into a local aggregate

    computation without being written first to disk. This local aggregate operator will

    partially compute the two aggregates (sum and average) concurrently (what is called acombiner in MapReduce terminology). These partial aggregates are then repartitioned

    by hashing on this sourceIPAddr to produce the final results for Q1.

    It is certainly the case that you could do the same thing in MapReduce by using

    hashing to map records to chunks of the file and then modifying the MapReduce

    program to exploit the knowledge of how the data was loaded. But in a database,physical data independence happens automatically. When Q1 is "compiled," the query

    optimizer will extract partitioning information about the two tables from the schema.

    It will then generate the correct query plan based on this partitioning information

    (e.g., maybe Rankings is hash partitioned on pageURL but UserVisits is hashpartitioned on sourceIPAddr). This happens transparently to any user (modulo

    changes in response time) who submits a query involving a join of the two tables.

    3. Column representation. Many questions access only a subset of the fields of the

    input files. The others do not need to be read by a column store.

    4. Push, not pull. MapReduce relies on the materialization of the output files from themap phase on disk for fault tolerance. Parallel database systems push the intermediate

    files directly to the receiving (i.e., reduce) nodes, avoiding writing the intermediate

    results and then reading them back as they are pulled by the reduce computation. Thisprovides MapReduce far superior fault tolerance at the expense of additional I/Os.

    In general, we expect these mechanisms to provide about a factor of 10 to 100 performance advantage,

    depending on the selectivity of the query, the width of the input records to the map computation, and the

    size of the output files from the map phase. As such, we believe that 10 to 100 parallel database nodes

    can do the work of 1,000 MapReduce nodes.

    To further illustrate out point, suppose you have a more general filter, F, a more general group_by

    function, G, and a more general Reduce function, R. PostgreSQL (an open source, free DBMS) allowsthe following SQL query over a table T:

    Select R (T)

    From TGroup_by G (T)Where F (T)

  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    18/39

    F, R, and G can be written in a general-purpose language like C or C++. A SQL engine, extended with

    user-defined functions and aggregates, has nearly -- if not all -- of the generality of MapReduce.

    As such, we claim that most things that are possible in MapReduce are also possible in a SQL engine .

    Hence, it is exactly appropriate to compare the two approaches. We are working on a more complete

    paper that demonstrates the relative performance and relative programming effort between the twoapproaches, so, stay tuned.

    Feedback No. 2: MapReduce has excellent scalability; the proof is Google's use

    Many readers took offense at our comment about scaling and asserted that since Google runs MapReduceprograms on 1,000s (perhaps 10s of 1,000s) of nodes it must scale well. Having started benchmarking

    database systems 25 years ago (yes, in 1983), we believe in a more scientific approach toward evaluating

    the scalability of any system for data intensive applications.

    Consider the following scenario. Assume that you have a 1 TB data set that has been partitioned across

    100 nodes of a cluster (each node will have about 10 GB of data). Further assume that some MapReduce

    computation runs in 5 minutes if 100 nodes are used for both the map and reduce phases. Now scale thedataset to 10 TB, partition it over 1,000 nodes, and run the same MapReduce computation using those

    1,000 nodes. If the performance of MapReduce scales linearly, it will execute the same computation on

    10x the amount of data using 10x more hardware in the same 5 minutes.Linear scaleup is the goldstandard for measuring the scalability of data intensive applications. As far as we are aware there are no

    published papers that study the scalability of MapReduce in a controlled scientific fashion. MapReduce

    may indeed scale linearly, but we have not seen published evidence of this.

    Feedback No. 3: MapReduce is cheap and databases are expensive

    Every organization has a "build" versus "buy" decision, and we don't question the decision by Google to

    roll its own data analysis solution. We also don't intend to defend DBMS pricing by the commercial

    vendors. What we wanted to point out is that we believe it is possible to build a version of MapReducewith more functionality and better performance. Pig is an excellent step in this direction.

    Also, we want to mention that there are several open source (i.e., free) DBMSs, including PostgreSQL,MySQL, Ingres, and BerkeleyDB. Several of the aforementioned parallel DBMS companies have

    increased the scale of these open source systems by adding parallel computing extensions.

    A number of individuals also commented that SQL and the relational data model are too restrictive.Indeed, the relational data model might very well be the wrong data model for the types of datasets that

    MapReduce applications are targeting. However, there is considerable ground between the relational data

    model and no data model at all. The point we were trying to make is that developers writing businessapplications have benefited significantly from the notion of organizing data in the database according to a

    data model and accessing that data through a declarative query language. We don't care what that

    language or model is. Pig, for example, employs a nested relational model, which gives developers moreflexibility that a traditional 1NF doesn't allow.

  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    19/39

    Feedback No. 4: We are the old guard trying to defend our turf/legacy from the young turks

    Since both of us are among the "gray beards" and have been on this earth about 2 Giga-seconds, we have

    seen a lot of ideas come and go. We are constantly struck by the following two observations:

    How insular computer science is. The propagation of ideas from sub-discipline to sub-discipline

    is very slow and sketchy. Most of us are content to do our own thing, rather than learn what other

    sub-disciplines have to offer.

    How little knowledge is passed from generation to generation. In a recent paper entitled "Whatgoes around comes around," (M. Stonebraker/J. Hellerstein, Readings in Database Systems 4th

    edition, MIT Press, 2004) one of us noted that many current database ideas were tried a quarter of

    a century ago and discarded. However, such pragma does not seem to be passed down from the

    "gray beards" to the "young turks." The turks and gray beards aren't usually and shouldn't beadversaries.

    Thanks for stopping by the "pasture" and reading this post. We look forward to reading your feedback,

    comments and alternative viewpoints.

    Categories

    Database architecture, Database innovation

    Tags

    DBMS

    DeWitt

    MapReduce

    Stonebraker

    http://www.databasecolumn.com/database-architecture/http://www.databasecolumn.com/database-architecture/http://www.databasecolumn.com/database-innovation/http://www.databasecolumn.com/blog/mt-search.cgi?tag=DBMS&blog_id=1&IncludeBlogs=1http://www.databasecolumn.com/blog/mt-search.cgi?tag=DeWitt&blog_id=1&IncludeBlogs=1http://www.databasecolumn.com/blog/mt-search.cgi?tag=MapReduce&blog_id=1&IncludeBlogs=1http://www.databasecolumn.com/blog/mt-search.cgi?tag=Stonebraker&blog_id=1&IncludeBlogs=1http://www.databasecolumn.com/database-architecture/http://www.databasecolumn.com/database-innovation/http://www.databasecolumn.com/blog/mt-search.cgi?tag=DBMS&blog_id=1&IncludeBlogs=1http://www.databasecolumn.com/blog/mt-search.cgi?tag=DeWitt&blog_id=1&IncludeBlogs=1http://www.databasecolumn.com/blog/mt-search.cgi?tag=MapReduce&blog_id=1&IncludeBlogs=1http://www.databasecolumn.com/blog/mt-search.cgi?tag=Stonebraker&blog_id=1&IncludeBlogs=1
  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    20/39

    Serial vs. Parallel Programming

    In the early days of computing, programs wereserial, that is, a program consisted of a sequence ofinstructions, where each instruction executed one after the other. It ran from start to finish on a single

    processor.

    Parallel programmingdeveloped as a means of improving performance and efficiency. In a parallel

    program, the processing is broken up into parts, each of which can be executed concurrently. Theinstructions from each part run simultaneously on different CPUs. These CPUs can exist on a single

    machine, or they can be CPUs in a set of computers connected via a network.

    Not only are parallel programs faster, they can also be used to solve problems on large datasets using

    non-local resources. When you have a set of computers connected on a network, you have a vast pool ofCPUs, and you often have the ability to read and write very large files (assuming a distributed file system

    is also in place).

    The Basics

    The first step in building a parallel program is identifying sets of tasks that can run concurrently and/orparitions of data that can be processed concurrently. Sometimes it's just not possible. Consider a

    Fibonacci function:

    Fk+2 = Fk + Fk+1

    A function to compute this based on the form above, cannot be "parallelized" because each computedvalue is dependent on previously computed values.

    A common situation is having a large amount of consistent data which must be processed. If the data can

    be decomposed into equal-size partitions, we can devise a parallel solution. Consider a huge array which

    can be broken up into sub-arrays.

    If the same processing is required for each array element, with no dependencies in the computations, andno communication required between tasks, we have an ideal parallel computing opportunity. Here is a

    common implementation technique called master/worker.

    The MASTER:

    initializes the array and splits it up according to the number of available WORKERS

    sends each WORKER its subarray

    receives the results from each WORKER

    The WORKER:

    receives the subarray from the MASTER

    performs processing on the subarray returns results to MASTER

    This model implementsstatic load balancingwhich is commonly used if all tasks are performing the

    same amount of work on identical machines. In general, load balancingrefers to techniques which try to

    spread tasks among the processors in a parallel system to avoid some processors being idle while othershave tasks queueing up for execution.

  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    21/39

    A static load balancer allocates processes to processors at run time while taking no account of current

    network load. Dynamic algorithms are more flexible, though more computationally expensive, and give

    some consideration to the network load before allocating the new process to a processor.

    As an example of the MASTER/WORKER technique, consider one of the methods for approximating pi.The first step is to inscribe a circle inside a square:

    The area of the square, denoted As = (2r)2 or 4r2. The area of the circle, denoted Ac, is pi * r2. So:

    pi = Ac / r2

    As = 4r2

    r2 = As / 4pi = 4 * Ac / As

    The reason we are doing all these algebraic manipulation is we can parallelize this method in the

    following way.

    1. Randomly generate points in the square2. Count the number of generated points that are both in the circle and in the square

    3. r = the number of points in the circle divided by the number of points in the square4. PI = 4 * r

    And here is how we parallelize it:

    NUMPOINTS = 100000; // some large number - the bigger, the closer the approximation

    p = number of WORKERS;numPerWorker = NUMPOINTS / p;countCircle = 0; // one of these for each WORKER

    // each WORKER does the following:for (i = 0; i < numPerWorker; i++) {generate 2 random numbers that lie inside the square;

    xcoord = first random number;ycoord = second random number;if (xcoord, ycoord) lies inside the circlecountCircle++;

    }

    MASTER:receives from WORKERS their countCircle valuescomputes PI from these values: PI = 4.0 * countCircle / NUMPOINTS;

    What is MapReduce?

    Now that we have seen some basic examples of parallel programming, we can look at the MapReduceprogramming model. This model derives from the map and reduce combinators from a functionallanguage like Lisp.

    In Lisp, a map takes as input a function and a sequence of values. It then applies the function to eachvalue in the sequence. Areduce combines all the elements of a sequence using a binary operation. Forexample, it can use "+" to add up all the elements in the sequence.

    MapReduce is inspired by these concepts. It was developed within Google as a mechanism for processinglarge amounts of raw data, for example, crawled documents or web request logs. This data is so large, it

  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    22/39

    must be distributed across thousands of machines in order to be processed in a reasonable time. This

    distribution implies parallel computing since the same computations are performed on each CPU, but

    with a different dataset. MapReduce is an abstraction that allows Google engineers to perform simple

    computations while hiding the details of parallelization, data distribution, load balancing and faulttolerance.

    Map, written by a user of the MapReduce library, takes an input pair and produces a set of intermediate

    key/value pairs. The MapReduce library groups together all intermediate values associated with the sameintermediate keyIand passes them to the reduce function.

    The reduce function, also written by the user, accepts an intermediate key I and a set of values for that

    key. It merges together these values to form a possibly smaller set of values. [1]

    Consider the problem of counting the number of occurrences of each word in a large collection of

    documents:

    map(String key, String value):// key: document name// value: document contentsfor each word w in value:EmitIntermediate(w, "1");

    reduce(String key, Iterator values):// key: a word// values: a list of countsint result = 0;for each v in values:result += ParseInt(v);

    Emit(AsString(result)); [1]

    The map function emits each word plus an associated count of occurrences ("1" in this example). The

    reduce function sums together all the counts emitted for a particular word.

    MapReduce Execution Overview

    The Map invocations are distributed across multiple machines by automatically partitioning the input

    data into a set of M splits orshards. The input shards can be processed in parallel on different machines.

    Reduce invocations are distributed by partitioning the intermediate key space into R pieces using apartitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function

    are specifed by the user.

    The illustration below shows the overall fow of a MapReduce operation. When the user program calls the

    MapReduce function, the following sequence of actions occurs (the numbered labels in the illustrationcorrespond to the numbers in the list below).

    1. The MapReduce library in the user program first shards the input files into M pieces of typically

    16 megabytes to 64 megabytes (MB) per piece. It then starts up many copies of the program on a

    cluster of machines.

    2. One of the copies of the program is special: the master. The rest are workers that are assigned

    work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle

    workers and assigns each one a map task or a reduce task.

    3. A worker who is assigned a map task reads the contents of the corresponding input shard. It parses

  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    23/39

    key/value pairs out of the input data and passes each pair to the user-defined Map function. The

    intermediate key/value pairs produced by the Map function are buffered in memory.

    4. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the

    partitioning function. The locations of these buffered pairs on the local disk are passed back to themaster, who is responsible for forwarding these locations to the reduce workers.

    5. When a reduce worker is notified by the master about these locations, it uses remote procedure

    calls to read the buffered data from the local disks of the map workers. When a reduce worker has

    read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the samekey are grouped together. If the amount of intermediate data is too large to fit in memory, an

    external sort is used.

    6. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key

    encountered, it passes the key and the corresponding set of intermediate values to the user'sReduce function. The output of the Reduce function is appended to a final output file for this

    reduce partition.

    7. When all map tasks and reduce tasks have been completed, the master wakes up the user program.

    At this point, the MapReduce call in the user program returns back to the user code.

    After successful completion, the output of the MapReduce execution is available in the R output files. [1]

    To detect failure, the master pings every worker periodically. If no response is received from a worker ina certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker

    are reset back to their initial idle state, and therefore become eligible for scheduling on other workers.

    Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomeseligible for rescheduling.

    Completed map tasks are re-executed when failure occurs because their output is stored on the local

    disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-

    executed since their output is stored in a global fille system.

    MapReduce Examples

    Here are a few simple examples of interesting programs that can be easily expressed as MapReduce

    computations.

    Distributed Grep: The map function emits a line if it matches a given pattern. The reduce function is an

    identity function that just copies the supplied intermediate data to the output.

    Count of URL Access Frequency: The map function processes logs of web page requests and outputs

    . The reduce function adds together all values for the same URL and emits a pair.

    Reverse Web-Link Graph: The map function outputs pairs for each link to a targetURL found in a page named "source". The reduce function concatenates the list of all source URLs

    associated with a given target URL and emits the pair: .

    Term-Vector per Host: A term vector summarizes the most important words that occur in a document ora set of documents as a list of pairs. The map function emits a pair for each input document (where the hostname is extracted from the URL of the document).

    The reduce function is passed all per-document term vectors for a given host. It adds these term vectorstogether, throwing away infrequent terms, and then emits a final pair.

    Inverted Index: The map function parses each document, and emits a sequence of

  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    24/39

    pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and

    emits a pair. The set of all output pairs forms a simple inverted index. It is

    easy to augment this computation to keep track of word positions. [1]

  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    25/39

    Sponsored by:

    This story appeared on JavaWorld at

    http://www.javaworld.com/javaworld/jw-09-2008/jw-09-hadoop.html

    MapReduce programming with Apache Hadoop

    Process massive data sets in parallel on large clusters

    By Ravi Shankar and Govindu Narendra, JavaWorld.com, 09/23/08

    Google and its MapReduce framework may rule the roost when it comes to massive-scale data

    processing, but there's still plenty of that goodness to go around. This article gets you started with

    Hadoop, the open source MapReduce implementation for processing large data sets. Authors Ravi

    Shankar and Govindu Narendra first demonstrate the powerful combination ofmap and reduce in asimple Java program, then walk you through a more complex data-processing application based onHadoop. Finally, they show you how to install and deploy your application in both standalone mode and

    clustering mode.Are you amazed by the fast response you get while searching the Web with Google or Yahoo? Have you

    ever wondered how these services manage to search millions of pages and return your results inmilliseconds or less? The algorithms that drive both of these major-league search services originated with

    Google's MapReduce framework. While MapReduce is proprietary technology, the Apache Foundation

    has implemented its own open source map-reduce framework, called Hadoop. Hadoop is used by Yahooand many other services whose success is based on processing massive amounts of data. In this article

    we'll help you discover whether it might also be a good solution for your distributed data processing

    needs.

    We'll start with an overview of MapReduce, followed by a couple of Java programs that demonstrate thesimplicity and power of the framework. We'll then introduce you to Hadoop's MapReduce

    implementation and walk through a complex application that searches a huge log file for a specific string.

    Finally, we'll show you how to install Hadoop in a Microsoft Windows environment and deploy the

    application -- first as a standalone application and then in clustering mode.

    You won't be an expert in all things Hadoop when you're done reading this article, but you will have

    enough material to explore and possibly implement Hadoop for your own large-scale data-processing

    requirements.

    About MapReduce

    MapReduce is a programming model specifically implemented for processing large data sets. The model

    was developed by Jeffrey Dean and Sanjay Ghemawat at Google (see "MapReduce: Simplified dataprocessing on large clusters"). At its core, MapReduce is a combination of two functions -- map() andreduce(), as its name would suggest.

    A quick look at a sample Java program will help you get your bearings in MapReduce. This applicationimplements a very simple version of the MapReduce framework, but isn't built on Hadoop. The simple,

    abstracted program will illustrate the core parts of the MapReduce framework and the terminology

    associated with it. The application creates some strings, counts the number of characters in each string,and finally sums them up to show the total number of characters altogether. Listing 1 contains the

    program's Main class.

    http://www.javaworld.com/javaworld/jw-09-2008/jw-09-hadoop.html#resourceshttp://www.javaworld.com/javaworld/jw-09-2008/jw-09-hadoop.html#resourceshttp://www.javaworld.com/javaworld/jw-09-2008/jw-09-hadoop.html#resourceshttp://www.javaworld.com/javaworld/jw-09-2008/jw-09-hadoop.html#resourceshttp://www.javaworld.com/javaworld/jw-09-2008/jw-09-hadoop.html#resourceshttp://www.javaworld.com/javaworld/jw-09-2008/jw-09-hadoop.html#resources
  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    26/39

    Listing 1. Main class for a simple MapReduce Java app

    public class Main{

    public static void main(String[] args)

    {

    MyMapReduce my = new MyMapReduce();my.init();

    }}

    Listing 1 just instantiates a class called MyMapReduce, which is shown inListing 2.

    Listing 2. MyMapReduce.java

    import java.util.*;

    public class MyMapReduce

    ...

    Download complete Listing 2

    As you see, the crux of the class lies in just four functions:

    The init() method creates some dummy data (just 30 strings). This data serves as the inputdata for the program. Note that in the real world, this input could be gigabytes, terabytes, orpetabytes of data!

    The step1ConvertIntoBuckets() method segments the input data. In this example, thedata is divided into six smaller chunks and put inside an ArrayList named buckets. You cansee that the method takes a list, which contains all of the input data, and anotherint value,numberOfBuckets. This value has been hardcoded to five; if you divide 30 strings into fivebuckets, each bucket will have six strings each. Each bucket in turn is represented as an

    ArrayList. These array lists are put finally into another list and returned. So, at the end of thefunction, you have an array list with five buckets (array lists) of six strings each.

    These buckets can be put in memory (as in this case), saved to disk, or put onto different nodes ina cluster!

    step2RunMapFunctionForAllBuckets() is the next method invoked from init().This method internally creates five threads (because there are five buckets -- the idea is to start athread for each bucket). The class responsible for threading is StartThread, which isimplemented as an inner class. Each thread processes each bucket and puts the individual result in

    another array list named intermediateresults. All the computation and threading takesplace within the same JVM, and the whole process runs on a single machine.

    http://www.javaworld.com/javaworld/jw-09-2008/hadoop-list2.txthttp://www.javaworld.com/javaworld/jw-09-2008/hadoop-list2.txthttp://www.javaworld.com/javaworld/jw-09-2008/hadoop-list2.txthttp://www.javaworld.com/javaworld/jw-09-2008/hadoop-list2.txthttp://www.javaworld.com/javaworld/jw-09-2008/hadoop-list2.txt
  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    27/39

    If the buckets were on different machines, a master should be monitoring them to know when the

    computation is over, if there are any failures in processing in any of the nodes, and so on. It would

    be great if the master could perform the computations on different nodes, rather than bringing the

    data from all nodes to the master itself and executing it.

    The step3RunReduceFunctionForAllBuckets() method collates the results fromintermediateresults, sums it up, and gives you the final output.

    Note that intermediateresults needs to combine the results from the parallel processingexplained in the previous bullet point. The exciting part is that this process also can happen

    concurrently!

    A more complicated scenario

    Processing 30 input elements doesn't really make for an interesting scenario. Imagine instead that thereare 100,000 elements of data to be processed. The task at hand is to search for the total number of

    occurrences of the wordJavaWorld. The data may be structured or unstructured. Here's how you'd

    approach it:

    Assume that, in some way, the data is divided into smaller chunks and is inserted into buckets.You have a total of 10 buckets now, with 10,000 elements of data within each of them. (Don't

    bother worrying about who exactly does the dividing at the moment.)

    Apply a function named map(), which in turn executes your search algorithm on a single bucketand repeats it concurrently for all the buckets in parallel, storing the result (of processing of each

    bucket) in another set of buckets, called result buckets. Note that there may be more than oneresult bucket.

    Apply a function named reduce() on each of these result buckets. This function iteratesthrough the result buckets, takes in each value, and then performs some kind of processing, if

    needed. The processing may either aggregate the individual values or apply some kind of businesslogic on the aggregated or individual values. This functionality once again takes place

    concurrently. Finally, you will get the result you expected.

    These four steps are very simple but there is so much power in them! Let's look at the details.

    Dividing the data

    In Step 1, note that the buckets created by someone for you may be on a single machine or on multiple

    machines (though they must be on the same cluster in that case). In practice, that means that in largeenterprise projects, multiple terabytes or petabytes of data could be segmented into thousands of buckets

    on different machines in the cluster, and processing could be performed in parallel, giving the user an

    extremely fast response. Google uses this concept to index every Web page it crawls. If you take

    advantage of the power of the underlying filesystem used for storing the data in individual machines ofthe cluster, the result could be more fascinating. Google uses the proprietary Google File System (GFS)

    for this.

    The map() function

    In Step 2, the map() function understands exactly where it should go to process the data. The source ofdata may be memory, or disk, or another node in the cluster. Please note that bringing data to the place

    where the map() function resides is more costly and time-consuming than letting the function execute at

    http://www.javaworld.com/javaworld/jw-09-2008/jw-09-hadoop.html#resourceshttp://www.javaworld.com/javaworld/jw-09-2008/jw-09-hadoop.html#resources
  • 7/31/2019 Lecture 11 Notes, Map Reduce and Hadoop

    28/39

    the place where the data resides. If you write a C++ or Java program to process data on multiple threads,

    then the program fetches data from a data source (typically a remote database server) and is usually

    executed on the machine where your application is running. In MapReduce implementations, the

    computation happens on the distributed nodes.

    The reduce() function

    In Step 3, the reduce() function operates on one or more lists of intermediate results by fetching eachof them from memory, disk, or a network transfer and performing a function on each element of each list.The final result of the complete operation is performed by collating and interpreting the results from all

    processes running reduce