tutorial for mapreduce (hadoop) & large scale processing le zhao (lti, scs, cmu) database...

Tutorial for MapReduce (Hadoop) & Large Scale Processing

Le Zhao (LTI, SCS, CMU)

Database Seminar & Large Scale Seminar

2010-Feb-15Some slides adapted from IR course lectures by Jamie Callan

© 2010, Le Zhao1

Outline

• Why MapReduce (Hadoop)

• MapReduce basics

• The MapReduce way of thinking

• Manipulating large data

© 2010, Le Zhao2

Outline


– Why go large scale

– Compared to other parallel computing models

– Hadoop related tools




© 2010, Le Zhao3

Why NOT to do parallel computing

• Concerns: a parallel system needs to provide:

– Data distribution

– Computation distribution

– Fault tolerance

– Job scheduling

© 2010, Le Zhao4

Why MapReduce (Hadoop)

• Previous parallel computation models

– 1) scp + ssh

» Manual everything

– 2) network cross-mounted disks + condor/torque

» No data distr, disk access is bottleneck

» Can only partition totally distributed computation

» No fault tolerance

» Prioritized job scheduling

© 2010, Le Zhao5

Hadoop

• Parallel batch computation

– Data distribution

» Hadoop Distributed File System (HDFS)

» Like Linux FS, but with automatic data repetition

– Computation distribution

» Automatic, user only need to specify #input_splits

» Can distribute aggregation computations as well

– Fault tolerance

» Automatic recovery from failure

» Speculative execution (a backup task)

– Job scheduling

» Ok, but still relies on the politeness of users

© 2010, Le Zhao6

How you can use Hadoop

• Hadoop Streaming

– Quick hacking – much like shell scripting

» Uses STDIN & STDOUT carry data

» cat file | mapper | sort | reducer > output

– Easier to use legacy code, all programming languages

• Hadoop Java API

– Build large systems

» More data types

» More control over Hadoop’s behavior

» Easier debugging with Java’s error stacktrace display

– NetBeans plugin for Hadoop provides easy programming

» http://hadoopstudio.org/docs.html

© 2010, Le Zhao7

Outline





© 2010, Le Zhao8

© 2009, Jamie Callan 9

Map and Reduce

MapReduce is a new use of an old idea in Computer Science

• Map: Apply a function to every object in a list

– Each object is independent

» Order is unimportant

» Maps can be done in parallel

– The function produces a result

• Reduce: Combine the results to produce a final result

You may have seen this in a Lisp or functional programming course


MapReduce

• Input reader– Divide input into splits, assign each split to a Map processor

• Map– Apply the Map function to each record in the split– Each Map function returns a list of (key, value) pairs

• Shuffle/Partition and Sort– Shuffle distributes sorting & aggregation to many reducers– All records for key k are directed to the same reduce processor– Sort groups the same keys together, and prepares for aggregation

• Reduce– Apply the Reduce function to each key– The result of the Reduce function is a list of (key, value) pairs

MapReduce in One Picture

© 2010, Le Zhao11

Tom White, Hadoop: The Definitive Guide

Outline




– Two simple use cases

– Two more advanced & useful MapReduce tricks

– Two MapReduce applications


© 2010, Le Zhao12

MapReduce Use Case (1) – Map Only

Data distributive tasks – Map Only

• E.g. classify individual documents

• Map does everything

– Input: (docno, doc_content), …

– Output: (docno, [class, class, …]), …

• No reduce

© 2010, Le Zhao13

MapReduce Use Case (2) – Filtering and Accumulation

Filtering & Accumulation – Map and Reduce

• E.g. Counting total enrollments of two given classes

• Map selects records and outputs initial counts

– In: (Jamie, 11741), (Tom, 11493), …

– Out: (11741, 1), (11493, 1), …

• Shuffle/Partition by class_id

• Sort

– In: (11741, 1), (11493, 1), (11741, 1), …

– Out: (11493, 1), …, (11741, 1), (11741, 1), …

• Reduce accumulates counts

– In: (11493, [1, 1, …]), (11741, [1, 1, …])

– Sum and Output: (11493, 16), (11741, 35)

© 2010, Le Zhao14

MapReduce Use Case (3) – Database Join

Problem: Massive lookups– Given two large lists: (URL, ID) and (URL, doc_content) pairs– Produce (ID, doc_content)

Solution: Database join• Input stream: both (URL, ID) and (URL, doc_content) lists

– (http://del.icio.us/post, 0), (http://digg.com/submit, 1), …– (http://del.icio.us/post, <html0>), (http://digg.com/submit, <html1>), …

• Map simply passes input along,• Shuffle and Sort on URL (group ID & doc_content for the same URL together)

– Out: (http://del.icio.us/post, 0), (http://del.icio.us/post, <html0>), (http://digg.com/submit, <html1>), (http://digg.com/submit, 1), …

• Reduce outputs result stream of (ID, doc_content) pairs– In: (http://del.icio.us/post, [0, html0]), (http://digg.com/submit, [html1, 1]), …– Out: (0, <html0>), (1, <html1>), …

© 2010, Le Zhao15

MapReduce Use Case (4) – Secondary Sort

Problem: Sorting on values• E.g. Reverse graph edge directions & output in node order

– Input: adjacency list of graph (3 nodes and 4 edges)(3, [1, 2]) (1, [3])(1, [2, 3]) (2, [1, 3]) (3, [1])

• Note, the node_ids in the output values are also sorted. But Hadoop only sorts on keys!

Solution: Secondary sort• Map

– In: (3, [1, 2]), (1, [2, 3]).– Intermediate: (1, [3]), (2, [3]), (2, [1]), (3, [1]). (reverse edge direction)– Out: (<1, 3>, [3]), (<2, 3>, [3]), (<2, 1>, [1]), (<3, 1>, [1]).– Copy node_ids from value to key.

1 2

3

1 2

3

© 2010, Le Zhao16

MapReduce Use Case (4) – Secondary Sort

Secondary Sort (ctd.)

• Shuffle on Key.field1, and Sort on whole Key (both fields)

– In: (<1, 3>, [3]), (<2, 3>, [3]), (<2, 1>, [1]), (<3, 1>, [1])

– Out: (<1, 3>, [3]), (<2, 1>, [1]), (<2, 3>, [3]), (<3, 1>, [1])

• Grouping comparator

– Merge according to part of the key

– Out: (<1, 3>, [3]), (<2, 1>, [1, 3]), (<3, 1>, [1]) this will be the reducer’s input

• Reduce

– Merge & output: (1, [3]), (2, [1, 3]), (3, [1])

© 2010, Le Zhao17

Using MapReduce to Construct Indexes:Preliminaries

Construction of binary inverted lists

• Input: documents: (docid, [term, term..]), (docid, [term, ..]), ..

• Output: (term, [docid, docid, …])

– E.g., (apple, [1, 23, 49, 127, …])

• Binary inverted lists fit on a slide more easily

• Everything also applies to frequency and positional inverted lists

A document id is an internal document id, e.g., a unique integer

• Not an external document id such as a url

MapReduce elements

• Combiner, Secondary Sort, complex keys, Sorting on keys’ fields


Using MapReduce to Construct Indexes:A Simple Approach

A simple approach to creating binary inverted lists

• Each Map task is a document parser

– Input: A stream of documents

– Output: A stream of (term, docid) tuples

» (long, 1) (ago, 1) (and, 1) … (once, 2) (upon, 2) …

• Shuffle sorts tuples by key and routes tuples to Reducers

• Reducers convert streams of keys into streams of inverted lists

– Input: (long, 1) (long, 127) (long, 49) (long, 23) …

– The reducer sorts the values for a key and builds an inverted list

» Longest inverted list must fit in memory

– Output: (long, [df:492, docids:1, 23, 49, 127, …])


Using MapReduce to Construct Indexes:A Simple Approach

A more succinct representation of the previous algorithm

• Map: (docid1, content1) (t1, docid1) (t2, docid1) …

• Shuffle by t

• Sort by t

(t5, docid1) (t4, docid3) … (t4, docid3) (t4, docid1) (t5, docid1) …

• Reduce: (t4, [docid3 docid1 …]) (t, ilist)

docid: a unique integer

t: a term, e.g., “apple”

ilist: a complete inverted list

but a) inefficient, b) docids are sorted in reducers, and c) assumes ilist of a word fits in memory


Using MapReduce to Construct Indexes:Using Combine

• Map: (docid1, content1) (t1, ilist1,1) (t2, ilist2,1) (t3, ilist3,1) …

– Each output inverted list covers just one document

• Combine

Sort by t

Combine: (t1 [ilist1,2 ilist1,3 ilist1,1 …]) (t1, ilist1,27)

– Each output inverted list covers a sequence of documents

• Shuffle by t

• Sort by t

(t4, ilist4,1) (t5, ilist5,3) … (t4, ilist4,2) (t4, ilist4,4) (t4, ilist4,1) …

• Reduce: (t7, [ilist7,2, ilist3,1, ilist7,4, …]) (t7, ilistfinal)

ilisti,j: the j’th inverted list fragment for term i



Using MapReduce to Construct Indexes

Parser /Indexer

Parser /Indexer

Parser /Indexer

:

::

:

:

:

Merger

Merger

Merger

::

A-F

DocumentsInverted

Lists

Map/Combine

ProcessorsInverted ListFragments Processors

Shuffle/Sort Reduce

G-P

Q-Z

Using MapReduce to ConstructPartitioned Indexes

• Map: (docid1, content1) ([p, t1], ilist1,1)

• Combine to sort and group values

([p, t1] [ilist1,2 ilist1,3 ilist1,1 …]) ([p, t1], ilist1,27)

• Shuffle by p

• Sort values by [p, t]

• Reduce: ([p, t7], [ilist7,2, ilist7,1, ilist7,4, …]) ([p, t7], ilistfinal)

p: partition (shard) id


Using MapReduce to Construct Indexes:Secondary Sort

So far, we have assumed that Reduce can sort values in memory …but what if there are too many to fit in memory?

• Map: (docid1, content1) ([t1, fd1,1], ilist1,1)


• Shuffle by t

• Sort by [t, fd], then Group by t (Secondary Sort)

([t7, fd7,2], ilist7,2), ([t7, fd7,1], ilist7,1) … (t7, [ilist7,1, ilist7,2, …])

• Reduce: (t7, [ilist7,1, ilist7,2, …]) (t7, ilistfinal)

Values arrive in order, so Reduce can stream its output

fdi,j is the first docid in ilisti,j


Using MapReduce to Construct Indexes:Putting it All Together

• Map: (docid1, content1) ([p, t1, fd1,1], ilist1,1)


([p, t1, fd1,1] [ilist1,2 ilist1,3 ilist1,1 …]) ([p, t1, fd1,27], ilist1,27)

• Shuffle by p

• Secondary Sort by [(p, t), fd]

([p, t7], [ilist7,2, ilist7,1, ilist7,4, …]) ([p, t7], [ilist7,1, ilist7,2, ilist7,4, …])

• Reduce: ([p, t7], [ilist7,1, ilist7,2, ilist7,4, …]) ([p, t7], ilistfinal)



Using MapReduce to Construct Indexes

Parser /Indexer

Parser /Indexer

Parser /Indexer

:

::

:

:

:

Merger

Merger

Merger

::

Shard

DocumentsInverted

Lists

Map/Combine

ProcessorsInverted ListFragments Processors

Shuffle/Sort Reduce

Shard

Shard

PageRank Calculation:Preliminaries

One PageRank iteration:

• Input:

– (id1, [score1(t), out11, out12, ..]), (id2, [score2

(t), out21, out22, ..]) ..

• Output:

– (id1, [score1(t+1), out11, out12, ..]), (id2, [score2

(t+1), out21, out22, ..]) ..

MapReduce elements

• Score distribution and accumulation

• Database join

• Side-effect files


PageRank: Score Distribution and Accumulation

• Map

– In: (id1, [score1(t), out11, out12, ..]), (id2, [score2

(t), out21, out22, ..]) ..

– Out: (out11, score1(t)/n1), (out12, score1

(t)/n1) .., (out21, score2(t)/n2), ..

• Shuffle & Sort by node_id

– In: (id2, score1), (id1, score2), (id1, score1), ..

– Out: (id1, score1), (id1, score2), .., (id2, score1), ..

• Reduce

– In: (id1, [score1, score2, ..]), (id2, [score1, ..]), ..

– Out: (id1, score1(t+1)), (id2, score2

(t+1)), ..


PageRank: Database Join to associate outlinks with score

• Map

– In & Out: (id1, score1(t+1)), (id2, score2

(t+1)), .., (id1, [out11, out12, ..]), (id2, [out21, out22, ..]) ..

• Shuffle & Sort by node_id

– Out: (id1, score1(t+1)), (id1, [out11, out12, ..]), (id2, [out21, out22, ..]), (id2,

score2(t+1)), ..

• Reduce

– In: (id1, [score1(t+1), out11, out12, ..]), (id2, [out21, out22, .., score2

(t+1)]), ..

– Out: (id1, [score1(t+1), out11, out12, ..]), (id2, [score2

(t+1), out21, out22, ..]) ..


PageRank: Side Effect Files for dangling nodes

• Dangling Nodes

– Nodes with no outlinks (observed but not crawled URLs)

– Score has no outlet

» need to distribute to all graph nodes evenly

• Map for dangling nodes:

– In: .., (id3, [score3]), ..

– Out: .., ("*", 0.85×score3), ..

• Reduce

– In: .., ("*", [score1, score2, ..]), ..

– Out: .., everything else, ..

– Output to side-effect: ("*", score), fed to Mapper of next iteration


Manipulating Large Data

• Do everything in Hadoop (and HDFS)

– Make sure every step is parallelized!

– Any serial step breaks your design

• E.g. storing the URL list for a Web graph

– Each node in Web graph has an id

– [URL1, URL2, …], use line number as id – bottle neck

– [(id1, URL1), (id2, URL2), …], explicit id

© 2010, Le Zhao32

Hadoop based Tools

• For Developing in Java, NetBeans plugin

– http://www.hadoopstudio.org/docs.html

• Pig Latin, a SQL-like high level data processing script language

• Hive, Data warehouse, SQL

• Cascading, Data processing

• Mahout, Machine Learning algorithms on Hadoop

• HBase, Distributed data store as a large table

• More

– http://hadoop.apache.org/

– http://en.wikipedia.org/wiki/Hadoop

– Many other toolkits, Nutch, Cloud9, Ivory

© 2010, Le Zhao33

Get Your Hands Dirty

• Hadoop Virtual Machine

– http://www.cloudera.com/developers/downloads/virtual-machine/

» This runs Hadoop 0.20

– An earlier Hadoop 0.18.0 version is here http://code.google.com/edu/parallel/tools/hadoopvm/index.html

• Amazon EC2

• Various other Hadoop clusters around

• The NetBeans plugin simulates Hadoop

– The workflow view works on Windows

– Local running & debugging works on MacOS and Linux

– http://www.hadoopstudio.org/docs.html

© 2010, Le Zhao34

Conclusions

• Why large scale

• MapReduce advantages

• Hadoop uses

• Use cases

– Map only: for totally distributive computation

– Map+Reduce: for filtering & aggregation

– Database join: for massive dictionary lookups

– Secondary sort: for sorting on values

– Inverted indexing: combiner, complex keys

– PageRank: side effect files

• Large data



For More Information

• L. A. Barroso, J. Dean, and U. Hölzle. “Web search for a planet: The Google cluster architecture.” IEEE Micro, 2003.

• J. Dean and S. Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters.” Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages 137-150. 2004.

• S. Ghemawat, H. Gobioff, and S.-T. Leung. “The Google File System.” Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP-03), pages 29-43. 2003.

• I.H. Witten, A. Moffat, and T.C. Bell. Managing Gigabytes. Morgan Kaufmann. 1999.

• J. Zobel and A. Moffat. “Inverted files for text search engines.” ACM Computing Surveys, 38 (2). 2006.

• http://hadoop.apache.org/common/docs/current/mapred_tutorial.html. “Map/Reduce Tutorial”. Fetched January 21, 2010.

• Tom White. Hadoop: The Definitive Guide. O'Reilly Media. June 5, 2009

• J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce, Book Draft. February 7, 2010.

tutorial for mapreduce (hadoop) & large scale processing le zhao (lti, scs, cmu) database...

Documents

jamie callan mapreduce

large scalecompared

spliteach map function

parallelthe function

parallel system

parallel computingconcerns

map processormapapply

list of key