Download - Map Reduce
![Page 1: Map Reduce](https://reader036.vdocuments.us/reader036/viewer/2022062418/554f5f70b4c905bb178b4604/html5/thumbnails/1.jpg)
Lecture 2 – MapReduce
CPE 458 – Parallel Programming, Spring 2009
Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.http://creativecommons.org/licenses/by/2.5
![Page 2: Map Reduce](https://reader036.vdocuments.us/reader036/viewer/2022062418/554f5f70b4c905bb178b4604/html5/thumbnails/2.jpg)
Outline MapReduce: Programming Model MapReduce Examples A Brief History MapReduce Execution Overview Hadoop MapReduce Resources
![Page 3: Map Reduce](https://reader036.vdocuments.us/reader036/viewer/2022062418/554f5f70b4c905bb178b4604/html5/thumbnails/3.jpg)
MapReduce
“A simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs.”
Dean and Ghermawat, “MapReduce: Simplified Data Processing on Large Clusters”, Google Inc.
![Page 4: Map Reduce](https://reader036.vdocuments.us/reader036/viewer/2022062418/554f5f70b4c905bb178b4604/html5/thumbnails/4.jpg)
MapReduce
More simply, MapReduce is:A parallel programming model and associated
implementation.
![Page 5: Map Reduce](https://reader036.vdocuments.us/reader036/viewer/2022062418/554f5f70b4c905bb178b4604/html5/thumbnails/5.jpg)
Programming Model
Description The mental model the programmer has about the
detailed execution of their application. Purpose
Improve programmer productivity Evaluation
Expressibility Simplicity Performance
![Page 6: Map Reduce](https://reader036.vdocuments.us/reader036/viewer/2022062418/554f5f70b4c905bb178b4604/html5/thumbnails/6.jpg)
Programming Models
von Neumann model Execute a stream of instructions (machine code) Instructions can specify
Arithmetic operations Data addresses Next instruction to execute
Complexity Track billions of data locations and millions of instructions Manage with:
Modular design High-level programming languages (isomorphic)
![Page 7: Map Reduce](https://reader036.vdocuments.us/reader036/viewer/2022062418/554f5f70b4c905bb178b4604/html5/thumbnails/7.jpg)
Programming Models
Parallel Programming Models Message passing
Independent tasks encapsulating local data Tasks interact by exchanging messages
Shared memory Tasks share a common address space Tasks interact by reading and writing this space asynchronously
Data parallelization Tasks execute a sequence of independent operations Data usually evenly partitioned across tasks Also referred to as “Embarrassingly parallel”
![Page 8: Map Reduce](https://reader036.vdocuments.us/reader036/viewer/2022062418/554f5f70b4c905bb178b4604/html5/thumbnails/8.jpg)
MapReduce:Programming Model Process data using special map() and reduce()
functions The map() function is called on every item in the input
and emits a series of intermediate key/value pairs All values associated with a given key are grouped
together The reduce() function is called on every unique key,
and its value list, and emits a value that is added to the output
![Page 9: Map Reduce](https://reader036.vdocuments.us/reader036/viewer/2022062418/554f5f70b4c905bb178b4604/html5/thumbnails/9.jpg)
MapReduce:Programming Model
How nowBrown cow
How doesIt work now
brown 1cow 1does 1How 2
it 1now 2work 1
M
M
M
M
R
R
<How,1><now,1><brown,1><cow,1><How,1><does,1><it,1><work,1><now,1>
<How,1 1><now,1 1><brown,1><cow,1><does,1><it,1><work,1>
Input OutputMap
ReduceMapReduceFramework
![Page 10: Map Reduce](https://reader036.vdocuments.us/reader036/viewer/2022062418/554f5f70b4c905bb178b4604/html5/thumbnails/10.jpg)
MapReduce:Programming Model More formally,
Map(k1,v1) --> list(k2,v2)Reduce(k2, list(v2)) --> list(v2)
![Page 11: Map Reduce](https://reader036.vdocuments.us/reader036/viewer/2022062418/554f5f70b4c905bb178b4604/html5/thumbnails/11.jpg)
MapReduce Runtime System
1. Partitions input data2. Schedules execution across a set of
machines3. Handles machine failure4. Manages interprocess communication
![Page 12: Map Reduce](https://reader036.vdocuments.us/reader036/viewer/2022062418/554f5f70b4c905bb178b4604/html5/thumbnails/12.jpg)
MapReduce Benefits
Greatly reduces parallel programming complexity Reduces synchronization complexity Automatically partitions data Provides failure transparency Handles load balancing
Practical Approximately 1000 Google MapReduce jobs run
everyday.
![Page 13: Map Reduce](https://reader036.vdocuments.us/reader036/viewer/2022062418/554f5f70b4c905bb178b4604/html5/thumbnails/13.jpg)
MapReduce Examples
Word frequency
Map
doc
Reduce
<word,3>
<word,1>
<word,1>
<word,1>
RuntimeSystem
<word,1,1,1>
![Page 14: Map Reduce](https://reader036.vdocuments.us/reader036/viewer/2022062418/554f5f70b4c905bb178b4604/html5/thumbnails/14.jpg)
MapReduce Examples
Distributed grep Map function emits <word, line_number> if word
matches search criteria Reduce function is the identity function
URL access frequency Map function processes web logs, emits <url, 1> Reduce function sums values and emits <url, total>
![Page 15: Map Reduce](https://reader036.vdocuments.us/reader036/viewer/2022062418/554f5f70b4c905bb178b4604/html5/thumbnails/15.jpg)
A Brief History
Functional programming (e.g., Lisp)map() function
Applies a function to each value of a sequencereduce() function
Combines all elements of a sequence using a binary operator
![Page 16: Map Reduce](https://reader036.vdocuments.us/reader036/viewer/2022062418/554f5f70b4c905bb178b4604/html5/thumbnails/16.jpg)
MapReduce Execution Overview1. The user program, via the MapReduce
library, shards the input data
UserProgramInput
Data
Shard 0Shard 1Shard 2Shard 3Shard 4Shard 5Shard 6
* Shards are typically 16-64mb in size
![Page 17: Map Reduce](https://reader036.vdocuments.us/reader036/viewer/2022062418/554f5f70b4c905bb178b4604/html5/thumbnails/17.jpg)
MapReduce Execution Overview2. The user program creates process
copies distributed on a machine cluster. One copy will be the “Master” and the others will be worker threads.
UserProgram
Master
WorkersWorkers
WorkersWorkers
Workers
![Page 18: Map Reduce](https://reader036.vdocuments.us/reader036/viewer/2022062418/554f5f70b4c905bb178b4604/html5/thumbnails/18.jpg)
MapReduce Resources
3. The master distributes M map and R reduce tasks to idle workers.
M == number of shards R == the intermediate key space is divided
into R parts
Master IdleWorker
Message(Do_map_task)
![Page 19: Map Reduce](https://reader036.vdocuments.us/reader036/viewer/2022062418/554f5f70b4c905bb178b4604/html5/thumbnails/19.jpg)
MapReduce Resources
4. Each map-task worker reads assigned input shard and outputs intermediate key/value pairs.
Output buffered in RAM.
MapworkerShard 0 Key/value pairs
![Page 20: Map Reduce](https://reader036.vdocuments.us/reader036/viewer/2022062418/554f5f70b4c905bb178b4604/html5/thumbnails/20.jpg)
MapReduce Execution Overview5. Each worker flushes intermediate values,
partitioned into R regions, to disk and notifies the Master process.
Master
Mapworker
Disk locations
LocalStorage
![Page 21: Map Reduce](https://reader036.vdocuments.us/reader036/viewer/2022062418/554f5f70b4c905bb178b4604/html5/thumbnails/21.jpg)
MapReduce Execution Overview6. Master process gives disk locations to an
available reduce-task worker who reads all associated intermediate data.
Master
Reduceworker
Disk locations
remoteStorage
![Page 22: Map Reduce](https://reader036.vdocuments.us/reader036/viewer/2022062418/554f5f70b4c905bb178b4604/html5/thumbnails/22.jpg)
MapReduce Execution Overview7. Each reduce-task worker sorts its
intermediate data. Calls the reduce function, passing in unique keys and associated key values. Reduce function output appended to reduce-task’s partition output file.
Reduceworker
Sorts data PartitionOutput file
![Page 23: Map Reduce](https://reader036.vdocuments.us/reader036/viewer/2022062418/554f5f70b4c905bb178b4604/html5/thumbnails/23.jpg)
MapReduce Execution Overview8. Master process wakes up user process
when all tasks have completed. Output contained in R output files.
wakeup UserProgramMaster
Outputfiles
![Page 24: Map Reduce](https://reader036.vdocuments.us/reader036/viewer/2022062418/554f5f70b4c905bb178b4604/html5/thumbnails/24.jpg)
MapReduce Execution Overview Fault Tolerance
Master process periodically pings workers Map-task failure
Re-execute All output was stored locally
Reduce-task failure Only re-execute partially completed tasks
All output stored in the global file system
![Page 25: Map Reduce](https://reader036.vdocuments.us/reader036/viewer/2022062418/554f5f70b4c905bb178b4604/html5/thumbnails/25.jpg)
Hadoop
Open source MapReduce implementationhttp://hadoop.apache.org/core/index.html
Uses Hadoop Distributed Filesytem (HDFS)
http://hadoop.apache.org/core/docs/current/hdfs_design.html
Javassh
![Page 26: Map Reduce](https://reader036.vdocuments.us/reader036/viewer/2022062418/554f5f70b4c905bb178b4604/html5/thumbnails/26.jpg)
References
Introduction to Parallel Programming and MapReduce, Google Code University http://code.google.com/edu/parallel/mapreduce-tutorial.html
Distributed Systems http://code.google.com/edu/parallel/index.html
MapReduce: Simplified Data Processing on Large Clusters http://labs.google.com/papers/mapreduce.html
Hadoop http://hadoop.apache.org/core/