main map reduce
Post on 09-Jun-2015
448 Views
Preview:
TRANSCRIPT
Introduction to MapReduceمقدمه ای بر مدل نگاشت کاهش
Masoumeh Rezaei jam
In the name of God
Distributed Systems courseDepartment of Computer Engineering
University of Tabriz
22th May 2012
Introduction
Mapper s& Reducers
Combiners & Partitioners
Steps of MapReduce Execution
contents
Conclusion
2/31Introduction to MapReduce
Execution Framework
Introduction
Cloud computing involves large-scale,
distributed computations on data from multiple sources.
MapReduce is a programming model as
well as a framework that supports the
model.
it enables to process a massive volume of data
in parallel with many low-end computing
nodes.
Introduction to MapReduce 3/31
Introduction (cont.)
The Emergence of MapReduce
Introduced by Google in 2004 to support simplified distributed computing on
clusters of computers
Social-networking services, video-sharing sites, web-based email
services, and applications
Introduction to MapReduce 4/31
Importance of
MapReduce
Von Neumann isn't sufficient
Scales to large clusters of machines
Organize computations on clusters
Effective tool for tackling large-data
The model is easy to use
Importance of MapReduce
5/31Introduction to MapReduce
Characteristics
•Main idea of MapReduce
• No race condition, lock contention, etc.
•to focus on data processing strategies
Hide system-level details from the
developers
• Developer specifies the computation that needs to be performed
• Execution framework (“runtime”) handles actual execution
Separating the “what” from “how”
•Move processing to the data
•Seamless scalability
Big data
6/31Introduction to MapReduce
MapReduce Overview
Users specify the computation in terms of a map and a reduce function.
The underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Introduction to MapReduce 7/31
The Execution Framework
A MapReduce program, consists of code for mappers and reducers (as well as combiners and
partitioners) together with configuration parameters
The developer submits the job to the submission node of a cluster and execution framework (the
“runtime") takes care of everything else: it transparently handles all other aspects of
distributed code execution, on clusters ranging from a single node to a few thousand nodes.
Specific responsibilities of Execution Framework
Scheduling
Data / code co-
location
Synchronization
Error and fault
handling
9/31
Runtime Scheduling Scheme
Nodes completed their tasks assigned another
Speculative, redundant execution
No execution
plan
achieves fault tolerance
11/31Introduction to MapReduce
The mapper is applied to every input key-value pair (split across an arbitrary number of files) to generate an arbitrary number of intermediate key-value pairs.
Simplified view of MapReduce
Key-value pairs form the basic data structure in MapReduce.Part of the design of MapReduce algorithms involves imposing the key-value structure on arbitrary datasets.web pages: (URLs, the actual HTML content) graph: (node id, adjacency lists of node)
The reducer is applied to all values associated with the same intermediate key to generate output key-value pairs.
12/31
The
Map
Redu
ce m
odel
con
sist
s of
two
prim
itive
func
tions
: Map
and
Red
uce
The input for MapReduce is a list of (key1, value1) pairs
Map() is applied to each pair to
compute intermediate key-value
pairs, (key2, value2)
The intermediate key-value pairs are then grouped together on the key equality basis, i.e. (key2, list(value2))
For each key2, Reduce()
works on the list of all
values, then produces zero
or more aggregated results.
Introduction to MapReduce 13/31
Mapper & Reducers
Mappers and Reducers are objects that implement the Map and Reduce methods, respectively.
A mapper object is a worker initialized for each map task and the Map method is called on each key-value pair by the execution framework.
if an input is broken down into 400 blocks and there are 40 mappers in a cluster, the number of map tasks are 400 and the map tasks are executed through 10 waves of task runs.
The situation is similar for the reduce phase: a reducer object is a worker initialized for each reduce task, and the Reduce method is called once per intermediate key.
Introduction to MapReduce 14/31
An example of using MapReduce
Input Map Reduce Output
15/31Introduction to MapReduce
PowerPoint has new layouts that give you more ways to present
your words, images and media.
Combiners
Combiners & Partitioners
They are an optimization in MapReduce that allow for local aggregation before
the shuffle and sort phase.
To perform local aggregation on the output of each mapper, i.e.,to compute a
local count for a key over all the documents processed by the mapper.
Perform it to cut down on the number of intermediate key-value pairs
16/31Introduction to MapReduce
PowerPoint has new layouts that give you more ways to present
your words, images and media.
Partitioners
Combiners & Partitioners (cont.)
They are responsible for dividing up the intermediate key space and assigning intermediate
key-value pairs to reducers.
The execution framework uses this information to copy the data to the right location during the
shuffle and sort phase.
Simplest partitioner involves computing the hash value of the key and then taking the
mod of that value with the number of reducers.
17/31Introduction to MapReduce
ONE TAB TWO TAB FOUR TAB FIVE
Before starting the Map task, an input file is loaded on the distributed file system. At loading, the file is partitioned into multiple data blocks with same size, typically 64MB. Each block is triplicated to guarantee fault-tolerance.
Steps of MapReduce Execution
18/31Introduction to MapReduce
TWO TAB TWO TAB FOUR TAB FIVE
Each block is then assigned to a mapper. The mapper applies Map() to each record in the data block.
Steps of MapReduce Execution
19/31Introduction to MapReduce
THREE TAB FOUR TAB FIVE
The intermediate outputs produced by the mappers are then sorted locally for grouping key-value pairs sharing the same key.
Steps of MapReduce Execution
20/31Introduction to MapReduce
FOUR TAB FOUR TAB FIVE
After local sort, Combine() is optionally applied to perform pre-aggregation on the grouped key-value pairs,so that the communication cost of transfer all the intermediate outputs to reducers is minimized.
Steps of MapReduce Execution
21/31Introduction to MapReduce
FIVE TAB FIVE
The mapped outputs are stored in local disks of the mappers, partitioned into R, where R is the number of Reduce tasks in the MR job. This partitioning is basically done by a hash function, e.g. hash(key) mod R.
Steps of MapReduce Execution
22/31Introduction to MapReduce
SIX TAB FIVE
When all Map tasks are completed, the MapReduce scheduler assigns Reduce tasks to workers.
Steps of MapReduce Execution
23/31Introduction to MapReduce
SEVEN
Since all mapped outputs are already partitioned and stored in local disks, each reducer performs the shuffling by simply pulling its partition of the mapped outputs from mappers.
Steps of MapReduce Execution
24/31Introduction to MapReduce
EIGHT
A reducer reads the intermediate results and merge them by the intermediate keys, i.e. key2, so that all values of the same key are grouped together. Sometime this grouping is done by external merge-sort.
Steps of MapReduce Execution
25/31Introduction to MapReduce
NINE
Then each reducer applies Reduce() to the intermediate values for each key2 it encounters. The output of reducers are stored and triplicated in HDFS.
Steps of MapReduce Execution
26/31Introduction to MapReduce
Introduction to MapReduce
Hadoop Architecturethe single opportunity for global synchronization is the barrier between the map and reduce phases.Due to that between the map and reduce tasks, the map phase of a job is only as fast as the slowest map task. Similarly, the completion time of a job is bounded by the running time of the slowest reduce task.
27/31
MapReduce Executionthe number of Map tasks does not depend on the number of nodes, but the number of input blocks.Each block is assigned to a single Map task.
28/31
Complete view of MapReduce
29/31
Conclusion
In this peresentation, MapReduce as a programming model for expressing distributed computations on massive amounts of data and an execution framework for large-scale data processing on clusters of commodity servers was discussed.
MapReduce can enhance the productivity for junior developers who lack the experience of distributed/parallel development.
Introduction to MapReduce 30/31
References Jimmy Lin, Chris Dyer, “Data-Intensive Text Processing with MapReduce,” 2010
R. Buyya, C. Shin Yeo, S. Venugopal, J. Broberg, I. Brandic, “Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility,” 2009.
Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, Bongki Moon, “Parallel Data Processing with MapReduce: A Survey,” 2011
B.Thirumala Rao , Dr. L.S.S.Reddy , “Survey on Improved Scheduling in Hadoop MapReduce in Cloud Environments,” 2011
Jeffrey Dean, Sanjay Ghemawat , “MapReduce: Simplified Data Processing on Large Clusters,” 2008
Introduction to MapReduce 31/31
Thank U !
AnyQuestion?
top related