main map reduce

Introduction to MapReduceمقدمه ای بر مدل نگاشت کاهش

Masoumeh Rezaei jam

In the name of God

Distributed Systems courseDepartment of Computer Engineering

University of Tabriz

22th May 2012

Introduction

Mapper s& Reducers

Combiners & Partitioners

Steps of MapReduce Execution

contents

Conclusion

2/31Introduction to MapReduce

Execution Framework

Introduction

Cloud computing involves large-scale,

distributed computations on data from multiple sources.

MapReduce is a programming model as

well as a framework that supports the

model.

it enables to process a massive volume of data

in parallel with many low-end computing

nodes.

Introduction to MapReduce 3/31

Introduction (cont.)

The Emergence of MapReduce

Introduced by Google in 2004 to support simplified distributed computing on

clusters of computers

Social-networking services, video-sharing sites, web-based email

services, and applications


Importance of

MapReduce

Von Neumann isn't sufficient

Scales to large clusters of machines

Organize computations on clusters

Effective tool for tackling large-data

The model is easy to use

Importance of MapReduce


Characteristics

•Main idea of MapReduce

• No race condition, lock contention, etc.

•to focus on data processing strategies

Hide system-level details from the

developers

• Developer specifies the computation that needs to be performed

• Execution framework (“runtime”) handles actual execution

Separating the “what” from “how”

•Move processing to the data

•Seamless scalability

Big data


MapReduce Overview

Users specify the computation in terms of a map and a reduce function.

The underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.


The Execution Framework

A MapReduce program, consists of code for mappers and reducers (as well as combiners and

partitioners) together with configuration parameters

The developer submits the job to the submission node of a cluster and execution framework (the

“runtime") takes care of everything else: it transparently handles all other aspects of

distributed code execution, on clusters ranging from a single node to a few thousand nodes.

Specific responsibilities of Execution Framework

Scheduling

Data / code co-

location

Synchronization

Error and fault

handling

9/31

Runtime Scheduling Scheme

Nodes completed their tasks assigned another

Speculative, redundant execution

No execution

plan

achieves fault tolerance


The mapper is applied to every input key-value pair (split across an arbitrary number of files) to generate an arbitrary number of intermediate key-value pairs.

Simplified view of MapReduce

Key-value pairs form the basic data structure in MapReduce.Part of the design of MapReduce algorithms involves imposing the key-value structure on arbitrary datasets.web pages: (URLs, the actual HTML content) graph: (node id, adjacency lists of node)

The reducer is applied to all values associated with the same intermediate key to generate output key-value pairs.

12/31

The

Map

Redu

ce m

odel

con

sist

s of

two

prim

itive

func

tions

: Map

and

Red

uce

The input for MapReduce is a list of (key1, value1) pairs

Map() is applied to each pair to

compute intermediate key-value

pairs, (key2, value2)

The intermediate key-value pairs are then grouped together on the key equality basis, i.e. (key2, list(value2))

For each key2, Reduce()

works on the list of all

values, then produces zero

or more aggregated results.


Mapper & Reducers

Mappers and Reducers are objects that implement the Map and Reduce methods, respectively.

A mapper object is a worker initialized for each map task and the Map method is called on each key-value pair by the execution framework.

if an input is broken down into 400 blocks and there are 40 mappers in a cluster, the number of map tasks are 400 and the map tasks are executed through 10 waves of task runs.

The situation is similar for the reduce phase: a reducer object is a worker initialized for each reduce task, and the Reduce method is called once per intermediate key.


An example of using MapReduce

Input Map Reduce Output


PowerPoint has new layouts that give you more ways to present

your words, images and media.

Combiners

Combiners & Partitioners

They are an optimization in MapReduce that allow for local aggregation before

the shuffle and sort phase.

To perform local aggregation on the output of each mapper, i.e.,to compute a

local count for a key over all the documents processed by the mapper.

Perform it to cut down on the number of intermediate key-value pairs


PowerPoint has new layouts that give you more ways to present

your words, images and media.

Partitioners

Combiners & Partitioners (cont.)

They are responsible for dividing up the intermediate key space and assigning intermediate

key-value pairs to reducers.

The execution framework uses this information to copy the data to the right location during the

shuffle and sort phase.

Simplest partitioner involves computing the hash value of the key and then taking the

mod of that value with the number of reducers.


ONE TAB TWO TAB FOUR TAB FIVE

Before starting the Map task, an input file is loaded on the distributed file system. At loading, the file is partitioned into multiple data blocks with same size, typically 64MB. Each block is triplicated to guarantee fault-tolerance.



TWO TAB TWO TAB FOUR TAB FIVE

Each block is then assigned to a mapper. The mapper applies Map() to each record in the data block.



THREE TAB FOUR TAB FIVE

The intermediate outputs produced by the mappers are then sorted locally for grouping key-value pairs sharing the same key.



FOUR TAB FOUR TAB FIVE

After local sort, Combine() is optionally applied to perform pre-aggregation on the grouped key-value pairs,so that the communication cost of transfer all the intermediate outputs to reducers is minimized.



FIVE TAB FIVE

The mapped outputs are stored in local disks of the mappers, partitioned into R, where R is the number of Reduce tasks in the MR job. This partitioning is basically done by a hash function, e.g. hash(key) mod R.



SIX TAB FIVE

When all Map tasks are completed, the MapReduce scheduler assigns Reduce tasks to workers.



SEVEN

Since all mapped outputs are already partitioned and stored in local disks, each reducer performs the shuffling by simply pulling its partition of the mapped outputs from mappers.



EIGHT

A reducer reads the intermediate results and merge them by the intermediate keys, i.e. key2, so that all values of the same key are grouped together. Sometime this grouping is done by external merge-sort.



NINE

Then each reducer applies Reduce() to the intermediate values for each key2 it encounters. The output of reducers are stored and triplicated in HDFS.



Introduction to MapReduce

Hadoop Architecturethe single opportunity for global synchronization is the barrier between the map and reduce phases.Due to that between the map and reduce tasks, the map phase of a job is only as fast as the slowest map task. Similarly, the completion time of a job is bounded by the running time of the slowest reduce task.

27/31

MapReduce Executionthe number of Map tasks does not depend on the number of nodes, but the number of input blocks.Each block is assigned to a single Map task.

28/31

Complete view of MapReduce

29/31

Conclusion

In this peresentation, MapReduce as a programming model for expressing distributed computations on massive amounts of data and an execution framework for large-scale data processing on clusters of commodity servers was discussed.

MapReduce can enhance the productivity for junior developers who lack the experience of distributed/parallel development.


References Jimmy Lin, Chris Dyer, “Data-Intensive Text Processing with MapReduce,” 2010

R. Buyya, C. Shin Yeo, S. Venugopal, J. Broberg, I. Brandic, “Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility,” 2009.

Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, Bongki Moon, “Parallel Data Processing with MapReduce: A Survey,” 2011

B.Thirumala Rao , Dr. L.S.S.Reddy , “Survey on Improved Scheduling in Hadoop MapReduce in Cloud Environments,” 2011

Jeffrey Dean, Sanjay Ghemawat , “MapReduce: Simplified Data Processing on Large Clusters,” 2008


Thank U !

AnyQuestion?

main map reduce

Documents

keyvalue structure

output keyvalue pairs

input keyvalue pair

mapreduce overview users

sameintermediate key

perintermediate key

intermediatekeyvalue

number of map tasks