main map reduce

31
Introduction to MapReduce ش ها ک ت شا گ ندل ر م ب ای ه م د ق مMasoumeh Rezaei jam In the name of God Distributed Systems course Department of Computer Engineering University of Tabriz 22th May 2012

Upload: masoumeh-rezaei-jam

Post on 09-Jun-2015

447 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Main map reduce

Introduction to MapReduceمقدمه ای بر مدل نگاشت کاهش

Masoumeh Rezaei jam

In the name of God

Distributed Systems courseDepartment of Computer Engineering

University of Tabriz

22th May 2012

Page 2: Main map reduce

Introduction

Mapper s& Reducers

Combiners & Partitioners

Steps of MapReduce Execution

contents

Conclusion

2/31Introduction to MapReduce

Execution Framework

Page 3: Main map reduce

Introduction

Cloud computing involves large-scale,

distributed computations on data from multiple sources.

MapReduce is a programming model as

well as a framework that supports the

model.

it enables to process a massive volume of data

in parallel with many low-end computing

nodes.

Introduction to MapReduce 3/31

Page 4: Main map reduce

Introduction (cont.)

The Emergence of MapReduce

Introduced by Google in 2004 to support simplified distributed computing on

clusters of computers

Social-networking services, video-sharing sites, web-based email

services, and applications

Introduction to MapReduce 4/31

Page 5: Main map reduce

Importance of

MapReduce

Von Neumann isn't sufficient

Scales to large clusters of machines

Organize computations on clusters

Effective tool for tackling large-data

The model is easy to use

Importance of MapReduce

5/31Introduction to MapReduce

Page 6: Main map reduce

Characteristics

•Main idea of MapReduce

• No race condition, lock contention, etc.

•to focus on data processing strategies

Hide system-level details from the

developers

• Developer specifies the computation that needs to be performed

• Execution framework (“runtime”) handles actual execution

Separating the “what” from “how”

•Move processing to the data

•Seamless scalability

Big data

6/31Introduction to MapReduce

Page 7: Main map reduce

MapReduce Overview

Users specify the computation in terms of a map and a reduce function.

The underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

Introduction to MapReduce 7/31

Page 8: Main map reduce

The Execution Framework

A MapReduce program, consists of code for mappers and reducers (as well as combiners and

partitioners) together with configuration parameters

The developer submits the job to the submission node of a cluster and execution framework (the

“runtime") takes care of everything else: it transparently handles all other aspects of

distributed code execution, on clusters ranging from a single node to a few thousand nodes.

Page 9: Main map reduce

Specific responsibilities of Execution Framework

Scheduling

Data / code co-

location

Synchronization

Error and fault

handling

9/31

Page 10: Main map reduce

Runtime Scheduling Scheme

Nodes completed their tasks assigned another

Speculative, redundant execution

No execution

plan

achieves fault tolerance

11/31Introduction to MapReduce

Page 11: Main map reduce

The mapper is applied to every input key-value pair (split across an arbitrary number of files) to generate an arbitrary number of intermediate key-value pairs.

Simplified view of MapReduce

Key-value pairs form the basic data structure in MapReduce.Part of the design of MapReduce algorithms involves imposing the key-value structure on arbitrary datasets.web pages: (URLs, the actual HTML content) graph: (node id, adjacency lists of node)

The reducer is applied to all values associated with the same intermediate key to generate output key-value pairs.

12/31

Page 12: Main map reduce

The

Map

Redu

ce m

odel

con

sist

s of

two

prim

itive

func

tions

: Map

and

Red

uce

The input for MapReduce is a list of (key1, value1) pairs

Map() is applied to each pair to

compute intermediate key-value

pairs, (key2, value2)

The intermediate key-value pairs are then grouped together on the key equality basis, i.e. (key2, list(value2))

For each key2, Reduce()

works on the list of all

values, then produces zero

or more aggregated results.

Introduction to MapReduce 13/31

Page 13: Main map reduce

Mapper & Reducers

Mappers and Reducers are objects that implement the Map and Reduce methods, respectively.

A mapper object is a worker initialized for each map task and the Map method is called on each key-value pair by the execution framework.

if an input is broken down into 400 blocks and there are 40 mappers in a cluster, the number of map tasks are 400 and the map tasks are executed through 10 waves of task runs.

The situation is similar for the reduce phase: a reducer object is a worker initialized for each reduce task, and the Reduce method is called once per intermediate key.

Introduction to MapReduce 14/31

Page 14: Main map reduce

An example of using MapReduce

Input Map Reduce Output

15/31Introduction to MapReduce

Page 15: Main map reduce

PowerPoint has new layouts that give you more ways to present

your words, images and media.

Combiners

Combiners & Partitioners

They are an optimization in MapReduce that allow for local aggregation before

the shuffle and sort phase.

To perform local aggregation on the output of each mapper, i.e.,to compute a

local count for a key over all the documents processed by the mapper.

Perform it to cut down on the number of intermediate key-value pairs

16/31Introduction to MapReduce

Page 16: Main map reduce

PowerPoint has new layouts that give you more ways to present

your words, images and media.

Partitioners

Combiners & Partitioners (cont.)

They are responsible for dividing up the intermediate key space and assigning intermediate

key-value pairs to reducers.

The execution framework uses this information to copy the data to the right location during the

shuffle and sort phase.

Simplest partitioner involves computing the hash value of the key and then taking the

mod of that value with the number of reducers.

17/31Introduction to MapReduce

Page 17: Main map reduce

ONE TAB TWO TAB FOUR TAB FIVE

Before starting the Map task, an input file is loaded on the distributed file system. At loading, the file is partitioned into multiple data blocks with same size, typically 64MB. Each block is triplicated to guarantee fault-tolerance.

Steps of MapReduce Execution

18/31Introduction to MapReduce

Page 18: Main map reduce

TWO TAB TWO TAB FOUR TAB FIVE

Each block is then assigned to a mapper. The mapper applies Map() to each record in the data block.

Steps of MapReduce Execution

19/31Introduction to MapReduce

Page 19: Main map reduce

THREE TAB FOUR TAB FIVE

The intermediate outputs produced by the mappers are then sorted locally for grouping key-value pairs sharing the same key.

Steps of MapReduce Execution

20/31Introduction to MapReduce

Page 20: Main map reduce

FOUR TAB FOUR TAB FIVE

After local sort, Combine() is optionally applied to perform pre-aggregation on the grouped key-value pairs,so that the communication cost of transfer all the intermediate outputs to reducers is minimized.

Steps of MapReduce Execution

21/31Introduction to MapReduce

Page 21: Main map reduce

FIVE TAB FIVE

The mapped outputs are stored in local disks of the mappers, partitioned into R, where R is the number of Reduce tasks in the MR job. This partitioning is basically done by a hash function, e.g. hash(key) mod R.

Steps of MapReduce Execution

22/31Introduction to MapReduce

Page 22: Main map reduce

SIX TAB FIVE

When all Map tasks are completed, the MapReduce scheduler assigns Reduce tasks to workers.

Steps of MapReduce Execution

23/31Introduction to MapReduce

Page 23: Main map reduce

SEVEN

Since all mapped outputs are already partitioned and stored in local disks, each reducer performs the shuffling by simply pulling its partition of the mapped outputs from mappers.

Steps of MapReduce Execution

24/31Introduction to MapReduce

Page 24: Main map reduce

EIGHT

A reducer reads the intermediate results and merge them by the intermediate keys, i.e. key2, so that all values of the same key are grouped together. Sometime this grouping is done by external merge-sort.

Steps of MapReduce Execution

25/31Introduction to MapReduce

Page 25: Main map reduce

NINE

Then each reducer applies Reduce() to the intermediate values for each key2 it encounters. The output of reducers are stored and triplicated in HDFS.

Steps of MapReduce Execution

26/31Introduction to MapReduce

Page 26: Main map reduce

Introduction to MapReduce

Hadoop Architecturethe single opportunity for global synchronization is the barrier between the map and reduce phases.Due to that between the map and reduce tasks, the map phase of a job is only as fast as the slowest map task. Similarly, the completion time of a job is bounded by the running time of the slowest reduce task.

27/31

Page 27: Main map reduce

MapReduce Executionthe number of Map tasks does not depend on the number of nodes, but the number of input blocks.Each block is assigned to a single Map task.

28/31

Page 28: Main map reduce

Complete view of MapReduce

29/31

Page 29: Main map reduce

Conclusion

In this peresentation, MapReduce as a programming model for expressing distributed computations on massive amounts of data and an execution framework for large-scale data processing on clusters of commodity servers was discussed.

MapReduce can enhance the productivity for junior developers who lack the experience of distributed/parallel development.

Introduction to MapReduce 30/31

Page 30: Main map reduce

References Jimmy Lin, Chris Dyer, “Data-Intensive Text Processing with MapReduce,” 2010

R. Buyya, C. Shin Yeo, S. Venugopal, J. Broberg, I. Brandic, “Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility,” 2009.

Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, Bongki Moon, “Parallel Data Processing with MapReduce: A Survey,” 2011

B.Thirumala Rao , Dr. L.S.S.Reddy , “Survey on Improved Scheduling in Hadoop MapReduce in Cloud Environments,” 2011

Jeffrey Dean, Sanjay Ghemawat , “MapReduce: Simplified Data Processing on Large Clusters,” 2008

Introduction to MapReduce 31/31

Page 31: Main map reduce

Thank U !

AnyQuestion?