mapreduce

MapReduce

Simplified Data Processing On large Clusters

Jeffery Dean and Sanjay Ghemawat

Outline

◦Introduction

◦Programming Model

◦Implementation

◦Refinement

◦Performance

◦Related work

◦Conclusions

Introduction

◦What is the purpose?

◦The abstraction

InputData

Map IntermediateKey/value

Reduce

OutputFile

Programming model

◦Map◦Reduce

◦Example

Programming model

◦Real example: make an index

Programming Model

◦More example

Distributed grep

Count of URL Access Frequency

Reverse Web-link Graph

Term Vector per host

Inverted index

Distributed sort

Implementation◦Execution overview

Implementation

◦Master data structure

◦Fault tolerance Worker failure Master failure Semantics in the Presence of Failures

◦Locality

◦Task Granularity

◦Back Tasks

Refinements

◦Partitioning Function

◦Ordering Guarantees

◦Combiner Function

◦Input and Out Types

◦Side-effect

◦Skipping Bad Records

◦Local Execution

◦Status Information

◦Counters

Performance

◦Cluster Configuration

1800machines

Each 2GHz Intel Xeon processors

4GB memory

2*160GB IDE disk

1 Gbps Ethernet

Arranged in two-level tree-shaped

Performance

◦Grep Scan through 1010 100-byte records Search a relatively rare three-character pattern

(occur in 92,337 records) Data transfer rate over time

The entrie computation takes approximately 150s

Peaks at over 30GB/s1764workers assigned

Performance

◦Sort Sorts 1010 100-byte records Modeled after TeraSort benchmark Extract a 10-byte sorting key

Normal execution

No backup

200 tasks killed

Performance

◦Sort

Input rate is less than for grep

There is a delay

The rate: input > shuffle > output

Effect of backup tasks

Machine failures

Related Work◦Restricted programming models

◦Parallel processing compare to Bulk Synchronous Programming & MPI primitive

◦Backup task mechanism compare to Charlotte System

◦Sorting facility compare to NOW-Sort

Related Work◦Sending data over distributed queue compare to

River

◦Programming model compare to BAD-FS

Conclusion◦What is the reason for the sucess of

MapReduce?

Easy to use

Problem are easily expressible

Scales to large cluster

◦Learned from this work

Restriction the programming

Network bandwidth is a scarce resource

Redundant execution

mapreduce

Documents