mapreduce

16
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat

Upload: gwendolyn-fuller

Post on 31-Dec-2015

32 views

Category:

Documents


0 download

DESCRIPTION

MapReduce. Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat. Outline Introduction Programming Model Implementation Refinement Performance Related work Conclusions. Introduction What is the purpose? The abstraction. Input Data. Intermediate Key/value. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: MapReduce

MapReduce

Simplified Data Processing On large Clusters

Jeffery Dean and Sanjay Ghemawat

Page 2: MapReduce

Outline

◦Introduction

◦Programming Model

◦Implementation

◦Refinement

◦Performance

◦Related work

◦Conclusions

Page 3: MapReduce

Introduction

◦What is the purpose?

◦The abstraction

InputData

Map IntermediateKey/value

Reduce

OutputFile

Page 4: MapReduce

Programming model

◦Map◦Reduce

◦Example

Page 5: MapReduce

Programming model

◦Real example: make an index

Page 6: MapReduce

Programming Model

◦More example

Distributed grep

Count of URL Access Frequency

Reverse Web-link Graph

Term Vector per host

Inverted index

Distributed sort

Page 7: MapReduce

Implementation◦Execution overview

Page 8: MapReduce

Implementation

◦Master data structure

◦Fault tolerance Worker failure Master failure Semantics in the Presence of Failures

◦Locality

◦Task Granularity

◦Back Tasks

Page 9: MapReduce

Refinements

◦Partitioning Function

◦Ordering Guarantees

◦Combiner Function

◦Input and Out Types

◦Side-effect

◦Skipping Bad Records

◦Local Execution

◦Status Information

◦Counters

Page 10: MapReduce

Performance

◦Cluster Configuration

1800machines

Each 2GHz Intel Xeon processors

4GB memory

2*160GB IDE disk

1 Gbps Ethernet

Arranged in two-level tree-shaped

Page 11: MapReduce

Performance

◦Grep Scan through 1010 100-byte records Search a relatively rare three-character pattern

(occur in 92,337 records) Data transfer rate over time

The entrie computation takes approximately 150s

Peaks at over 30GB/s1764workers assigned

Page 12: MapReduce

Performance

◦Sort Sorts 1010 100-byte records Modeled after TeraSort benchmark Extract a 10-byte sorting key

Normal execution

No backup

200 tasks killed

Page 13: MapReduce

Performance

◦Sort

Input rate is less than for grep

There is a delay

The rate: input > shuffle > output

Effect of backup tasks

Machine failures

Page 14: MapReduce

Related Work◦Restricted programming models

◦Parallel processing compare to Bulk Synchronous Programming & MPI primitive

◦Backup task mechanism compare to Charlotte System

◦Sorting facility compare to NOW-Sort

Page 15: MapReduce

Related Work◦Sending data over distributed queue compare to

River

◦Programming model compare to BAD-FS

Page 16: MapReduce

Conclusion◦What is the reason for the sucess of

MapReduce?

Easy to use

Problem are easily expressible

Scales to large cluster

◦Learned from this work

Restriction the programming

Network bandwidth is a scarce resource

Redundant execution