mapreduce
DESCRIPTION
MapReduce. Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat. Outline Introduction Programming Model Implementation Refinement Performance Related work Conclusions. Introduction What is the purpose? The abstraction. Input Data. Intermediate Key/value. - PowerPoint PPT PresentationTRANSCRIPT
MapReduce
Simplified Data Processing On large Clusters
Jeffery Dean and Sanjay Ghemawat
Outline
◦Introduction
◦Programming Model
◦Implementation
◦Refinement
◦Performance
◦Related work
◦Conclusions
Introduction
◦What is the purpose?
◦The abstraction
InputData
Map IntermediateKey/value
Reduce
OutputFile
Programming model
◦Map◦Reduce
◦Example
Programming model
◦Real example: make an index
Programming Model
◦More example
Distributed grep
Count of URL Access Frequency
Reverse Web-link Graph
Term Vector per host
Inverted index
Distributed sort
Implementation◦Execution overview
Implementation
◦Master data structure
◦Fault tolerance Worker failure Master failure Semantics in the Presence of Failures
◦Locality
◦Task Granularity
◦Back Tasks
Refinements
◦Partitioning Function
◦Ordering Guarantees
◦Combiner Function
◦Input and Out Types
◦Side-effect
◦Skipping Bad Records
◦Local Execution
◦Status Information
◦Counters
Performance
◦Cluster Configuration
1800machines
Each 2GHz Intel Xeon processors
4GB memory
2*160GB IDE disk
1 Gbps Ethernet
Arranged in two-level tree-shaped
Performance
◦Grep Scan through 1010 100-byte records Search a relatively rare three-character pattern
(occur in 92,337 records) Data transfer rate over time
The entrie computation takes approximately 150s
Peaks at over 30GB/s1764workers assigned
Performance
◦Sort Sorts 1010 100-byte records Modeled after TeraSort benchmark Extract a 10-byte sorting key
Normal execution
No backup
200 tasks killed
Performance
◦Sort
Input rate is less than for grep
There is a delay
The rate: input > shuffle > output
Effect of backup tasks
Machine failures
Related Work◦Restricted programming models
◦Parallel processing compare to Bulk Synchronous Programming & MPI primitive
◦Backup task mechanism compare to Charlotte System
◦Sorting facility compare to NOW-Sort
Related Work◦Sending data over distributed queue compare to
River
◦Programming model compare to BAD-FS
Conclusion◦What is the reason for the sucess of
MapReduce?
Easy to use
Problem are easily expressible
Scales to large cluster
◦Learned from this work
Restriction the programming
Network bandwidth is a scarce resource
Redundant execution