an introduction to mapreduce
TRANSCRIPT
An Introduction to MapReduce
Presented by Frane Bandov at the Operating Complex IT-Systems seminar
Berlin, 1/26/2010
Outline
• Introduction • Google MapReduce – Idea – Overview – Fault Tolerance – GFS: Google File System – Job Example
• Alternative Implementations • Reception and Criticism • Trends and Future Development • Conclusion 2/16/10 2 An Introduction to MapReduce
Outline
• Introduction • Google MapReduce – Idea – Overview – Fault Tolerance – GFS: Google File System – Job Example
• Alternative Implementations • Reception and Criticism • Trends and Future Development • Conclusion 2/16/10 3 An Introduction to MapReduce
Introduction – Problem
2/16/10 An Introduction to MapReduce 4
0
50
100
150
200
250
You Facebook Yahoo! Groups German Climate Computing Centre
TBytes
Sometimes we have to deal with huge amounts of data
Introduction – Problem
The data needs to be processed, but how?
Can‘t process all of this data on one machine Distribute the processing to many machines
2/16/10 An Introduction to MapReduce 5
Introduction – Approach
Distributed computing is the solution “Let’s write our own distributed computing
software as a solution to our problem”
Development takes a long time Expensive: Cost-benefit ratio?
Build complex software for simple computations?
2/16/10 An Introduction to MapReduce 6
Checklist design protocols design data structures write the code assure failure tolerance
Outline
• Introduction • Google MapReduce – Idea – Overview – Fault Tolerance – GFS: Google File System – Job Example
• Alternative Implementations • Reception and Criticism • Trends and Future Development • Conclusion 2/16/10 7 An Introduction to MapReduce
Google MapReduce – Idea
A framework for distributed computing
Don‘t care about protocols, failure tolerance, etc.
Just write your simple computation
2/16/10 An Introduction to MapReduce 8
Google MapReduce – Idea
Map: Apply function to all elements of a list
[1, 4, 9, 16, 25]
Reduce: Combine all elements of a list
15
2/16/10 An Introduction to MapReduce 9
square x = x * x; map square [1, 2, 3, 4, 5];
reduce (+)[1, 2, 3, 4, 5];
MapReduce Paradigm
Google MapReduce – Idea
Basic functioning
2/16/10 An Introduction to MapReduce 10
Input Map Reduce Output
Google MapReduce – Overview
2/16/10 An Introduction to MapReduce 11
MapReduce-Based User Program
Input file
Split 1
Split 2
Split 3
Split 4
Split 5
Master
Worker
Worker
Worker
Worker
Worker
File 1
File 2
Map Phase Reduce Phase
Output files
Intermediate File 1
Intermediate File 2
Intermediate File 3
GFS GFS
MapReduce – Fault Tolerance
• Workers are periodically pinged by master • No answer over certain time worker failed
Mapper fails: – Reset map job as idle – Even if job was completed intermediate files are
inaccessible – Notify reducers where to get the new intermediate file
Reducer fails: – Reset its job as idle
2/16/10 An Introduction to MapReduce 12
MapReduce – Fault Tolerance
Master fails: – Periodically sets checkpoints – In case of failure MapReduce-Operation is aborted – Operation can be restarted from last checkpoint
2/16/10 An Introduction to MapReduce 13
Google MapReduce – GFS
Google File System • In-house distributed file system at Google • Stores all input an output files • Stores files… – divided into 64 MB blocks – on at least 3 different machines
• Machines running GFS also run MapReduce
2/16/10 An Introduction to MapReduce 14
Google MapReduce – Job Example
2/16/10 An Introduction to MapReduce 15
Google MapReduce – Job Example
2/16/10 An Introduction to MapReduce 16
Google MapReduce – Job Example
2/16/10 An Introduction to MapReduce 17
Google MapReduce – Job Example
2/16/10 An Introduction to MapReduce 18
Outline
• Introduction • Google MapReduce – Idea – Overview – Fault Tolerance – GFS: Google File System – Job Example
• Alternative Implementations • Reception and Criticism • Trends and Future Development • Conclusion 2/16/10 19 An Introduction to MapReduce
Alternative Implementations
Apache Hadoop
• Open-Source-Implementation in Java • Jobs can be written in C++, Java, Python, etc. • Used by Yahoo!, Facebook, Amazon and others • Most commonly used implementation • HDFS as open-source-implementation of GFS • Can also use Amazon S3, HTTP(S) or FTP • Extensions: Hive, Pig, HBase
2/16/10 An Introduction to MapReduce 20
Alternative Implementations
Mars MapReduce-Implementation for nVidia GPU
using the CUDA framework
MapReduce-Cell Implementation for the Cell multi-core
processor
Qizmt MySpace’s implementation of MapReduce in C#
2/16/10 An Introduction to MapReduce 21
Alternative Implementations
There are many other open- and closed- source implementations of MapReduce!
2/16/10 An Introduction to MapReduce 22
Outline
• Introduction • Google MapReduce – Idea – Overview – Fault Tolerance – GFS: Google File System – Job Example
• Alternative Implementations • Reception and Criticism • Trends and Future Development • Conclusion 2/16/10 23 An Introduction to MapReduce
Reception and Criticism
• Yahoo!: Hadoop on a 10,000 server cluster • Facebook analyses the daily log (25TB) on
a 1,000 server cluster • Amazon Elastic MapReduce: Hadoop
clusters for rent on EC2 and S3 • IBM and Google: Support university
courses in distributed programming • UC Berkley announced to teach freashmen
programming MapReduce 2/16/10 An Introduction to MapReduce 24
Reception and Criticism
2/16/10 An Introduction to MapReduce 25
Reception and Criticism
• Criticism mainly by RDBMS experts DeWitt and Stonebraker
• MapReduce – is a step backwards in database access – is a poor implementation – is not novel – is missing features that are routinely provided
by modern DBMSs – is incompatible with the DBMS tools
2/16/10 An Introduction to MapReduce 26
Reception and Criticism
Response to criticism
MapReduce is no RDBMS
It suits well for processing and structuring huge amounts of unstructured data
MapReduce's big inovation is that it enables distributing data processing across a network of
cheap and possibly unreliable computers 2/16/10 An Introduction to MapReduce 27
Outline
• Introduction • Google MapReduce – Idea – Overview – Fault Tolerance – GFS: Google File System – Job Example
• Alternative Implementations • Reception and Criticism • Trends and Future Development • Conclusion 2/16/10 28 An Introduction to MapReduce
Trends and Future Development
Trend of utilizing MapReduce/Hadoop as parallel database
• Hive: Query language for Hadoop • HBase: Column-oriented distributed database
(modeled after Google’s BigTable) • Map-Reduce-Merge: Adding merge to the
paradigm allows implementing features of relational algebra
2/16/10 An Introduction to MapReduce 29
Trends and Future Development
Trend to use the MapReduce-paradigm to better utilize multi-core CPUs
• Qt Concurrent – Simplified C++ version of MapReduce for distributing
tasks between multiple processor cores
• Mars • MapReduce-Cell
2/16/10 An Introduction to MapReduce 30
Outline
• Introduction • Google MapReduce – Idea – Overview – Fault Tolerance – GFS: Google File System – Job Example
• Alternative Implementations • Reception and Criticism • Trends and Future Development • Conclusion 2/16/10 31 An Introduction to MapReduce
Conclusion
MapReduce
provides an easy solution for the processing of large amounts of data
brings a paradigm shift in programming
changed the world, i.e. made data processing more efficient and
cheaper, is the foundation of many other approaches and solutions
2/16/10 An Introduction to MapReduce 32
Questions?
2/16/10 An Introduction to MapReduce 33
Thank You!
2/16/10 An Introduction to MapReduce 34