an introduction to mapreduce

34
An Introduction to MapReduce Presented by Frane Bandov at the Operating Complex IT-Systems seminar Berlin, 1/26/2010

Upload: frane-bandov

Post on 20-Aug-2015

5.293 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: An Introduction to MapReduce

An Introduction to MapReduce

Presented by Frane Bandov at the Operating Complex IT-Systems seminar

Berlin, 1/26/2010

Page 2: An Introduction to MapReduce

Outline

•  Introduction •  Google MapReduce – Idea – Overview – Fault Tolerance – GFS: Google File System – Job Example

•  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 2 An Introduction to MapReduce

Page 3: An Introduction to MapReduce

Outline

•  Introduction •  Google MapReduce – Idea – Overview – Fault Tolerance – GFS: Google File System – Job Example

•  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 3 An Introduction to MapReduce

Page 4: An Introduction to MapReduce

Introduction – Problem

2/16/10 An Introduction to MapReduce 4

0

50

100

150

200

250

You Facebook Yahoo! Groups German Climate Computing Centre

TBytes

Sometimes we have to deal with huge amounts of data

Page 5: An Introduction to MapReduce

Introduction – Problem

The data needs to be processed, but how?

Can‘t process all of this data on one machine Distribute the processing to many machines

2/16/10 An Introduction to MapReduce 5

Page 6: An Introduction to MapReduce

Introduction – Approach

Distributed computing is the solution “Let’s write our own distributed computing

software as a solution to our problem”

 Development takes a long time   Expensive: Cost-benefit ratio?

Build complex software for simple computations?

2/16/10 An Introduction to MapReduce 6

Checklist   design protocols   design data structures   write the code   assure failure tolerance

Page 7: An Introduction to MapReduce

Outline

•  Introduction •  Google MapReduce – Idea – Overview – Fault Tolerance – GFS: Google File System – Job Example

•  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 7 An Introduction to MapReduce

Page 8: An Introduction to MapReduce

Google MapReduce – Idea

A framework for distributed computing

Don‘t care about protocols, failure tolerance, etc.

Just write your simple computation

2/16/10 An Introduction to MapReduce 8

Page 9: An Introduction to MapReduce

Google MapReduce – Idea

Map: Apply function to all elements of a list

[1, 4, 9, 16, 25]

Reduce: Combine all elements of a list

15

2/16/10 An Introduction to MapReduce 9

square x = x * x; map square [1, 2, 3, 4, 5];

reduce (+)[1, 2, 3, 4, 5];

MapReduce Paradigm

Page 10: An Introduction to MapReduce

Google MapReduce – Idea

Basic functioning

2/16/10 An Introduction to MapReduce 10

Input Map Reduce Output

Page 11: An Introduction to MapReduce

Google MapReduce – Overview

2/16/10 An Introduction to MapReduce 11

MapReduce-Based User Program

Input file

Split 1

Split 2

Split 3

Split 4

Split 5

Master

Worker

Worker

Worker

Worker

Worker

File 1

File 2

Map Phase Reduce Phase

Output files

Intermediate File 1

Intermediate File 2

Intermediate File 3

GFS GFS

Page 12: An Introduction to MapReduce

MapReduce – Fault Tolerance

•  Workers are periodically pinged by master •  No answer over certain time worker failed

Mapper fails: – Reset map job as idle – Even if job was completed intermediate files are

inaccessible – Notify reducers where to get the new intermediate file

Reducer fails: – Reset its job as idle

2/16/10 An Introduction to MapReduce 12

Page 13: An Introduction to MapReduce

MapReduce – Fault Tolerance

Master fails: – Periodically sets checkpoints – In case of failure MapReduce-Operation is aborted – Operation can be restarted from last checkpoint

2/16/10 An Introduction to MapReduce 13

Page 14: An Introduction to MapReduce

Google MapReduce – GFS

Google File System •  In-house distributed file system at Google •  Stores all input an output files •  Stores files… – divided into 64 MB blocks – on at least 3 different machines

• Machines running GFS also run MapReduce

2/16/10 An Introduction to MapReduce 14

Page 15: An Introduction to MapReduce

Google MapReduce – Job Example

2/16/10 An Introduction to MapReduce 15

Page 16: An Introduction to MapReduce

Google MapReduce – Job Example

2/16/10 An Introduction to MapReduce 16

Page 17: An Introduction to MapReduce

Google MapReduce – Job Example

2/16/10 An Introduction to MapReduce 17

Page 18: An Introduction to MapReduce

Google MapReduce – Job Example

2/16/10 An Introduction to MapReduce 18

Page 19: An Introduction to MapReduce

Outline

•  Introduction •  Google MapReduce – Idea – Overview – Fault Tolerance – GFS: Google File System – Job Example

•  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 19 An Introduction to MapReduce

Page 20: An Introduction to MapReduce

Alternative Implementations

Apache Hadoop

•  Open-Source-Implementation in Java •  Jobs can be written in C++, Java, Python, etc. •  Used by Yahoo!, Facebook, Amazon and others •  Most commonly used implementation •  HDFS as open-source-implementation of GFS •  Can also use Amazon S3, HTTP(S) or FTP •  Extensions: Hive, Pig, HBase

2/16/10 An Introduction to MapReduce 20

Page 21: An Introduction to MapReduce

Alternative Implementations

Mars MapReduce-Implementation for nVidia GPU

using the CUDA framework

MapReduce-Cell Implementation for the Cell multi-core

processor

Qizmt MySpace’s implementation of MapReduce in C#

2/16/10 An Introduction to MapReduce 21

Page 22: An Introduction to MapReduce

Alternative Implementations

There are many other open- and closed- source implementations of MapReduce!

2/16/10 An Introduction to MapReduce 22

Page 23: An Introduction to MapReduce

Outline

•  Introduction •  Google MapReduce – Idea – Overview – Fault Tolerance – GFS: Google File System – Job Example

•  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 23 An Introduction to MapReduce

Page 24: An Introduction to MapReduce

Reception and Criticism

•  Yahoo!: Hadoop on a 10,000 server cluster •  Facebook analyses the daily log (25TB) on

a 1,000 server cluster •  Amazon Elastic MapReduce: Hadoop

clusters for rent on EC2 and S3 •  IBM and Google: Support university

courses in distributed programming • UC Berkley announced to teach freashmen

programming MapReduce 2/16/10 An Introduction to MapReduce 24

Page 25: An Introduction to MapReduce

Reception and Criticism

2/16/10 An Introduction to MapReduce 25

Page 26: An Introduction to MapReduce

Reception and Criticism

•  Criticism mainly by RDBMS experts DeWitt and Stonebraker

•  MapReduce – is a step backwards in database access – is a poor implementation – is not novel – is missing features that are routinely provided

by modern DBMSs – is incompatible with the DBMS tools

2/16/10 An Introduction to MapReduce 26

Page 27: An Introduction to MapReduce

Reception and Criticism

Response to criticism

MapReduce is no RDBMS

It suits well for processing and structuring huge amounts of unstructured data

MapReduce's big inovation is that it enables distributing data processing across a network of

cheap and possibly unreliable computers 2/16/10 An Introduction to MapReduce 27

Page 28: An Introduction to MapReduce

Outline

•  Introduction •  Google MapReduce – Idea – Overview – Fault Tolerance – GFS: Google File System – Job Example

•  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 28 An Introduction to MapReduce

Page 29: An Introduction to MapReduce

Trends and Future Development

Trend of utilizing MapReduce/Hadoop as parallel database

•  Hive: Query language for Hadoop •  HBase: Column-oriented distributed database

(modeled after Google’s BigTable) •  Map-Reduce-Merge: Adding merge to the

paradigm allows implementing features of relational algebra

2/16/10 An Introduction to MapReduce 29

Page 30: An Introduction to MapReduce

Trends and Future Development

Trend to use the MapReduce-paradigm to better utilize multi-core CPUs

•  Qt Concurrent – Simplified C++ version of MapReduce for distributing

tasks between multiple processor cores

•  Mars •  MapReduce-Cell

2/16/10 An Introduction to MapReduce 30

Page 31: An Introduction to MapReduce

Outline

•  Introduction •  Google MapReduce – Idea – Overview – Fault Tolerance – GFS: Google File System – Job Example

•  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 31 An Introduction to MapReduce

Page 32: An Introduction to MapReduce

Conclusion

MapReduce

provides an easy solution for the processing of large amounts of data

brings a paradigm shift in programming

changed the world, i.e. made data processing more efficient and

cheaper, is the foundation of many other approaches and solutions

2/16/10 An Introduction to MapReduce 32

Page 33: An Introduction to MapReduce

Questions?

2/16/10 An Introduction to MapReduce 33

Page 34: An Introduction to MapReduce

Thank You!

2/16/10 An Introduction to MapReduce 34