distributed computing & map reduce cs16: introduction to data structures & algorithms...

27
DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

Upload: doris-griffith

Post on 16-Dec-2015

233 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

1

DISTRIBUTED COMPUTING & MAP REDUCE

CS16: Introduction to Data Structures & Algorithms

Thursday, April 17, 2014

Page 2: DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

2

Outline

1) Distributed Computing Overview

2) MapReduceI. Application: Word Count

II. Application: Mutual Friends

Thursday, April 17, 2014

Page 3: DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

3

Distributed Computing: Motivation

• Throughout the course of the semester, we have talked about many ways to optimize algorithm speed, for example:• Using efficient data structures• Taking greedy “shortcuts” when we are sure they are

correct• Dynamic programming algorithms

• However, we left out a seemingly obvious one...• Using more than one computer!

Thursday, April 17, 2014

Page 4: DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

4

Distributed Computing: Explanation

• Distributed computing is the field of taking a computational problem and distributing it among many worker computers (nodes)

• Usually there is a master computer, which coordinates the distribution to and feedback from the nodes

• The distributed system can be represented as a graph where the vertices are nodes and the edges are the network connecting them

Thursday, April 17, 2014

Page 5: DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

5

Distributed Computing: Primes

• Computing very large prime numbers is a very computationally intensive problem

• We don’t have any formula or algorithm we can use to generate primes• There are probabilistic tests, but the only known

deterministic method is to divide a test number n by every number from 2 to

• If none of the numbers in this range divide n, then n is prime

• Linear in the square root of the number• That doesn’t sound so bad…?

Thursday, April 17, 2014

Page 6: DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

6

Distributed Computing: Primes (2)• Wrong! The largest known prime number was discovered

last year: 257,885,161 – 1. This number has 17,425,170 digits!• A single machine cannot verify the primality of this

number!• This prime (named M48) was discovered through GIMPS,

the distributed program “Great Internet Mersenne Prime Search”• launched in 1996 to find large prime numbers• searches for primes of the form 2m-1 where m itself is prime

• GIMPS allows users to donate some of their computer’s rest time toward testing the primality of numbers of this form

Thursday, April 17, 2014

Page 7: DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

7

Distributed Computing: Primes (3)

Thursday, April 17, 2014

Is 101 prime?

Node 3Node 2Node 1

Master

{5,6,7} | 101? {8,9,10} | 101?{2,3,4} | 101?

No No No

Page 8: DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

8

More Applications

• There are many problems being solved by massive distributed systems like GIMPS:• Stanford’s Folding@home project uses distributed computing to solve the problem of how proteins fold, helping scientists studying Alzheimer’s, Parkinson’s, and cancers

• Berkeley’s Milkyway@home project uses distributed computing to generate a highly accurate 3D model of the Milky Way galaxy using data collected over the past decade

Thursday, April 17, 2014

Page 9: DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

9

Typical Large-Data Problem

• As the Internet has grown, so has the amount of data in existence!

• To make sense of all this data, we generally want to:• Iterate over a large number of records• Extract something of interest from each• Shuffle and sort intermediate results• Aggregate intermediate results• Generate final output

Thursday, April 17, 2014

Page 10: DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

10

Developing MapReduce

• In 2004, Google was “indexing” websites: crawling pages and keeping track of words and their locations on each page

• The size of the Web was huge! To handle all this information, Google researchers came up with the MapReduce Framework

Thursday, April 17, 2014

Page 11: DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

11

MapReduce

• MapReduce is a programming abstraction seeking to hide system-level details from developers while providing the advantages of distributed computation• Only focus on the “what” and not the “how”• The developer specifies the computation that needs to be performed, and the framework (MapReduce) handles the actual execution

• Serves as a black box for distributed computing

Thursday, April 17, 2014

Page 12: DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

12

Typical Large-Data Problem

• To make sense of all this data, we generally want to:• Iterate over a large number of records• Extract something of interest from each• Shuffle and sort intermediate results• Aggregate intermediate results• Generate final output

• The MapReduce framework takes care of the rest

Thursday, April 17, 2014

REDUCE

MAP

Page 13: DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

13

MapReduce

• Programmers specify two functions:• map(map_k, map_v) [set of (k, v) tuples]• reduce(k, [list of v with same k]) (k, output)

• The MapReduce “runtime” distributes portions of the large data set to separate nodes, running your map routine on the computers in parallel

• The runtime then takes care of sorting the intermediate results, which can then be aggregated in parallel by your reduce routine

Thursday, April 17, 2014

Page 14: DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

14

Counting Words

• Suppose you want to know the most popular word on the Internet• One method you could use would be to create a hashtable with words as keys and counts as values, but looping through every word on every page and hashing it to the table takes a long time…

• MapReduce allows you to speed up the entire process. You need to determine the “mapping” and “reducing” portions, and the framework does the rest!

Thursday, April 17, 2014

Page 15: DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

15

MapReduce Example• Suppose we have three documents and we would like to know the number of times each word occurs throughout all the documents

Thursday, April 17, 2014

donot

forgettodo

whatyoudo

sendmea

forgetmenot

twoplustwois

notone

doc1 doc2 doc3

Page 16: DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

16

MapReduce Example

• The MapReduce runtime takes care of calling our map routine three times, once for each document.

• What are the input arguments to map?

Thursday, April 17, 2014

donot

forgettodo

whatyoudo

sendmea

forgetmenot

twoplustwois

notone

doc1 doc2 doc3

Page 17: DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

17

MapReduce Example: Mapping

Thursday, April 17, 2014

donot

forgettodo

whatyoudo

sendmea

forgetmenot

twoplustwois

notone

doc1 doc2 doc3

map(doc1, [do, not, forget, to do, what, you, do])map(doc2, [send, me, a, forget, me, not])map(doc3, [two, plus, two, is, not, one])

map(map_k, map_v) [set of (k, v) tuples]

Page 18: DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

18

• What should our map function return?• Remember that the function must return a list of (key, value) tuples. The output from all of the map calls is grouped by key and then passed to the reducer.

MapReduce Example: Mapping

Thursday, April 17, 2014

map(doc1, [do, not, forget, to, do, what, you, do])

map(doc2, [send, me, a, forget, me, not])

map(doc3, [two, plus, two, is, not, one])

?

?

?

Page 19: DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

19

MapReduce Example: Mapping

Thursday, April 17, 2014

map(doc1, [do, not, forget, to, do, what, you, do]) [(do,1),(not,1),(forget,1),(to,1),(what,1),(you,1),(do,1)]

map(doc2, [send, me, a, forget, me, not]) [(send,1),(me,1),(a,1),(forget,1),(me,1),(not,1)]

map(doc3, [two, plus, two, is, not, one]) [(two,1),(plus,1),(two,1),(is,1),(not,1),(one,1)]

Page 20: DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

20

• The framework takes care of grouping the returned tuples by key and passing the list of values to the reducers.

• All values for a certain key are sent to the same reducer

MapReduce Example: Mapping

Thursday, April 17, 2014

map(map_k, map_v): // In this example, map_k is a document name and // map_v is a list of strings in the document tuples = [] for v in map_v: tuples.add((v,1)) return tuples

Page 21: DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

21

• Reduce is called for each of these (word, []) pairs, e.g. reduce(forget, [1,1])

MapReduce Example: Reducing

Thursday, April 17, 2014

do [1, 1, 1]not [1, 1, 1]forget [1, 1]me [1, 1]two [1, 1]to [1]what [1]

you [1]send [1]send [1]a [1]plus [1]is [1]one [1]

Page 22: DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

22

• Once the reducers all return, we have our counts for all the words!

MapReduce Example: Reducing

Thursday, April 17, 2014

reduce(k, values): // In this example, k is a particular word and // values is a list of all 1’s sum = 0 for v in values: sum += 1 return (k, sum)

Page 23: DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

23

The MapReduce Process

Thursday, April 17, 2014

framework framework framework

Page 24: DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

24

• Note that all of the map operations can run entirely in parallel, as can all of the reduce operations (once the map operations terminate)• With hundreds or thousands of computing nodes, this is a huge benefit!

• This example seemed trivial, but suppose we were counting billions of entries!

MapReduce Benefits

Thursday, April 17, 2014

Page 25: DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

25

One More Example

• For every person on Facebook, Facebook stores their friends in the format

Person [List of friends]• How can we use MapReduce precompute for any two people on Facebook all of the friends they have in common?

• Hint: we’ll have to use the property that identical keys from the mapper all go to the same reducer!

Thursday, April 17, 2014

Page 26: DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17, 2014 1

26

MapReduce Benefits (2)

• The MapReduce framework• Handles scheduling• Assigns workers to map and reduce tasks• Handles “data distribution”• Handles synchronization• Gathers, sorts, and shuffles intermediate data• Handles errors and worker failures and restarts• Everything happens on top of a distributed filesystem

(for another lecture!)

• All you have to do (for the most part) is write two functions, map and reduce!

Thursday, April 17, 2014