map-reduce examples 1. so, what is it? a two phase process geared toward optimizing broad, widely...

22
Map-Reduce examples 1

Upload: piers-ramsey

Post on 19-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop

1

Map-Reduceexamples

Page 2: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop

2So, what is it?

• A two phase process geared toward optimizing broad, widely distributed parallel computing platforms

• Apache Hadoop is a MapReduce file system.• MapReduce is Googles version (and it is proprietary).• Phases• 1. Take a series of keys and transform them into a different

series of values, generally, ones that have some semantic context

• 2. Perform a second pass where the new series of values are compressed into far fewer values

Page 3: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop

3

In its strictest sense…

• Map-reduce is a two phase operation• First, convert a list of data into a list of a different kind of

data• Second, turn the second list into a single or a list of scalar

values, often the cardinality of the items created in the first step

Page 4: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop

4

Relevant computing/data application types

• For aggregate database processing, and not so much for set-oriented, and certainly not for object-based querying

• Fits well with cluster-based environments, where there are lots of opportunities for parallel processing

• Fits query patterns that calculate the cardinality of sets and the removal of duplicates

Page 5: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop

5Strategy for M-R

• We try to do the computing on the machines where the data sits

• So we try to engineer the storage of data so that it accommodates the chaining of M-R operations

Page 6: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop

6

The key bottom line concept

• In a relational database, we try to minimize the I/O costs of moving large volumes of data from the server to the client, so that it can then be scanned and aggregated

• In a database that supports MP, we trying to screen (and sometimes aggregate) data on the server where it sits

• We also use parallel processing within cluster servers to minimize the cost of doing that aggregation if it cannot all be done on a single server housing the original data.

Page 7: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop

7

Another way of looking at this…

• We have seen the tradeoff between moving data and moving processing logic in the context of distributed, homogenous distributed data

• Often, in distributed databases, it is far cheaper to ship processing logic instead of data, even if it causes extra processing to have to happen

• This is another context in which we often choose to send processing code to a server in order to minimize the movement of large volumes of data

Page 8: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop

8Example

• 1. We start with a set of person keys and map each of these to the names of the people.• Key 1 -> Harry• Key 2 -> Harry• Key 3 -> Tommy

• 2. We aggregate the list of people names by counting how many unique names are in the list.• Harry, Harry, Tommy -> 2

Page 9: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop

9

What actually happens?

• Informally: • Each key leads to a name field.• Then, the names are isolated.• Then, each is passed to a “mapper”, which returns the name,

along with a 1. • Then, a “reducer” takes each name and makes a list of 1’s.

The reducer adds up the 1’s for each name and returns a list of (name, count) pairs.

Page 10: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop

10From wikipedia

Imagine that for a database of 1.1 billion people, one would like to compute the average number of social contacts a person has according to age.

SELECT age AS Y, AVG(contacts) AS A

FROM social.person GROUP BY age ORDER BY age

function Map is

input: integer K1 between 1 and 1100, representing a batch of 1 million social.person records

for each social.person record in the K1 batch do

let Y be the person's age

let N be the number of contacts the person has

produce one output record <Y,N>

repeat

end function

function Reduce is

input: age (in years) Y

for each input record <Y,N> do

Accumulate in S the sum of N

Accumulate in C the count of records so far

repeat

let A be S/C

produce one output record <Y,A>

end function

Page 11: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop

11

From NoSQL Distilled:1. Creating a list with a map

Page 12: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop

12

2. Aggregating with a reduce

Page 13: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop

13

3. Partitioning the output of mappers: parallelism & adding a phase that merges the results of the reducers

Page 14: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop

14

4. Introducing a combiner operation to minimize the movement of redundant data – the output format must be the same as the input format

Page 15: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop

15

5. A combiner that removes duplicate product-customer pairs

Page 16: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop

16

6. Concatenating a combining and a reduce (counting) operation

Page 17: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop

17

7. Maintaining the 1’s counts in the mapping phase

Page 18: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop

18

8. Adding temporal information to the map/reduce process

Page 19: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop

19

9. Using a reduce operator to create product per month totals

Page 20: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop

20

10. A second mapper that creates base year by year comparisons

Page 21: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop

21

11. A reduce operation combines records for a given year

Page 22: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop

22Complaints

• M-R is low level.• It is rigid.• It exists to optimize the distributed cluster model – only.• It demands that an application fit perfectly into the

paradigm.• It takes careful planning and knowledge of exactly how

the data will be used to structure the database to optimally serve a series of map/reduce operations

• It thus does not accommodate on-the-fly browsing