map reduce -...

18
Map Reduce 1 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms

Upload: others

Post on 05-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of

Map Reduce

1 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms

Page 2: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of

2 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms

INPUT

PROCESS

OUTPUT

Typical application

Page 3: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of

3 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms

INPUT

PROCESS

OUTPUT

What if…

Page 4: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of

4 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms

INPUT

OUTPUT

Divide & Conquer

I1

W1

O1

I2

W2

O2

I3

W3

O3

I4

W4

O4

I5

W5

O5

Page 5: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of

Questions

•  How do we split the input?

•  How do we distribute the input splits?

•  How do we collect the output splits?

•  How do we aggregate the output?

•  How do we coordinate the work?

•  What if input splits > num workers?

•  What if workers need to share input/output splits?

•  What if a worker dies?

•  What if we have a new input?

5 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms

Page 6: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of

6 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms

Page 7: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of

Design ideas

•  Scale “out”, not “up” –  Low end machines

•  Move processing to the data –  Network bandwidth bottleneck

•  Process data sequentially, avoid random access –  Huge data files –  Write once, read many

•  Seamless scalability –  Strive for the unobtainable

•  Right level of abstraction –  Hide implementation details from applications

development

7 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms

Page 8: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of

Typical Large-Data Problem

•  Iterate over a large number of records •  Extract something of interest from each •  Shuffle and sort intermediate results •  Aggregate intermediate results •  Generate final output

(Dean and Ghemawat, OSDI 2004)

Map Reduce Programming Model

8 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms

Page 9: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of

Intermediate Values

Final Value Initial Value

Intermediate list

Input list

Fold

Map

g g g g g

f f f f f

From functional programming…

9 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms

Page 10: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of

…To MapReduce •  Programmers specify two functions:

map (k1, v1) → [(k2, v2)]

reduce (k2, [v2]) → [(k3, v3)]

•  All values with the same key are sent to the same reducer

•  Input keys and values (k1, v1) are drawn from different domain than output keys and values (k3, v3)

•  Intermediate keys (k2, v2) and values are from the same domain as the output keys and values (k3, v3)

•  The runtime handles everything else…

10 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms

Page 11: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of

Programming Model (simple)

INPUT

OUTPUT

O1 O2 O3

I1

map

I2

map

I3

map

I4

map

I5

map

Aggregate values by key

reduce reduce reduce

11 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms

Page 12: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of

Example (I)

12 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms

Page 13: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of

Example (II)

13 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms

Input Hello world Java and MapReduce Hello Java

split 1 Hello world

split 2 Java and

MapReduce

split 3 Hello Java

map 3

(hello, 1)

(java, 1)

map 2

(and, 1)

(mapreduce, 1)

(java, 1)

reduce 1

(hello, 1)

(and, 1)

(hello, 1)

reduce 2

(world, 1)

(mapreduce, 1)

(java, 1)

(java, 1)

map 1

(hello, 1)

(world, 1)

Output (hello, 2) (world, 1) (and, 1) (java, 2) (mapreduce, 1)

Page 14: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of

Runtime

•  Handles scheduling –  Assigns workers to map and reduce tasks

•  Handles “data distribution” –  Moves processes to data

•  Handles synchronization –  Gathers, sorts, and shuffles intermediate data

•  Handles errors and faults –  Detects worker failures and restarts

•  Everything happens on top of a distributed FS

14 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms

Page 15: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of

Partitioners and combiners

•  Programmers specify two functions: map (k1, v1) → [(k2, v2)] reduce (k2, [v2]) → [(k3, v3)] –  All values with the same key are reduced together

•  The execution framework handles everything else…

•  Not quite…usually, programmers also specify:

partition (k2, number of partitions) → partition for k2 –  Often a simple hash of the key, e.g., hash(k’) mod n –  Divides up key space for parallel reduce operations

combine (k2, v2) → [(k2, v2)] –  Mini-reducers that run in memory after the map phase –  Used as an optimization to reduce network traffic

15 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms

Page 16: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of

Programming Model (complete)

INPUT

I1

map

I2

map

I3

map

I4

map

I5

map

Aggregate values by key

OUTPUT

O1 O2 O3

reduce reduce reduce

16 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms

partition

combine

partition

combine

partition

combine

partition

combine

partition

combine

Page 17: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of

MapReduce can refer to…

•  The programming model •  The execution framework (aka “runtime”) •  The specific implementation

17 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms

Page 18: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of

MapReduce Implementations

•  Google has a proprietary implementation in C++ –  Bindings in Java, Python

•  Hadoop is an open-source implementation in Java –  Development led by Yahoo, used in production –  Now an Apache project –  Rapidly expanding software ecosystem

•  Lots of custom research implementations –  For GPUs, cell processors, etc.

18 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms