assignment of different-sized inputs in mapreduce

Assignment of Different-Sized

Inputs in MapReduce

Shantanu Sharma2

joint work with

Foto N. Afrati1, Shlomi Dolev2, Ephraim Korach2, and Jeffrey D. Ullman3

1 National Technical University of Athens, Greece2 Ben-Gurion University of the Negev, Israel

3 Stanford University, USA

28th International Symposium on Distributed Computing (DISC 2014)Austin, Texas, USA (12-15 October 2014)

• Cluster Computing

– Terabytes or Petabytes amount of data cannot beprocessed on a single computer

– Cluster of computers

– How to mask failures, e.g., hardware failures

• MapReduce is a programming model used forparallel processing over large-scale data

Introduction

2

Introduction

3

Worker

Worker

Masterprocess

Worker

Worker

Worker

fork

Read Local write

Remote read, sort

OutputFile 0

OutputFile 1

Write

Chunk 0

Chunk 1

Chunk 2

Input Data

MapReduce job: Map Phase and Reduce Phase

Map Phase: applies a user-defined Map function

Reduce Phase: applies a user-defined Reduce function

Mapper

1

Reducer for

I

Mapper

2

1I

1like

IntroductionMapReduce working example – Word Count

2appleReducer for

like

Reducer for

apple

Reducer for

is

Reducer for

banana

Reducer for

fruit

(I, 2)

(like, 2)

(apple, 2)

(is, 1)

(fruit, 1)

(banana, 1)

I like

apple.

Apple is

fruit.

I like

banana.

1fruit

1is

1I

1like

1banana

Mapper

1

Reducer for

I

Mapper

2

1I

1like

IntroductionInputs and outputs in our context

2appleReducer for

like

Reducer for

apple

Reducer for

is

Reducer for

banana

Reducer for

fruit

(I, 2)

(like, 2)

(apple, 2)

(is, 1)

(fruit, 1)

(banana, 1)

I like

apple.

Apple is

fruit.

I like

banana.

1fruit

1is

1I

1like

1banana

Inputs

Outputs

• Values, provided by each mapper, have some sizes(input size)

• Reduce capacity: an upper bound on the sum of thesizes of the values that are assigned to the reducer

• Example: reducer capacity to be the size of the mainmemory of the processors on which reducers run

We consider two special matching problems

Reducer Capacity

6

State-of-the-Art

• F. Afrati, A.D. Sarma, S. Salihoglu, and J.D. Ullman,“Upper and Lower Bounds on the Cost of a Map-Reduce Computation,” PVLDB, 2013.

• Unit input size

• Reducer Size

– Maximum number of inputs that a given reducercan have.

7

Problem Statement

• Communication cost between the map and thereduce phases is a significant factor

• How we can reduce the communication cost?

– A lesser number of reducers, and hence, a smallercommunication cost

– How to minimize the total number of reducerswhile respecting their limited capacity?

• Not an easy task

– All-to-All mapping schema problem

– X-to-Y mapping schema problem

8

Mapper for

1st

input

Reducer for k1

(1, 2)

Reducer for k2

(1, 3)

Reducer for k3

(2, 3)

Mapper for

2nd

input

Mapper for

3rd

input

input1k1

input1k2

input2k1

input2k3

input3k2

input3k3

Mapper for

1st

input

Reducer for k1

(1, 2, 3)

Mapper for

2nd

input

Mapper for

3rd

input

input1k1

input2k1

input3k1

input 1

input 2

input 3

input 1

input 2

input 3

Notationki: key

• A set of inputs is given

• Each pair of inputs corresponds to one output

• Example

– Computing common friends

• Lists of friends of m persons are given

• Find common friends of the given m persons

• Every two friend lists must be assigned to a singlecommon reducer

A2A Mapping Schema Problem

9

Mapper for

1st

friend

fl2

fl3

fl1 Reducer for k1

(1, 2, 3)

fl4

Reducer for k2

(1, 2, 4)

Reducer for k3

(3, 4)

Mapper for

2nd

friend

Mapper for

3rd

friend

Mapper for

4th

friend

fl1k1

fl1k2

fl2k1

fl2k2

fl3k1

fl3k3

fl4k2

fl4k3

Reducer capacity is enough to hold some of

the friend lists together

10

Notationski: key

fli: ith friend list1, 2

1, 3

2, 3

2, 4

1, 4

3, 4


Mapper for

1st

friend

fl2

fl3

fl1

Reducer for k1

(1, 2, 3, 4)

fl4

Mapper for

2nd

friend

Mapper for

3rd

friend

Mapper for

4th

friend

fl1k1

fl2k1

fl3k1

fl4k1

Reducer capacity is enough to hold all the friend lists together

11

Notationski: key

fli: ith friend list1, 2

1, 31, 4

2, 32, 4

3, 4


• What to do?– Assigns the given m inputs to the given number of

reducers, without exceeding q, in a manner thatevery given input is coupled with every other giveninput in at least one reducer in common

• Polynomial time solution for one and tworeducers

• NP-hard for z > 2 reducers

12


Heuristics for A2A Mapping Schema Problem

• Based on

– First-Fit Decreasing (FFD) or Best-Fit Decreasing(BFD) bin-packing algorithm

– Pseudo-polynomial bin-packing algorithm*

– 2-step Algorithms

– The selection of a prime number p

• A fixed reducer capacity is given

13

*D. R. Karger and J. Scott. Efficient algorithms for fixed-precision instances of bin packing and euclidean tsp. In APPROX-RANDOM, pages 104–117, 2008.

• Two disjoint sets X and Y are given

• Each pairs of element xi, yj (where xi X, yj

Y, i, j) of the sets X and Y corresponds toone output

• Example

– Skew Join

• Two relations X(A, B) and Y(B, C) are given where lots oftuple have a common “b” value

• Every tuple with an identical “b” value is required toassign to at least one reducer

X2Y Mapping Schema Problem

14

X2Y Mapping Schema Problem

• What to do?

– Assigns each input of the set X with each inputof the set Y to at least one reducer in common,without exceeding q

• Polynomial for one reducer

– Can we assign all the inputs of the sets X and Y toa single reducer

• NP-hard for z > 1 reducers

15

Heuristics for X2Y Mapping Schema Problem

• Based on

– First-Fit Decreasing (FFD) or Best-Fit Decreasing(BFD) bin-packing algorithm

• A fixed reducer capacity is given

16

Conclusion

• Reducer capacity

– An important parameter to be considered in all MapReducealgorithms

– The capacity is in terms of, not necessarily identical, memoryauxiliary size, augmented and added to the index of the dataitem(s)

• Two assignment schemas of MapReduce are given

– All-to-All (A2A) mapping schema problem

– X-to-Y (X2Y) mapping schema problem

• Several heuristics for A2A and X2Y mapping schemaproblems are provided

17

Foto Afrati1, Shlomi Dolev2, Ephraim Korach3, Shantanu Sharma2, and Jeffrey D. Ullman4

1 School of Electrical and Computing Engineering, National Technical University of Athens, Greece

[email protected] Department of Computer Science, Ben-Gurion University of the

Negev, Israel{dolev,sharmas}@cs.bgu.ac.il

3 Department of Industrial Engineering and Management, Ben-Gurion University of the Negev, Israel

[email protected] Department of Computer Science, Stanford University, USA

[email protected]

Presentation is available athttp://www.cs.bgu.ac.il/~sharmas/publication.html

assignment of different-sized inputs in mapreduce

Data & Analytics

reducer example

reducer capacity values

apple isfruit

mapping schema problem

sizesinput size

set of inputs

pair of inputs

common friends