assignment of different-sized inputs in mapreduce
DESCRIPTION
Assignment problems in MapReduceTRANSCRIPT
Assignment of Different-Sized
Inputs in MapReduce
Shantanu Sharma2
joint work with
Foto N. Afrati1, Shlomi Dolev2, Ephraim Korach2, and Jeffrey D. Ullman3
1 National Technical University of Athens, Greece2 Ben-Gurion University of the Negev, Israel
3 Stanford University, USA
28th International Symposium on Distributed Computing (DISC 2014)Austin, Texas, USA (12-15 October 2014)
• Cluster Computing
– Terabytes or Petabytes amount of data cannot beprocessed on a single computer
– Cluster of computers
– How to mask failures, e.g., hardware failures
• MapReduce is a programming model used forparallel processing over large-scale data
Introduction
2
Introduction
3
Worker
Worker
Masterprocess
Worker
Worker
Worker
fork
Read Local write
Remote read, sort
OutputFile 0
OutputFile 1
Write
Chunk 0
Chunk 1
Chunk 2
Input Data
MapReduce job: Map Phase and Reduce Phase
Map Phase: applies a user-defined Map function
Reduce Phase: applies a user-defined Reduce function
Mapper
1
Reducer for
I
Mapper
2
1I
1like
IntroductionMapReduce working example – Word Count
2appleReducer for
like
Reducer for
apple
Reducer for
is
Reducer for
banana
Reducer for
fruit
(I, 2)
(like, 2)
(apple, 2)
(is, 1)
(fruit, 1)
(banana, 1)
I like
apple.
Apple is
fruit.
I like
banana.
1fruit
1is
1I
1like
1banana
Mapper
1
Reducer for
I
Mapper
2
1I
1like
IntroductionInputs and outputs in our context
2appleReducer for
like
Reducer for
apple
Reducer for
is
Reducer for
banana
Reducer for
fruit
(I, 2)
(like, 2)
(apple, 2)
(is, 1)
(fruit, 1)
(banana, 1)
I like
apple.
Apple is
fruit.
I like
banana.
1fruit
1is
1I
1like
1banana
Inputs
Outputs
• Values, provided by each mapper, have some sizes(input size)
• Reduce capacity: an upper bound on the sum of thesizes of the values that are assigned to the reducer
• Example: reducer capacity to be the size of the mainmemory of the processors on which reducers run
We consider two special matching problems
Reducer Capacity
6
State-of-the-Art
• F. Afrati, A.D. Sarma, S. Salihoglu, and J.D. Ullman,“Upper and Lower Bounds on the Cost of a Map-Reduce Computation,” PVLDB, 2013.
• Unit input size
• Reducer Size
– Maximum number of inputs that a given reducercan have.
7
Problem Statement
• Communication cost between the map and thereduce phases is a significant factor
• How we can reduce the communication cost?
– A lesser number of reducers, and hence, a smallercommunication cost
– How to minimize the total number of reducerswhile respecting their limited capacity?
• Not an easy task
– All-to-All mapping schema problem
– X-to-Y mapping schema problem
8
Mapper for
1st
input
Reducer for k1
(1, 2)
Reducer for k2
(1, 3)
Reducer for k3
(2, 3)
Mapper for
2nd
input
Mapper for
3rd
input
input1k1
input1k2
input2k1
input2k3
input3k2
input3k3
Mapper for
1st
input
Reducer for k1
(1, 2, 3)
Mapper for
2nd
input
Mapper for
3rd
input
input1k1
input2k1
input3k1
input 1
input 2
input 3
input 1
input 2
input 3
Notationki: key
• A set of inputs is given
• Each pair of inputs corresponds to one output
• Example
– Computing common friends
• Lists of friends of m persons are given
• Find common friends of the given m persons
• Every two friend lists must be assigned to a singlecommon reducer
A2A Mapping Schema Problem
9
Mapper for
1st
friend
fl2
fl3
fl1 Reducer for k1
(1, 2, 3)
fl4
Reducer for k2
(1, 2, 4)
Reducer for k3
(3, 4)
Mapper for
2nd
friend
Mapper for
3rd
friend
Mapper for
4th
friend
fl1k1
fl1k2
fl2k1
fl2k2
fl3k1
fl3k3
fl4k2
fl4k3
Reducer capacity is enough to hold some of
the friend lists together
10
Notationski: key
fli: ith friend list1, 2
1, 3
2, 3
2, 4
1, 4
3, 4
A2A Mapping Schema Problem
Mapper for
1st
friend
fl2
fl3
fl1
Reducer for k1
(1, 2, 3, 4)
fl4
Mapper for
2nd
friend
Mapper for
3rd
friend
Mapper for
4th
friend
fl1k1
fl2k1
fl3k1
fl4k1
Reducer capacity is enough to hold all the friend lists together
11
Notationski: key
fli: ith friend list1, 2
1, 31, 4
2, 32, 4
3, 4
A2A Mapping Schema Problem
• What to do?– Assigns the given m inputs to the given number of
reducers, without exceeding q, in a manner thatevery given input is coupled with every other giveninput in at least one reducer in common
• Polynomial time solution for one and tworeducers
• NP-hard for z > 2 reducers
12
A2A Mapping Schema Problem
Heuristics for A2A Mapping Schema Problem
• Based on
– First-Fit Decreasing (FFD) or Best-Fit Decreasing(BFD) bin-packing algorithm
– Pseudo-polynomial bin-packing algorithm*
– 2-step Algorithms
– The selection of a prime number p
• A fixed reducer capacity is given
13
*D. R. Karger and J. Scott. Efficient algorithms for fixed-precision instances of bin packing and euclidean tsp. In APPROX-RANDOM, pages 104–117, 2008.
• Two disjoint sets X and Y are given
• Each pairs of element xi, yj (where xi X, yj
Y, i, j) of the sets X and Y corresponds toone output
• Example
– Skew Join
• Two relations X(A, B) and Y(B, C) are given where lots oftuple have a common “b” value
• Every tuple with an identical “b” value is required toassign to at least one reducer
X2Y Mapping Schema Problem
14
X2Y Mapping Schema Problem
• What to do?
– Assigns each input of the set X with each inputof the set Y to at least one reducer in common,without exceeding q
• Polynomial for one reducer
– Can we assign all the inputs of the sets X and Y toa single reducer
• NP-hard for z > 1 reducers
15
Heuristics for X2Y Mapping Schema Problem
• Based on
– First-Fit Decreasing (FFD) or Best-Fit Decreasing(BFD) bin-packing algorithm
• A fixed reducer capacity is given
16
Conclusion
• Reducer capacity
– An important parameter to be considered in all MapReducealgorithms
– The capacity is in terms of, not necessarily identical, memoryauxiliary size, augmented and added to the index of the dataitem(s)
• Two assignment schemas of MapReduce are given
– All-to-All (A2A) mapping schema problem
– X-to-Y (X2Y) mapping schema problem
• Several heuristics for A2A and X2Y mapping schemaproblems are provided
17
Foto Afrati1, Shlomi Dolev2, Ephraim Korach3, Shantanu Sharma2, and Jeffrey D. Ullman4
1 School of Electrical and Computing Engineering, National Technical University of Athens, Greece
[email protected] Department of Computer Science, Ben-Gurion University of the
Negev, Israel{dolev,sharmas}@cs.bgu.ac.il
3 Department of Industrial Engineering and Management, Ben-Gurion University of the Negev, Israel
[email protected] Department of Computer Science, Stanford University, USA
Presentation is available athttp://www.cs.bgu.ac.il/~sharmas/publication.html