a homomorphism-based framework for systematic parallel programming with mapreduce

53
A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce Yu Liu 1 , Zhenjiang Hu 2 1 The Graduate University for Advanced Studies,Tokyo, Japan [email protected] 2 National Institute of Informatics,Tokyo, Japan 2 [email protected] Mar. 10th, 2011 Yu Liu 1 , Zhenjiang Hu 2 A Homomorphism-based Framework for Systematic Parallel Progr

Upload: yu-liu

Post on 22-Jan-2018

304 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

A Homomorphism-based Framework forSystematic Parallel Programming with MapReduce

Yu Liu1, Zhenjiang Hu2

1 The Graduate University for Advanced Studies,Tokyo, [email protected]

2 National Institute of Informatics,Tokyo, Japan2 [email protected]

Mar. 10th, 2011

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 2: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Background

MapReduce

Google’s MapReduce is a popular parallel-distributed programmingmodel, for processing large data sets. It has been the de factostandard for large scale data analysis.

Concepts from functional programming languages

Automatic parallel processing, fault tolerance

Good scalability

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 3: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

MapReduce

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 4: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

MapReduce

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 5: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

MapReduce

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 6: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

MapReduce

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 7: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

MapReduce

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 8: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Programming with MapReduce

A user has to

design a D&C algorithm that fits MapReduce paradigm

map this algorithm to MapReduce.

Difficulties of programming with MapReduce

How to resolve the constrains on computing order.

How to resolve the data dependency.

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 9: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Programming with MapReduce

A user has to

design a D&C algorithm that fits MapReduce paradigm

map this algorithm to MapReduce.

Difficulties of programming with MapReduce

How to resolve the constrains on computing order.

How to resolve the data dependency.

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 10: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Example

The Maximum Prefix Sum problem

mps [3,−1, 4,−1,−5, 9, 2,−6, 5,−10] = 11

A sequential program for MPS in O(n) time

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 11: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Example

The Maximum Prefix Sum problem

mps [3,−1, 4,−1,−5, 9, 2,−6, 5,−10] = 11

Hard to compute MPS with MapReduce

Computation has order.

MPS of sub-lists cannot be conquered directly.

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 12: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Questions

Is there a systematic way to resolving such problems withMapReduce ?

How to handle the problems with district order ?

How to systematically design the divide-and-conqueralgorithm ?

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 13: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Questions

Is there a systematic way to resolving such problems withMapReduce ?

How to handle the problems with district order ?

How to systematically design the divide-and-conqueralgorithm ?

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 14: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Motivation and objective

We propose a systematic approach to automatically generate fullyparallelized and scalable MapReduce programs.A new framework which provides algorithmic programminginterfaces has been implemented.

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 15: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

A systematic approach for programming with MapReduce

Firstly, derive a function h.

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 16: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

A systematic approach for programming with MapReduce

Then write a inverse function h◦.

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 17: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

A systematic approach for programming with MapReduce

D&C algorithm can be gotten.

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 18: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

A systematic approach for programming with MapReduce

Map it to MapReduce paradigm.

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 19: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

A systematic approach for programming with MapReduce

Parallelization is in a black box.

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 20: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

A systematic approach for programming with MapReduce

Implemented by multi-phases MapReduce processing.

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 21: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Conditions of this f function

Theorem

If there exists a binary operator � such that

f (xs ++ ys) = f xs � f ysthen such � can be defined as :x � y = f (f ◦x ++ f ◦x)

where ++ islistconcatenation.

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 22: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Iff a function can be defined both rightwards and leftwards, such �exists. We can derive a divide-and-conquer algorithm like this:

Divide-and-conquer

f (xs ++ ys) = f (f ◦ (f xs) ++ f ◦ (f ys))

Such functions are so called: homomorphisms.

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 23: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Programming Interface

Fold and unfold

fold :: [α]→ βunfold :: β → [α].

The implementation in Java

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 24: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

A function which computes MPS and its right inverse can bewritten as followings:

fold xs = mps 4 sum xsunfold (m, s) = [m, s −m]

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 25: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

The computation inside framework

Use fold and unfold functions doing the computation:

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 26: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Autonomous intermediate data

Each record of the intermediate data has the information ofposition, thus the distribution of data is indifferent.

< id , val > → << parId , id >, val >

By taking use of sorting and grouping mechanism of MapReduceframework, lists can be reconstructed when necessary.

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 27: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

A formal definitation

homMR

homMR :: (α→ β)→ (β → β → β)→ {(ID, α)} → βhomMR f (⊕) = getValue ◦MapReduce mapper2 reducer2

◦MapReduce mapper1 reducer1where

mapper1 :: (ID, α)→ [((PID, ID), α)]mapper1 (i , a) = [(pid , i), a))]

where pid = makePid ireducer1 :: ((PID, ID), [α])→ ((PID, ID), β)reducer1 ((pid , j), ias) = ((pid , j), hom f (⊕) ias)

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 28: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

continued

mapper2 :: ((PID, ID), β)→ ((PID, ID), β)mapper2 ((pid , j), b) = ((c0, j), b)

where c0 is a predefined constant pidreducer2 :: ((PID, ID), [β])→ ((PID, ID), β)reducer2 ((c0, k), jbs) = ((c0, k), hom f (⊕) jbs)

getValue :: ((PID, ID), β)→ βgetValue ((c0, k), c) = c

Where, hom f (⊕) denotes a sequential version of ([f ,⊕]).

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 29: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Actual user-program for MPS

http://screwdriver.googlecode.com

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 30: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Performance evaluation

Environment: hardware

We configured clusters with 2, 4, 8, and 16 nodes. Eachcomputing/data node has two Xeon CPUs (Nocona, single-core,2.8 GHz), 2 GB memory. The nodes are connected with GigabitEthernet.

Environment: software

Linux2.6.26 ,Hadoop 0.21.0 +HDFS

Hadoop configuration: heap size= 1024MB

maximum mapper per node: 2

maximum reducer per node: 1

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 31: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Test cases

We implemented several programs for three problems on ourframework and Hadoop:

1 the maximum-prefix-sum problem.

MPS-lh is implemented using our framework’ API.MPS-mr is implemented by Hadoop API.

2 parallel sum of 64-bit integers

SUM-lh is implemented by our framework’ API.SUM-mr is implemented by Hadoop API.

3 VAR-lh computes the variance of 32-bit floating-pointnumbers;

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 32: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Test cases

Test data

100 million 64-bit integers (2.87 GB) for MPS, SUM.100 million 32-bit floating-point numbers (2.76 GB) for VAR.

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 33: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Performance

The experiment results are summarized :

With 16 nodes speedup of all cases are more than 7.

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 34: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Performance

Time curves:

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 35: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Concluding remarks

In this research:

Introduced a systematic way of parallel programming onMapReduce.

Developed a framework on top of Hadoop.

Algorithmic programming interfaces let user can focus on thealgebraic properties of problem.Details of MapReduce are hidden.Achieved good scalability and parallelism.Automatic optimization can be equipped.

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 36: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Concluding remarks

In this research:

Introduced a systematic way of parallel programming onMapReduce.

Developed a framework on top of Hadoop.

Algorithmic programming interfaces let user can focus on thealgebraic properties of problem.Details of MapReduce are hidden.Achieved good scalability and parallelism.Automatic optimization can be equipped.

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 37: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Concluding remarks

In this research:

Introduced a systematic way of parallel programming onMapReduce.

Developed a framework on top of Hadoop.

Algorithmic programming interfaces let user can focus on thealgebraic properties of problem.

Details of MapReduce are hidden.Achieved good scalability and parallelism.Automatic optimization can be equipped.

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 38: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Concluding remarks

In this research:

Introduced a systematic way of parallel programming onMapReduce.

Developed a framework on top of Hadoop.

Algorithmic programming interfaces let user can focus on thealgebraic properties of problem.Details of MapReduce are hidden.

Achieved good scalability and parallelism.Automatic optimization can be equipped.

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 39: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Concluding remarks

In this research:

Introduced a systematic way of parallel programming onMapReduce.

Developed a framework on top of Hadoop.

Algorithmic programming interfaces let user can focus on thealgebraic properties of problem.Details of MapReduce are hidden.Achieved good scalability and parallelism.

Automatic optimization can be equipped.

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 40: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Concluding remarks

In this research:

Introduced a systematic way of parallel programming onMapReduce.

Developed a framework on top of Hadoop.

Algorithmic programming interfaces let user can focus on thealgebraic properties of problem.Details of MapReduce are hidden.Achieved good scalability and parallelism.Automatic optimization can be equipped.

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 41: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Future work

Decrease the system overhead and do more optimization.

Extend to more complex data structure such as tree andgraph.

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 42: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Related work

Parallel programming with list homomorphisms (M.Cole 95)

The Third Homomorphism Theorem(J.Gibbons 96).

Systematic extraction and implementation ofdivide-and-conquer parallelism (Gorlatch PLILP96).

Automatic inversion generates divide-and-conquer parallelprograms(Morita et.al., PLDI07).

The third homomorphism theorem on trees: downward &upward lead to divide-and-conquer (Morihata, POPL09)

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 43: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Thank you very much.Questions?

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 44: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

List Homomorphism

Function h is said to be a list homomorphism

If there are a function f and an associative operator � such thatfor any list x and list y

h [a] = f ah (x ++ y) = h(x)� h(y).

Where ++ is the list concatenation.

Instance of a list homomorphism

sum [a] = asum (x ++ y) = sum x + sum y .

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 45: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Theorem (The Third Homomorphism Theorem (Gibbons,96) )

Let h be a given function and and � be binary operators. If thefollowing two equations hold for any element a and list y

h ([a] ++ y) = a h yh (y ++ [a]) = h y � a

then the function h is a homomorphism.

In fact, for a function h, if we have one of its right inverse h◦ thatsatisfies h ◦ h◦ ◦ h = h, then we can obtain the list-homomorphicdefinition as follows.

h = ([f ,�]) wheref a = h [a]l � r = h (h◦ l ++ h◦ r)

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 46: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

MapReduce programs can be automatically obtained bytwo sequential functions

homomorphism ([f ,⊕])

f :: a→ b⊕ :: b → b → b(a⊕ b)⊕ c = a⊕ (b ⊕ c).

fold and unfold, that compose leftwards and rightwards functions

fold([a] ++ x) = fold([a] ++ unfold(fold(x)))fold(x ++ [a]) = fold(unfold(fold(x)) ++ [a]).

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 47: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Currently, Screwdriver provides two kinds of programminginterfaces:

Programming interface corresponding to definition of listhomomorphism;

Programming interface corresponding to the 3rdhomomorphism theorem.

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 48: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Basic Homomorphism-Programming Interface

Two functions which define an homomorphism

filter :: a→ bplus :: b → b → b.

The implementation in Java

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 49: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Programming Interface based on the 3rd homomorphismtheorem

A function and its right inverse

fold :: [a]→ bunfold :: b → [a].

The implementation in Java

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 50: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

The implementation of Screwdriver : list representation

To implement our programming interface with Hadoop, we need toconsider how to represent lists in a distributed manner.

Input data: index-value pairs

We use integer as the index’s type, the list [a, b, c , d , e] isrepresented by {(3, d), (1, b), (2, c), (0, a), (4, e)}.

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 51: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Partition of input list

The pid(partition-id) of type PID is the index of a partial list. Theframework produces a same pid for the records which will begrouped together. These records have continues id .

Intermediate data: nested pairs ((pid , id), val)

Suppose the above list was divided to two parts and in differentnodes, then they are represented as{((0, 1), b), ((0, 2), c), ((0, 0), a)} and {((1, 3), d), ((1, 4), e)}.

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 52: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Grouping and sorting of intermediate data

We defined two functions: the comparatorG and comparatorS asfollows:

comparatorG (pid1, id1) (pid2, id2) = if pid1 == pid2

then 0else − 1

comparatorS (pid1, id1) (pid2, id2) = if id1 > id2then 1else − 1

for grouping intermediate records with same pid and sorting themby id .

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Page 53: A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce

Data partition

1 In MAP task,

intermediate records with same pid are grouped together andsorted by id .a partitioner dispatches the groups to different reducers.

2 In REDUCE task, reducers apply merge-sort on all groupswith same pid

Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce