a homomorphism-based mapreduce framework for systematic parallel programming

33
Outline Motivation Brief introduction of background The Design of Homomorphism-based Framework on MapReduce Case Study Performance Evaluation A Homomorphism-based MapReduce Framework for Systematic Parallel Programming Yu Liu The Graduate University for Advanced Studies Jan 12, 2011 Yu Liu A Homomorphism-based MapReduce Framework for Systematic P

Upload: yu-liu

Post on 09-Feb-2017

144 views

Category:

Technology


0 download

TRANSCRIPT

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

A Homomorphism-based MapReduce Frameworkfor Systematic Parallel Programming

Yu Liu

The Graduate University for Advanced Studies

Jan 12, 2011

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Outline

1 Motivations

2 Brief introduction of MapReduce

3 The Homomorphism-based Framework

4 Case Study: Parallel sum, Maximum prefix sum, Variance ofnumbers

5 Experimental Results

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Motivation of This Talk

Show how to make programming with MapReduce easier.

Introduce an approach of automatic parallel programgenerating.

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Motivation of This Talk

Show how to make programming with MapReduce easier.

Introduce an approach of automatic parallel programgenerating.

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Programming Paradigm of MapReduceList Homomorphism and Homomorphism Theorems

MapReduce Programming model

The Computation of MapReduce Framework

Google’s MapReduce is a parallel-distributed programming model,together with an associated implementation, for processing verylarge data sets in a massively parallel manner.

Components of a MapReduce program (Hadoop)

A Mapper;

A Partitioner that can be used shuffling data;

A Combiner that can be used doing local reduction;

A Reducer ;

A Comparator can be used while sorting or grouping;

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Programming Paradigm of MapReduceList Homomorphism and Homomorphism Theorems

MapReduce Programming modelMapReduce Data-processing flow

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Programming Paradigm of MapReduceList Homomorphism and Homomorphism Theorems

MapReduce Programming model

A simple functional specifcation of the MapReduce framework

Function mapS is a set version of the map function. FunctiongroupByKey :: {[(k , v)]} → {(k , [v ])} takes a set of list ofkey-value pairs (each pair is called a record) and groups the valuesof the same key into a list.

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Programming Paradigm of MapReduceList Homomorphism and Homomorphism Theorems

Maximum Prefix Sum problem

The Maximum Prefix Sum problem (mps) is to find the maximumprefix-summation in a list:

3,−1, 4, 1,−5, 9, 2,−6, 5

This problem seems not obvious to solve this problem efficientlywith MapReduce.

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Programming Paradigm of MapReduceList Homomorphism and Homomorphism Theorems

List Homomorphism

Function h is said to be a list homomorphism

If there are a function f and an associated operator � such thatfor any list x and list y

h [a] = f ah (x ++ y) = h(x)� h(y).

Where ++ is the list concatenation.

For instance, the function sum can be described as a listhomomorphism

sum [a] = asum (x ++ y) = sum x + sum y .

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Programming Paradigm of MapReduceList Homomorphism and Homomorphism Theorems

List Homomorphism and Homomorphism Theorems

Leftwards function

Function h is leftwards if it is defined in the following form withfunction f and operator ⊕,

h [a] = f ah ([a] ++ x) = a⊕ h x .

Rightwards function

Function h is rightwards if it is defined in the following form withfunction f and operator ⊗,

h [a] = f ah (x ++ [a]) = h x ⊗ a.

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Programming Paradigm of MapReduceList Homomorphism and Homomorphism Theorems

List Homomorphism and Homomorphism Theorems

Map and Reduce

For a given function f , the function of the form ([[·] ◦ f ,++ ]) is amap function, and is written as map f .————————————————————————————The function of the form ([id ,�]) for some � is a reduce function,and is written as reduce (�).

The First Homomorphism Theorem

Any homomorphism can be written as the composition of a mapand a reduce:

([f ,�]) = reduce (�) ◦map f .

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Programming Paradigm of MapReduceList Homomorphism and Homomorphism Theorems

List Homomorphism and Homomorphism Theorems

The Third Homomorphism Theorem

Function h can be described as a list homomorphism, iff ∃ � and∃ f such that:

h = ([f ,�])

if and only if there exist f , ⊕, and ⊕ such that

h [a] = f ah ([a] ++ x) = a⊕ h xh (x ++ [b]) = h x ⊗ b.

The third homomorphism gives a necessary and sufficient conditionfor the existence of a list homomorphism.

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Automatic ParallelizationCase Study

A homomorphism-based framework wrapping MapReduce

To make it easy for resolving problems such as mps byMapReduce. We using the knowledge of homomorphism especiallythe third homomorphism theorem to wrapping MapReduce model.

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Automatic ParallelizationCase Study

A homomorphism-based framework wrapping MapReduce

Basic Homomorphism-Programming Interface

filter :: a→ baggregator :: b → b → b.

The implementlation on Hadoop

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Automatic ParallelizationCase Study

A homomorphism-based framework wrapping MapReduce

A simple example of using this interface for computing the sum ofa list

The implementlation on Hadoop

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Automatic ParallelizationCase Study

A homomorphism-based framework wrapping MapReduce

Programming Interface with Right Inverse

fold :: [a]→ bunfold :: b → [a].

The implementlation on Hadoop

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Automatic ParallelizationCase Study

A homomorphism-based framework wrapping MapReduce

A simple example of using this interface for computing the sum ofa list

The implementlation on Hadoop

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Automatic ParallelizationCase Study

A homomorphism-based framework wrapping MapReduce

Requirements of using this interface in addition to the right-inverseproperty of unfold over fold .

Both leftwards and rightwards functions exist

fold([a] ++ x) = fold([a] ++ unfold(fold(x)))fold(x ++ [a]) = fold(unfold(fold(x)) ++ [a]).

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Automatic ParallelizationCase Study

The implementation of homomorphism framework uponHadoop

To implement our programming interface with Hadoop, we need toconsider how to represent lists in a distributed manner.

Set of pairs as list

We use integer as the index’s type, the list [a, b, c , d , e] isrepresented by {(3, d), (1, b), (2, c), (0, a), (4, e)}.

Set of pairs as distributed List

We can represent the above list as two sub-sets{((0, 1), b), ((0, 2), c), ((0, 0), a)} and {((1, 3), d), ((1, 4), e)}, eachin different data-nodes

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Automatic ParallelizationCase Study

The implementation of homomorphism framework uponHadoop

The first homomorphism theorem implies that a listhomomorphism can be implemented by MapReduce, at least twopasses of MapReduce.

Defination of homMR

homMR :: (α→ β)→ (β → β → β)→ {(ID, α)} → βhomMR f (⊕) = getValue ◦MapReduce mapper2 reducer2

◦MapReduce mapper1 reducer1where

mapper1 :: (ID, α))→ [((ID, ID), β))]mapper1 (i , a) = [((pid , i), b)]

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Automatic ParallelizationCase Study

The implementation of homomorphism framework uponHadoop

Defination of homMR

reducer1 :: (ID, ID)→ [β]→ βreducer1 ((p, j), ias)) = hom f (⊕) ias

mapper2 :: ((ID, ID), β)→ [((ID, ID), β)]mapper2 ((p, j), b) = [((0, j), b)]

reducer2 :: (ID, ID)→ [β]→ βreducer2 ((0, k), jbs) = hom (⊕) jbs

getValue {(0, b)} = b

Where, hom f (⊕) denotes a sequential version of ([f ,⊕]).

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Automatic ParallelizationCase Study

The leftwards and rightwardsfunction

Derivation by right inverse

leftwards([a] ++ x) = fold([a] ++ unfold(fold(x)))rightwards(x ++ [a]) = fold(unfold(fold x) ++ [a]).

Now if for all xs,

rightwards xs = leftwards xs, (1)

then a list homomorphism ([f ,⊕]) that computes fold can beobtained automatically, where f and ⊕ are defined as follows:

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Automatic ParallelizationCase Study

The leftwards and rightwardsfunction

Derivation by right inverse

f a = fold([a])a⊕ b = fold(unfold a ++ unfold b).

Equation (1) should be satisfied.

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Automatic ParallelizationCase Study

Programming with this homomorphism frameworkMPS

A sequential program

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Automatic ParallelizationCase Study

Programming with this homomorphism frameworkMPS

A tupled function

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

MPS

(mps 4 sum) [a] = (a ↑ 0, a)(mps 4 sum) (x + +[a]) = let (m, s) = (mps 4 sum) x in (m ↑ (s + a), s + a).

We use this tupled function as the fold function. The right inverseof the tupled function, (mps 4 sum)◦:

(mps 4 sum)◦ (m, s) = [m, s −m]

Noting that for the any result (m, s) of the tupled function theinequality m > s hold,

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

The implementation of homomorphism framework uponHadoopperformance tests

Environment:Hardware

COE cluster in Tokyo University which has 192 computing nodes.We choose 16 , 8 , 4 , 2 and 1 node to run the MapReduce-MPSprogram. Each node has 2 Xeon(Nocona) CPU with 2GB RAM.

Environment:Software

Linux2.6.26 ,Hadoop0.20.2 +HDFS

Hadoop configuration: heap size= 1024MB

maximum mapper pre node: 2

maximum reducer pre node: 2

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Performance

The input data

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Performance

The time consuming of calculate 100 million-long list

(SequenceFile, Pair < Long >):

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

Performance

The speedup of 2-16 nodes:

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

PerformanceComparison of 2 version SUM

Comparison of 2-16 nodes:

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

PerformanceConclusions

The time curve indicate the system scalability with the number ofcomputing nodes. The curve between x-axis 2 and 8 has biggestslope, when the curve reaches to 16, the slope decreased, that isbecause when there are more nodes, the overhead ofcommunication increased. Totally, the curve shows the scalabilityis near-linear.Overhead of 2 phases Map-Reduce.Overhead of Java reflection.Not support local reduction now (not implemented yet).

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming

OutlineMotivation

Brief introduction of backgroundThe Design of Homomorphism-based Framework on MapReduce

Case StudyPerformance Evaluation

The endQuestions?

?

Yu Liu A Homomorphism-based MapReduce Framework for Systematic Parallel Programming