a map-reduce-like system for programming and optimizing data-intensive computations on emerging...

50
A Map-Reduce-Like System for Programming and Optimizing Data- Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High Performance Computing Research Group Department of Computer Science and Engineering The Ohio State University Advisor: Dr. Gagan Agrawal

Upload: gervais-farmer

Post on 12-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations

on Emerging Parallel Architectures

Wei Jiang

Data-Intensive and High Performance Computing Research Group

Department of Computer Science and Engineering

The Ohio State University

Advisor: Dr. Gagan Agrawal

Page 2: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

The Era of “Big Data” • When data size becomes a problem:– Need easy-to-use tools! What other aspects?– Performance? Analysis and Management? Security and

Privacy?

2

Notebook

Spread

sheet

Paralle

l DBM

S

Paralle

l Syst

ems

Next Generati

on

19791984

2007 ?Pick the right data tool!

Data-Intensive and High Performance Computing Research Group

Old days

Page 3: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

Motivation• Growing need of Data-Intensive SuperComputing– Performance is highest priority in HPC!– Efficient data processing & High programming productivity

• Emergence of various parallel architectures– Traditional CPU clusters (multi-cores)– GPU clusters (many-cores)– CPU-GPU clusters (heterogeneous systems)

• Given big data, high-end apps, and parallel architectures…– We need Programming Models and Middleware Support!

3Data-Intensive and High Performance Computing Research Group

Page 4: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

4

• Map-Reduce and its variants– Simple API : map and reduce

• Easy to write parallel programs• Fault-tolerant for large-scale data centers with commodity nodes• High programming productivity

– Performance?• Always a concern for HPC community and also Database

• Data-intensive applications – Various Subclasses:

• Data center-oriented : search technologies• Data Mining, graph mining, and scientific computing• Large intermediate structures: pre-processing/post-processing

Map-Reduce is good, but…

Data-Intensive and High Performance Computing Research Group

Page 5: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

5

• Parallel Architectures– CPU clusters (multi-cores)

• Most widely used as traditional HPC platforms• Motivated MapReduce and many of its variants

– GPU clusters (many-cores)• Higher performance with better cost & energy efficiency• Low programming productivity• Limited MapReduce-like support

– CPU-GPU clusters• Emerging heterogeneous systems• No general MapReduce-like support up to date

– New Hybrid Architectures• CPU+GPU on the same chip: Sandy Bridge, Fusion, etc.

Parallel Computing Environments

Data-Intensive and High Performance Computing Research Group

Page 6: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

Our Middleware Series• Bridge the gap between the parallel architectures

and the applications– Higher programming productivity than MPI– Better performance efficiency than MapReduce

6

GPU

GPU GPU

GPU

MATE

Ex-MATE

MATE-CG

FT-MATE

Tall oaks grow from little acorns!

Data-Intensive and High Performance Computing Research Group

Page 7: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

Our Current Work• Four systems on different parallel architectures:– MATE (Map-reduce with an AlternaTE API)

• For multi-core environments and data mining

– Ex-MATE (Extended MATE)• For clusters of multi-cores• Provided large-sized reduction object support

– MATE-CG (MATE for Cpu-Gpu)• For heterogeneous CPU-GPU clusters• Provided an auto-tuning framework for data distribution

– FT-MATE (Fault Tolerant MATE)• Supports more efficient fault tolerance for MPI programs • Makes use of distributed memory and reliable storage

7Data-Intensive and High Performance Computing Research Group

Page 8: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

8

The Programming Model• The generalized reduction model– Based on user-declared reduction objects– Motivated by a set of data mining applications• For example, K-Means could have a very large set of data points to

process but only need to update a small set of centroids (the reduction object!)

– Forms a compact summary of computational states• Helps achieve more efficient fault tolerance and recovery than

replication/job re-execution in Map-Reduce

– Avoids large-sized intermediate data• Applies updates directly on the reduction object instead of going

through Map---Intermediate Processing---Reduce

Data-Intensive and High Performance Computing Research Group

Page 9: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

Comparing Processing Structures

9

• Reduction Object represents the intermediate state of the execution• Reduce func. is commutative and associative• Sorting, grouping, shuffling.. .overheads are eliminated with red. func/obj.• But we need global combination.

Data-Intensive and High Performance Computing Research Group

• Insight: we could even provide a better implementation of the same map-reduce API! --- e.g., Turbo MapReduce from Quantcast!

Page 10: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

Our Current Work• Four systems on different parallel architectures:– MATE (Map-reduce with an AlternaTE API)

• For multi-core environments and data mining

– Ex-MATE (Extended MATE)• For clusters of multi-cores• Provided large-sized reduction object support

– MATE-CG (MATE for Cpu-Gpu)• For heterogeneous CPU-GPU clusters• Provided an auto-tuning framework for data distribution

– FT-MATE (Fault Tolerant MATE)• Supports more efficient fault tolerance for MPI programs• Makes use of distributed memory and reliable storage

10Data-Intensive and High Performance Computing Research Group

Page 11: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

11

Shared-Memory Parallelization in MATE

• Basic one-stage dataflow in Full Replication scheme– Locking-free: each CPU core has a private copy of the

reduction object– Parallel merge is performed in combination phase

Data-Intensive and High Performance Computing Research Group

Split 0

Split 1

Split 2

Split 3

Split 4

Split 5

Reduction

Reduction

Reduction

Reduction Object

Reduction Object

Reduction Object

Combination Output

Page 12: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

12

Functions• APIs defined/customized by the user

Function Description R/O

int (*splitter_t)(void *, int, reduction_args_t *) O

void (*reduction_t)(reduction_args_t *) R

void (*combination_t)(void*) O

void (*finalize_t)(void *) O

Data-Intensive and High Performance Computing Research Group

Page 13: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

1313

• For comparison against Phoenix, we used three data mining applications– K-Means Clustering, Princinple Component Analysis

(PCA), Apriori Associative Mining.– Also evaluated the single-node performance of Hadoop

on KMeans and Apriori • Combine function is used in Hadoop with careful tuning

• Experiments on two multi-core platforms– 8 cores on one 8-core node (Intel cpu)– 16 cores on one 16-core node (AMD cpu)

Experiments Design

Data-Intensive and High Performance Computing Research Group

Page 14: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

14

Results: Data Mining (I)• K-Means on 8-core and 16-core machines: 400MB

dataset, 3-dim points, k = 100

1 2 4 80

20

40

60

80

100

120Phoenix

MATE

Hadoop

Avg.

Tim

e Pe

r Ite

ratio

n (s

ec)

# of threads

2.0

1 2 4 8 160

20406080

100120140

PhoenixMATEHadoop

3.0

Data-Intensive and High Performance Computing Research Group

Page 15: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

15

Results: Data Mining (II)• PCA on 8-core and 16-core machines : 8000 * 1024

matrix

1 2 4 80

50

100

150

200

250

300Phoenix

MATE

# of threads

Tota

l Tim

e (

sec)

2.0

1 2 4 8 160

50100150200250300350400450

Phoenix

MATE

2.0

Data-Intensive and High Performance Computing Research Group

Page 16: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

Extending MATE• Main issue of the original MATE:– Assumes that the reduction object MUST fit in memory

• We extended MATE to address this limitation– Focus on graph mining: an emerging class of apps

• Require large-sized reduction objects as well as large-scale datasets

• E.g., PageRank could have a 16GB reduction object!

– Support of managing arbitrary-sized reduction objects• Large-sized reduction objects are disk-resident

– Evaluated Ex-MATE using PEGASUS• PEGASUS: A Hadoop-based graph mining system

16Data-Intensive and High Performance Computing Research Group

Page 17: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

Our Current Work• Four systems on different parallel architectures:– MATE (Map-reduce with an AlternaTE API)

• For multi-core environments and data mining

– Ex-MATE (Extended MATE)• For clusters of multi-cores• Provided large-sized reduction object support

– MATE-CG (MATE for Cpu-Gpu)• For heterogeneous CPU-GPU clusters• Provided an auto-tuning framework for data distribution

– FT-MATE (Fault Tolerant MATE)• Supports more efficient fault tolerance for MPI programs• Makes use of distributed memory and reliable storage

17Data-Intensive and High Performance Computing Research Group

Page 18: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

18

Ex-MATE Runtime Overview• Basic one-stage execution

Data-Intensive and High Performance Computing Research Group

Execution Overview of the Extended MATE

Page 19: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

19

Implementation Considerations• Support for processing very large datasets– Partitioning function:

• Partition and distribute to a number of computing nodes

– Splitting function: • Use the multi-core CPU on each node

• Management of a large reduction-object on disk:– How to reduce disk I/O?– Outputs (R.O.) are updated in a demand-driven way

• Partition the reduction object into splits

– Inputs are re-organized based on data access patterns• Reuse a R.O. split as much as possible in memory

– Example: Matrix-Vector Multiplication

Data-Intensive and High Performance Computing Research Group

Page 20: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

A MV-Multiplication Example

20

Output Vector

Input Vector

Input Matrix(1, 1)

(2, 1)

(1, 2)

Data-Intensive and High Performance Computing Research Group

Matrix-Vector Multiplication using checkerboard partitioning. B(i,j) represents amatrix block, I_V(j) represents an input vector split, and O_V(i) represents an output vector split. The matrix/vector multiplies are done block-wise, notelement-wise.

Page 21: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

2121

• Applications:– Three graph mining algorithms:

• PageRank, Diameter Estimation (HADI), and Finding Connected Components (HCC) – Parallelized using GIM-V method

• Evaluation:– Performance comparison with PEGASUS

• PEGASUS provides a naïve version and an optimized version• Speedups with an increasing number of nodes

– Scalability speedups with an increasing size of datasets

• Experimental platform: – A cluster of multi-core CPU machines – Used up to 128 cores (16 nodes)

Experiments Design

Data-Intensive and High Performance Computing Research Group

Page 22: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

22

Results: Graph Mining (I)• 16GB datasets:

Ex-MATE:~10 times speedup

4 8 160

50100150200250300 P-Naïve

P-BlockEx-MATE

Avg.

Tim

e Pe

r Ite

ratio

n (m

in)

# of nodes

4 8 160

20406080

100120140160 P-Naïve

P-BlockEx-MATE

4 8 160

20406080

100120140160 P-Naïve

P-BlockEx-MATE

HCC

HADI

PageRank

Data-Intensive and High Performance Computing Research Group

Page 23: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

23

Scalability: Graph Mining (II)• HCC: better scalability with

larger datasets

4 8 160

20406080

100120140 P-Naïve

P-BlockEx-MATE

Avg.

Tim

e Pe

r Ite

ratio

n (m

in)

# of nodes

4 8 160

50

100

150

200 P-NaïveP-BlockEx-MATE

4 8 160

50100150200250300 P-Naïve

P-BlockEx-MATE

64GB

32GB

8GB

Data-Intensive and High Performance Computing Research Group

1.5

3.0

Page 24: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

Our Current Work• Four systems on different parallel architectures:– MATE (Map-reduce with an AlternaTE API)

• For multi-core environments and data mining

– Ex-MATE (Extended MATE)• For clusters of multi-cores• Provided large-sized reduction object support

– MATE-CG (MATE for Cpu-Gpu)• For heterogeneous CPU-GPU clusters• Provided an auto-tuning framework for data distribution

– FT-MATE (Fault Tolerant MATE)• Supports more efficient fault tolerance for MPI programs• Makes use of distributed memory and reliable storage

24Data-Intensive and High Performance Computing Research Group

Page 25: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

MATE for CPU-GPU Clusters• Still adopts Generalized Reduction– Built on top of MATE and Ex-MATE

• Accelerates data-intensive computations on heterogeneous systems– Focus on CPU-GPU clusters– A multi-level data partitioning

• Proposed a novel auto-tuning framework– Exploits iterative nature of many data-intensive apps– Automatically decides the workload distribution between

CPUs and GPUs

25Data-Intensive and High Performance Computing Research Group

Page 26: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

26

MATE-CG Overview• Execution work-flow

Data-Intensive and High Performance Computing Research Group

Page 27: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

Auto-Tuning Framework

27

• Auto-tuning problem: – Given an application, find the optimal data distribution

between the CPU and the GPU to minimize the overall running time on each node• For example: which is best, 20/80, 50/50, or 70/30?

• Our approach:– Exploits the iterative nature of many data-intensive

applications with similar computations over a number of iterations • Construct an analytical model to predict performance

– The optimal value is computed and learnt over the first few iterations

• No compile-time search or tuning is needed– Low runtime overheads with a large number of iterations

Data-Intensive and High Performance Computing Research Group

Page 28: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

The Analytical Model• Illustration of the relationship between Tcg and p:

28

T_c varies with p

T_g varies with p

T_cg varies with p

T_c: processing times on CPU with p; T_o: fixed overheads on CPUT_p: processing times on GPU with (1-p): T_g_o: fixed overheads on GPUT_cg: overall processing times on CPU+GPU with p

Data-Intensive and High Performance Computing Research Group

Page 29: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

2929

• Experiments Platform– A heterogeneous CPU-GPU cluster

• Each node has one multi-core CPU and one GPU– Intel 8-core CPU – NVIDA Tesla (Fermi) GPU (14*32 (448) cores)

• Used up to 128 CPU cores and 7168 GPU cores on 16 nodes

• Three representative applications– Gridding kernel, EM, and PageRank

• For each application, we run it in four modes in the cluster– CPU-1: CPU-8: GPU-only: CPU-8-n-GPU

Experiments Design

Data-Intensive and High Performance Computing Research Group

Page 30: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

• GPU-only is better than CPU-8

2 4 8 160

200

400

600

800

30

Results: Scalability with increasing # of GPUs

Avg.

Tim

e Pe

r Ite

ratio

n (s

ec)

# of nodes

2 4 8 160

200400600800

10001200

2 4 8 160

500100015002000250030003500

CPU-1CPU-8GPU-only

Gridding Kernel

EMPageRank

16% 3.0

25%

Data-Intensive and High Performance Computing Research Group

Page 31: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

• On 16 nodes

1 2 3 4 5 6 7 8 9 100

10

20

30

31

Results: Auto-tuningEx

ecuti

on T

ime

In O

ne It

erati

on (s

ec)

Iteration Number1 2 3 4 5 6 7 8 9 10

0

10

20

30

40

50

60

E-stepM-step

0

500

1000

1500

2000Gridding Kernel

EMPageRank

Data-Intensive and High Performance Computing Research Group

Page 32: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

Our Current Work• Four systems on different parallel architectures:– MATE (Map-reduce with an AlternaTE API)

• For multi-core environments and data mining

– Ex-MATE (Extended MATE)• For clusters of multi-cores• Provided large-sized reduction object support

– MATE-CG (MATE for Cpu-Gpu)• For heterogeneous CPU-GPU clusters• Provided an auto-tuning framework for data distribution

– FT-MATE (Fault Tolerant MATE)• Supports more efficient fault tolerance for MPI programs• Makes use of distributed memory and reliable storage

32Data-Intensive and High Performance Computing Research Group

Page 33: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

33

Reduction Object-based Fault Tolerance

• Fault tolerance is important for data-intensive computing– Reduction object helps to provide fault-tolerance at lower

costs than Hadoop– Example: FREERIDE-G V.S. Hadoop

Data-Intensive and High Performance Computing Research Group

K-Means Clustering:• One node fails at 50% of

data processingOverheads• Hadoop

23.06 | 71.78 | 78.11• FREERIDE-G

20.37 | 8.18 | 9.18

Taken from Tekin Bicer et.al. (IPDPS’2010)

Page 34: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

34

Applying Reduction Object-based Approach to MPI Programs

• The reduction object model achieves better fault tolerance for data mining and graph mining as in FREERIDE-G and MATE systems

• Fault-tolerance support for MPI applications remains an ongoing challenge and existing solutions will not work in the future– Checkpoint time will exceed the Mean-Time To Failure in

the exascale era (exascale systems expected in 2018)!

• So, can our ideas from the reduction object work help for other types of application like MPI?

Data-Intensive and High Performance Computing Research Group

Page 35: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

Our Fault Tolerance Approach (I)• Based on the Extended Generalized Reduction Model– Aims to improve the expensive C/R for MPI applications– Divides the reduction object into two parts

• One inter-node global reduction object• One set of intra-node local reduction objects• Only the global reduction object participates in the global

combination phase

• Target applications that can be written by the extended generalized reduction model– No redundant/backup nodes are used/needed– Deal with fail-stop type of failures (not affecting others)– Assume failure detection is accurate and instant

35Data-Intensive and High Performance Computing Research Group

Page 36: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

Our Fault Tolerance Approach (II)• Using distributed memory and reliable storage

– Cache the global reduction object into the memory of other nodes

– Save the local reduction objects onto a persistent storage• Can be viewed as an adaption of the application-level

checkpointing approach– The key difference is we exploit the reduction object structures

and re-distribute remaining data upon a failure– There is no need to restart a failed process

• Suitable for a diverse set of applications– Data mining: only the global reduction object is needed– Stencil computations: both global and local ones are needed– Irregular reductions: only the local reduction objects are needed

36Data-Intensive and High Performance Computing Research Group

Page 37: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

37

MPI Application Examples

• Dense Grid Computations– Stencil computations like Jacobi and Sobel Filter

• Sparse Grid Computations– Irregular Reductions like Euler Solver and Molecular

Dynamics

Data-Intensive and High Performance Computing Research Group

Page 38: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

38

Implementing Dense Grid Computations using EGR (I)

• A simple way is based on output matrix partitioning– The input data needed for computing an output partition

consists of the corresponding input partition and the elements on the border of its neighboring input partitions• Each output partition is a local reduction object and there is no

use of global reduction object

Data-Intensive and High Performance Computing Research Group

Corresponding input matrix Output matrix partitioning

Page 39: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

39

Implementing Dense Grid Computations using EGR (II)

• An alternative is based on input matrix partitioning– A data-driven approach: determine the corresponding

points to be updated in the output matrix for each point in the input matrix• The ghost output rows of two neighboring output partitions are

global reduction object and the other output rows are local reduction objects

Data-Intensive and High Performance Computing Research Group

Input matrix partitioning Corresponding output matrix

Page 40: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

40

Implementing Sparse Grid Computations using EGR (I)

• For irregular reductions, the corresponding points in the output space are not known at the compile time– The input space partitioning will have to treat the entire

output space as the global reduction object and results in poor scalability in our preliminary experiments

– We choose the output space partitioning in our implementation and re-organize the corresponding input in the pre-processing stage

Data-Intensive and High Performance Computing Research Group

Page 41: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

41

Implementing Sparse Grid Computations using EGR (II)

Data-Intensive and High Performance Computing Research Group

• Partitioning on Reduction Space for Irregular Reductions

Page 42: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

The FT-MATE System (I)• The processing structure with fault tolerance support

42Data-Intensive and High Performance Computing Research Group

Checkpointing

Recovery

Page 43: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

The FT-MATE System (II)• Fault tolerance runtime components:– Configuration --- MATE_FTSetup()

• Setup checkpoint interval, directory for saving checkpoints, etc.

– Check-pointing• MATE_MemCheckpoint() --- synchronous/asynchronous data

exchange• MATE_DiskCheckpoint() --- single-thread/multi-thread data output

– Detecting Failures --- MATE_DetectFailure()• Peer-to-peer communication among the nodes are kept alive and

timeouts are used to detect node failures with the aid of MPI stack

– Recovering Failures --- MATE_RecoverFailure()• Data re-distribution and processing of unfinished data• Output space recovery if needed

43Data-Intensive and High Performance Computing Research Group

Page 44: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

44

The FT-MATE System (III)• Example: the fault recovery process for an irregular

application

Data-Intensive and High Performance Computing Research Group

Page 45: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

4545

• Experiments Platform– A cluster of nodes with multi-cores and each node has

one Intel 8-core CPU

• Four representative apps in scientific computing– Stencil Computations: Jacobi and Sobel Filter– Irregular Reductions: Euler Solver and Molecular Dynamics

• For each application, we could run it in two modes:– CPU-1: use 1 CPU core per node– CPU-8: use 8 CPU cores per node

• Evaluated against the fault tolerant MPICH2 library– FT-MATE V.S. MPICH2-BLCR

Experiments Design

Data-Intensive and High Performance Computing Research Group

Page 46: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

46

Results: Scalabilities Study• Scalabilities without a failure in FT-MATE– MPI’s absolute performance is similar to that of FT-MATE

2 4 8 160

20

40

60

80

100

120

JacobiSobelEulerMoldyn

Avg.

Tim

e Pe

r Ite

ratio

n (s

ecs)

# of Nodes

CPU-1 Versions

2 4 8 1602468

101214 CPU-8 Versions

7.8 6.8

Scalability with CPU-1/CPU-8 versions on 2, 4, 8, and 16 nodes for each of the four applications

Data-Intensive and High Performance Computing Research Group

Page 47: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

47

Results: Checkpointing Overheads• Low checkpointing overheads

1 10 100 2000

20

40

60

80

100

120

MPICH2-BLCRFT-MATE

Nor

mal

ized

Che

ckpo

intin

g Co

sts

(%)

# of Iterations per Checkpoint Interval

1124 115

Sobel Filter

1 10 100 2000

20

40

60

80

100

120

MPICH2-BLCRFT-MATE

765

Molecular Dynamics

5.54% 2.99%

On 8 nodes and running for 1000 iterations in CPU-8 mode

Data-Intensive and High Performance Computing Research Group

Checkpoint Size for 7.4GB Moldyn: 9.3GB V.S. 48MB

Page 48: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

48

Results: Fault Recovery• W/REBIRTH in FT-MATE: the failed node recovers one

iteration after the fault recovery startsLow Absolute Recovery Costs

25 50 7502468

10121416

MPICH2-BLCR W/REBIRTHFT-MATE W/ REBIRTHFT-MATE W/O REBIRTH

Nor

mal

ized

Rec

over

y Co

sts

(%)

Failure Point (%)0.02% 0.19%

25 50 7502468

10121416

Jacobi Euler Solver

On 32 nodes and checkpoint interval is 100/1000Data-Intensive and High Performance

Computing Research Group

Page 49: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

Summary– MATE, Ex-MATE, MATE-CG, and FT-MATE for multi-cores,

homogeneous clusters, and heterogeneous clusters• A diverse set of applications: data mining, graph mining, scientific

computing, stencil computations, irregular reductions, etc.• FT-MATE supports more efficient fault tolerance for MPI programs

– Also, MATE series have been used internally in our group• MATE-EC2: allows to start MATE instances in cloud providers like

Amazon Web Services• Sci-MATE: supports scientific data formats like NetCDF and HDF5• As a backend to run generated C code from python/R code

– Some ongoing follow-up projects:• Using an implicit reduction object with the same map-reduce API?• What are the opportunities in Sandy Bridge and Fusion APU?• Dealing with GPU failures in MATE-CG?

49Data-Intensive and High Performance Computing Research Group

Page 50: A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High

50

Thank You!

• Questions, comments, and suggestions?

Data-Intensive and High Performance Computing Research Group