1/20 optimization of multi-level checkpoint model for large scale hpc applications sheng di, mohamed...

1/20

Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications

Sheng Di, Mohamed Slim Bouguerra, Leonardo Bautista-gomez, Franck Cappello

INRIA and ANL2013

2/20

OutlineBackground of Multi-level Checkpoint ModelProblem FormulationOptimization of Multi-level Checkpoint Model

Optimizing Checkpoint Intervals for each levelOptimizing the Selection of Levels

Performance EvaluationConclusion and Future Work

3/20

Background of Multi-level Ckpt ModelTraditional Ckpt/Restart model always stores c

heckpoint files onto Parallel File System (PFS) PFS is of central-controlled mode, which suffer

s bottle-neck issue for large-scale app.For example, our experiments shows that the che

ckpoint overhead on PFS increases quickly with problem size and execution scale:

# cores 128 256 512 1024

Ckpt cost 7.4 sec 10.8 sec 16.8 sec 43.1 sec

4/20

Background of Multi-level Ckpt ModelExisting Multi-level checkpoint toolkits

Scalable Checkpoint/Restart Library (SCR) – SC’10 RAM disk / local disk Partner-copy / XOR encoding Parallel File System (PFS), e.g., NFS

Fault Tolerance Interface (FTI) - SC’11 Local disk: storing ckpt files into local disk Partner-copy: storing ckpt files in local disk & partner disk Reed-Solomon encoding (RS-encoding) Parallel File System (PFS): such as NFS

5/20

Problem FormulationDifferent Types of Failures

CPL1: There are no hardware failures but software errors.

CPL2: There are non-adjacent hardware failuresCPL3: There are a few adjacent hardware failuresCPL4: There are a lot of hardware failures

Time

Node4

Node3

Node2

Node1

Soft-F

CPL1 CPL2 CPL3 CPL4 CPL2

6/20

Problem FormulationThe process of running an HPC application

with failures over multi-level checkpoint model

Level 1

Level 2

Level 3

Level 4

Local FS

Partner copy

RS encoding

PFS

Parallel app execution

Normal run Roll-back loss

One node crashOne checkpoint

Soft failure Hard failure Hard failure

Te/x1

Te/x2

Te/x3

Adjacent node crash

7/20

Problem FormulationOur Objective - Minimize the expected wall-

clock length for each HPC application with: optimized selection of levelsoptimized checkpoint intervals on each level

Mathematical Expectation of Wall-clock Length:

Productive time# of levels

# of ckpt intervals at level i

Ckpt overhead Rollback loss Restart cost

# of failures at level i

probability

8/20

Optimization of Multi-level Checkpoint Model

E(Tw) is convex, becausexi is referred to as the # of ckpt intervals at level

iWe get optimal solution as long as we solve t

he simultaneous equations,optimal xi

* :

where i = 1, 2, 3, …., L

9/20

Optimization of Multi-level Checkpoint ModelOptimizing Checkpoint Intervals

Simplified equations:

We use an iterative algorithm to solve it: k=0: err=0.2 k=1: err=0.08 k=2: err=0.005 K=3: err=0.0001 ……

We use Young’s formula

to initialize xi(0)

k+1k

k

10/20

Optimization of Multi-level Checkpoint ModelOptimizing Checkpoint Intervals

How fast is our iterative optimal algorithm? If we set the error threshold to 10-6, the algorithm will

converge with only about 20-30 iterations !!What is the performance gain under our

method, compared to the traditional Young’s formula? Suppose there are 8 levels and application execution

length is 1000 ~ 9000 seconds The checkpoint overheads on the 8 levels are 10, 30,

45, 50, 55, 60, 65, 240 seconds per checkpoint. Numerical simulation shows that our method is better

than Young’s formula by 4.2% - 17.8%.

11/20

Optimization of Multi-level Checkpoint ModelOptimizing Selection of Checkpoint Levels

For a particular combination of levels, the computation complexity is only about 30 iterations.

It is feasible to traverse all of combinations of levels to find the optimal selection of levels.

Suppose there are 8 levels, so there are 28-1=255 different combinations of levels, and the total computation complexity is 255*30=7650, which is very small!

12/20

Optimization of Multi-level Checkpoint ModelAnalysis of A Practical Case – FTI

There are 4 levels: local disk, partner-copy, RS-encoding, and PFS

Use Clf, Cpc, Crs, Cpf to denote ckpt overheads

Use Rlf, Rpc, Rrs, Rpf to denote restart overheads

13/20

Optimization of Multi-level Checkpoint ModelAnalysis of A Practical Case – FTI

The target simultaneous equations derived from convex optimization (first-order derivatives) is:

The solution to the above equations must be optimalWe can use iterative method to get it very quickly.

14/20

Performance EvaluationExperimental Setting

Evaluation Type A: Numerical Simulation To evaluate a large number of various cases with differen

t parameters, including different ckpt overheads, restart cost, application length, etc.

Evaluation Type B: Real Experiment To validate the feasibility of using our optimal checkpoint

model in a real use case – FTI scenario. MPI program used in our

experiment: Head distribution

15/20

Performance EvaluationCheckpoint Overhead of FTI on FUSION cluster

Key Indicator: Workload Processing Ratio (WPR)

= productive time / wall-clock length

26MB per proc57MB per proc

16/20

Performance EvaluationDifferent Selections of Checkpoint Levels

Simulation Settings

17/20

Performance EvaluationDifferent Selections of Checkpoint Levels

Simulation Results

Improvement:10-20%

18/20

Performance EvaluationExperimental Results on FUSION cluster

19/20

ConclusionOptimal Multi-level Checkpoint/Restart Model

Key Theoretical Conclusions: Ckpt intervals on each level can be optimized by fast iter

ative methods (converged within only 30 iterations) The ckpt intervals are optimal based on convex-optimiza

tion theoryKey Simulation/Experimental Results:

For FTI, Iterative Optimal method with best selection of levels is better than other solutions by up to 20%.

For other cases like 8 levels, Optimized selection of levels can improve performance by 50% in some cases.

20/20

Future WorkIn the future, we plan to:

evaluate our optimal ckpt/restart model using more complex MPI program on real clusters with larger scales, such as CESM.

optimize the robustness and stability by taking into account the possible prediction errors on checkpoint overheads and execution length.

optimize the execution scale (# of processes) based on checkpoint overheads for some application with specific productive time.

21/20

Thanks!!Contact me at: [email protected]

1/20 optimization of multi-level checkpoint model for large scale hpc applications sheng di, mohamed...

Documents