1/20 optimization of multi-level checkpoint model for large scale hpc applications sheng di, mohamed...
TRANSCRIPT
1/20
Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications
Sheng Di, Mohamed Slim Bouguerra, Leonardo Bautista-gomez, Franck Cappello
INRIA and ANL2013
2/20
OutlineBackground of Multi-level Checkpoint ModelProblem FormulationOptimization of Multi-level Checkpoint Model
Optimizing Checkpoint Intervals for each levelOptimizing the Selection of Levels
Performance EvaluationConclusion and Future Work
3/20
Background of Multi-level Ckpt ModelTraditional Ckpt/Restart model always stores c
heckpoint files onto Parallel File System (PFS) PFS is of central-controlled mode, which suffer
s bottle-neck issue for large-scale app.For example, our experiments shows that the che
ckpoint overhead on PFS increases quickly with problem size and execution scale:
# cores 128 256 512 1024
Ckpt cost 7.4 sec 10.8 sec 16.8 sec 43.1 sec
4/20
Background of Multi-level Ckpt ModelExisting Multi-level checkpoint toolkits
Scalable Checkpoint/Restart Library (SCR) – SC’10 RAM disk / local disk Partner-copy / XOR encoding Parallel File System (PFS), e.g., NFS
Fault Tolerance Interface (FTI) - SC’11 Local disk: storing ckpt files into local disk Partner-copy: storing ckpt files in local disk & partner disk Reed-Solomon encoding (RS-encoding) Parallel File System (PFS): such as NFS
5/20
Problem FormulationDifferent Types of Failures
CPL1: There are no hardware failures but software errors.
CPL2: There are non-adjacent hardware failuresCPL3: There are a few adjacent hardware failuresCPL4: There are a lot of hardware failures
Time
Node4
Node3
Node2
Node1
Soft-F
CPL1 CPL2 CPL3 CPL4 CPL2
6/20
Problem FormulationThe process of running an HPC application
with failures over multi-level checkpoint model
Level 1
Level 2
Level 3
Level 4
Local FS
Partner copy
RS encoding
PFS
Parallel app execution
Normal run Roll-back loss
One node crashOne checkpoint
Soft failure Hard failure Hard failure
Te/x1
Te/x2
Te/x3
Adjacent node crash
7/20
Problem FormulationOur Objective - Minimize the expected wall-
clock length for each HPC application with: optimized selection of levelsoptimized checkpoint intervals on each level
Mathematical Expectation of Wall-clock Length:
Productive time# of levels
# of ckpt intervals at level i
Ckpt overhead Rollback loss Restart cost
# of failures at level i
probability
8/20
Optimization of Multi-level Checkpoint Model
E(Tw) is convex, becausexi is referred to as the # of ckpt intervals at level
iWe get optimal solution as long as we solve t
he simultaneous equations,optimal xi
* :
where i = 1, 2, 3, …., L
9/20
Optimization of Multi-level Checkpoint ModelOptimizing Checkpoint Intervals
Simplified equations:
We use an iterative algorithm to solve it: k=0: err=0.2 k=1: err=0.08 k=2: err=0.005 K=3: err=0.0001 ……
We use Young’s formula
to initialize xi(0)
k+1k
k
10/20
Optimization of Multi-level Checkpoint ModelOptimizing Checkpoint Intervals
How fast is our iterative optimal algorithm? If we set the error threshold to 10-6, the algorithm will
converge with only about 20-30 iterations !!What is the performance gain under our
method, compared to the traditional Young’s formula? Suppose there are 8 levels and application execution
length is 1000 ~ 9000 seconds The checkpoint overheads on the 8 levels are 10, 30,
45, 50, 55, 60, 65, 240 seconds per checkpoint. Numerical simulation shows that our method is better
than Young’s formula by 4.2% - 17.8%.
11/20
Optimization of Multi-level Checkpoint ModelOptimizing Selection of Checkpoint Levels
For a particular combination of levels, the computation complexity is only about 30 iterations.
It is feasible to traverse all of combinations of levels to find the optimal selection of levels.
Suppose there are 8 levels, so there are 28-1=255 different combinations of levels, and the total computation complexity is 255*30=7650, which is very small!
12/20
Optimization of Multi-level Checkpoint ModelAnalysis of A Practical Case – FTI
There are 4 levels: local disk, partner-copy, RS-encoding, and PFS
Use Clf, Cpc, Crs, Cpf to denote ckpt overheads
Use Rlf, Rpc, Rrs, Rpf to denote restart overheads
13/20
Optimization of Multi-level Checkpoint ModelAnalysis of A Practical Case – FTI
The target simultaneous equations derived from convex optimization (first-order derivatives) is:
The solution to the above equations must be optimalWe can use iterative method to get it very quickly.
14/20
Performance EvaluationExperimental Setting
Evaluation Type A: Numerical Simulation To evaluate a large number of various cases with differen
t parameters, including different ckpt overheads, restart cost, application length, etc.
Evaluation Type B: Real Experiment To validate the feasibility of using our optimal checkpoint
model in a real use case – FTI scenario. MPI program used in our
experiment: Head distribution
15/20
Performance EvaluationCheckpoint Overhead of FTI on FUSION cluster
Key Indicator: Workload Processing Ratio (WPR)
= productive time / wall-clock length
26MB per proc57MB per proc
16/20
Performance EvaluationDifferent Selections of Checkpoint Levels
Simulation Settings
17/20
Performance EvaluationDifferent Selections of Checkpoint Levels
Simulation Results
Improvement:10-20%
18/20
Performance EvaluationExperimental Results on FUSION cluster
19/20
ConclusionOptimal Multi-level Checkpoint/Restart Model
Key Theoretical Conclusions: Ckpt intervals on each level can be optimized by fast iter
ative methods (converged within only 30 iterations) The ckpt intervals are optimal based on convex-optimiza
tion theoryKey Simulation/Experimental Results:
For FTI, Iterative Optimal method with best selection of levels is better than other solutions by up to 20%.
For other cases like 8 levels, Optimized selection of levels can improve performance by 50% in some cases.
20/20
Future WorkIn the future, we plan to:
evaluate our optimal ckpt/restart model using more complex MPI program on real clusters with larger scales, such as CESM.
optimize the robustness and stability by taking into account the possible prediction errors on checkpoint overheads and execution length.
optimize the execution scale (# of processes) based on checkpoint overheads for some application with specific productive time.
21/20
Thanks!!Contact me at: [email protected]