RESEARCH POSTER PRESENTATION DESIGN © 2015
www.PosterPresentations.com
1.Motivation
2.LetGo – ARoll-ForwardC/R 4.ApproachandChallenges
§Roll-forward recovery may introduce SDCs§How likely?
§ Current evaluation approach: statistical faultinjections
§ Problems:§ Time consuming§ No indication for a particular repair§ No predictive power
§ Our goal is to efficiently predict how likely aparticular repair would lead to a SDC
3.Goal
6.OurDesign
7. LessonLearnedandPreliminaryResults
Netsyslab@UBC: http://netsyslab.ece.ubc.ca/ Bo Fang: [email protected]@UBC: http://blogs.ubc.ca/karthik/
BoFang,Karthik Pattabiraman,Matei RipeanuUniversityofBritishColumbia
PredictingtheImpactofRoll-ForwardRecoveryforHPCApplications
Crash,22%
SDC, 2%
Detected,2%
Benign, 74%
WithLetGo
§ LetGo converts 62% crashes into continuedexecution, with 1% increase in SDC rate and 1%increase in detected rate
§ Large scale systems are prone to experience transienthardware faults
§ Main causes include:
§ Outcomes:§ Silent data corruption (undetected incorrect output)§ Fail-stop failure (terminate unexpectedly)
§ Characteristics of recovery techniques:
§ Approximate recover may introduce new SDCs !
Cosmic ray Particle strike Voltage fluctuation
Roll-back Roll-forwardOverhead T(checkpoints)+T(redo) T(to next state)+T(checkpoints)?RecoveredState Precise Precise/Approximate
§ Attempts to avoid reload from checkpoints [1] § System design
§ Evaluation results with fault injections:
Crash,56%
SDC, 1% Detected*,1%
Benign, 41%
WithoutLetGo
Program stateSDCCrashBenign
SoftwareFault injection
Repeated process to achieve statistical significance
§General idea: tracing the dependent state of theapproximate data
§Approach: building dynamic data dependence graph
§Challenge i): State exploration§Example:
§HPC applications: orders of magnitude bigger§Challenge ii): No “ground truth”: approximate
recovery implies uncertainty
whi lea = b + c; d = mem[a ] ; i f d > e mem[ d ] = f ; d = c;
else:h = d + g;
r3(addr1)
r1 r2
r4 r5
r4(addr2)
r6
add
load
store
mov
r2’ r1’
r3’(addr3)
r6’
add
load
store
r4’(addr4)
r8
cmp
r4’ r5’
r8’
cmp
r4* r5*
r4*(addrX)
r6*store
mov
r2** r1**
r3**(addrY)
r6’
add
load
store
r4**(addrZ)
r8*
cmp
r4** r5**
r8**
cmp
source code A slice of DDG Affected state
* Detected means that the application correctness check catches the data corruption
§ I. Many HPC applications have repetitive behaviors§Control flow divergence detection through
profiling§ II. Data convergence
§Fault masking memory access patterns
Constructprofiled DDG(s)andaffected
DDG
Matchprofiled
DDG(s) andaffected DDG
ExaminetheDDGs withheuristics
High SDC-proneness
Low SDC-proneness
Check for memoryaccess pattern
Check for messagepassing semantics
Ranking(2-page write-up)
(Website for Project LetGohttp://netsyslab.ece.ubc.ca/wiki/index.php/LetGo)
§Data corruptions across multiple MPI processes§Checking if the repair affects messagepassing is promising
§Size of the affected memory locations:§Repair&SDC contain more memory writesthan repair&benign
§Match affected DDG and profiled DDG§More configurations for profilingsize/frequency are needed
[1] Bo Fang, Qiang Guan, Nathan Debardeleben, Karthik Pattabiraman, and Matei Ripeanu. LetGo: A Lightweight ContinuousFramework for HPC Applications Under Failures. In Proceedings of the 26th International Symposium on High-PerformanceParallel and Distributed Computing (HPDC '17). ACM, New York, NY, USA, 117-130.
ProcessState
Monitor
Modifier
SignalHandling
Processexecution
LetGo
OS
Application Running Time # of dyn inst Size of DDGLulesh (DOE mini-app) 40s 16 Billions 60 Billions of Edges
5.Key Observations
§Predicting system: collecting program state andanalyzing DDGs
§The core workflow of the predictor: