bo fang, karthik pattabiraman, mateiripeanumatei/papers/ipdps.poster.2018.pdf · letgo: a...

1
RESEARCH POSTER PRESENTATION DESIGN © 2015 www.PosterPresentations.com 1. Motivation 2. LetGo – A Roll-Forward C/R 4. Approach and Challenges § Roll-forward recovery may introduce SDCs § How likely? § Current evaluation approach: statistical fault injections § Problems: § Time consuming § No indication for a particular repair § No predictive power § Our goal is to efficiently predict how likely a particular repair would lead to a SDC 3. Goal 6. Our Design 7. Lesson Learned and Preliminary Results Netsyslab@UBC: http://netsyslab.ece.ubc.ca/ Bo Fang: [email protected] DependableSystemslab@UBC: http://blogs.ubc.ca/karthik/ Bo Fang, Karthik Pattabiraman, Matei Ripeanu University of British Columbia Predicting the Impact of Roll-Forward Recovery for HPC Applications Crash, 22% SDC, 2% Detected, 2% Benign, 74% With LetGo § LetGo converts 62% crashes into continued execution, with 1% increase in SDC rate and 1% increase in detected rate § Large scale systems are prone to experience transient hardware faults § Main causes include: § Outcomes: § Silent data corruption (undetected incorrect output) § Fail-stop failure (terminate unexpectedly) § Characteristics of recovery techniques: § Approximate recover may introduce new SDCs ! Cosmic ray Particle strike Voltage fluctuation Roll-back Roll-forward Overhead T(checkpoints)+T(redo) T(to next state)+T(checkpoints)? Recovered State Precise Precise/Approximate § Attempts to avoid reload from checkpoints [1] § System design § Evaluation results with fault injections: Crash, 56% SDC, 1% Detected *, 1% Benign, 41% Without LetGo Program state SDC Crash Benign Software Fault injection Repeated process to achieve statistical significance § General idea: tracing the dependent state of the approximate data § Approach: building dynamic data dependence graph § Challenge i): State exploration § Example: § HPC applications: orders of magnitude bigger § Challenge ii): No “ground truth”: approximate recovery implies uncertainty whi l e a = b + c; d = mem[a ] ; i f d > e mem[ d ] = f ; d = c; else: h = d + g; r3(addr1) r1 r2 r4 r5 r4(addr2) r6 add load store mov r2’ r1’ r3’(addr3) r6’ add load store r4’(addr4) r8 cmp r4’ r5’ r8’ cmp r4* r5* r4*(addrX) r6* store mov r2** r1** r3**(addrY) r6’ add load store r4**(addrZ) r8* cmp r4** r5** r8** cmp source code A slice of DDG Affected state * Detected means that the application correctness check catches the data corruption § I. Many HPC applications have repetitive behaviors § Control flow divergence detection through profiling § II. Data convergence § Fault masking memory access patterns Construct profiled DDG(s) and affected DDG Match profiled DDG(s) and affected DDG Examine the DDGs with heuristics High SDC- proneness Low SDC- proneness Check for memory access pattern Check for message passing semantics Ranking (2-page write-up) (Website for Project LetGo http://netsyslab.ece.ubc.ca/wiki/index.php/LetGo) § Data corruptions across multiple MPI processes § Checking if the repair affects message passing is promising § Size of the affected memory locations: § Repair&SDC contain more memory writes than repair&benign § Match affected DDG and profiled DDG § More configurations for profiling size/frequency are needed [1] Bo Fang, Qiang Guan, Nathan Debardeleben, Karthik Pattabiraman, and Matei Ripeanu. LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '17). ACM, New York, NY, USA, 117-130. Process State Monitor Modifier Signal Handling Process execution LetGo OS Application Running Time # of dyn inst Size of DDG Lulesh (DOE mini-app) 40s 16 Billions 60 Billions of Edges 5. Key Observations § Predicting system: collecting program state and analyzing DDGs § The core workflow of the predictor:

Upload: others

Post on 20-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bo Fang, Karthik Pattabiraman, MateiRipeanumatei/papers/ipdps.poster.2018.pdf · LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures. In Proceedings of the

RESEARCH POSTER PRESENTATION DESIGN © 2015

www.PosterPresentations.com

1.Motivation

2.LetGo – ARoll-ForwardC/R 4.ApproachandChallenges

§Roll-forward recovery may introduce SDCs§How likely?

§ Current evaluation approach: statistical faultinjections

§ Problems:§ Time consuming§ No indication for a particular repair§ No predictive power

§ Our goal is to efficiently predict how likely aparticular repair would lead to a SDC

3.Goal

6.OurDesign

7. LessonLearnedandPreliminaryResults

Netsyslab@UBC: http://netsyslab.ece.ubc.ca/ Bo Fang: [email protected]@UBC: http://blogs.ubc.ca/karthik/

BoFang,Karthik Pattabiraman,Matei RipeanuUniversityofBritishColumbia

PredictingtheImpactofRoll-ForwardRecoveryforHPCApplications

Crash,22%

SDC, 2%

Detected,2%

Benign, 74%

WithLetGo

§ LetGo converts 62% crashes into continuedexecution, with 1% increase in SDC rate and 1%increase in detected rate

§ Large scale systems are prone to experience transienthardware faults

§ Main causes include:

§ Outcomes:§ Silent data corruption (undetected incorrect output)§ Fail-stop failure (terminate unexpectedly)

§ Characteristics of recovery techniques:

§ Approximate recover may introduce new SDCs !

Cosmic ray Particle strike Voltage fluctuation

Roll-back Roll-forwardOverhead T(checkpoints)+T(redo) T(to next state)+T(checkpoints)?RecoveredState Precise Precise/Approximate

§ Attempts to avoid reload from checkpoints [1] § System design

§ Evaluation results with fault injections:

Crash,56%

SDC, 1% Detected*,1%

Benign, 41%

WithoutLetGo

Program stateSDCCrashBenign

SoftwareFault injection

Repeated process to achieve statistical significance

§General idea: tracing the dependent state of theapproximate data

§Approach: building dynamic data dependence graph

§Challenge i): State exploration§Example:

§HPC applications: orders of magnitude bigger§Challenge ii): No “ground truth”: approximate

recovery implies uncertainty

whi lea = b + c; d = mem[a ] ; i f d > e mem[ d ] = f ; d = c;

else:h = d + g;

r3(addr1)

r1 r2

r4 r5

r4(addr2)

r6

add

load

store

mov

r2’ r1’

r3’(addr3)

r6’

add

load

store

r4’(addr4)

r8

cmp

r4’ r5’

r8’

cmp

r4* r5*

r4*(addrX)

r6*store

mov

r2** r1**

r3**(addrY)

r6’

add

load

store

r4**(addrZ)

r8*

cmp

r4** r5**

r8**

cmp

source code A slice of DDG Affected state

* Detected means that the application correctness check catches the data corruption

§ I. Many HPC applications have repetitive behaviors§Control flow divergence detection through

profiling§ II. Data convergence

§Fault masking memory access patterns

Constructprofiled DDG(s)andaffected

DDG

Matchprofiled

DDG(s) andaffected DDG

ExaminetheDDGs withheuristics

High SDC-proneness

Low SDC-proneness

Check for memoryaccess pattern

Check for messagepassing semantics

Ranking(2-page write-up)

(Website for Project LetGohttp://netsyslab.ece.ubc.ca/wiki/index.php/LetGo)

§Data corruptions across multiple MPI processes§Checking if the repair affects messagepassing is promising

§Size of the affected memory locations:§Repair&SDC contain more memory writesthan repair&benign

§Match affected DDG and profiled DDG§More configurations for profilingsize/frequency are needed

[1] Bo Fang, Qiang Guan, Nathan Debardeleben, Karthik Pattabiraman, and Matei Ripeanu. LetGo: A Lightweight ContinuousFramework for HPC Applications Under Failures. In Proceedings of the 26th International Symposium on High-PerformanceParallel and Distributed Computing (HPDC '17). ACM, New York, NY, USA, 117-130.

ProcessState

Monitor

Modifier

SignalHandling

Processexecution

LetGo

OS

Application Running Time # of dyn inst Size of DDGLulesh (DOE mini-app) 40s 16 Billions 60 Billions of Edges

5.Key Observations

§Predicting system: collecting program state andanalyzing DDGs

§The core workflow of the predictor: