bo fang, karthik pattabiraman, mateiripeanumatei/papers/ipdps.poster.2018.pdf · letgo: a...

Post on 20-May-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

RESEARCH POSTER PRESENTATION DESIGN © 2015

www.PosterPresentations.com

1.Motivation

2.LetGo – ARoll-ForwardC/R 4.ApproachandChallenges

§Roll-forward recovery may introduce SDCs§How likely?

§ Current evaluation approach: statistical faultinjections

§ Problems:§ Time consuming§ No indication for a particular repair§ No predictive power

§ Our goal is to efficiently predict how likely aparticular repair would lead to a SDC

3.Goal

6.OurDesign

7. LessonLearnedandPreliminaryResults

Netsyslab@UBC: http://netsyslab.ece.ubc.ca/ Bo Fang: bof@ece.ubc.caDependableSystemslab@UBC: http://blogs.ubc.ca/karthik/

BoFang,Karthik Pattabiraman,Matei RipeanuUniversityofBritishColumbia

PredictingtheImpactofRoll-ForwardRecoveryforHPCApplications

Crash,22%

SDC, 2%

Detected,2%

Benign, 74%

WithLetGo

§ LetGo converts 62% crashes into continuedexecution, with 1% increase in SDC rate and 1%increase in detected rate

§ Large scale systems are prone to experience transienthardware faults

§ Main causes include:

§ Outcomes:§ Silent data corruption (undetected incorrect output)§ Fail-stop failure (terminate unexpectedly)

§ Characteristics of recovery techniques:

§ Approximate recover may introduce new SDCs !

Cosmic ray Particle strike Voltage fluctuation

Roll-back Roll-forwardOverhead T(checkpoints)+T(redo) T(to next state)+T(checkpoints)?RecoveredState Precise Precise/Approximate

§ Attempts to avoid reload from checkpoints [1] § System design

§ Evaluation results with fault injections:

Crash,56%

SDC, 1% Detected*,1%

Benign, 41%

WithoutLetGo

Program stateSDCCrashBenign

SoftwareFault injection

Repeated process to achieve statistical significance

§General idea: tracing the dependent state of theapproximate data

§Approach: building dynamic data dependence graph

§Challenge i): State exploration§Example:

§HPC applications: orders of magnitude bigger§Challenge ii): No “ground truth”: approximate

recovery implies uncertainty

whi lea = b + c; d = mem[a ] ; i f d > e mem[ d ] = f ; d = c;

else:h = d + g;

r3(addr1)

r1 r2

r4 r5

r4(addr2)

r6

add

load

store

mov

r2’ r1’

r3’(addr3)

r6’

add

load

store

r4’(addr4)

r8

cmp

r4’ r5’

r8’

cmp

r4* r5*

r4*(addrX)

r6*store

mov

r2** r1**

r3**(addrY)

r6’

add

load

store

r4**(addrZ)

r8*

cmp

r4** r5**

r8**

cmp

source code A slice of DDG Affected state

* Detected means that the application correctness check catches the data corruption

§ I. Many HPC applications have repetitive behaviors§Control flow divergence detection through

profiling§ II. Data convergence

§Fault masking memory access patterns

Constructprofiled DDG(s)andaffected

DDG

Matchprofiled

DDG(s) andaffected DDG

ExaminetheDDGs withheuristics

High SDC-proneness

Low SDC-proneness

Check for memoryaccess pattern

Check for messagepassing semantics

Ranking(2-page write-up)

(Website for Project LetGohttp://netsyslab.ece.ubc.ca/wiki/index.php/LetGo)

§Data corruptions across multiple MPI processes§Checking if the repair affects messagepassing is promising

§Size of the affected memory locations:§Repair&SDC contain more memory writesthan repair&benign

§Match affected DDG and profiled DDG§More configurations for profilingsize/frequency are needed

[1] Bo Fang, Qiang Guan, Nathan Debardeleben, Karthik Pattabiraman, and Matei Ripeanu. LetGo: A Lightweight ContinuousFramework for HPC Applications Under Failures. In Proceedings of the 26th International Symposium on High-PerformanceParallel and Distributed Computing (HPDC '17). ACM, New York, NY, USA, 117-130.

ProcessState

Monitor

Modifier

SignalHandling

Processexecution

LetGo

OS

Application Running Time # of dyn inst Size of DDGLulesh (DOE mini-app) 40s 16 Billions 60 Billions of Edges

5.Key Observations

§Predicting system: collecting program state andanalyzing DDGs

§The core workflow of the predictor:

top related