enhancing the fault-tolerance of nonmasking programs sandeep s. kulkarni and ali ebnenasir software...

34
Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer Science and Engineering Department Michigan State University

Post on 20-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Enhancing The Fault-Tolerance of Nonmasking Programs

Sandeep S. Kulkarni and Ali Ebnenasir

Software Engineering and Network Systems Laboratory

Computer Science and Engineering DepartmentMichigan State University

Page 2: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Acknowledgement

This work is partially sponsored by: NSF, DARPA NEST, ONR URI, and Michigan State University

Page 3: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Motivation Programs are subject to unanticipated faults

Encounter new classes of faults, add corresponding fault-tolerance

How to add fault-tolerance? Develop from scratch (expensive approach) Incrementally add fault-tolerance

Reuse of the behaviors of the fault-intolerant program Potential to preserve properties that are hard to specify (e.g.,

efficiency)

How to ensure correctness? After the fact verification Automatic addition of fault-tolerance (correct by construction)

Page 4: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Motivation (Continued) Problem: Complexity of automatic addition

Automatic addition of fault-tolerance to distributed programs is

NP-hard [FTRTFT00], [ICDCS02]

How do we deal with this complexity? Develop heuristics Identifying the boundary of polynomial-time addition Step-wise addition (weaker forms of fault-tolerance)

The goal of this paper Enhance the fault-tolerance of nonmasking programs Partial automation of fault-tolerance programs

Page 5: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Outline

Preliminary Concepts

Enhancement Problem

Enhancement in High Atomicity Model

Enhancement for Distributed Programs

Example: Byzantine Agreement Program

Conclusion and Future Work

Page 6: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Preliminary Concepts:Programs and Faults

Finite State space Sp Invariant S, fault-span T Sp

Program p, Fault f, Safety { (s0, s1) | (s0, s1) Sp Sp }

Fault-tolerance Failsafe, Nonmasking, Masking

ST

p/f p

f

SpProgram

Fault

Page 7: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Step-Wise Addition

Intolerant Program

Nonmasking fault-tolerant

Masking fault-tolerantThis paper

[FTR

TFT

00

]

Failsafe fault-tolerant

[ICDCS02]

Page 8: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

TSp

Enhancement Problem

Synthesis Algorithm

Nonmasking program p

Specification Spec

Invariant S

Masking program p'

Invariant S'

Faults f

Requirements: Only fault-tolerance is added; no new functional behavior is added

fS

Fault-span T'

S ' = T ' ST '

Page 9: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Enhancement in High Atomicity Model

Page 10: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Enhancement in High Atomicity Model

High Atomicity Model Each process can read/write all program variables

T S

ms

ms: States from where safety will be violated by fault transitions

f

Page 11: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Enhancement in High Atomicity Model – (Continued)

T S

• Deadlock States appear due to removing some transitions

ms

Find a state predicate T ' such that: T ' is closed in the computations of the program in the presence of faults The specification is satisfied from every state of T ' (i.e., no deadlocks)

Construct p' such that for every (s0, s1) p' : (s0, s1) does not violate safety s0 T ' s1 T '

T'

S'

Page 12: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Enhancement Addition

HighAtomicityEnhancement (p,f: transitions,

T:StatePredicate, specification spec) {

1. Calculate ms; Calculate mt;

2. T' = ConstructFaultSpan( );

3. if ( T' = {} ) declare no masking

f-tolerant program exists; exit;

else Construct the transitions of p';}

AddMasking (p,f: transitions, S:StatePredicate,

specification spec) {

1. Calculate ms; Calculate mt;

2. . . .

3. . . .

4. repeat4-1) . . .

4-2) . . .

4-3) T := ConstructFaultSpan( );4-4) . . .

4-5) if (S = {} \/ T = {}) declare no masking f-tolerant

program exists; exit;

until (ExitConditionHolds);

5. Remove cycles in outside the invariant in T ;

6. Construct the transitions of p'; }

Fault-intolerant program

Nonmasking program

Masking program

Manual Automatic: Enhancement

Partial Automation

[FTRTFT00]

Page 13: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Enhancement For Distributed Programs

Page 14: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Difficulties with Distribution Read/Write restrictions (low atomicity model). A program p

Two processes j, k Two Boolean variables a and b

Process j cannot read b Can we include the following transition?

a=0,b=0 a=1,b=0

Groups of transitions (instead of individual transitions) must be chosen.

a=0,b=1 a=1,b=1

Only if we include the transition

Page 15: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Enhancement of Nonmasking Distributed Programs

Calculate T' high

Calculate S' init = S' low

Calculate Sreachable from S' low by fault/program transitions

Calculate Srecovery from where recovery is possible to S' low

Srecovery = {}

Sreachable = {}

No

YesDeclarefailure

No

T' = S' low

Calculate p' transitions

Yes

Search in(T' high – S' low)

Under distribution restrictions

S' low = S' low Srecovery

Stop

Start

Page 16: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

T

A High Atomicity Fault-Span

The largest possible domain for the states that can be included in the fault-span of the distributed program

S

T' high

S' high = S T' high

ms

Page 17: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

The Initial Low Atomicity Invariant

Remove states from where an outgoing transition crosses the boundary of S' high

E.g., s0

Removal is a non-deterministic choice, where we have more than one state to remove

T' high

S' high

S0 S' init

Page 18: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

T' high

Sreachable

S' low

Single-Step Reachable States Reachable by a fault/program transition (denoted Sreachable)

S' init

f

S1

S1

S0

S2 S3

S2 S3

Page 19: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

T'high

Srecovery

Single-Step Recovery States Safer recovery in a single step (denoted Srecovery) Goal: infinite computations are possible from all states

in S' low

s0 represents a typical recovery state

S' init

S0

S2 S3

S2 S3

S' low

Page 20: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Enhancement of Nonmasking Distributed Programs

Calculate T' high

Calculate S' init = S' low

Calculate Sreachable from S' low by fault/program transitions

Calculate Srecovery from where recovery is possible to S' low

Srecovery = {}

Sreachable = {}

No

YesDeclarefailure

No

Start

Yes

S' low = S' low Srecovery

T' = S' low

Calculate p' transitions

Stop

Page 21: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Example: Byzantine Agreement Why this example?

Was used to illustrate the addition of masking fault-tolerance in [SRDS01]

Manual enhancement has been already applied [TSE98] Processes: General, g, and three non-generals j, k, and l Variables

d.g : {0, 1} d.j, d.k, d.l : {0, 1, ┴ } b.g, b.j, b.k, b.l : {0, 1} f.j, f.k, f.l : {0, 1}

Safety Specification: Agreement: No two non-Byzantine non-generals can

finalize with different decisions Validity: If g is not Byzantine, no process can finalize with

different decision with respect to g A finalized process should not execute any transition

g

lkj

Page 22: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Example: Byzantine Agreement Read/Write restrictions

Readable variables for process j b.j, d.j, f.j, d.g, d.k, d.l

Process j can write d.j, f.j Disjkstra’s guarded commands

Guard Statement { (s0, s1) | Guard holds at s0 and atomic execution of Statement yields s1 }

Nonmasking fault-tolerant program transitions

d.j = ┴ f.j = 0 d.j := d.g d.j ≠ ┴ f.j = 0 f.j := 1 d.j = 1 d.k = 0 d.l = 0 d.j := 0 d.j = 0 d.k = 1 d.l = 1 d.j := 1

Fault transitions ¬b.g ¬b.j ¬b.k ¬b.l b.j := true b.j d.j :=0|1

Page 23: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Example: Byzantine Agreement (Continued)

d.j = d.k = ┴ , d.g = 1, d.l = 1, f.l = 0

d.j = d.k = ┴ , d.g = 1, d.l = 1, f.l = 1

S0

S1

A good transition inside the invariant

d.j = d.k = 0 , d.g = 0, d.l = 1, f.l = 1 S4

Fault transition

A deadlock state

Premature finalization

b.g = 1

d.j = d.k = ┴ , d.g = 0, d.l = 1, f.l = 1

d.j = d.k = ┴ , d.g = 0, d.l = 1, f.l = 1

S3

S2

Why enhancement is easier?

Page 24: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Example: Byzantine Agreement (Continued)

d.j = ┴ f.j = 0 d.j := d.gd.j ≠ ┴ f.j = 0 f.j := 1d.j = 1 d.k = 0 d.l = 0 d.j := 0d.j = 0 d.k = 1 d.l = 1 d.j := 1

((d.j = d.k) (d.j = d.l))

(f.j = 0)

(f.j = 0)

Masking fault-tolerant program

High atomicity reasoning Synthesize a masking program in high atomicity and

then refine it to a distributed program

Page 25: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Enhancement vs. Addition

Reuse the computations of the nonmasking program

Reasoning in high atomicity model has the potential to reduce the complexity of addition

Page 26: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Synthesis Framework Development of a synthesis framework

Developers of fault-tolerance can interactively add fault-tolerance to fault-intolerant programs

Partial automation helps us to reap the benefits of automation as much as possible

Enhancement identifies programs where partial automation is possible

Implementation of enhancement algorithms in the synthesis framework

http://www.cse.msu.edu/~sandeep/software/Code/synthesis-framework/

Page 27: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Conclusion and Future Work Enhancement simplifies automated design of masking

programs Less asymptotic complexity

Polynomial-time enhancement in the low atomicity model (in the state space of the nonmasking program)

Sound, but not complete

Reasoning in high atomicity simplifies the synthesis of masking distributed programs

Future Work: A polynomial-time sound and complete enhancement

algorithm for a restricted class of programs and specifications

Page 28: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Thank You!

Questions?

Page 29: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Example: Triple Modular Redundancy

Processes: Three processes: j, k, and l Variables and their domains

in.j, in.k, and in.l are Boolean variables out belongs to { 0, 1, ┴ }

Nonmasking program (+ addition in modulo 3):

N1: (out = ┴) out := in.jN2: (out != ┴) /\ (out != in.j) /\

((in.j = in.k) \/ (in.j = in.l)) out := in.j Faults:

F: (in.j = in.k) /\ (in.j = in.l) in.j := 0|1 Safety specification:

Do not reach states where out is different than the majority of inputs.

out should not be changed after it is assigned a value.

Page 30: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Example: Triple Modular Redundancy

Invariant: S = ((out = ┴) /\ (in.j = in.k = in.k)) \/ (out = in.j = in.k)

\/ (out = in.j = in.l) \/ (out = in.k = in.l) Fault-span:

T = ( (in.j = in.k = in.l) => ((out = ┴) \/ (out = in.j = in.k = in.l)) ) Enhancement algorithm:

Compute ms: ms = { } Remove bad transitions:

{t: t violates safety} and {t: t reaches ms}

Construct a new fault-span T’:

T’ = T – { s: (out !=┴) /\ (out is not equal to majority of inputs) } Masking program:

M1: (out = ┴) /\ (in.j = in.k) \/ (in.j = in.l) out := in.j

Page 31: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Enhancement of Nonmasking Distributed Programs

Calculate T' high

Calculate S' init = S' low

Calculate Sreachable from S' low by fault/program transitions

Calculate Srecovery from where recovery is possible to S' low

Srecovery = {}

Sreachable = {}

No

YesDeclarefailure

No

Start

T' = S' low , calculate p' transitionsYes

S' low = S' low Srecovery

Page 32: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Enhancement of Nonmasking Distributed Programs

Calculate T' high

Calculate S' init = S' low

Calculate Sreachable from S' low by fault/program transitions

Calculate Srecovery from where recovery is possible to S' low

Srecovery = {}

Sreachable = {}

No

YesDeclarefailure

No

Start

T' = S' low , calculate p' transitionsYes

S' low = S' low Srecovery

Page 33: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Enhancement of Nonmasking Distributed Programs

Calculate T' high

Calculate S' init = S' low

Calculate Sreachable from S' low by fault/program transitions

Calculate Srecovery from where recovery is possible to S' low

Srecovery = {}

Sreachable = {}

No

YesDeclarefailure

No

Start

T' = S' low , calculate p' transitionsYes

S' low = S' low Srecovery

S' init = S' low

at the first iteration

Page 34: Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer

Enhancement of Nonmasking Distributed Programs

Calculate T' high

Calculate S' init = S' low

Calculate Sreachable from S' low by fault/program transitions

Calculate Srecovery from where recovery is possible to S' low

Srecovery = {}

Sreachable = {}

No

YesDeclarefailure

No

Start

T' = S' low , calculate p' transitionsYes

S' low = S' low Srecovery