fault tolerance mechanisms itv model-based analysis and design of embedded software techniques and...

23
Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg University August 2011

Upload: erika-anthony

Post on 04-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg

Fault Tolerance Mechanisms

ITV Model-based Analysis and Design of Embedded SoftwareTechniques and methods for Critical Software

Anders P. RavnAalborg University

August 2011

Page 2: Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg

Fault Tolerance

Means to isolate component faults

Prevents system failures

May increase system dependability

... And mask them

Page 3: Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg

Fault Tolerance

Page 4: Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg

FT - levels

• Full tolerance

• Graceful Degradation

• Fail safeBW p. 107

Page 5: Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg

FT basis: Redundancy

• Time

• Space

Try Retry Retry ...

TryTry

Try

...

Page 6: Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg

Fault Tolerance

Page 7: Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg

Basic Strategies

Page 8: Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg

Dynamic Redundancy

1. Error detection

2. Damage confinement and assessment

3. Error recovery

4. Fault treatment and continued service

BW p. 114

Page 9: Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg

Error Detection

f: State x Input State x Output

• Environment (exception)• Application Assertion:

• precondition (input)• postcondition (input, output)• invariant(state, state’)

Timing:• WCET(f, input) • Deadline (f,input)

D

Page 10: Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg

Damage Confinement

• Static structure

• Dynamic structure (transaction)

object

object

II

Page 11: Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg

Error Recovery

• Forward • Backward

Repair the state – if you can !

• define recovery points• checkpoint state at r. p.• roll back• retry

Domino effect

Page 12: Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg

Recovery blocks

ENSURE acceptance_testBY { module_1 }ELSE BY { module_2 } ...ELSE BY { module_m }ELSE ERROR

BW p. 120

Page 13: Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg

Implementation of Recovery Blocks

Page 14: Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg

Abstract class RecoveryBlockpublic abstract class RecoveryBlock {

abstract boolean acceptanceTest();

/** method to produce the result, it must be implemented by the application.

* @param module 0, ... , MaxModule-1 */

abstract void block(int module);

/* MaxModules must be set by the application to the number of blocks */

protected int MaxModules;

ENSURE acceptance_testBY { module_1 }ELSE BY { module_2 } ...ELSE BY { module_m }ELSE ERROR

Page 15: Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg

RecoveryBlock execution/** method to execute recovery module 0, 1, ... MaxModules-1 until one succeds

* @throws NoAccept if no module passes acceptanceTest.

*/

public final void do_it() throws NoAccept, CloneNotSupportedException{

save();

int i = 0;

do { try { block(i++);

if ( acceptanceTest() ) return;

} catch (Exception e) {/* if the block fails, we continue - not acceptance */}

restore(copy);

} while (i < MaxBlocks);

throw new NoAccept();

}

}

ENSURE acceptance_testBY { module_1 }ELSE BY { module_2 } ...ELSE BY { module_m }ELSE ERROR

Page 16: Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg

RecoveryBlock cachepublic abstract class RecoveryBlock {

/** The recovery Cache is implemented by a clone of the original object */

RecoveryBlock copy;

/** save object to recovery cache, uses Java clone which must be a deep clone. */

private final void save() throws CloneNotSupportedException {

copy = (RecoveryBlock) this.clone();

}

/** method to restore data from recovery cache, it must be implemented by the application

* @param value of the object to be restored */

abstract void restore(RecoveryBlock copy);

Page 17: Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg

Application/** Extends the basic abstract RecoveryBlock with faulty sorting

* algorithms and log calls, returns etc. to a TextArea. */

public class RecoveringSort extends RecoveryBlock {

/** checksum for acceptance test */

private int checksum;

/** data to be saved in recovery cache */

private int [] argument;

public RecoveringSort(TextArea t) {

MaxBlocks = 3;

log = t;

}

Page 18: Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg

Acceptance criteria /* Acceptance test for sorting; it shall verify:

* 1) the return value is an ordered list,

* 2) the return value is a permutation of the initial values */

boolean acceptanceTest() {

boolean result = true;

// check ordering

int i = argument.length-1;

while (i > 0) if (argument[i] < argument[--i]) {result = false; break; }

// check permutation, this is a partial check through a checksum

// A full check is as expensive computationally as sorting,

// thus, we use a partial check.

i = argument.length; int sum = 0;

while (i > 0) sum+=argument[--i];

return result && (sum == checksum);

}

Page 19: Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg

Application - modules /** Starts sorting using the recovery block mechanisms..

* @param data integer array containing elements to be sorted. */

public int [] sort(int [] data) {

argument = (int [])data.clone(); // copy needed for recovery to work

checksum = 0; int i = argument.length; while (i > 0) checksum+=argument[--i];

try { do_it();

} catch (NoAccept e) { log.append("All blocks falied\n"); }

return argument;

}

void block(int i) {

switch (i) {

case 0: BucketSort(argument); break;

case 1: BadSort(argument); break;

case 2: AlmostGoodSort(argument); break;

default:

}

}

Page 20: Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg

Fault classes (scope of R-B)

• Origin

• Kind

• Property

• physical (internal/external)

• logical (design/interaction)

• omission

• value

• timing

byzantine

• duration (permanent, transient)

• consistency (determinate, nondeterminate)

• autonomy (spontaneous, event-dependent)

++

(+)++(-)

+ / (+)

+ / ++ / +

Page 21: Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg

The ideal FT-component

Exception HandlerNormal mode

Request/response

Request/response

Interfaceexception

Interfaceexception

Failureexception

Failureexception

Page 22: Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg

N-version programming

V1 V2 V3

Driver (comparator)

Comparison vectors (votes)

Comparison status indicators

Comparison points

Page 23: Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg

Fault classes (scope of N-VP)

• Origin

• Kind

• Property

• physical (internal/external)

• logical (design/interaction)

• omission

• value

• timing

byzantine

• duration (permanent, transient)

• consistency (determinate, nondeterminate)

• autonomy (spontaneous, event-dependent)

++

(+)+++

+ / (+)

+ / ++ / +