fault tolerance mechanisms itv model-based analysis and design of embedded software techniques and...

Post on 04-Jan-2016

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Fault Tolerance Mechanisms

ITV Model-based Analysis and Design of Embedded SoftwareTechniques and methods for Critical Software

Anders P. RavnAalborg University

August 2011

Fault Tolerance

Means to isolate component faults

Prevents system failures

May increase system dependability

... And mask them

Fault Tolerance

FT - levels

• Full tolerance

• Graceful Degradation

• Fail safeBW p. 107

FT basis: Redundancy

• Time

• Space

Try Retry Retry ...

TryTry

Try

...

Fault Tolerance

Basic Strategies

Dynamic Redundancy

1. Error detection

2. Damage confinement and assessment

3. Error recovery

4. Fault treatment and continued service

BW p. 114

Error Detection

f: State x Input State x Output

• Environment (exception)• Application Assertion:

• precondition (input)• postcondition (input, output)• invariant(state, state’)

Timing:• WCET(f, input) • Deadline (f,input)

D

Damage Confinement

• Static structure

• Dynamic structure (transaction)

object

object

II

Error Recovery

• Forward • Backward

Repair the state – if you can !

• define recovery points• checkpoint state at r. p.• roll back• retry

Domino effect

Recovery blocks

ENSURE acceptance_testBY { module_1 }ELSE BY { module_2 } ...ELSE BY { module_m }ELSE ERROR

BW p. 120

Implementation of Recovery Blocks

Abstract class RecoveryBlockpublic abstract class RecoveryBlock {

abstract boolean acceptanceTest();

/** method to produce the result, it must be implemented by the application.

* @param module 0, ... , MaxModule-1 */

abstract void block(int module);

/* MaxModules must be set by the application to the number of blocks */

protected int MaxModules;

ENSURE acceptance_testBY { module_1 }ELSE BY { module_2 } ...ELSE BY { module_m }ELSE ERROR

RecoveryBlock execution/** method to execute recovery module 0, 1, ... MaxModules-1 until one succeds

* @throws NoAccept if no module passes acceptanceTest.

*/

public final void do_it() throws NoAccept, CloneNotSupportedException{

save();

int i = 0;

do { try { block(i++);

if ( acceptanceTest() ) return;

} catch (Exception e) {/* if the block fails, we continue - not acceptance */}

restore(copy);

} while (i < MaxBlocks);

throw new NoAccept();

}

}

ENSURE acceptance_testBY { module_1 }ELSE BY { module_2 } ...ELSE BY { module_m }ELSE ERROR

RecoveryBlock cachepublic abstract class RecoveryBlock {

/** The recovery Cache is implemented by a clone of the original object */

RecoveryBlock copy;

/** save object to recovery cache, uses Java clone which must be a deep clone. */

private final void save() throws CloneNotSupportedException {

copy = (RecoveryBlock) this.clone();

}

/** method to restore data from recovery cache, it must be implemented by the application

* @param value of the object to be restored */

abstract void restore(RecoveryBlock copy);

Application/** Extends the basic abstract RecoveryBlock with faulty sorting

* algorithms and log calls, returns etc. to a TextArea. */

public class RecoveringSort extends RecoveryBlock {

/** checksum for acceptance test */

private int checksum;

/** data to be saved in recovery cache */

private int [] argument;

public RecoveringSort(TextArea t) {

MaxBlocks = 3;

log = t;

}

Acceptance criteria /* Acceptance test for sorting; it shall verify:

* 1) the return value is an ordered list,

* 2) the return value is a permutation of the initial values */

boolean acceptanceTest() {

boolean result = true;

// check ordering

int i = argument.length-1;

while (i > 0) if (argument[i] < argument[--i]) {result = false; break; }

// check permutation, this is a partial check through a checksum

// A full check is as expensive computationally as sorting,

// thus, we use a partial check.

i = argument.length; int sum = 0;

while (i > 0) sum+=argument[--i];

return result && (sum == checksum);

}

Application - modules /** Starts sorting using the recovery block mechanisms..

* @param data integer array containing elements to be sorted. */

public int [] sort(int [] data) {

argument = (int [])data.clone(); // copy needed for recovery to work

checksum = 0; int i = argument.length; while (i > 0) checksum+=argument[--i];

try { do_it();

} catch (NoAccept e) { log.append("All blocks falied\n"); }

return argument;

}

void block(int i) {

switch (i) {

case 0: BucketSort(argument); break;

case 1: BadSort(argument); break;

case 2: AlmostGoodSort(argument); break;

default:

}

}

Fault classes (scope of R-B)

• Origin

• Kind

• Property

• physical (internal/external)

• logical (design/interaction)

• omission

• value

• timing

byzantine

• duration (permanent, transient)

• consistency (determinate, nondeterminate)

• autonomy (spontaneous, event-dependent)

++

(+)++(-)

+ / (+)

+ / ++ / +

The ideal FT-component

Exception HandlerNormal mode

Request/response

Request/response

Interfaceexception

Interfaceexception

Failureexception

Failureexception

N-version programming

V1 V2 V3

Driver (comparator)

Comparison vectors (votes)

Comparison status indicators

Comparison points

Fault classes (scope of N-VP)

• Origin

• Kind

• Property

• physical (internal/external)

• logical (design/interaction)

• omission

• value

• timing

byzantine

• duration (permanent, transient)

• consistency (determinate, nondeterminate)

• autonomy (spontaneous, event-dependent)

++

(+)+++

+ / (+)

+ / ++ / +

top related