cs, auhenrik bærbak christensen1 fault tolerant architectures lyu chapter 14 sommerville chapter 20...

20
CS, AU Henrik Bærbak Christensen 1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II

Upload: peter-goodwin

Post on 12-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II

CS, AU Henrik Bærbak Christensen 1

Fault Tolerant Architectures

Lyu Chapter 14

Sommerville Chapter 20 Part II

Page 2: CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II

CS, AU Henrik Bærbak Christensen 2

Application domains

  Fault tolerant systems are used in various domains, primarily for safety-critical systems

– First documented example is for railway systems (1978)

– Nuclear power plants– Airplanes (Airbus)– Space program

Page 3: CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II

CS, AU Henrik Bærbak Christensen 3

Experience

  Somewhat mixed, actually…  Why?

– Redundancy works for hardware because hardware often fails randomly

• Due to wearing out (component failure, not design failure)

– Software fails due to specification and design errors• Thus simple replication does not provide protection…• Review the Ariane failure reported by Tan

  Redundant software units require diversity– However, there are evidence of failure correlations

even over diverse implementations…

Page 4: CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II

CS, AU Henrik Bærbak Christensen 4

The origins: Hardware

  Triple Modular redundancy (TMR)

– Three identical hardware units process input– Output is compared for equality

• Deviating output is ignored• Fault manager may try repair, or reconfigure to take unit out

of service

Page 5: CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II

CS, AU Henrik Bærbak Christensen 5

Terminology for Software FT

  Principal requirement– Redundancy of functional equivalent but different

software units• different teams, tools, processes, …

– Oracle• Method to dynamically determine if output is correct or within

acceptable limits

– Recovery• Defect will lead to error state that leads to failure if not

handled• A detected error state, results in recovery being initiated

Page 6: CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II

CS, AU Henrik Bærbak Christensen 6

Terminology for Software FT

  Recovery– Backward Recovery

• Recovery points are stored during normal execution• System rolled back/restored to a previous restore point and

restarted from that

– Forward Recovery• Transition into degraded mode state which is functional but

quality is lowered• Or: error compensation in which algorithms derive the

correct answer.

  Exercise:– Give examples of each type

Page 7: CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II

CS, AU Henrik Bærbak Christensen 7

Oracles

  Result verification/Dynamic self-checking  Acceptance test

– Internal accept test• Test for correctness, or if answer is within limits or bounds• Require that testing correctness is easier than calculating the

result, like |sqr(x)*sqr(x) – x| < E

  Examples– Checksums, used to accept test datagram contents– Data structure validation methods– Hardware self tests

Page 8: CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II

CS, AU Henrik Bærbak Christensen 8

Oracles

  Result verification  External consistency:

– Uses additional knowledge outside of the unit producing results

  Examples– Watchdogs (heartbeat in Bass) use timings to detect

and resolve problems– Exceptions: for instance floating point errors

Page 9: CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II

CS, AU Henrik Bærbak Christensen 9

Diversity

  The rationale behind diversity:  Modules fail on disjoint subsets of the input

space – one will always process input correct!

Program 1 execution state

I_e

error states

Input space

Program 2 execution state

I_e

error states

Input space

Page 10: CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II

CS, AU Henrik Bærbak Christensen 10

Redundancy

  Require a software unit to judge acceptability of redundant modules: adjudicator– As it is a software unit – it may contain defects

– Techniques• Voting• Median value• Acceptance testing• And more…

Page 11: CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II

CS, AU Henrik Bærbak Christensen 11

Failure classes

  In unit testing, failures occur because of defects in the software unit.– A test case either fail or pass

  A redundant system (= N functionally identical but different units) introduces more types/classes of failures– k-fold coincident failures (sammenfaldende)

• k out of N units fails on the same test case– U1 says 7, U2 says 13, but answer is 42.

• Identical-and wrong (IAW) answer– U1 and U2 says 7, but answer is 42

Page 12: CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II

CS, AU Henrik Bærbak Christensen 12

Failure classes

  Correlated/Dependent failures– P( U1 fails | U2 fails) ≠ P( U1 fails )

• Hvis sandsynligheden for at U1 fejler på en test case givet at vi ved U2 fejler på test cases er forskellig fra ss for at U1 fejler givet at vi ikke ved om U2 fejler på test casen.

• Tænk på at U1 er lig U2. Hvis vi ved U2 fejler og U1 er identisk med U2 så ved vi sten sikkert at U1 fejler: P(U1 fails | U2 fail) = 1. Men hvis vi ved at U1 er identisk med U2 men ikke om U2 vil fejle, så kender vi kun fordelingen som måske er at SS for at U1 fejler er 0,1%.

Page 13: CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II

CS, AU Henrik Bærbak Christensen 13

Failure classes

Page 14: CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II

CS, AU Henrik Bærbak Christensen 14

Adjudication Techniques

Page 15: CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II

CS, AU Henrik Bærbak Christensen 15

Voting

  Majority voting– m = number of matching outputs– m = ceil[(n+1)/2]– Usually N = 3 which means m = ?

  Two-out-of-N voting– Actually m = 2 is enough regardless of N (usually)– Note: agreement ≠ correctness– Best argument I have:

• Hitler was democratic elected

  Median voting– Sort the answers and select the middle element– Used in aerospace …

Page 16: CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II

CS, AU Henrik Bærbak Christensen 16

Voting

  Consensus voting

Page 17: CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II

CS, AU Henrik Bærbak Christensen 17

Redundancy Techniques

Page 18: CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II

CS, AU Henrik Bærbak Christensen 18

Recovery Blocks

  Failed Accept Test– Often roll-back / recovery of system state– Single processors suffers from sequential processing

• ‘core dumped’ in first unit is bad…

Page 19: CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II

CS, AU Henrik Bærbak Christensen 19

N-version Programming

  Executed in parallel– Voting used to select proper answer

Page 20: CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II

CS, AU Henrik Bærbak Christensen 20

Variants

  Lyu discuss various variants and combinations.  One I find interesting is  Acceptance voting

– N versions execute in parallel, and the answers are subjected to acceptance testing.

– Only accepted answers are then feed to the voter– Voter must be dynamic as the number of inputs, Ni <=

N, to the voter varies according to the number of accepted outputs.