3. hardware redundancy reliable system design 2010 by: amir m. rahmani

23
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani

Upload: johnathan-joseph

Post on 17-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani

3. Hardware Redundancy

Reliable System Design 2010by: Amir M. Rahmani

Page 2: 3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Forms of Redundancy Hardware redundancy

• – add extra hardware for detection or tolerating faults

Software redundancy• – add extra software for detection and possibly

tolerating faults Information redundancy

• – extra information, i.e. codes Time redundancy

• – extra time for performing tasks for fault tolerance

Page 3: 3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Types of Hardware Redundancy

Fault Tolerance requires Redundancy1- Static Redundancy (that is Passive)

• • uses fault masking to hide occurrence of fault• • does not require reconfiguration• • Example: TMR, Voting

2- Dynamic Redundancy (that is Active)• • uses comparison for detection and/or diagnoses• • requires reconfiguration

• remove faulty hardware from system• • Example: Stand-by system

3- Hybrid Redundancy• • combination of static & dynamic redundancy

Page 4: 3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

1- Static Redundancy

A class of redundancy techniques that can tolerate faults without reconfiguration (failover).

Static redundancy can be divided into two major subclasses:

• • Masking redundancy• • Active redundancy

Page 5: 3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Masking Redundancy

Uses majority voting to mask faults Requires 2f +1 modules to tolerate f faulty

modules

N-Modular Redundant system (NMR) N independent modules replicate the same function

• – parallelism• – results are voted on• – requirements: N >= 3

TMR (Triple Modular Redundancy)

Page 6: 3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Triple Modular Redundancy (TMR)

e.g. Majority voting. 1-bit majority voter (3 AND gates ORed)

Page 7: 3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Triple Modular Redundancy

(TMR)

Page 8: 3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Masking Redundancy

TMR with triple voting

Page 9: 3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Masking Redundancy

Multi-stage TMR

Page 10: 3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

N-Modular Redundant system (NMR)

Page 11: 3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Active Redundancy

Two or more units are active and produce replicated results simultaneously

Relies on fail-stop units Fail-stop property: a unit produces correct

results or no results at all Requires f +1 modules to tolerate f faulty

modules

Page 12: 3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Fail-stop Nodes

Node 1 and 2 send their results individually to node 3 and 4

All nodes are fail-stop: They send correct results or no results at all

Page 13: 3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

2- Dynamic Redundancy

Relies on error detection and reconfiguration Requires f +1 modules to tolerate f faulty

modules May require recovery of system or

application state May require outage time

Page 14: 3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Example: Duplicate and Compare

• – can only detect, but NOT diagnose• i.e. fault detection, no fault-tolerance

• – may order shutdown• – comparator is single point of failure

• simple implementation: 2 input XOR for single bit compare

Page 15: 3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Example: Stand-by System

• E.g. communications checksums and memory parity bits• – only one module is driving outputs• – other modules are:

• idle => hot spares• shut down => cold spares

• – error detection => switch to a new module (hot or cold spares)

Page 16: 3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Types of Stand-by Systems

Hot standby Warm standby Cold standby

Page 17: 3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Hot Stand-by

Characteristics• • Spare updated simultaneously with primary

module

+ Advantages• + Very short or no outage time• + Does not require recovery of application

- Drawbacks• - High failure rate (fault rate)• - High power consumption

Page 18: 3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Warm Stand-by Characteristics

• • Spare up and running• • Needs to recover application status

+ Advantages• + Does not require simultaneous up-dating of spare

and primary module - Drawbacks

• - Requires recovery of application state• - High fault rate• - High power consumption

Page 19: 3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Cold Stand-by

Characteristics• • Spare powered-down

+ Advantages• + Low failure rate (fault rate)• + Low power consumption

• Satellite application

- Drawbacks• - Very long outage time• - Needs to boot kernel/operating system and

recover application status.

Page 20: 3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

3- Hybrid Redundancy

N-Modular Redundancy with spares• – N active + S spare modules (off-line)• – Voting and comparison• – Replaces erroneous module from spare pool

Page 21: 3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

N-Modular Redundancy with spares

N-Modular Redundancy with spares

Page 22: 3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Coding checks / Exception checks

Coding checks Error detection codes are formed by the addition of check

bits to a data word. A cyclic redundancy code check was used in the disk

store of ESS. A parity bit was used in the RAMException checks Hardware constraints: Usually result from the inability of

the hardware to provide the better service needed by the software.

Examples• • Improper address alignment• • Unequipped memory locations• • Unused op-code• • Stack overflow

Page 23: 3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir

Watchdog Timers

So far, we’ve figured out how to detect when something is wrong … but how do we detect when we’re not doing anything at all?

Watchdog timer monitors a module and triggers a recovery if the module doesn’t do anything in a given amount of time

• – E.g., put a watchdog timer on a microprocessor bus Who watches the watchdog?

• – If we assume single fault scenario, then this usually isn’t a problem

• – But what if watchdog has hard fault that causes it to never timeout and trigger a recovery?