![Page 1: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/1.jpg)
1
Taxonomy and Trends
Dan SiewiorekCarnegie Mellon University
June 2012
![Page 2: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/2.jpg)
2
Outline Taxonomy and Trends General Purpose Examples High Availability Examples A Methodology Conclusion
![Page 3: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/3.jpg)
3
Application Taxonomy General purpose
• Wide range of applications; frequently high performance High availability
• Occasional loss of single user but not system; rapid restart Long life
• No human maintenance; automatically detect and reconfigure; high coverage Critical computations
• Usually real-time control systems; low recovery time; high coverage
![Page 4: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/4.jpg)
4
![Page 5: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/5.jpg)
5
General Purpose Examples
![Page 6: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/6.jpg)
6
Error Detection Techniques in Typical General-Purpose System Memory
• Double-error-detection code on memory data• Parity on address and control information
Cache• Parity on data, address, control information
I/O Unit• Parity on data and control
CPU• Parity on data paths• Parity on control store• Duplication and comparison of control logic
![Page 7: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/7.jpg)
7
Error Recovery Techniques in Typical General-Purpose System
Memory• Single-error-detection code on data• Retry on address or control information parity error
Cache• Retry on data, address, control information parity
error I/O Unit
• Retry on data or control parity errors CPU
• Retry on control store parity error• Invert sense of control store• Macroinstruction retry
![Page 8: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/8.jpg)
8
IBM 3090 Series Fault-Tolerance Features Reliability
• Low intrinsic failure rate technology• Extensive component burn-in during manufacture• Dual processor controller that incorporates switchover• Dual 3370 Direct Access Storage units support
switchover• Multiple consoles for monitoring processor activity and
for backup• LSI packaging vastly reduces number of circuit
connections• Internal machine power and temperature monitoring• Chip sparing in memory replaces defective chips
automatically
![Page 9: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/9.jpg)
9
IBM 3090 Series Fault-Tolerance Features Availability
• Two or four central processors• Automatic error detection and correction in central and
expanded storage– Single bit error correction and double bit error detection in
central storage– Double bit error correction and triple bit error detection in
expanded storage• Storage deallocation in 4K-byte increments under system
program control• Ability to vary channels off line in one channel increments• Instruction retry• Channel command retry• Error detection and fault isolation circuits provide improved
recovery and serviceability• Multipath I/O controllers and units
![Page 10: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/10.jpg)
10
IBM 3090 Series Fault-Tolerance Features
Data integrity• Key controlled storage protection (store and fetch)• Critical address storage protection• Storage error checking and correction• Processor cache error handling• Parity and other internal error checking• Segment protection (S/370 mode)• Page protection (S/370 mode)• Clear reset of registers and main storage• Automatic Remote Support authorization• Block multiplexer channel command retry• Extensive I/O recovery by hardware and control
programs
![Page 11: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/11.jpg)
11
IBM 3090 Series Fault-Tolerance Features Serviceability
• Automatic fault isolation (analysis routines) concurrent with operation
• Automatic remote support capability – auto call to IBM if authorized by the customer
• Automatic customer engineer and parts dispatching
• Trade facilities• Error logout recording• Microcode update distribution via remote support
facilities• Remote service console capability• Automatic validation tests after repair• Customer problem analysis facilities
![Page 12: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/12.jpg)
12
ED/FI in IBM 308X / 3090 Hundreds of thousands of isolation domains Parity checks account for 70-80% of checkers – data,
address, and shift/increment parity predictors Decoder/encoder checkers 25% of IBM 3090 circuits for RAS Can instantaneously detect 90% of all errors 25% of faults assumed solid for the technology If less that two weeks between events, the cause is
assumed to be the same intermittent Call service if 24 errors in 2 hours
![Page 13: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/13.jpg)
13
High Availability Examples
![Page 14: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/14.jpg)
14
Tandem Design Objectives “Nonstop” operation where failures detected,
components configured out of service, repaired components configured back in without stopping other system components
No single hardware failure can compromise data integrity of the system
Modular system expansion through adding more processing power, memory, and peripherals without impacting application software
![Page 15: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/15.jpg)
15
![Page 16: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/16.jpg)
16
![Page 17: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/17.jpg)
17
Fault Containment Software processes do not share state – only message
passing Hardware – no shared memory, dual porting I/O,
multiple power supply
![Page 18: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/18.jpg)
18
Fast-Fail Modules (detection) Software – consistency checks, defensive programming Hardware – software generated status probes,
hardware self-tests
![Page 19: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/19.jpg)
19
Software Bugs
Backup process does not encounter same state and environment, code takes a different path
![Page 20: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/20.jpg)
20
Software Process pairs Transaction processing – two phase commit protocol Log write-ahead protocol – record before and after-
image of database in an audit trail Network systems management – programmed operators
help reduce administrative errors Tandem maintenance and diagnostic system – analyze
event loss to successfully call out FRU 90% of time
![Page 21: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/21.jpg)
21
![Page 22: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/22.jpg)
22
Error Handling Error detection logic records error Operating system runs diagnostics
• Incident of failure algorithm• If transient return board to service• If permanent call Customer Assistant Center – CAC
CAC determines problem• Selects board of same revision level• Print installation instructions• Ship via overnight courier
22 field engineers support 400 systems Service 6% / year of LCC vs. 9% for others
![Page 23: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/23.jpg)
23
![Page 24: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/24.jpg)
24
A Methodology
![Page 25: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/25.jpg)
25
A Methodology Define objectives Limit the scope Define confinement regions Design error handling mechanisms Design error reporting mechanisms Testing of error handling/reporting mechanisms Evaluate design
![Page 26: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/26.jpg)
26
![Page 27: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/27.jpg)
27
![Page 28: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/28.jpg)
28
![Page 29: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/29.jpg)
29
![Page 30: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/30.jpg)
30
![Page 31: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/31.jpg)
31
![Page 32: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/32.jpg)
32
![Page 33: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/33.jpg)
33
![Page 34: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/34.jpg)
34
![Page 35: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/35.jpg)
35
Exercising Latent Faults
Dormant Area Exercise
Memory locations MCU periodically reads every array location (scrubbing)
Detection mechanisms Software* periodically forces error conditions into the detection mechanisms
Reporting mechanisms Software* periodically initiates and observes error reports
Recovery mechanisms Software* periodically invokes recovery operations
*Special commands to support exercising dormant areas are provided in BIUs and MCUs
![Page 36: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/36.jpg)
36
Recovery Mechanisms and Coverage
Mechanism Coverage
Retry Transient errors
ECC Storage array address and data
Spare bit DRAM replacement
Memory bus pairs Memory bus failure
Module shadowing Module failure, GDP, IP, or memory
![Page 37: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/37.jpg)
37
Conclusion
![Page 38: 1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012](https://reader035.vdocuments.us/reader035/viewer/2022062315/5697bf8b1a28abf838c8b379/html5/thumbnails/38.jpg)
38
Conclusion Designing from first principles to produce an
architecture to tolerate failures achieves better reliability, availability, and cost-effectiveness than an ad-hoc, add-on approach
It is possible to build systems in which the activities of fault detection, diagnosis, and recovery are completely automated and transparent to the user