cs 505: thu d. nguyen rutgers university, spring 2005 1 cs 505: computer structures fault tolerance...

14
CS 505: Thu D. Nguyen utgers University, Spring 2005 1 CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers University

Upload: camron-bell

Post on 14-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CS 505: Thu D. Nguyen Rutgers University, Spring 2005 1 CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers

CS 505: Thu D. NguyenRutgers University, Spring 2005 1

CS 505: Computer Structures

Fault Tolerance

Thu D. Nguyen

Spring 2005

Computer Science

Rutgers University

Page 2: CS 505: Thu D. Nguyen Rutgers University, Spring 2005 1 CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers

CS 505: Thu D. NguyenRutgers University, Spring 2005 2

Fault Tolerance

• Computing components WILL fail– Hardware, software, and people

• General field of dependability, fault tolerance, reliability, etc. addresses the issue of how can we keep a computing system running in the presence of component failures

• Lots of jargon (like all areas of computer science) so need to start with terminology

– See short paper I posted on web today

Page 3: CS 505: Thu D. Nguyen Rutgers University, Spring 2005 1 CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers

CS 505: Thu D. NguyenRutgers University, Spring 2005 3

Dependability, Reliability, Availability

• Dependability: the ability of a computing system to deliver service that can justifiably be trusted

– Service delivered by a system is its behavior as perceived by the service’s users

– Dependability is a general concept that encapsulate reliability, availability, etc.

• Availability: readiness for correct service– What percentage of time is the service available

• Reliability: continuity of correct service– How long until the next service failure

• Safety: absence of catastrophic consequences on the users and environment, even in presence of faults

Page 4: CS 505: Thu D. Nguyen Rutgers University, Spring 2005 1 CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers

CS 505: Thu D. NguyenRutgers University, Spring 2005 4

Faults, Errors, and Failures

• Failure: an event that occurs when the delivered service deviates from correct service

– By definition, a failure is visible to the user

• A fault is a failure of a component of a computing system that may lead to service failure

– If the system can tolerate this fault, that is, continue to provide correct service despite the fault, then the fault does not lead to service failure

• An error is the activation of a fault– Faults may be dormant or latent– For example, a disk fault may not ever become an error

if the service never uses that disk again

Page 5: CS 505: Thu D. Nguyen Rutgers University, Spring 2005 1 CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers

CS 505: Thu D. NguyenRutgers University, Spring 2005 5

Fault Tolerance

• How to continue delivering correct service in the presence of errors

• Error detection: figuring out that an error exists in the service

• Fault diagnosis: figure out the root cause of the detected error(s)

• Error handling and recovery: dynamic reconfiguration of the service to continue delivering correct service

• Fault prediction: predicting when faults are likely to occur

• Fault prevention: pro-active reconfiguration of the service to tolerate likely future faults

Page 6: CS 505: Thu D. Nguyen Rutgers University, Spring 2005 1 CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers

CS 505: Thu D. NguyenRutgers University, Spring 2005 6

Mathematical Definitions

• Availability = MTTF / (MTTF + MTTR)• Reliability = MTTF

Page 7: CS 505: Thu D. Nguyen Rutgers University, Spring 2005 1 CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers

CS 505: Thu D. NguyenRutgers University, Spring 2005 7

Tandem Case Study

• Modularity• Fail-fast (fail-stop) hardware

– Extensive self-monitoring– Fault model enforcement– What happens when the self-monitoring and fault model

enforcement hardware fails?

• Replicate hardware for redundancy– Tolerate single fault

• Fault-tolerance software• On-line maintenance• Simplified user interface

Page 8: CS 505: Thu D. Nguyen Rutgers University, Spring 2005 1 CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers

CS 505: Thu D. NguyenRutgers University, Spring 2005 8

Tandem NonStop

Page 9: CS 505: Thu D. Nguyen Rutgers University, Spring 2005 1 CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers

CS 505: Thu D. NguyenRutgers University, Spring 2005 9

Tandem Integrity

Page 10: CS 505: Thu D. Nguyen Rutgers University, Spring 2005 1 CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers

CS 505: Thu D. NguyenRutgers University, Spring 2005 10

Census of Tandem Availability

Page 11: CS 505: Thu D. Nguyen Rutgers University, Spring 2005 1 CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers

CS 505: Thu D. NguyenRutgers University, Spring 2005 11

Census of Tandem Availability

Page 12: CS 505: Thu D. Nguyen Rutgers University, Spring 2005 1 CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers

CS 505: Thu D. NguyenRutgers University, Spring 2005 12

Case Study of 1 Tandem Customer

Page 13: CS 505: Thu D. Nguyen Rutgers University, Spring 2005 1 CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers

CS 505: Thu D. NguyenRutgers University, Spring 2005 13

Sources of Failures(Going Beyond Tandem)

• Operator mistakes are a major source of service failures• Theory: insufficient infrastructural support major reason

for operator mistakes– System designers rarely consider human-system interactions

59%22%

8%

11%

OperatorHardwareSoftwareOverload

51%

15%

34%

0%

Public Switched Telephone Network Average of 3 Internet Sites

[Patterson et al. 2002]

Page 14: CS 505: Thu D. Nguyen Rutgers University, Spring 2005 1 CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers

CS 505: Thu D. NguyenRutgers University, Spring 2005 14

Data from Vivo Project

• Conducting survey to understand database and network administration

– ~100 respondents– DBAs: all ≥ 2 years experience, 71% ≥ 5 years experience– Networking: 98% ≥ 2 years experience, 81% ≥ 5 years experience

• Source of failures

Network and Systems

44%

15%

15%

10%

10%

2%2% 2%

Database

16%

18%

16%26%

14%

2%8% Operator Lack of

Understanding/ExperienceComplex Operation

Hardware Failure

Buggy Software

Operator Inattentive/tired

Lack of Appropriate Tools

Unfriendly Interface

Not specified