why do internet services fail, and what can be done about...
TRANSCRIPT
![Page 1: Why do Internet services fail, and what can be done about it?roc.cs.berkeley.edu/talks/pdf/usits-dist.pdf · Why do Internet services fail, and what can be done about it? ... •](https://reader030.vdocuments.us/reader030/viewer/2022020303/5b24428e7f8b9ae0578b4702/html5/thumbnails/1.jpg)
Why do Internet services fail, and what can be done about it?
David Oppenheimer, Archana Ganapathi, and David Patterson
Computer Science DivisionUniversity of California at Berkeley
4th USENIX Symposium on Internet Technologies and SystemsMarch 2003
![Page 2: Why do Internet services fail, and what can be done about it?roc.cs.berkeley.edu/talks/pdf/usits-dist.pdf · Why do Internet services fail, and what can be done about it? ... •](https://reader030.vdocuments.us/reader030/viewer/2022020303/5b24428e7f8b9ae0578b4702/html5/thumbnails/2.jpg)
Slide 2
Motivation• Internet service availability is important
– email, instant messenger, web search, e-commerce, …
• User-visible failures are relatively frequent– especially if use non-binary definition of “failure”
• To improve availability, must know what causes failures– know where to focus research– objectively gauge potential benefit of techniques
• Approach: study failures from real Internet svcs.– evaluation includes impact of humans & networks
![Page 3: Why do Internet services fail, and what can be done about it?roc.cs.berkeley.edu/talks/pdf/usits-dist.pdf · Why do Internet services fail, and what can be done about it? ... •](https://reader030.vdocuments.us/reader030/viewer/2022020303/5b24428e7f8b9ae0578b4702/html5/thumbnails/3.jpg)
Slide 3
Outline
• Describe methodology and services studied
• Identify most significant failure root causes– source: type of component– impact: number of incidents, contribution to TTR
• Evaluate HA techniques to see which of them would mitigate the observed failures
• Drill down on one cause: operator error
• Future directions for studying failure data
![Page 4: Why do Internet services fail, and what can be done about it?roc.cs.berkeley.edu/talks/pdf/usits-dist.pdf · Why do Internet services fail, and what can be done about it? ... •](https://reader030.vdocuments.us/reader030/viewer/2022020303/5b24428e7f8b9ae0578b4702/html5/thumbnails/4.jpg)
Slide 4
Methodology• Obtain “failure” data from three Internet services– two services: problem tracking database– one service: post-mortems of user-visible failures
![Page 5: Why do Internet services fail, and what can be done about it?roc.cs.berkeley.edu/talks/pdf/usits-dist.pdf · Why do Internet services fail, and what can be done about it? ... •](https://reader030.vdocuments.us/reader030/viewer/2022020303/5b24428e7f8b9ae0578b4702/html5/thumbnails/5.jpg)
Slide 5
Methodology• Obtain “failure” data from three Internet services– two services: problem tracking database– one service: post-mortems of user-visible failures
• We analyzed each incident– failure root cause
» hardware, software, operator, environment, unknown– type of failure
» “component failure” vs. “service failure”– time to diagnose + repair (TTR)
![Page 6: Why do Internet services fail, and what can be done about it?roc.cs.berkeley.edu/talks/pdf/usits-dist.pdf · Why do Internet services fail, and what can be done about it? ... •](https://reader030.vdocuments.us/reader030/viewer/2022020303/5b24428e7f8b9ae0578b4702/html5/thumbnails/6.jpg)
Slide 6
Methodology• Obtain “failure” data from three Internet services– two services: problem tracking database– one service: post-mortems of user-visible failures
• We analyzed each incident– failure root cause
» hardware, software, operator, environment, unknown– type of failure
» “component failure” vs. “service failure”– time to diagnose + repair (TTR)
• Did not look at security problems
![Page 7: Why do Internet services fail, and what can be done about it?roc.cs.berkeley.edu/talks/pdf/usits-dist.pdf · Why do Internet services fail, and what can be done about it? ... •](https://reader030.vdocuments.us/reader030/viewer/2022020303/5b24428e7f8b9ae0578b4702/html5/thumbnails/7.jpg)
Slide 7
Comparing the three services
562140# service failures
205N/A296# component failures
3 months6 months7 monthsperiod studied
custom s/w; open-source OS on x86
custom s/w; open-source OS on x86
Network Appliance
filers
back-endnode
architecture
custom s/w; open-source OS on x86;
custom s/w; open-source OS on x86
custom s/w; Solaris on
SPARC, x86
front-end node
architecture
~500@ ~15 sites
> 2000@ 4 sites
~500 @ 2 sites
# of machines
~7 million~100 million~100 millionhits per dayContentReadMostlyOnlinecharacteristic
![Page 8: Why do Internet services fail, and what can be done about it?roc.cs.berkeley.edu/talks/pdf/usits-dist.pdf · Why do Internet services fail, and what can be done about it? ... •](https://reader030.vdocuments.us/reader030/viewer/2022020303/5b24428e7f8b9ae0578b4702/html5/thumbnails/8.jpg)
Slide 8
Outline• Describe methodology and services studied
• Identify most significant failure root causes– source: type of component– impact: number of incidents, contribution to TTR
• Evaluate HA techniques to see which of them would mitigate the observed failures
• Drill down on one cause: operator error
• Future directions for studying failure data
![Page 9: Why do Internet services fail, and what can be done about it?roc.cs.berkeley.edu/talks/pdf/usits-dist.pdf · Why do Internet services fail, and what can be done about it? ... •](https://reader030.vdocuments.us/reader030/viewer/2022020303/5b24428e7f8b9ae0578b4702/html5/thumbnails/9.jpg)
Slide 9
Failure cause by % of service failuresOnline Content
ReadMostly
hardware10%
software25%
network20%
operator33%
unknown12%
hardware2%
software25%
network15%operator
36%
unknown22%
software5%
network62%
operator19%
unknown14%
![Page 10: Why do Internet services fail, and what can be done about it?roc.cs.berkeley.edu/talks/pdf/usits-dist.pdf · Why do Internet services fail, and what can be done about it? ... •](https://reader030.vdocuments.us/reader030/viewer/2022020303/5b24428e7f8b9ae0578b4702/html5/thumbnails/10.jpg)
Slide 10
hardware6%software17%
network1%
operator76%
unknown1% software
6%network19%
operator75%
Failure cause by % of TTROnline Content
ReadMostly
network97%
operator3%
![Page 11: Why do Internet services fail, and what can be done about it?roc.cs.berkeley.edu/talks/pdf/usits-dist.pdf · Why do Internet services fail, and what can be done about it? ... •](https://reader030.vdocuments.us/reader030/viewer/2022020303/5b24428e7f8b9ae0578b4702/html5/thumbnails/11.jpg)
Slide 11
Most important failure root cause?
• Operator error generally the largest cause of service failure– even more significant as fraction of total “downtime”– configuration errors > 50% of operator errors– generally happened when making changes, not repairs
• Network problems significant cause of failures
![Page 12: Why do Internet services fail, and what can be done about it?roc.cs.berkeley.edu/talks/pdf/usits-dist.pdf · Why do Internet services fail, and what can be done about it? ... •](https://reader030.vdocuments.us/reader030/viewer/2022020303/5b24428e7f8b9ae0578b4702/html5/thumbnails/12.jpg)
Slide 12
Related work: failure causes
• Tandem systems (Gray)– 1985: Operator 42%, software 25%, hardware 18%– 1989: Operator 15%, software 55%, hardware 14%
• VAX (Murphy)– 1993: Operator 50%, software 20%, hardware 10%
• Public Telephone Network (Kuhn, Enriquez)– 1997: Operator 50%, software 14%, hardware 19%– 2002: Operator 54%, software 7%, hardware 30%
![Page 13: Why do Internet services fail, and what can be done about it?roc.cs.berkeley.edu/talks/pdf/usits-dist.pdf · Why do Internet services fail, and what can be done about it? ... •](https://reader030.vdocuments.us/reader030/viewer/2022020303/5b24428e7f8b9ae0578b4702/html5/thumbnails/13.jpg)
Slide 13
Outline• Describe methodology and services studied
• Identify most significant failure root causes– source: type of component– impact: number of incidents, contribution to TTR
• Evaluate HA techniques to see which of them would mitigate the observed failures
• Drill down on one cause: operator error
• Future directions for studying failure data
![Page 14: Why do Internet services fail, and what can be done about it?roc.cs.berkeley.edu/talks/pdf/usits-dist.pdf · Why do Internet services fail, and what can be done about it? ... •](https://reader030.vdocuments.us/reader030/viewer/2022020303/5b24428e7f8b9ae0578b4702/html5/thumbnails/14.jpg)
Slide 14
Potential effectiveness of techniques?
pre-deployment correctness testing*proactive restart*
pre-deployment fault injection/load testcomponent isolation*
post-deploy. fault injection/load testingautomatic configuration checking
redundancy*expose/monitor failures*
post-deployment correctness testing*
technique
* indicates technique already used by Online
![Page 15: Why do Internet services fail, and what can be done about it?roc.cs.berkeley.edu/talks/pdf/usits-dist.pdf · Why do Internet services fail, and what can be done about it? ... •](https://reader030.vdocuments.us/reader030/viewer/2022020303/5b24428e7f8b9ae0578b4702/html5/thumbnails/15.jpg)
Slide 15
Potential effectiveness of techniques?
2pre-deployment correctness testing*3proactive restart*3pre-deployment fault injection/load test5component isolation*6post-deploy. fault injection/load testing9automatic configuration checking9redundancy*12expose/monitor failures*26post-deployment correctness testing*
failures avoided / mitigated
technique
(40 service failures examined)
![Page 16: Why do Internet services fail, and what can be done about it?roc.cs.berkeley.edu/talks/pdf/usits-dist.pdf · Why do Internet services fail, and what can be done about it? ... •](https://reader030.vdocuments.us/reader030/viewer/2022020303/5b24428e7f8b9ae0578b4702/html5/thumbnails/16.jpg)
Slide 16
Outline• Describe methodology and services studied
• Identify most significant failure root causes– source: type of component– impact: number of incidents, contribution to TTR
• Evaluate existing techniques to see which of them would mitigate the observed failures
• Drill down on one cause: operator error
• Future directions for studying failure data
![Page 17: Why do Internet services fail, and what can be done about it?roc.cs.berkeley.edu/talks/pdf/usits-dist.pdf · Why do Internet services fail, and what can be done about it? ... •](https://reader030.vdocuments.us/reader030/viewer/2022020303/5b24428e7f8b9ae0578b4702/html5/thumbnails/17.jpg)
Slide 17
Drilling down: operator errorWhy does operator error cause so many svc. failures?
Existing techniques (e.g., redundancy) are minimally effective at masking operator error
50%
24%19%
6%
operator software network hardware
25%21% 19%
3%
operator software network hardware
% of component failures resulting in service failures
Content Online
![Page 18: Why do Internet services fail, and what can be done about it?roc.cs.berkeley.edu/talks/pdf/usits-dist.pdf · Why do Internet services fail, and what can be done about it? ... •](https://reader030.vdocuments.us/reader030/viewer/2022020303/5b24428e7f8b9ae0578b4702/html5/thumbnails/18.jpg)
Slide 18
Drilling down: operator error TTR
Detection and diagnosis difficult because of non-failstop failures and poor error checking
Why does operator error contribute so much to TTR?
hardware6%software17%
network1%
operator76%
unknown1% software
6%network19%
operator75%
Online Content
![Page 19: Why do Internet services fail, and what can be done about it?roc.cs.berkeley.edu/talks/pdf/usits-dist.pdf · Why do Internet services fail, and what can be done about it? ... •](https://reader030.vdocuments.us/reader030/viewer/2022020303/5b24428e7f8b9ae0578b4702/html5/thumbnails/19.jpg)
Slide 19
Future directions in studying failures• Quantify impact of of operational practices
• Study additional types of sites– transactional, intranets, peer-to-peer
• Create a public failure data repository– standard taxonomy of failure causes– standard metrics for impact – techniques for automatic anonymization– security (not just reliability)– automatic analysis (mining for trends, fixes, attacks, …)
• Perform controlled laboratory experiments
![Page 20: Why do Internet services fail, and what can be done about it?roc.cs.berkeley.edu/talks/pdf/usits-dist.pdf · Why do Internet services fail, and what can be done about it? ... •](https://reader030.vdocuments.us/reader030/viewer/2022020303/5b24428e7f8b9ae0578b4702/html5/thumbnails/20.jpg)
Slide 20
Conclusion• Operator error large cause of failures, downtime
• Many failures could be mitigated with– better post-deployment testing– automatic configuration checking– better error detection and diagnosis
• Longer-term: concern for operators must be built into systems from the ground up– make systems robust to operator error– reduce time it takes operators to detect, diagnose, and
repair problems
![Page 21: Why do Internet services fail, and what can be done about it?roc.cs.berkeley.edu/talks/pdf/usits-dist.pdf · Why do Internet services fail, and what can be done about it? ... •](https://reader030.vdocuments.us/reader030/viewer/2022020303/5b24428e7f8b9ae0578b4702/html5/thumbnails/21.jpg)
Willing to contribute failure data, or information about problem
detection/diagnosis techniques?
http://roc.cs.berkeley.edu/projects/faultmanage/