csg, january, 2005.99999 dan oberst, princeton university
TRANSCRIPT
![Page 1: CSG, January, 2005.99999 Dan Oberst, Princeton University](https://reader035.vdocuments.us/reader035/viewer/2022080914/56649cef5503460f949be7df/html5/thumbnails/1.jpg)
CSG, January, 2005
.99999
Dan Oberst, Princeton University
![Page 2: CSG, January, 2005.99999 Dan Oberst, Princeton University](https://reader035.vdocuments.us/reader035/viewer/2022080914/56649cef5503460f949be7df/html5/thumbnails/2.jpg)
CSG, January, 2005 Dan Oberst, Princeton University
Some Definitions Reliability Metrics: Percent Uptime
% Uptime Downtime Min/Week
Downtime Min/Month
Downtime Min/Year
99% 100 3024(50 hours)
5256(88 hours)
99.9% 10 302(5 hours)
525(9 hours)
99.99% 1 30 52
99.999% 0 (6 sec) 3 5
![Page 3: CSG, January, 2005.99999 Dan Oberst, Princeton University](https://reader035.vdocuments.us/reader035/viewer/2022080914/56649cef5503460f949be7df/html5/thumbnails/3.jpg)
CSG, January, 2005 Dan Oberst, Princeton University
Reliability Gotchas
2 hour outage in 1 year Requires 23 years of 100% uptime for .99999
99% Availability (88 hours/year) One 3+ day outage One ~7 hour outage every month One ~1½ hour outage every week
Reliability isn’t the whole story
![Page 4: CSG, January, 2005.99999 Dan Oberst, Princeton University](https://reader035.vdocuments.us/reader035/viewer/2022080914/56649cef5503460f949be7df/html5/thumbnails/4.jpg)
CSG, January, 2005 Dan Oberst, Princeton University
The Weakest Link
No system can be more reliable than any of its components System reliability is product of component reliability
Component Estimated Reliability
CPU 99.999%
Memory 99.999%
Disk 99.8%
Software 99.5%
System Overall 99.3% (<99.5%)
![Page 5: CSG, January, 2005.99999 Dan Oberst, Princeton University](https://reader035.vdocuments.us/reader035/viewer/2022080914/56649cef5503460f949be7df/html5/thumbnails/5.jpg)
CSG, January, 2005 Dan Oberst, Princeton University
Beyond Uptime
Scheduled Uptime How much can you afford to be down? = How much do you need to plan to be up?
24x7, 24x6.75, 18x7, etc.
RTO (Recovery Time Objective) How long before the system is back? How long can you afford to be without it?
RPO (Recovery Point Objective) How much lost work?
![Page 6: CSG, January, 2005.99999 Dan Oberst, Princeton University](https://reader035.vdocuments.us/reader035/viewer/2022080914/56649cef5503460f949be7df/html5/thumbnails/6.jpg)
CSG, January, 2005 Dan Oberst, Princeton University
Example Service Levels
Class Service Service Level
1 (RTE) Customer-facing
Revenue-producing
24x7 scheduled
99.9% availability (<45 min/wk)
RTO=2 hr/RPO=0 hr
2 Supply 24x6.75 scheduled
99.5% availability (<3.5 hr/mo)
RTO=8-24 hr/RPO=4 hr
3 Back Office 18x7 scheduled
99% availability (<5.5 hr/mo)
RTO=3 days/RPO=1 day
4 Departmental Function 24x6.5
98% availability (<13.5 hr/mo)
RTO=5 days; RPO=1 day
![Page 7: CSG, January, 2005.99999 Dan Oberst, Princeton University](https://reader035.vdocuments.us/reader035/viewer/2022080914/56649cef5503460f949be7df/html5/thumbnails/7.jpg)
CSG, January, 2005 Dan Oberst, Princeton University
How’re We Doin’?
Gartner CIO Poll How would you rank your most critical applications in
unplanned downtime in the past year?
Average <=98% (>=175 hr/yr)
Very Good 99% (<=87 hr/yr)
Outstanding 99.5% (<=43 hr/yr)
Best in Class 99.9% (<=9 hr/yr)
100% Availability Zero unplanned downtime
![Page 8: CSG, January, 2005.99999 Dan Oberst, Princeton University](https://reader035.vdocuments.us/reader035/viewer/2022080914/56649cef5503460f949be7df/html5/thumbnails/8.jpg)
CSG, January, 2005 Dan Oberst, Princeton University
![Page 9: CSG, January, 2005.99999 Dan Oberst, Princeton University](https://reader035.vdocuments.us/reader035/viewer/2022080914/56649cef5503460f949be7df/html5/thumbnails/9.jpg)
CSG, January, 2005 Dan Oberst, Princeton University
How’re We Doin’? (cont.) How would you rank your most-critical application in planned
downtime during the past year?
Average > 250 hours/year 13%
Very Good < 200 hours/year 38%
Outstanding < 50 hours/year 38%
Best in Class <12 hours/year 9%
100% Availability Zero planned downtime 2%
![Page 10: CSG, January, 2005.99999 Dan Oberst, Princeton University](https://reader035.vdocuments.us/reader035/viewer/2022080914/56649cef5503460f949be7df/html5/thumbnails/10.jpg)
CSG, January, 2005 Dan Oberst, Princeton University
Getting to .99999
Enhanced Availability Redundancy RAID
High Availability Clustering Remote mirroring
Fault-Tolerant All resources (including application) replicated
![Page 11: CSG, January, 2005.99999 Dan Oberst, Princeton University](https://reader035.vdocuments.us/reader035/viewer/2022080914/56649cef5503460f949be7df/html5/thumbnails/11.jpg)
CSG, January, 2005 Dan Oberst, Princeton University
Five Nines
It’s hard, it’s expensive. Match the reliability to the service. Improve the component with the fewest nines. Find the cheapest nines in the chain. Review assumptions. Practice3!! Moore’s Law is your friend.
![Page 12: CSG, January, 2005.99999 Dan Oberst, Princeton University](https://reader035.vdocuments.us/reader035/viewer/2022080914/56649cef5503460f949be7df/html5/thumbnails/12.jpg)
CSG, January, 2005 Dan Oberst, Princeton University
Resources
CIO Update: Poll Shows Application Availability Levels Have Increased, D. Scott, Gartner Article G00120892, 12 May, 2004.
Real-Time Enterprise: Business Continuity and Availability, D, Scott, J. Krischer, Gartner Research Note SPA-18-1683, 24 September, 2002.
Performance Tuning Active Call Center for Enterprise Applications, Sunny Beach Technology, Inc. White Paper, 7 January, 2001, http://www.sunny-beach.net.