csg, january, 2005.99999 dan oberst, princeton university

12
CSG, January, 2005 .99999 Dan Oberst, Princeton University

Upload: arleen-walker

Post on 17-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSG, January, 2005.99999 Dan Oberst, Princeton University

CSG, January, 2005

.99999

Dan Oberst, Princeton University

Page 2: CSG, January, 2005.99999 Dan Oberst, Princeton University

CSG, January, 2005 Dan Oberst, Princeton University

Some Definitions Reliability Metrics: Percent Uptime

% Uptime Downtime Min/Week

Downtime Min/Month

Downtime Min/Year

99% 100 3024(50 hours)

5256(88 hours)

99.9% 10 302(5 hours)

525(9 hours)

99.99% 1 30 52

99.999% 0 (6 sec) 3 5

Page 3: CSG, January, 2005.99999 Dan Oberst, Princeton University

CSG, January, 2005 Dan Oberst, Princeton University

Reliability Gotchas

2 hour outage in 1 year Requires 23 years of 100% uptime for .99999

99% Availability (88 hours/year) One 3+ day outage One ~7 hour outage every month One ~1½ hour outage every week

Reliability isn’t the whole story

Page 4: CSG, January, 2005.99999 Dan Oberst, Princeton University

CSG, January, 2005 Dan Oberst, Princeton University

The Weakest Link

No system can be more reliable than any of its components System reliability is product of component reliability

Component Estimated Reliability

CPU 99.999%

Memory 99.999%

Disk 99.8%

Software 99.5%

System Overall 99.3% (<99.5%)

Page 5: CSG, January, 2005.99999 Dan Oberst, Princeton University

CSG, January, 2005 Dan Oberst, Princeton University

Beyond Uptime

Scheduled Uptime How much can you afford to be down? = How much do you need to plan to be up?

24x7, 24x6.75, 18x7, etc.

RTO (Recovery Time Objective) How long before the system is back? How long can you afford to be without it?

RPO (Recovery Point Objective) How much lost work?

Page 6: CSG, January, 2005.99999 Dan Oberst, Princeton University

CSG, January, 2005 Dan Oberst, Princeton University

Example Service Levels

Class Service Service Level

1 (RTE) Customer-facing

Revenue-producing

24x7 scheduled

99.9% availability (<45 min/wk)

RTO=2 hr/RPO=0 hr

2 Supply 24x6.75 scheduled

99.5% availability (<3.5 hr/mo)

RTO=8-24 hr/RPO=4 hr

3 Back Office 18x7 scheduled

99% availability (<5.5 hr/mo)

RTO=3 days/RPO=1 day

4 Departmental Function 24x6.5

98% availability (<13.5 hr/mo)

RTO=5 days; RPO=1 day

Page 7: CSG, January, 2005.99999 Dan Oberst, Princeton University

CSG, January, 2005 Dan Oberst, Princeton University

How’re We Doin’?

Gartner CIO Poll How would you rank your most critical applications in

unplanned downtime in the past year?

Average <=98% (>=175 hr/yr)

Very Good 99% (<=87 hr/yr)

Outstanding 99.5% (<=43 hr/yr)

Best in Class 99.9% (<=9 hr/yr)

100% Availability Zero unplanned downtime

Page 8: CSG, January, 2005.99999 Dan Oberst, Princeton University

CSG, January, 2005 Dan Oberst, Princeton University

Page 9: CSG, January, 2005.99999 Dan Oberst, Princeton University

CSG, January, 2005 Dan Oberst, Princeton University

How’re We Doin’? (cont.) How would you rank your most-critical application in planned

downtime during the past year?

Average > 250 hours/year 13%

Very Good < 200 hours/year 38%

Outstanding < 50 hours/year 38%

Best in Class <12 hours/year 9%

100% Availability Zero planned downtime 2%

Page 10: CSG, January, 2005.99999 Dan Oberst, Princeton University

CSG, January, 2005 Dan Oberst, Princeton University

Getting to .99999

Enhanced Availability Redundancy RAID

High Availability Clustering Remote mirroring

Fault-Tolerant All resources (including application) replicated

Page 11: CSG, January, 2005.99999 Dan Oberst, Princeton University

CSG, January, 2005 Dan Oberst, Princeton University

Five Nines

It’s hard, it’s expensive. Match the reliability to the service. Improve the component with the fewest nines. Find the cheapest nines in the chain. Review assumptions. Practice3!! Moore’s Law is your friend.

Page 12: CSG, January, 2005.99999 Dan Oberst, Princeton University

CSG, January, 2005 Dan Oberst, Princeton University

Resources

CIO Update: Poll Shows Application Availability Levels Have Increased, D. Scott, Gartner Article G00120892, 12 May, 2004.

Real-Time Enterprise: Business Continuity and Availability, D, Scott, J. Krischer, Gartner Research Note SPA-18-1683, 24 September, 2002.

Performance Tuning Active Call Center for Enterprise Applications, Sunny Beach Technology, Inc. White Paper, 7 January, 2001, http://www.sunny-beach.net.