How Many Nines? Understanding RPO and RTO Metrics for BC/DR
Mike Robinson Sr. Solution Marketing Manager [email protected] September 7, 2014
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 2
Solutions Track 5: How Many Nines? Understanding RPO and RTO Metrics for BC/DR
Virtualization and the prospect of using the cloud for backup or disaster recovery scenarios offers the prospect of scale, flexibility and resilience that IT can leverage to shrink costs and consolidate infrastructure while maximizing application uptime. How does an understanding of RPOs (recovery point objectives) and RTOs (recovery time objectives) impact the ability of IT to set SLAs? How much downtime occurs and data is lost between a “two nine” (99 percent) and “five nine” (99.999 percent) RPO? Explore the differences between RPOs and RTOs, determine how to apply them to server workloads, and learn guidelines for selecting the right DR technologies for the right workloads.
Mike Robinson is a senior product marketing manager at NetIQ Corporation.
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 3
Agenda
• The Importance of Disaster Recovery
• 3 Phases of Disaster Recovery
• Key Terminology
• Availability Tiers
• The Disaster Recovery Dichotomy
• Virtualized Disaster Recovery
• Matching Technology to Requirements
• Next Steps
The Importance of Disaster Recovery
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 5
Why Disaster Recovery Matters
1 Forrester, September 2010: Business Continuity and Disaster Recovery are top IT Priorities 2 National Archives & Records Administration 3 2003 London Chamber of Commerce and Industry Paper
Total economic damage from disaster in 20091
Economic impact felt in the U.S. from disasters in 20091
$41.3 Billion $10.8 Billion
Business that went bankrupt within 1 year after being unable to use their datacenter for 10 consecutive days2
Proportion of companies that close within 2 years after losing data during a disaster3
93% 90%
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 6
Disaster Recovery Pressure IT as Competitive Advantage
Availability
Cost of Downtime
Uptime Expectations
Number of Critical Systems
Employee and Customer Expectations
Backup Windows
Tolerance for Downtime
Failover Windows
Response Time
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 7
More Applications Classified as Critical
“What percentage of your applications and data fall into the following tiers?”
Mission-critical 34%
Business-critical 35%
Noncritical 31%
Source: Forrester/Disaster Recovery Journal November 2010 Global Disaster Recovery Preparedness Online Survey
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 8
Disaster Recovery – Top Priority “Audience polling … shows that of all the data management options, re-architecting backup and recovery was viewed as the top priority….”
Source: Gartner (August, 2011)
Key Terminology
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 10
What is a Server Workload?
Server
Data
Applications
Operating System
A workload is the operating system, applications, middleware and data that reside on a physical server or virtual host.
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 11
The 3 Phases of Disaster Recovery
Disaster Recovery means: 1. Backing up (replicating) entire server workloads (the
contents of a server, including the operating system, applications and data),
2. Recovering workloads during an outage, and
3. Restoring workloads to their original production locations after the outage.
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 12
Key Disaster Recovery Concepts
RPO: Recovery Point Objective – A measure of maximum acceptable data loss
in terms of time (minutes, hours, days). – An RPO of 4 hours means that the most
recent backup has to be no more than 4 hours old at the time of an outage.
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 13
Key Disaster Recovery Concepts
RTO: Recovery Time Objective – The target maximum allowable time to
recover from an outage. – An RTO of 4 hours means systems have to
be back up and operational no more than 4 hours after an outage.
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 14
Key Disaster Recovery Concepts RPO and RTO
9 pm
12 am
3 am
6 am
9 am
12 pm
3 pm
6 pm
9 pm
12 am
3 am
6 am
9 am
12 pm
3 pm
Tape backup Outage
Outage begins
Servers repaired/ replaced
Service restored
Restore
Tape backup window
Recovery time
Lost data Downtime
RPO = 24 hours – Actual recovery point: 21 hours
RTO = 12 hours – Actual recovery time = 15 hours
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 15
Key Disaster Recovery Concepts
Availability tiers: 99.9%, Five 9’s, etc. – Groupings of server workloads by uptime
requirements or SLAs – Different availability requirements have
different costs and use different technologies
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 16
Disaster Recovery Solutions
Availability%
90 “one nine”
95
99 “two nines”
99.9 “three nines”
99.99 “four nines”
99.999 “five nines”
Downtime per Year
36.5 days
18.25 days
3.65 days
8.76 hours
52.56 minutes
5.26 minutes
Downtime per Month
72 hours
36 hours
7.2 hours
43.8 minutes
4.32 minutes
25.9 seconds
Typical RTO/RPO
12 – 24 hours
15 – 60 minutes
<5 minutes
The Disaster Recovery Dichotomy
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 18
Disaster Recovery Budgets Rising Slowly
5.27% 5.38%
6.15%
Q2 2010 (N = 566) Q2 2011 (N = 476) Q2 2012 (N = 471)
“Approximately what percentage of your combined IT operating and capital budget will go to business continuity and disaster recovery?”
Base: IT decision-makers from US organizations with more than 500 employees Source: Forrsights Budgets And Priorities Tracker Survey, Q2 2010, Q2 2011, Q2 2012
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 19
Disaster Recovery SLAs “Audience polling shows that 87% of the enterprises surveyed have RTO for their most mission-critical applications/services
as four hours or less….”
RTOs of Mission-Critical Applications (n=93)
Source: Gartner (August, 2011)
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 20
Mission-Critical Recovery Objectives “For mission-critical … systems in your organization, what are/
were the recovery objectives today and three years ago?”
25%
39%
14%
31%
24%
20%
27%
27%
25%
14%
25%
14%
12%
6%
14%
6%
Three years ago
Today
Three years ago
Today
Rec
over
y po
int
obje
ctiv
es (R
PO
) R
ecov
ery
time
obje
ctiv
es (R
TO)
Less than 15minutes Less than 1 hour Less than 4 hours
Less than 24 hours Less than 72 hours
Base: 51 disaster recovery decision-makers at enterprises with more than 500 employees Source: A commissioned study conducted by Forrester Consulting on behalf of NetIQ, December 2012
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 21
Business-Critical App Recovery Objectives “For mission-critical and business-critical systems in your organization,
what are/were the recovery objectives today and three years ago?”
Base: 51 disaster recovery decision-makers at enterprises with more than 500 employees Source: A commissioned study conducted by Forrester Consulting on behalf of NetIQ, December 2012
25%
31%
18%
27%
18%
29%
25%
37%
37%
16%
37%
14%
8%
4%
6%
4%
Three years ago
Today
Three years ago
Today
Rec
over
y po
int
obje
ctiv
es (R
PO
) R
ecov
ery
time
obje
ctiv
es (R
TO)
Business-critical
Less than 15minutes Less than 1 hour Less than 4 hours
Less than 24 hours Less than 72 hours
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 22
Most Organizations Want to Improve Recovery Objectives “What are your plans for improving your current recovery objectives?”
Base: 51 disaster recovery decision-makers at enterprises with more than 500 employees Source: A commissioned study conducted by Forrester Consulting on behalf of NetIQ, December 2012
41%
33%
16%
10%
We have plans to improve within the next 6 months
We have plans to improve in the next 6 to 12 months
We would like to, but we have no plans currently
We are happy with the current objectives
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 23
The Need for Better Protection
• More workloads are considered to be business critical and need better protection – “Historically, the proportion of an organization's applications that it deems mission-
critical has been between 10% and 20% … of the audience, 60% has more than 20% of their applications/services categorized at the highest level of criticality” Gartner (August 2011)
• IT is under pressure to stretch their budgets to accommodate the business needs – “Best practice points to spending more money on the 20% that is mission-critical
and less on the 80% that isn't, to reduce or eliminate the impact of an outage on the business” Gartner (August 2011)
• Traditional disaster recovery approaches are either too expensive or too inadequate in terms of protection…
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 24
Mirroring is Used for Critical Apps “How do you copy data between your primary and recovery site(s)?”
(Select all that apply)
35% 41%
22% 17%
39%
22% 28% 29%
18%
46%
5% 8%
24% 17%
59%
Synchronous replication Asynchronous replication
Periodic point in time copies
Remote backup over the wide area network
Backup locally to tape and transport our tapes
Mission critical applications and data Business critical applications and data Non-critical applications and data
Base: 136 disaster recovery decision-makers at enterprises with more than 500 employees Source: Forrester/Disaster Recovery Journal November 2010 Global Disaster Recovery Preparedness Online Survey
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 25
Synchronous Replication Cost Issues
• Building & Running a Secondary Datacenter – Secondary location outside of “disaster” zone – Out of state, country, even continent
• Additional Equipment & Software Licenses – Duplicate hardware and software licenses – Significant expense for rarely or under-utilized servers
• Networking – Bandwidth costs between sites can be significant – Varies with amount of data and RPO tolerances
• Staff – Training, testing, additional specialized staff, etc.
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 26
Mirroring is Used for Critical Apps Tape is for Non-Critical Applications
“How do you copy data between your primary and recovery site(s)?” (Select all that apply)
Base: 136 disaster recovery decision-makers at enterprises with more than 500 employees Source: Forrester/Disaster Recovery Journal November 2010 Global Disaster Recovery Preparedness Online Survey
35% 41%
22% 17%
39%
22% 28% 29%
18%
46%
5% 8%
24% 17%
59%
Synchronous replication Asynchronous replication
Periodic point in time copies
Remote backup over the wide area network
Backup locally to tape and transport our tapes
Mission critical applications and data Business critical applications and data Non-critical applications and data
Virtualized Disaster Recovery
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 28
Disaster Recovery Solutions
Availability%
90 “one nine”
95
99 “two nines”
99.9 “three nines”
99.99 “four nines”
99.999 “five nines”
Downtime per Year
36.5 days
18.25 days
3.65 days
8.76 hours
52.56 minutes
5.26 minutes
Downtime per Month
72 hours
36 hours
7.2 hours
43.8 minutes
4.32 minutes
25.9 seconds
Typical RTO/RPO
12 – 24 hours
15 – 60 minutes
<5 minutes
Cost
Solution
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 29
Disaster Recovery Using Tape Tape headaches • Physical media needs to be managed • Sequential access to data
Data only view of the world • Data is backed up but system is not • How do you effectively recover the entire server?
Slow restore • Data only view makes restore a long process • Rebuild server, install OS, install application, recover data
Slow testing • Same painful restore process • Laborious testing leads to no testing
Recovery Risk – How well protected are you?
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 30
High Availability in the Physical World
Wide Area Network
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 31
Disaster Recovery with Virtualization
Wide Area Network
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 32
Benefits of Virtualized DR
• Heterogeneous workload protection – Protect Windows and Linux workloads, both physical and
virtual, with the same system
• Rapid failover and failback – Warm standby VMs provide very fast recovery – Restore to bare metal, repaired server or virtual platform – Use hypervisor-based snapshots for point-in-time recovery
• Safe sandbox testing – Virtual workloads can be tested at any time without affecting
the production environment
• Simplified licensing – No additional OS or app licenses required on recovery servers
Taking Disaster Recovery to the Cloud
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 34
Cloud-Based DR Delivery
Do-It-Yourself: Configure & manage your own solution using public cloud resources
DR-as-a-Service: Prepackaged pay-as-you-go recovery services to the cloud with specified RPO & RTO SLAs
Cloud-to-Cloud DR: Failover from one cloud environment to another
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 35
Traditional Disaster Recovery
Best Worst RPO/RTO
Cost
Offsite Replication Expensive; requires a secondary site, redundant hardware (which is idle / under-utilized most of the time)
Local Replication Only good for individual server failure. No protection against site failures.
Vaulting (tape, imaging) Recovery can take days or weeks. Difficult to test.
$
$$$$
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 36
Storage as a Service
Advantages
• Fixed per-gigabyte cost
• Off-site cloud-based storage
• Scale up or down on demand
• Service provider handles hardware maintenance, backups
Disadvantages • Data only, not workloads
• Static storage can’t run server workloads
• If a local outage occurs, data needs to be copied to recovery environment first
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 37
Recovery as a Service Storage as a Service + IaaS
Advantages
• Fixed per-gigabyte cost
• Off-site cloud-based storage
• Scale up or down on demand
• Service provider handles hardware maintenance, backups
More Advantages • Protect whole workloads, not just data
• Replicate to the cloud, recover and run in the cloud
• Live restore back to repaired data center
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 38
RaaS vs. Traditional Disaster Recovery
Best Worst RPO/RTO
Cost
Offsite Protection
Local Protection
Vaulting
$
$$$$
Raas
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 39
Private/Hybrid RaaS
• Dedicated backup hardware at service provider premise
• Scale by adding hardware
• Hardware owned, managed, maintained by customer or service provider
• Replicate workloads directly to offsite facility
• Run recovery workloads in dedicated environment
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 40
Public RaaS
• Shared backup hardware at service provider premise
• Scale using service provider’s resource pool
• Hardware owned, managed, maintained by service provider
• Replicate workloads directly to offsite facility
• Run recovery workloads in shared environment
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 41
Next Steps…
• Determine your tolerance for downtime & data loss
• Establish DR metrics – Categorize server workloads into
tiers by RPO, RTO • Match organizational needs to DR technologies – You can and will use multiple
technologies
– Balance budget with needs
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 42
Mike Robinson Sr. Product Marketing Manager [email protected]
© 2014 NetIQ Corporation and its affiliates. All Rights Reserved. 43
+1 713.548.1700 (Worldwide) 888.323.6768 (Toll-free) [email protected] NetIQ.com
Worldwide Headquarters 515 Post Oak Blvd., Suite 1200 Houston, TX 77027 USA http://community.netiq.com
This document could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein. These changes may be incorporated in new editions of this document. NetIQ Corporation may make improvements in or changes to the software described in this document at any time. Copyright © 2014 NetIQ Corporation and its affiliates. All Rights Reserved. ActiveAudit, ActiveView, Aegis, AppManager, Change Administrator, Change Guardian, Compliance Suite, the cube logo design, Directory and Resource Administrator, Directory Security Administrator, Domain Migration Administrator, Exchange Administrator, File Security Administrator, Group Policy Administrator, Group Policy Guardian, Group Policy Suite, IntelliPolicy, Knowledge Scripts, NetConnect, NetIQ, the NetIQ logo, PSAudit, PSDetect, PSPasswordManager, PSSecure, Secure Configuration Manager, Security Administration Suite, Security Manager, Server Consolidator, VigilEnt, and Vivinet are trademarks or registered trademarks of NetIQ Corporation or its subsidiaries in the United States.