operational resiliency vs. disaster recovery hypothetical ...• lean it team, one headcount...
Post on 19-Apr-2020
3 Views
Preview:
TRANSCRIPT
• Operational Resiliency for a Virtualized Environment
Peter Laz, MBCP, MBCI Managing Consultant
Forsythe
Brendan Foye Enterprise Account Manager
Zerto
2
AGENDA
• Operational Resiliency vs. Disaster Recovery
• Hypothetical Case Study
• Summary and Q&A
3
OPERATIONAL RESILIENCY VS. DISASTER RECOVERY
Optimum level of performance
Architecture and processes for continuous availability of business operations and IT environments
A much larger percentage of IT services and business functions experience:
• Continuous availability of IT Service • End-to-end process is business as
usual (appl. interdependencies, no workarounds)
• Full performance and capacity (IT & business functions) – No customer service impact
Minimum acceptable level of performance
Invoke alternate procedures to recover & resume operations following significant disruptive event Interruption in the following for a large percentage of apps and business functions:
• IT Service • Degraded IT capability • Operation/workflow • Limited customer service
TRADITIONAL DR MODEL OPERATIONAL RESILIENCY MODEL
4
• Used in response to single-site outage (out-of-region)
• Limited capacity & performance
• Primary metrics are RTO / RPO DR Site (Internal or Vendor Provided)
TRADITIONAL DR SOLUTION VIEW
• Full production: (capacity and performance)
• Application HA within the data center only
• Metrics include capacity, performance and availability within the data center
PRODUCTION DATA CENTER
5
OPERATIONAL RESILIENCY SOLUTION VIEW
• Production applications at both sites, in-region
• Application HA capabilities within and between sites (converged HA & DR)
• Metrics include capacity, performance and availability applied to both application and site-level outage events
• Used in response to dual regional outage (out-of-region)
• Limited capacity & performance
• Primary metrics are RTO / RPO DR Site (Internal or Vendor Provided)
PRODUCTION DATA CENTER
PRODUCTION DATA CENTER II 10 – 50 miles
6
CHALLENGE OF MOVING TO THE ‘RESILIENT’ MODEL
Controlling the proliferation of technologies that arise to meet resiliency requirements is key, because they:
PRODUCE FUNCTIONAL
GAPS
DRIVE UP RISK DRIVE UP COST
$
7
VIRTUAL CASE STUDY
• A mid-size enterprise with 100 VMs
• US and International Data Centers
• Legacy hardware, considering new/dissimilar flash array
• Lean IT team, one headcount dedicated to replication
• High standards for RPO, RTO, SLA
• Regular audits
8
ACME ANVIL CORPORATION
TIER RTO RPO
1 24 hrs 24 hrs 2 48 hrs 24 hrs 3 3 – 5 days 1 week
CURRENT
St. Louis HQCorporate Data Center
Chicago Co-lo DR site
DenverWest Region Hub
RaleighEast Region Hub
Remote Offices Remote Offices
9
CHALLENGE: REDUCE COSTS
Fat Data • Space-consuming snapshots
• Not duped or compressed
• Replicating all VMs and VDRs, regardless of which you really care about (such as application groups)
Overseas operations • Requires big pipe
• Adds expense, complexity
Duplicated hardware • Back-up array always matches storage array
• Costs doubled without increased functionality
10
CHALLENGE: MANAGEMENT COMPLEXITY
Valuable headcount assigned to DR • Manually re-mapping storage arrays
Additional time-consuming tasks mean team spends more hours in DR than production, such as: • De-duping and compressing data
• Overseas pipeline management
• Managing refreshes
• After-hours audit support
• Restoring applications
11
CHALLENGE: MEET RPOS & RTOS
Event-based challenges • Failback and recovery
• General DR tests
• Application tests/development
• Multi-site, multi-country data transfer
Functionality-based challenges • Snapshot-only environment
12
FROM THE CTO
• More flexibility in hardware with less money
• Sick of spending time at DR site v. production site
• No DR downtime and No lost data
• Maintain SLAs & ensure RTOs and RPOs
• Implement all this yesterday
• Small learning curve
• Simple/short install that won’t consume a lot of resources
TIER RTO RPO
1 < 4 hrs < 30 mins 2 < 24 hrs < 30 mins 3 48 hrs 24 hrs
ELMER’S MANDATE
13
SOLUTION: FLASH AND HYPERVISOR-BASED DR
• Not everything is virtualized or hypervisor based
• All primary applications and data need protection, including (especially) larger, non-x86 infrastructure
• One size does not fit all
• But complementary technologies will help mitigate risk
• Flash at the macro/array/volume/host group/protection group
• Zerto at the granular/vm/vm groups/cluster level
14
SOLUTION: DECREASE COSTS
• Streamline data with duping & compression
• Reduce workload on overseas pipeline
• Enable hardware-agnostic replication and DR (any – any)
• Test out adding new functionality to arrays
• Support multi-site, multi-country with minimal performance impact
15
SOLUTION: STREAMLINE COMPLEXITY
Free up headcount from DR and replication • Automatically re-map storage, no manual
Simplify other key time sucks • Automatically de-dupe and compress data
• Reduce need for overseas pipeline
• Streamline refreshes
• Audit support during business hours
• Restoring applications in less time
16
SOLUTION: MEET OR EXCEED REQUIREMENTS
Enable new capabilities in key workloads • Full live failover and failback
• Small data v. fat data
• Refreshes take minutes instead of hours or days
• Enable near-sync continuous replication as well as snapshots
• Streamline restores with Point in Tine Recovery journal
• Deliver RPOs in seconds, RTOs in minutes
HYPERVISOR–BASED DR IN ACTION
18
ZERTO VIRTUAL REPLICATION ARCHITECTURE
vCenter Server
vCenter Server
19
ZERTO VIRTUAL REPLICATION ARCHITECTURE
vCenter Server
vCenter Server
Replicate from anything to anything save cost and reuse
HW
Highly Scalable Software only, hypervisor
based, downloadable
Bandwidth Optimization,
WAN resiliency Point-in-Time Recovery - Recover
from Logical Failures Journal based any point in time
recovery - No snapshots
RPO = Seconds No App Performance Impact
Near-sync, continuous replication
20
APPLICATION PROTECTION: VIRTUAL PROTECTION GROUP
REPLICATION SITE
Application
SharePoint, CRM, ERP, Exchange etc.
Virtual Protection Group
Complete application protection and recovery
• VM & VMDK level consistency groups
• Protect across server and storage locations
• Fully support VMotion, Storage VMotion, HA, vApp
• Journal-based point-in-time protection
• Group policy and configuration
• VSS Support
21
AUTOMATION: FAILOVER, FAILBACK, RECOVERY
RTO = Minutes! Fully automated failover and failback of multiple VMs with write-order fidelity, including parallel VM recovery, boot
order, IP reconfiguration, test networks and more
Offsite Cloning Clone entire app offsite for test & dev or backup
Click-to-Test, Anytime Immediate, automated, failover testing
while protecting production, also to previous point in time
22
WORKFLOW AUTOMATION - END-USER ACCEPTED RECOVERY
• End-user acceptance testing with the ability to rollback a failover automatically ○ Validates prior to production release of application ○ Simply recover from logical failures
• Ability to automate the commit or failback event ○ Reduce operational complexity with workflow ○ Significantly reduces the time it takes to reverse a failover activity
23
BEFORE – COMPLEX, MANUAL REPLICATION PROCESS
Example - Current replication configuration process for virtualized CRM
Configure all replica/on pairs and en//es
Verify replica/on
On going monitoring
Create and document recovery plan
Test recovery plan
Move all other app VMs to other LUNs
Consolidate all CRM VMs to same LUN
Document all LUN proper/es
Locate all VMs affec/ng CRM
Zza Map & Document All LUNs
Locate all VM Datastores
Consolidate CRM VMs on separate LUN
Local vCenter
Storage
Replica/on Management
Remote Storage
Remote vCenter
Virtualiza/
on
Team
Storage Team
Allocate LUNs in replica with same proper/es
Ensure sufficient space for replica
Complex! Manual! Inflexible!
24
Example - Current replication configuration process for virtualized CRM
AFTER – AUTOMATED REPLICATION PROCESS
Configure all replica/on pairs and en//es
Verify replica/on
On going monitoring
Create and document recovery plan
Test recovery plan
Move all other app VMs to other LUNs
Consolidate all CRM VMs to same LUN
Document all LUN proper/es
Locate all VMs affec/ng CRM
Zza Map & Document All LUNs
Locate all VM Datastores
Consolidate CRM VMs on separate LUN
Local vCenter
Storage
Replica/on Management
Remote Storage
Remote vCenter
Virtualiza/
on
Team
Storage Team
Allocate LUNs in replica with same proper/es
Ensure sufficient space for replica
Locate all VMs affec/ng CRM
Configure verify replica/on and policies
Ensure sufficient space for replica
Allocate space for all replicated VMs
On going replica/on monitoring
SUMMARY
26
SUMMARY
Business requirements: • Driving demand toward Operational Resiliency model,
including Elmer’s requirements: ○ reduce costs, complexity, & resources ○ assure service level capabilities
– Traditional DR
– HA within & between sites – Metrics for capacity, performance & availability
top related