operational resiliency vs. disaster recovery hypothetical ...• lean it team, one headcount...

14
Operational Resiliency for a Virtualized Environment Peter Laz, MBCP, MBCI Managing Consultant Forsythe Brendan Foye Enterprise Account Manager Zerto 2 AGENDA Operational Resiliency vs. Disaster Recovery Hypothetical Case Study Summary and Q&A

Upload: others

Post on 19-Apr-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

• Operational Resiliency for a Virtualized Environment

Peter Laz, MBCP, MBCI Managing Consultant

Forsythe

Brendan Foye Enterprise Account Manager

Zerto

2

AGENDA

• Operational Resiliency vs. Disaster Recovery

• Hypothetical Case Study

• Summary and Q&A

3

OPERATIONAL RESILIENCY VS. DISASTER RECOVERY

Optimum level of performance

Architecture and processes for continuous availability of business operations and IT environments

A much larger percentage of IT services and business functions experience:

•  Continuous availability of IT Service •  End-to-end process is business as

usual (appl. interdependencies, no workarounds)

•  Full performance and capacity (IT & business functions) – No customer service impact

Minimum acceptable level of performance

Invoke alternate procedures to recover & resume operations following significant disruptive event Interruption in the following for a large percentage of apps and business functions:

•  IT Service •  Degraded IT capability •  Operation/workflow •  Limited customer service

TRADITIONAL DR MODEL OPERATIONAL RESILIENCY MODEL

4

•  Used in response to single-site outage (out-of-region)

•  Limited capacity & performance

•  Primary metrics are RTO / RPO DR Site (Internal or Vendor Provided)

TRADITIONAL DR SOLUTION VIEW

•  Full production: (capacity and performance)

•  Application HA within the data center only

•  Metrics include capacity, performance and availability within the data center

PRODUCTION DATA CENTER

5

OPERATIONAL RESILIENCY SOLUTION VIEW

•  Production applications at both sites, in-region

•  Application HA capabilities within and between sites (converged HA & DR)

•  Metrics include capacity, performance and availability applied to both application and site-level outage events

•  Used in response to dual regional outage (out-of-region)

•  Limited capacity & performance

•  Primary metrics are RTO / RPO DR Site (Internal or Vendor Provided)

PRODUCTION DATA CENTER

PRODUCTION DATA CENTER II 10 – 50 miles

6

CHALLENGE OF MOVING TO THE ‘RESILIENT’ MODEL

Controlling the proliferation of technologies that arise to meet resiliency requirements is key, because they:

PRODUCE FUNCTIONAL

GAPS

DRIVE UP RISK DRIVE UP COST

$

7

VIRTUAL CASE STUDY

•  A mid-size enterprise with 100 VMs

•  US and International Data Centers

•  Legacy hardware, considering new/dissimilar flash array

•  Lean IT team, one headcount dedicated to replication

•  High standards for RPO, RTO, SLA

•  Regular audits

8

ACME ANVIL CORPORATION

TIER RTO RPO

1 24 hrs 24 hrs 2 48 hrs 24 hrs 3 3 – 5 days 1 week

CURRENT

St. Louis HQCorporate Data Center

Chicago Co-lo DR site

DenverWest Region Hub

RaleighEast Region Hub

Remote Offices Remote Offices

9

CHALLENGE: REDUCE COSTS

Fat Data •  Space-consuming snapshots

•  Not duped or compressed

•  Replicating all VMs and VDRs, regardless of which you really care about (such as application groups)

Overseas operations •  Requires big pipe

•  Adds expense, complexity

Duplicated hardware •  Back-up array always matches storage array

•  Costs doubled without increased functionality

10

CHALLENGE: MANAGEMENT COMPLEXITY

Valuable headcount assigned to DR •  Manually re-mapping storage arrays

Additional time-consuming tasks mean team spends more hours in DR than production, such as: •  De-duping and compressing data

•  Overseas pipeline management

•  Managing refreshes

•  After-hours audit support

•  Restoring applications

11

CHALLENGE: MEET RPOS & RTOS

Event-based challenges •  Failback and recovery

•  General DR tests

•  Application tests/development

•  Multi-site, multi-country data transfer

Functionality-based challenges •  Snapshot-only environment

12

FROM THE CTO

•  More flexibility in hardware with less money

•  Sick of spending time at DR site v. production site

•  No DR downtime and No lost data

•  Maintain SLAs & ensure RTOs and RPOs

•  Implement all this yesterday

•  Small learning curve

•  Simple/short install that won’t consume a lot of resources

TIER RTO RPO

1 < 4 hrs < 30 mins 2 < 24 hrs < 30 mins 3 48 hrs 24 hrs

ELMER’S MANDATE

13

SOLUTION: FLASH AND HYPERVISOR-BASED DR

•  Not everything is virtualized or hypervisor based

•  All primary applications and data need protection, including (especially) larger, non-x86 infrastructure

•  One size does not fit all

•  But complementary technologies will help mitigate risk

•  Flash at the macro/array/volume/host group/protection group

•  Zerto at the granular/vm/vm groups/cluster level

14

SOLUTION: DECREASE COSTS

•  Streamline data with duping & compression

•  Reduce workload on overseas pipeline

•  Enable hardware-agnostic replication and DR (any – any)

•  Test out adding new functionality to arrays

•  Support multi-site, multi-country with minimal performance impact

15

SOLUTION: STREAMLINE COMPLEXITY

Free up headcount from DR and replication •  Automatically re-map storage, no manual

Simplify other key time sucks •  Automatically de-dupe and compress data

•  Reduce need for overseas pipeline

•  Streamline refreshes

•  Audit support during business hours

•  Restoring applications in less time

16

SOLUTION: MEET OR EXCEED REQUIREMENTS

Enable new capabilities in key workloads •  Full live failover and failback

•  Small data v. fat data

•  Refreshes take minutes instead of hours or days

•  Enable near-sync continuous replication as well as snapshots

•  Streamline restores with Point in Tine Recovery journal

•  Deliver RPOs in seconds, RTOs in minutes

HYPERVISOR–BASED DR IN ACTION

18

ZERTO VIRTUAL REPLICATION ARCHITECTURE

vCenter Server

vCenter Server

19

ZERTO VIRTUAL REPLICATION ARCHITECTURE

vCenter Server

vCenter Server

Replicate from anything to anything save cost and reuse

HW

Highly Scalable Software only, hypervisor

based, downloadable

Bandwidth Optimization,

WAN resiliency Point-in-Time Recovery - Recover

from Logical Failures Journal based any point in time

recovery - No snapshots

RPO = Seconds No App Performance Impact

Near-sync, continuous replication

20

APPLICATION PROTECTION: VIRTUAL PROTECTION GROUP

REPLICATION SITE

Application

SharePoint, CRM, ERP, Exchange etc.

Virtual Protection Group

Complete application protection and recovery

•  VM & VMDK level consistency groups

•  Protect across server and storage locations

•  Fully support VMotion, Storage VMotion, HA, vApp

•  Journal-based point-in-time protection

•  Group policy and configuration

•  VSS Support

21

AUTOMATION: FAILOVER, FAILBACK, RECOVERY

RTO = Minutes! Fully automated failover and failback of multiple VMs with write-order fidelity, including parallel VM recovery, boot

order, IP reconfiguration, test networks and more

Offsite Cloning Clone entire app offsite for test & dev or backup

Click-to-Test, Anytime Immediate, automated, failover testing

while protecting production, also to previous point in time

22

WORKFLOW AUTOMATION - END-USER ACCEPTED RECOVERY

•  End-user acceptance testing with the ability to rollback a failover automatically ○  Validates prior to production release of application ○  Simply recover from logical failures

•  Ability to automate the commit or failback event ○  Reduce operational complexity with workflow ○  Significantly reduces the time it takes to reverse a failover activity

23

BEFORE – COMPLEX, MANUAL REPLICATION PROCESS

Example - Current replication configuration process for virtualized CRM

Configure  all  replica/on  pairs  and  en//es  

Verify  replica/on  

On  going  monitoring  

Create  and  document  recovery  plan  

Test  recovery  plan  

Move  all    other  app  VMs  to  other  LUNs  

Consolidate  all  CRM  VMs  to  same  LUN  

Document  all  LUN  proper/es  

Locate  all  VMs  affec/ng  CRM  

Zza  Map  &  Document  All  LUNs  

Locate  all  VM  Datastores    

Consolidate  CRM  VMs  on  separate  LUN  

Local  vCenter  

Storage  

Replica/on  Management  

Remote  Storage  

Remote  vCenter  

Virtualiza/

on  

Team

 Storage  Team

 

Allocate  LUNs  in  replica  with  same  proper/es  

Ensure  sufficient  space  for  replica  

Complex!  Manual!  Inflexible!  

24

Example - Current replication configuration process for virtualized CRM

AFTER – AUTOMATED REPLICATION PROCESS

Configure  all  replica/on  pairs  and  en//es  

Verify  replica/on  

On  going  monitoring  

Create  and  document  recovery  plan  

Test  recovery  plan  

Move  all    other  app  VMs  to  other  LUNs  

Consolidate  all  CRM  VMs  to  same  LUN  

Document  all  LUN  proper/es  

Locate  all  VMs  affec/ng  CRM  

Zza  Map  &  Document  All  LUNs  

Locate  all  VM  Datastores    

Consolidate  CRM  VMs  on  separate  LUN  

Local  vCenter  

Storage  

Replica/on  Management  

Remote  Storage  

Remote  vCenter  

Virtualiza/

on  

Team

 Storage  Team

 

Allocate  LUNs  in  replica  with  same  proper/es  

Ensure  sufficient  space  for  replica  

Locate  all  VMs  affec/ng  CRM  

Configure  verify  replica/on  and  policies  

Ensure  sufficient  space  for  replica  

Allocate  space  for  all  replicated  VMs  

On  going  replica/on  monitoring  

SUMMARY

26

SUMMARY

Business requirements: •  Driving demand toward Operational Resiliency model,

including Elmer’s requirements: ○  reduce costs, complexity, & resources ○  assure service level capabilities

–  Traditional DR

–  HA within & between sites –  Metrics for capacity, performance & availability

27

Peter Laz [email protected]

Brendan Foye: [email protected]

QUESTION AND ANSWER