operational resiliency vs. disaster recovery hypothetical ...• lean it team, one headcount...

• Operational Resiliency for a Virtualized Environment

Peter Laz, MBCP, MBCI Managing Consultant

Forsythe

Brendan Foye Enterprise Account Manager

AGENDA

• Operational Resiliency vs. Disaster Recovery

• Hypothetical Case Study

• Summary and Q&A

OPERATIONAL RESILIENCY VS. DISASTER RECOVERY

Optimum level of performance

Architecture and processes for continuous availability of business operations and IT environments

A much larger percentage of IT services and business functions experience:

•  Continuous availability of IT Service •  End-to-end process is business as

usual (appl. interdependencies, no workarounds)

•  Full performance and capacity (IT & business functions) – No customer service impact

Minimum acceptable level of performance

Invoke alternate procedures to recover & resume operations following significant disruptive event Interruption in the following for a large percentage of apps and business functions:

•  IT Service •  Degraded IT capability •  Operation/workflow •  Limited customer service

TRADITIONAL DR MODEL OPERATIONAL RESILIENCY MODEL

•  Used in response to single-site outage (out-of-region)

•  Limited capacity & performance

•  Primary metrics are RTO / RPO DR Site (Internal or Vendor Provided)

TRADITIONAL DR SOLUTION VIEW

•  Full production: (capacity and performance)

•  Application HA within the data center only

•  Metrics include capacity, performance and availability within the data center

PRODUCTION DATA CENTER

OPERATIONAL RESILIENCY SOLUTION VIEW

•  Production applications at both sites, in-region

•  Application HA capabilities within and between sites (converged HA & DR)

•  Metrics include capacity, performance and availability applied to both application and site-level outage events

•  Used in response to dual regional outage (out-of-region)

•  Limited capacity & performance

•  Primary metrics are RTO / RPO DR Site (Internal or Vendor Provided)

PRODUCTION DATA CENTER

PRODUCTION DATA CENTER II 10 – 50 miles

CHALLENGE OF MOVING TO THE ‘RESILIENT’ MODEL

Controlling the proliferation of technologies that arise to meet resiliency requirements is key, because they:

PRODUCE FUNCTIONAL

DRIVE UP RISK DRIVE UP COST

VIRTUAL CASE STUDY

•  A mid-size enterprise with 100 VMs

•  US and International Data Centers

•  Legacy hardware, considering new/dissimilar flash array

•  Lean IT team, one headcount dedicated to replication

•  High standards for RPO, RTO, SLA

•  Regular audits

ACME ANVIL CORPORATION

TIER RTO RPO

1 24 hrs 24 hrs 2 48 hrs 24 hrs 3 3 – 5 days 1 week

CURRENT

St. Louis HQCorporate Data Center

Chicago Co-lo DR site

DenverWest Region Hub

RaleighEast Region Hub

Remote Offices Remote Offices

CHALLENGE: REDUCE COSTS

Fat Data •  Space-consuming snapshots

•  Not duped or compressed

•  Replicating all VMs and VDRs, regardless of which you really care about (such as application groups)

Overseas operations •  Requires big pipe

•  Adds expense, complexity

Duplicated hardware •  Back-up array always matches storage array

•  Costs doubled without increased functionality

CHALLENGE: MANAGEMENT COMPLEXITY

Valuable headcount assigned to DR •  Manually re-mapping storage arrays

Additional time-consuming tasks mean team spends more hours in DR than production, such as: •  De-duping and compressing data

•  Overseas pipeline management

•  Managing refreshes

•  After-hours audit support

•  Restoring applications

CHALLENGE: MEET RPOS & RTOS

Event-based challenges •  Failback and recovery

•  General DR tests

•  Application tests/development

•  Multi-site, multi-country data transfer

Functionality-based challenges •  Snapshot-only environment

FROM THE CTO

•  More flexibility in hardware with less money

•  Sick of spending time at DR site v. production site

•  No DR downtime and No lost data

•  Maintain SLAs & ensure RTOs and RPOs

•  Implement all this yesterday

•  Small learning curve

•  Simple/short install that won’t consume a lot of resources

TIER RTO RPO

1 < 4 hrs < 30 mins 2 < 24 hrs < 30 mins 3 48 hrs 24 hrs

ELMER’S MANDATE

SOLUTION: FLASH AND HYPERVISOR-BASED DR

•  Not everything is virtualized or hypervisor based

•  All primary applications and data need protection, including (especially) larger, non-x86 infrastructure

•  One size does not fit all

•  But complementary technologies will help mitigate risk

•  Flash at the macro/array/volume/host group/protection group

•  Zerto at the granular/vm/vm groups/cluster level

SOLUTION: DECREASE COSTS

•  Streamline data with duping & compression

•  Reduce workload on overseas pipeline

•  Enable hardware-agnostic replication and DR (any – any)

•  Test out adding new functionality to arrays

•  Support multi-site, multi-country with minimal performance impact

SOLUTION: STREAMLINE COMPLEXITY

Free up headcount from DR and replication •  Automatically re-map storage, no manual

Simplify other key time sucks •  Automatically de-dupe and compress data

•  Reduce need for overseas pipeline

•  Streamline refreshes

•  Audit support during business hours

•  Restoring applications in less time

SOLUTION: MEET OR EXCEED REQUIREMENTS

Enable new capabilities in key workloads •  Full live failover and failback

•  Small data v. fat data

•  Refreshes take minutes instead of hours or days

•  Enable near-sync continuous replication as well as snapshots

•  Streamline restores with Point in Tine Recovery journal

•  Deliver RPOs in seconds, RTOs in minutes

HYPERVISOR–BASED DR IN ACTION

ZERTO VIRTUAL REPLICATION ARCHITECTURE

vCenter Server

ZERTO VIRTUAL REPLICATION ARCHITECTURE

vCenter Server

Replicate from anything to anything save cost and reuse

Highly Scalable Software only, hypervisor

based, downloadable

Bandwidth Optimization,

WAN resiliency Point-in-Time Recovery - Recover

from Logical Failures Journal based any point in time

recovery - No snapshots

RPO = Seconds No App Performance Impact

Near-sync, continuous replication

APPLICATION PROTECTION: VIRTUAL PROTECTION GROUP

REPLICATION SITE

Application

SharePoint, CRM, ERP, Exchange etc.

Virtual Protection Group

Complete application protection and recovery

•  VM & VMDK level consistency groups

•  Protect across server and storage locations

•  Fully support VMotion, Storage VMotion, HA, vApp

•  Journal-based point-in-time protection

•  Group policy and configuration

•  VSS Support

AUTOMATION: FAILOVER, FAILBACK, RECOVERY

RTO = Minutes! Fully automated failover and failback of multiple VMs with write-order fidelity, including parallel VM recovery, boot

order, IP reconfiguration, test networks and more

Offsite Cloning Clone entire app offsite for test & dev or backup

Click-to-Test, Anytime Immediate, automated, failover testing

while protecting production, also to previous point in time

WORKFLOW AUTOMATION - END-USER ACCEPTED RECOVERY

•  End-user acceptance testing with the ability to rollback a failover automatically ○  Validates prior to production release of application ○  Simply recover from logical failures

•  Ability to automate the commit or failback event ○  Reduce operational complexity with workflow ○  Significantly reduces the time it takes to reverse a failover activity

BEFORE – COMPLEX, MANUAL REPLICATION PROCESS

Example - Current replication configuration process for virtualized CRM

Configure all replica/on pairs and en//es

Verify replica/on

On going monitoring

Create and document recovery plan

Test recovery plan

Move all other app VMs to other LUNs

Consolidate all CRM VMs to same LUN

Document all LUN proper/es

Locate all VMs affec/ng CRM

Zza Map & Document All LUNs

Locate all VM Datastores

Consolidate CRM VMs on separate LUN

Local vCenter

Storage

Replica/on Management

Remote Storage

Remote vCenter

Virtualiza/

Storage Team

Allocate LUNs in replica with same proper/es

Ensure sufficient space for replica

Complex! Manual! Inflexible!

Example - Current replication configuration process for virtualized CRM

AFTER – AUTOMATED REPLICATION PROCESS

Configure all replica/on pairs and en//es

Verify replica/on

On going monitoring

Create and document recovery plan

Test recovery plan

Move all other app VMs to other LUNs

Consolidate all CRM VMs to same LUN

Document all LUN proper/es

Zza Map & Document All LUNs

Locate all VM Datastores

Consolidate CRM VMs on separate LUN

Local vCenter

Storage

Replica/on Management

Remote Storage

Remote vCenter

Virtualiza/

Storage Team

Allocate LUNs in replica with same proper/es

Configure verify replica/on and policies

Allocate space for all replicated VMs

On going replica/on monitoring

SUMMARY

Business requirements: •  Driving demand toward Operational Resiliency model,

including Elmer’s requirements: ○  reduce costs, complexity, & resources ○  assure service level capabilities

–  Traditional DR

–  HA within & between sites –  Metrics for capacity, performance & availability

Peter Laz plaz@forsythe.com

Brendan Foye: Brendan.Foye@zerto.com

QUESTION AND ANSWER

operational resiliency vs. disaster recovery hypothetical ...• lean it team, one headcount...

Documents

hr headcount per total headcount survey report-2012

college of business. 2005 headcount 2006 headcount

remedial process optimization (rpo) inventory and...

tutorial rpo

headcount optimisation

org headcount optimization

hypothetical facility exercise data hypothetical atomic

headcount tutorial

headcount spm

profit by rpo

gm rpo codes.pdf

employee benefits survey 2020...employee benefits survey...

headcount bld 2013

rpo final דליה

rpo and volunteering

headcount 2014 airiel

rpo showhouse

headcount portal user guide - leicestershire county council...

18 second headcount

practical rpo: getting real benefits from an rpo program