aws cloud disaster recovery plan checklist - are you ready?

Solving the problem of

downtime in the cloud

AWS Cloud Disaster Recovery Plan Checklist

Are You Ready?

Founded: 2012

Offers Disaster Recovery as a

Service for cloud-based applications

Using Continuous Replication of

your Entire Application Stack

Source: Forrester

About CloudEndure

Some Of Our Customers

Agenda

DR 101 – Definitions and Terminology

Why AWS for DR?

AWS Global Infrastructure

4 Types of Disaster

3 Takeaways

Q&A

Disaster Recovery in 30 Words

Disaster recovery (DR) is the process, policies and

procedures that are related to preparing for

recovery or continuation of technology

infrastructure which are vital to an organization

after a natural or human induced crisis

DR Key Terminology

RPO – Recovery Point Objective – The maximum tolerable period in

which data might be lost.

RTO – Recovery Time Objective - The duration of time and a

service level within which a business process must be restored

after a disaster (or disruption) in order to avoid unacceptable

consequences.

Data replication – sharing information so as to ensure consistency

between redundant resources.

DR – What it’s not

Unlike Backup, which is mostly about data

loss prevention, DR is about service

availability - low RPO and RTO.

DR complements other High Availability

activities, but while those deal with

disaster prevention, DR is for those times

when the preventions failed.

Why DR?

54% of Cloud IT Managers experienced

an outage in the past 3 months

Top challenges in meeting availability

goals: Insufficient IT resources, Budget

limitations, Software Bugs

79% reports a service availability goal

of “Three Nines” (99.9%)

Source: 2014 Cloud Disaster Recovery Survey

Available for download in the “Resources” tab of the webinar

Why AWS for DR

Flexible

Define different

recovery objectives for

different components

and change them on the

fly. You can grow and

shrink your disaster site

whenever necessary

(even automatically).

Cheap

Pay for hourly usage of

resources. Only create your

disaster site when it’s

needed. Don’t pay for two

running sites all the time

Easy

DR and HA made easier –

No need to build your

DR solution from

scratch. AWS already

has many of the building

blocks built-in –

AutoScale, snapshots,

CloudFormation…


AWS Region

Availability Zone


Regions

8 publicly available regions.

Spread all over the world.

Completely independent. Different teams. Different infrastructure.

Availability Zones (AZs)

Each region contains one or more availability zones.

Physically separated, but in the same geographical location.

Share teams and software infrastructure.

Dynamic Resource Allocation

Pay for resources on an hourly basis.

Create and destroy resources quickly on demand using AWS dashboard,

CLI or API.

Automation is built into several services (such as Autoscale). APIs let

you add additional automation layers.

Types of downtime

Single-AZ

disaster

Whole-region

disaster

Single-service

disaster

Single-resource

disaster

Disaster Type 1 - Single-resource disaster

A single resource (instance, EBS, ELB…)

stops functioning.

Very high. For example, instances are

sometimes terminated by AWS or just

stop working without warning.

Make sure that no single resource is a

point of failure. Use clusters for

stateless instances (you can use

AutoScale and AMIs to help you).

Configure RAIDs for volumes. Use

services that are managed by AWS such

as RDS to store your state and data.

What is it?

Frequency

How to prepare?

Single-

resource

disaster

Disaster Type 2 - Single-AZ disaster

A whole AZ goes down, but all the

other AZs in the region still function.

More than 10 times a year (may be a

different AZ every time).

Build your system so that it’s spread

across multiple AZs and can survive

downtime of any single AZ failure.

Connect subnets in different AZs to

your ELB and turn on multi-AZ for

RDS.

Single-

AZ

disaster

What is it?

Frequency

How to prepare?

Disaster Type 3 - Single-service disaster

A specific service goes down across the

entire region. Almost always contained

within a single region.

Several times a year (a different service

every time).

Resist the temptation to use AWS

services for everything. Choose your

services carefully. Be ready to recreate

your system in a different region, where

the service works well (see next slide).

Single-

service

disaster

What is it?

Frequency

How to prepare?

Disaster Type 4 - Whole-region disaster

An entire region goes down taking all the

applications running on it with it.

Several times a year (a different region

every time) – see CloudEndure blog post

comparing the uptime of all AWS regions.

Implement cross-region DR methodology.

Take snapshots of your instances and copy

them to a different region. Use

CloudFormation to define your application

stack. Copy AMIs to a different region. Use

cross-region read replicas for RDS. Use

continuous data replication.

Whole-region

disaster

What is it?

Frequency

How to prepare?

https://www.cloudendure.com/blog/whos-buggiest-aws-region/

Beyond AWS

Not all outages are caused by your cloud provider. Downtime of

used 3rd party services can take your application down too. For

example – DNS, CDN, 3-rd part login services…

Pick your 3rd party services carefully.

Check the historical stability of the

considered services. Don’t rely on 3-rd

party services more than you need to.

3 Takeaways

Design DR into your

system – the earlier

you implement DR the

easier it is to recover.

It’s too late to think

about DR after disaster

strikes.

Take advantage of

what AWS offers. AWS

provides many building

blocks to help you

build a DR solution for

your application – you

don’t need to do

everything from

scratch.

Understand the impact

of relying on services –

each used service can

cause downtime.

Check the stability of

the service you’re

using and design your

system to stay up even

if some of the services

it depends on are

down.

1 2 3

Thank You

Leonid Feinberg

VP Products

[email protected]

mailto:[email protected]

aws cloud disaster recovery plan checklist - are you ready?

Technology