designing for failurefiles.informatandm.com/uploads/2018/10/designing_for...why aws for disaster...

91
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM DESIGNING FOR FAILURE Disaster Recovery using AWS Karan Desai Solutions Architect AWS

Upload: others

Post on 06-Apr-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

DESIGNING FORFAILUREDisaster Recovery using AWS

Karan DesaiSolutions Architect

AWS

Page 2: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

speaker:~ $ whoami

> Solutions Architect at AWS since 2016

> Previously Akamai

> Previously Ericsson

> MS EE Virginia Tech

> San Francisco Bay Area resident

> Likes the cloud, airplanes, photography

Page 3: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

What to Expect from the Session

• Disaster Recovery Concepts & Terminology• Why AWS for Disaster Recovery?• DR Design Options• Data Backup and Restore Strategies• DR Testing & Assurance• One More Thing…

Page 4: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

A long time agoin a galaxy far, far away….

Page 5: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

1986-04-26 01:23:04Begin experiment...

Page 6: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between
Page 7: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Recovery point

Page 8: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

PanicSystems are not normal

Manually interpret signals and intervene

Page 9: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

DisasterRecovery point

Data loss

Page 10: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

“There must be an incredible amount of radiation here. We'll be lucky if we're all still alive in the morning.”

– Anatoli Zakharov, Fire Station 2 Chernobyl

Page 11: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

DisasterRecovery point Recovery time

Data loss Down time

Page 12: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

And yet…

Page 13: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Fukushima Daiichi11 March 2011

Page 14: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

• Unplanned event causes coolant failure

• Uncontrolled fuel rod triggers meltdown event

• Uncontrolled release of steam triggers explosion

• Generator present, but failure occurs

• Unplanned event causes coolant failure

• Uncontrolled fuel rod triggers meltdown event

• Uncontrolled release of steam triggers explosion

• Generator present, but failure occurs

Chernobyl Fukushima Daiichi

Shared failures

Page 15: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Lesson learned?

Failure is not one thingIt’s many.

Page 16: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

What are we planning for?

Page 17: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between
Page 18: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between
Page 19: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between
Page 20: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Why do I care about disaster recoveryif I am in the cloud?

Page 21: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

“Everything fails,

all the time”- Werner Vogels

(CTO, Amazon.com)

Page 22: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

• Over 1 million active customers per month across 190 countries

• 2,300 government agencies

• 7,000 educational institutions

Services can be deployed at Global – Regional – Availability Zone levels of reliability

18 worldwide regions, 55 Availability Zones.

4 new regions, 12 additional Availability Zones announced for 2019

Why AWS for Disaster Recovery?

Page 23: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Why AWS for Disaster Recovery?

Your operational DNA has to be crafted for reliability.

• Service SLAs between 99.9% and 100% availability

• Amazon S3 is designed for 99.999999999% durability

• AWS Availability Zones exist on isolated fault lines, flood plains, networks, and electrical grids to substantially reduce the chance of simultaneous failure.

Page 24: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Do not wait for disaster.

It’s not all or nothing.

Start somewhere and scale up

Disaster Recovery in the Cloud

Page 25: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Concepts & Terminology

Page 26: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Start Here: Business Continuity Requirements

▪ How quickly I need this service to be

recovered

▪ 1 minute? 15 minutes? 1 hour? 4 hours? 1

day?

▪ How much data loss can be tolerated?

▪ Zero data loss? 15 minutes out of date?

Down time

RPO RTO

Transactions Lost

Recovery Point Objective (RPO) Recovery Time Objective (RTO)

Page 27: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Ascending levels of DR options

Backup & Restore

Pilot Light

Warm Standby

Hot-Site

Backup of on-premises data to AWS to use in a DR event

Replicate data and minimal running services into AWS, ready to take over and flare up

Replicate data and services into AWS ready to take over

Replicated and load balanced environments that are both actively taking production traffic

RPO

aRTO

COST

24 hours 24 hours

$

RPO

aRTO

COST

12 hours 4 hours

$$

RPO

aRTO

COST

1-4 hours 15 min

$$$

RPO

aRTO

COST

<15 min 0-5 min

$$$

Business continuity

begins

Un-interrupted

Business continuity

Page 28: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

DR Terminology Map

ELB/Appliance

EC2/Auto Scaling

Route 53

Load Balancers

Web/App Servers

Your Data Centers

DNS

Amazon RDS

Security Groups / ACL

Availability Zones / VPC

Multi-regionGeographical Redundancy

Data Centers

Firewall

Database Servers

Page 29: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Disaster Recovery Approaches

Page 30: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Backup and Restore

Page 31: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

On-premises Active Production www.example.com

Corporate data center AWS region

AWS DR failover

AppServers

DB

Server

VPN Connection

Storage GatewayiSCSI

BackupSystem

S3 Bucket

Glacier / Archive

WebServers Internet traffic

S3 Bucket

1TB Data Volume

Backup and Restore Architecture

Page 32: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

• Suitable for• Solutions that can sustain higher technical debt

• Lower business critical nature

• Low cost DR option

• Leverage existing investments in• De-duplication

• Compression

• WAN Acceleration

Backup and Restore Use-Case

Page 33: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Pilot light

Page 34: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Secondarydatabase

server

Pilot light – Preparationwww.example.com

Data mirroring replication

Not running

Pilot light system

Reverse proxy/ caching server

Datavolume

Applicationserver

Corporate data center

Reverse proxy/ caching server

Application server

MasterDatabase

server

Page 35: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Databaseserver

Pilot light – Recoverywww.example.com

Start in minutes

Add additional capacity, if needed

Reverse proxy/ caching server

Datavolume

Applicationserver

Corporate data center

Reverse proxy/ caching server

Application server

MasterDatabase

server

Page 36: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Suitable for:

• Meeting lower RTO & RPO requirements

• Services that can tolerate some downtime

• Mid-range cost option for DR

Pilot Light Use-Cases

Page 37: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Warm Standby

Page 38: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Warm standby –Preparation

Mirroring /replication

Application data source cut over

Elastic load

balancerActiveNot active for production

traffic

Route 53

www.example.com

Scaled down standbyCorporate data center

Datavolume

Applicationserver

Subordinatedatabase

server

Reverse proxy/ caching server

AWS region

Reverse proxy/ caching server

Application server

MasterDatabase

server

Page 39: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Warm standby –Recovery

Elastic loadbalancerActive

Route 53

www.example.com

Scaled-upproduction

Corporate data center

Datavolume

Applicationserver

Databaseserver

Reverse proxy/ caching server

AWS region

Reverse proxy/ caching server

Application Server

MasterDatabase

server

Page 40: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Suitable for:

• Full failover if needed during a disaster

• Solutions that require RTO & RPO in minutes

• Core business-critical functions

Higher cost than pilot light

Warm Standby Use-Cases

Page 41: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Hot Site

Page 42: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Hot site –Preparation

Mirroring /replication

Application data source cut over

Elastic loadbalancer

ActiveRoute 53

www.example.com

Corporate data center

Datavolume

Applicationserver

Subordinate database

server

Reverse proxy/ caching server

AWS region

Reverse proxy/ caching server

Application server

MasterDatabase

server

Active

Page 43: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Hot site –Recovery

Elastic loadbalancer

Route 53

www.example.com

Corporate data center

Datavolume

Applicationserver

Databaseserver

Reverse proxy/ caching server

AWS region

Reverse proxy/ caching server

Application server

MasterDatabase

server

Active

Scaled upfor production use

Page 44: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Suitable for:

• Most important business-critical functions

• Applications that cannot afford any downtime

• RTO and RPO in seconds

Highest cost option of Disaster Recovery

Hot Site Use-Cases

Page 45: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

What about my data?

Page 46: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Use case 1 Basic backup and recovery

Page 47: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

• $ aws s3 sync /backups s3://mybucket

// Back up and sync the backup folder

• $ aws s3 sync /backups s3://mybucket --delete

// Like the preceding, but now delete files not present

• $ aws s3 sync /backups s3://mybucket --delete –storage-class STANDARD_IA

// Like the preceding, but now leverages S3 Infrequent Access

AWS CLI-based backup, manual DR failover

Page 48: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

What does it look like?

S3 Amazon Glacier

S3 bucket

Remote location

/mybucketS3

STANDARD_IA

1

2

Lifecycle policy

Page 49: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

What does a recovery look like?

Failover Remote location

2

AWS DR Region

Amazon EC2

S3 Amazon Glacier

S3 bucket

/mybucketS3 STANDARD_IA

1

Lifecycle policy

Page 50: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

What would it cost?

S3 STANDARD_IA S3 Amazon Glacier

$ 0.0125/GB $ 0.023/GB $ 0.004/GB

Service Cost

S3 - 100 GB images $2.30

S3–Infrequent Access - 500 GB of data $6.25

Amazon Glacier – 1 TB archives $4.10

Total $12.65/month

Prices shown are for us-east-1 region as of Oct 2018 and subject to change over time.

Data size: 100 GB of images, 500 GB of older data, 1 TB of archives

Page 51: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Use case 2 Large data archive and recovery

Page 52: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Large data set – Backup using AWS Snowball

AWS cloud

Corporate data center

NGS

On-premisescompute /cluster

Sequence data

Flowcell-ID

Amazon Glacier

2 3

AWS Snowball device

AWS CLI

1

AWS Snowball

Page 53: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Large data set - Backup using Volume Gateway

AWS cloud

Corporate data center

NGS

On-premisesCompute / cluster

Virtual server

ISCSI

Cached volume1

2

virtual tape library

AWS Storage Gateway

Amazon Glacier

AmazonS3

AWS Storage Gateway Amazon

S3

Page 54: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Large data set – Backup using File Gateway

Corporate data center

NGS

On-premisesCompute / cluster

FileGateway

NFS

AWS cloud -US-West-2

Amazon S3

S3 bucket

Lifecycle policy

AWS cloud US-East-1

Amazon S3

S3 bucket

File Gateway VM

Page 55: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Large data set – Recovery using AWS Snowball

AWS DR Region

Sequence data

Flowcell-ID

Amazon Glacier

Corporate DR facility

Server infrastructure

1

AWS Snowball

S3 VPC endpoint

AWS DR Region

2

Amazon EC2

Page 56: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Large data set – Recovery using Volume Gateway

AWS DR Region

Corporate data center

NGS

On-premisesCompute/cluster

AWS Storage Gateway

Virtual server

ISCSI

Cached volume

1

Amazon Glacier

Amazon S3

instance

2

AWS DR Region

EBS snapshot

virtual tape library

AWS DR Region

instance

AMI

Amazon EBS

Page 57: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Large data set – Recovery using File Gateway

AWS DR Region

Amazon S3

S3 bucket

Corporate data center

NGS

On-premisesCompute/cluster

FileGateway

AWS DR Region

Amazon EC2

1

2

3

S3 endpoint

NFS

File Gateway VM

Page 58: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

What would it cost? – with Gateways

File Storage Volume Storage VTL - Archived

$ 0.023/GB $0.023/GB $ 0.004/GB

Service Cost

File Gateway - 10 TB $235.40

Storage Gateway - 32 TB $736

Storage Gateway VTL - 250TB $1,000

Total $1,971.4/mo

Prices shown are for us-west-2 region as of Oct 2018 and subject to change over time.

Data size: 10 TB of files, 32 TB of storage volume, 250 TB of tapes

Page 59: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

What would it cost? – with Snowball

S3 Snowball -edge Amazon Glacier

$ 0.023/GB $300/100TB $ 0.004/GB

Service Cost

AWS SnowBall * 10 $3,000.00

Amazon Glacier archive 1 PB $4,194.31

Total $ 7,194.31$4,194.31 /month

Prices shown are for us-west-2 region as of Oct 2018 and subject to change over time.

Data size: 1 PB of data, 1 PB of archives

Page 60: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

What if I have even more data?

Page 61: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.

- Andrew S. Tanenbaum

Page 62: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Use case 3Multi site replication and failover

Page 63: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Corporate data center

AWSDirect Connect

VPN

us-east-1

us-west-2

Server

Server

Availability Zone Availability Zone

Failback

Server

Multisite failover

Customer Gateway

users Equinix DA1

Page 64: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Corporate data center

AWSDirect Connect

VPN

us-east-1

us-west-2

Failback

AWS CloudFormation

Server

Availability Zone Availability Zone

Server

Multisite failover

Server

users Customer Gateway

Equinix DA1

Page 65: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

What would it cost? (30 days) - Remote SiteVPC VPN EC2 *

(m4.xlarge)1 GbDirect Connect

EBS Region data transfer fee

$ 0.05/hr $ 0.20/hr $ 0.30/hr $ 0.10/GB $ 0.02/GB

Service Cost

1 GB Direct Connect $219.60

VPN Fallback Connection $36.00

(2) EC2 instances – 1 in each AZ $292.80

(2) EBS 60 GB volumes $12.00

(1) AMI copy to us-west-2 $1.20

Total $561.60*US-West-2, Amazon Linux AMI

Page 66: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Use case 4:All in on AWS – Planning for Amazon S3 data loss

Page 67: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

“I’m worried about losing data from S3!”

• Amazon S3 is built for 11 9s of durability• If you store 10,000 objects, you can on average expect to incur a loss of

a single object once every 10,000,000 years.

• Amazon S3 supports cross region replication

• Amazon S3 supports versioning

• Amazon S3 supports MFA delete

• IAM roles can also be used to limit access to S3

AmazonS3

Page 68: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Use case 5:All in on AWS – Planning for Database failover

Page 69: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

RDS Database

• Create Multi-AZ deployments• Data synchronously replicated to another Availability Zone

• Set up automatic backup/snapshots

• Use Cross-region Read Replicas for MySQL, PostgreSQL, MariaDB

• Use Amazon Aurora for MySQL and PostgreSQL• Distributed, fault-tolerant, self-healing storage system

• Low-latency read replicas

• Point-in-time recovery

• Continuous backup to Amazon S3

• Replication across three Availability Zones.

AmazonRDS

Page 70: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Database Migration Service (DMS)

• Continuous or one time DB replication to EC2 or RDS

• Leverage DMS to replicate your database to AWS or even

change your schema from one engine to another

AWS DMS

Source Database Target Database on Amazon RDS

Oracle Database Amazon Aurora, MySQL, PostgreSQL, MariaDB

Oracle Data Warehouse Amazon Redshift

Microsoft SQL ServerAmazon Aurora, Amazon Redshift, MySQL, PostgreSQL,

MariaDB

Page 71: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

What about third party support?

Page 72: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Amazon BC/DR partner ecosystem (sample)

• Solutions that utilize AWS to enable recovery strategies

• Focused on RTO and RPO requirements

• Full suite of both cold and warm BC/DR solutions

Page 73: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Disaster Recovery Testing & Assurance

Page 74: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Test continuously and constantly

• Regularly execute tests in stable, production & production-like test environments

• Set up Infrastructure as Code

• CI/CD Test in Infrastructure Build Pipeline

• Playbook to follow documented procedures

Test your DR plan before disaster strikes

Page 75: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Warm Standby –Testing

Mirroring /replication

Application data source cut over

Elastic loadbalancer

Active Not active for production trafficRoute 53

www.example.com

Scaled down standby

Corporate data center

Datavolume

Applicationserver

Subordinatedatabase

server

Reverse proxy/ caching server

AWS region

Reverse proxy/ caching server

Application server

Master

Database server

Page 76: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Warm Standby –Testing

Mirroring /replication

Application data source cut over

Elastic loadbalancer

ActiveNot active for production

trafficRoute 53

www.example.com

Scaled down standby

Corporate data center

Datavolume

Applicationserver

Subordinatedatabase

server

Reverse proxy/ caching server

AWS region

Reverse proxy/ caching server

Application server

Master

Database server

Page 77: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Warm Standby –Testing

Mirroring /replication

Application data source cut over

Elastic loadbalancer

ActiveNot active for production

traffic

Route 53

www.example.com

Scaled down standby

Corporate data center

Datavolume

Applicationserver

Subordinatedatabase

server

Reverse proxy/ caching server

AWS region

Reverse proxy/ caching server

Application server

Master

Database server

Page 78: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Warm Standby –Testing

Mirroring /replication

Application data source cut over

Elastic loadbalancer

ActiveNot active for production

trafficRoute 53

www.example.com

Scaled down standby

Corporate data center

Datavolume

Applicationserver

Subordinatedatabase

server

Reverse proxy/ caching server

AWS region

Reverse proxy/ caching server

Application server

Master

Database server

aws rds reboot-db-instance --db-instance-identifier dbInstanceID --force-failover

Page 79: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

https://github.com/Netflix/chaosmonkey

Unleash the Simian Army!

Page 80: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

How easy can I make my DR?

Page 81: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

“Alexa, fail over my data center”

#Alexafailover

https://failover.karandemo.com/

Page 82: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

ELBin SIN region

Route 53

failover.karandemo.com

Singapore Region

ApplicationServers

(EC2 - ASG)

PrimaryDatabase(Aurora MySQL)

Web Servers

(EC2 - ASG)

Sydney Region

Voice Activated Failover with Alexa

Web Servers(EC2 –ASG)

ApplicationServers(EC2 –ASG)

DatabaseRead-

Replica(Aurora MySQL)

Lambdafunction

AlexaSkill

SNSTopic

ELBin SYD region

Alexa enableddevice

Page 83: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

ELBin SIN region

Route 53

Singapore Region

ApplicationServers

(EC2 - ASG)

PrimaryDatabase(Aurora MySQL)

Web Servers

(EC2 - ASG)

Sydney Region

Web Servers(EC2 –ASG)

ApplicationServers(EC2 –ASG)

DatabaseRead-Replica

(Aurora MySQL)

Lambdafunction

SNSTopic

ELBin SYD region

Alexa enableddevice

AlexaSkill

failover.karandemo.com

route53.changeResourceRecordSets()

SNS.publish()

rds.failoverDBCluster()

Page 84: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Putting it all together

Page 85: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Lessons from history

Plan for more than just what you expect to happen

Page 86: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Lessons from history

Test your execution plan before you think you can implement it

Page 87: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Lessons from history

Knowledge is critical. Know how to interpret an alarm on events.

Page 88: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Words of advice

People generally don’t do well

under pressure.

Relying on manual intervention to trigger DR plan is invitation for trouble.

Page 89: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Words of advice

• Automate as much as you can

• Table-top exercises can really help you understand roles and responsibility

• Not all services have to require the same RTO/RPO

• If you don’t have a runbook, it’s time to make one

• If you have one, have you tested it?

Seriously, automate as much as you can ahead of time!

Page 90: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

Further Reading:

https://aws.amazon.com/disaster-recovery/

Whitepaper: Using AWS for Disaster Recovery

https://media.amazonwebservices.com/AWS_Disaster_Recovery.pdf

Page 91: DESIGNING FOR FAILUREfiles.informatandm.com/uploads/2018/10/Designing_for...Why AWS for Disaster Recovery? Your operational DNA has to be crafted for reliability. •Service SLAs between

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Thank You!

Karan DesaiSolutions ArchitectAWS

Email: [email protected]

Twitter: @somecloudguy

DESIGNING FOR FAILUREDisaster Recovery Using AWS