leveraging the public cloud for disaster recovery

Leveraging the Public Cloudfor Disaster Recovery

Lahav Savir, Architect & CEOEmind systems [email protected]

mailto:[email protected]

About

Lahav Savir• 15+ years’ experience in on-line industry• Architect and CEO @ Emind Systems

Emind Systems (est. 2006)• Boutique system integrator• ~100 AWS customers• AWS solution provider

Amazon (AWS) Certification

Amazon Solution Provider& Consulting Partner

https://aws.amazon.com/solution-providers/si/emind-systems-ltd

Disaster Recovery in a Nutshell

• Business continuity• Minimize downtime and data loss• Recovery Time Objective (RPO)• Recovery Point Objective (RTO)• Price

DR ApproachesComplete server mirroring

Data mirroring / replication

Configuration replication

Emind’s Best Practice

Server MirrorConfiguration

Mirror

Data Mirror

Data Mirror

Why Amazon ?

Flexible, Global Infrastructure• N. Virginia• Oregon• N. California• Ireland• Singapore• Tokyo• Sydney• São Paulo• GovCloud

Secure

• VPC - Virtual Private Cloud on AWS's infrastructure

• Specify private IP address range

• Bridge your onsite IT infrastructure and the VPC with a VPN connection or Direct Connect

• Extending your existing security and management policies to the cloud

A different cost model

2nd Site Cost

AWS Cost

Demand

Cost savings w/ AWS

Ability to scale – no arbitrary time limit to failback

Time

Infr

astr

uctu

re C

ost

Test Test Failover Failback

Zoom into the technics

Disaster Recovery Terms• RTO: Recovery Time Objective

– Acceptable time period within which normal operation (or degraded operation) needs to be restored after event

• RPO: Recovery Point Objective– Acceptable data loss measured in time

Backup and Restore

On-premises Infrastructure

Traditional server

Amazon Route 53

AWS Import/Export

S3 Bucket with Objects

Data copied to S3

Backup and Restore

Availability Zone

AWS Region

Data Volume

Amazon EC2Instance

AMI

Amazon S3 Bucket

Data copied from objects in S3

Instance Quickly provisioned from

AMI

Pre-bundled with OS and

applications

Backup and Restore

• Advantages– Simple to get started– Extremely cost effective (mostly backup storage)

• Preparation Phase– Take backups of current systems– Store backups in S3– Describe procedure to restore from backup on AWS

• Know which AMI to use, build your own as needed• Know how to restore system from backups• Know how to switch to new system

Backup and Restore

• In Case of Disaster– Retrieve backups from S3– Bring up required infrastructure

• EC2 instances with prepared AMIs, Load Balancing, etc.

– Restore system from backup– Switch over to the new system

• Adjust DNS records to point to AWS

• Objectives– RTO: as long as it takes to bring up infrastructure and restore

system from backups– RPO: time since last backup

Pilot LightUser or system

WebServer

ApplicationServer

DatabaseServer

Data Volume

Web Server

ApplicationServer

DatabaseServer

Data Volume

Data Mirroring/ Replication

Not Running

Smaller Instance

Amazon Route 53


WebServer

DatabaseServer

Data Volume

Web Server

ApplicationServer

DatabaseServer

Data Volume

Not Running

Smaller Instance

Amazon Route 53

WebServer

ApplicationServer

DatabaseServer Data Mirroring/

Replication

ApplicationServer

Web Server


WebServer

DatabaseServer

Data Volume

DatabaseServer

Data Volume

Start in minutes

Resize as desired

Amazon Route 53

WebServer

ApplicationServer

DatabaseServer Data Mirroring/

Replication

Pilot Light

• Advantages– Very cost effective (fewer 24/7 resources)

• Preparation Phase– Enable replication of all critical data to AWS– Prepare all required resources for automatic start

• AMIs, Network Settings, Load Balancing, etc.

Pilot Light

• In Case of Disaster– Automatically bring up resources around the replicated core data set– Scale the system as needed to handle current production traffic– Switch over to the new system


• Objectives– RTO: as long as it takes to detect need for DR and automatically scale

up replacement system– RPO: depends on replication type

WebServer

Fully-Working Low Capacity Standby

User or system

Data Volume

Data Volume


Low CapacityAmazon Route 53

WebServer

AppServer

DBServer

DatabaseServer

ApplicationServer


User or system

Data Volume

Data Volume

Low CapacityAmazon Route 53

WebServer

AppServer

DBServerData Mirroring/

Replication

WebServer

DatabaseServer

ApplicationServer


User or system

Data Volume

AppServer

DBServer

Data Volume

Grow CapacityAmazon Route 53

WebServer

Web Server

ApplicationServer

DatabaseServer

WebServer

DatabaseServer

ApplicationServer


Fully-Working Low-Capacity Standby

User or system

Data Volume

AppServer

DBServer

Data Volume

Grow CapacityAmazon Route 53

WebServer

Web Server

ApplicationServer

DatabaseServer

WebServer

DatabaseServer

ApplicationServer



• Advantages– Can take some production traffic at any time– Cost savings (IT footprint smaller than full DR)

• Preparation– Similar to Pilot Light– All necessary components running 24/7, but not scaled for production

traffic– Best practice – continuous testing

• “Trickle” a statistical subset of production traffic to DR site


• In Case of Disaster– Immediately fail over most critical production load

• Adjust DNS records to point to AWS– (Auto) Scale the system further to handle all production load

• Objectives– RTO: for critical load: as long as it takes to fail over; for all other load,

as long as it takes to scale further– RPO: depends on replication type

Multi-Site Hot StandbyUser or system

Data Volume

AppServer

DBServer

Data Volume


Full CapacityAmazon Route 53

WebServer

ApplicationServer

DatabaseServer

Web Server

ApplicationServer

DatabaseServer

Web Server

ApplicationServer

DatabaseServer

Multi-Site Hot Standby

• Advantages– At any moment can take all production load

• Preparation– Similar to Low-Capacity Standby– Fully scaling in/out with production load

• In Case of Disaster– Immediately fail over all production load


• Objectives– RTO: as long as it takes fail over– RPO: depends on replication type

Summary

• Plan– Analyze your existing applications and services– Find the right approach per case

• Adapt– Match your plan to RTO, RPO and Budget

• POC– Validate your plan

• Test– Periodic testing

• Monitor– Ensure continues operation of all

• goCloud – Emind’s optimal road to the cloud– Secure cloud architecture– Scalable & high-availability design– Customized system deployment– Orchestrating cloud and software– Cloud operation team– Monitoring and alerting– 24x7 SLA

Contact [email protected] @lahavsavir

054-4321688

mailto:[email protected]

leveraging the public cloud for disaster recovery

Documents

pilot light

disaster immediately

aws objectives

pilot light

production

load balancing

working low

restore system