leveraging the public cloud for disaster recovery
DESCRIPTION
TRANSCRIPT
Leveraging the Public Cloudfor Disaster Recovery
Lahav Savir, Architect & CEOEmind systems [email protected]
About
Lahav Savir• 15+ years’ experience in on-line industry• Architect and CEO @ Emind Systems
Emind Systems (est. 2006)• Boutique system integrator• ~100 AWS customers• AWS solution provider
Amazon (AWS) Certification
Amazon Solution Provider& Consulting Partner
https://aws.amazon.com/solution-providers/si/emind-systems-ltd
Disaster Recovery in a Nutshell
• Business continuity• Minimize downtime and data loss• Recovery Time Objective (RPO)• Recovery Point Objective (RTO)• Price
DR ApproachesComplete server mirroring
Data mirroring / replication
Configuration replication
Emind’s Best Practice
Server MirrorConfiguration
Mirror
Data Mirror
Data Mirror
Why Amazon ?
Flexible, Global Infrastructure• N. Virginia• Oregon• N. California• Ireland• Singapore• Tokyo• Sydney• São Paulo• GovCloud
Secure
• VPC - Virtual Private Cloud on AWS's infrastructure
• Specify private IP address range
• Bridge your onsite IT infrastructure and the VPC with a VPN connection or Direct Connect
• Extending your existing security and management policies to the cloud
A different cost model
2nd Site Cost
AWS Cost
Demand
Cost savings w/ AWS
Ability to scale – no arbitrary time limit to failback
Time
Infr
astr
uctu
re C
ost
Test Test Failover Failback
Zoom into the technics
Disaster Recovery Terms• RTO: Recovery Time Objective
– Acceptable time period within which normal operation (or degraded operation) needs to be restored after event
• RPO: Recovery Point Objective– Acceptable data loss measured in time
Backup and Restore
On-premises Infrastructure
Traditional server
Amazon Route 53
AWS Import/Export
S3 Bucket with Objects
Data copied to S3
Backup and Restore
Availability Zone
AWS Region
Data Volume
Amazon EC2Instance
AMI
Amazon S3 Bucket
Data copied from objects in S3
Instance Quickly provisioned from
AMI
Pre-bundled with OS and
applications
Backup and Restore
• Advantages– Simple to get started– Extremely cost effective (mostly backup storage)
• Preparation Phase– Take backups of current systems– Store backups in S3– Describe procedure to restore from backup on AWS
• Know which AMI to use, build your own as needed• Know how to restore system from backups• Know how to switch to new system
Backup and Restore
• In Case of Disaster– Retrieve backups from S3– Bring up required infrastructure
• EC2 instances with prepared AMIs, Load Balancing, etc.
– Restore system from backup– Switch over to the new system
• Adjust DNS records to point to AWS
• Objectives– RTO: as long as it takes to bring up infrastructure and restore
system from backups– RPO: time since last backup
Pilot LightUser or system
WebServer
ApplicationServer
DatabaseServer
Data Volume
Web Server
ApplicationServer
DatabaseServer
Data Volume
Data Mirroring/ Replication
Not Running
Smaller Instance
Amazon Route 53
Pilot LightUser or system
WebServer
DatabaseServer
Data Volume
Web Server
ApplicationServer
DatabaseServer
Data Volume
Not Running
Smaller Instance
Amazon Route 53
WebServer
ApplicationServer
DatabaseServer Data Mirroring/
Replication
ApplicationServer
Web Server
Pilot LightUser or system
WebServer
DatabaseServer
Data Volume
DatabaseServer
Data Volume
Start in minutes
Resize as desired
Amazon Route 53
WebServer
ApplicationServer
DatabaseServer Data Mirroring/
Replication
Pilot Light
• Advantages– Very cost effective (fewer 24/7 resources)
• Preparation Phase– Enable replication of all critical data to AWS– Prepare all required resources for automatic start
• AMIs, Network Settings, Load Balancing, etc.
Pilot Light
• In Case of Disaster– Automatically bring up resources around the replicated core data set– Scale the system as needed to handle current production traffic– Switch over to the new system
• Adjust DNS records to point to AWS
• Objectives– RTO: as long as it takes to detect need for DR and automatically scale
up replacement system– RPO: depends on replication type
WebServer
Fully-Working Low Capacity Standby
User or system
Data Volume
Data Volume
Data Mirroring/ Replication
Low CapacityAmazon Route 53
WebServer
AppServer
DBServer
DatabaseServer
ApplicationServer
Fully-Working Low Capacity Standby
User or system
Data Volume
Data Volume
Low CapacityAmazon Route 53
WebServer
AppServer
DBServerData Mirroring/
Replication
WebServer
DatabaseServer
ApplicationServer
Fully-Working Low Capacity Standby
User or system
Data Volume
AppServer
DBServer
Data Volume
Grow CapacityAmazon Route 53
WebServer
Web Server
ApplicationServer
DatabaseServer
WebServer
DatabaseServer
ApplicationServer
Data Mirroring/ Replication
Fully-Working Low-Capacity Standby
User or system
Data Volume
AppServer
DBServer
Data Volume
Grow CapacityAmazon Route 53
WebServer
Web Server
ApplicationServer
DatabaseServer
WebServer
DatabaseServer
ApplicationServer
Data Mirroring/ Replication
Fully-Working Low-Capacity Standby
• Advantages– Can take some production traffic at any time– Cost savings (IT footprint smaller than full DR)
• Preparation– Similar to Pilot Light– All necessary components running 24/7, but not scaled for production
traffic– Best practice – continuous testing
• “Trickle” a statistical subset of production traffic to DR site
Fully-Working Low-Capacity Standby
• In Case of Disaster– Immediately fail over most critical production load
• Adjust DNS records to point to AWS– (Auto) Scale the system further to handle all production load
• Objectives– RTO: for critical load: as long as it takes to fail over; for all other load,
as long as it takes to scale further– RPO: depends on replication type
Multi-Site Hot StandbyUser or system
Data Volume
AppServer
DBServer
Data Volume
Data Mirroring/ Replication
Full CapacityAmazon Route 53
WebServer
ApplicationServer
DatabaseServer
Web Server
ApplicationServer
DatabaseServer
Web Server
ApplicationServer
DatabaseServer
Multi-Site Hot Standby
• Advantages– At any moment can take all production load
• Preparation– Similar to Low-Capacity Standby– Fully scaling in/out with production load
• In Case of Disaster– Immediately fail over all production load
• Adjust DNS records to point to AWS
• Objectives– RTO: as long as it takes fail over– RPO: depends on replication type
Summary
• Plan– Analyze your existing applications and services– Find the right approach per case
• Adapt– Match your plan to RTO, RPO and Budget
• POC– Validate your plan
• Test– Periodic testing
• Monitor– Ensure continues operation of all
• goCloud – Emind’s optimal road to the cloud– Secure cloud architecture– Scalable & high-availability design– Customized system deployment– Orchestrating cloud and software– Cloud operation team– Monitoring and alerting– 24x7 SLA