aws cloud disaster recovery plan checklist - are you ready?
DESCRIPTION
Building your DR plan in the AWS cloud can be tricky when compared to on-premise methodologies. Make sure you take the following into consideration when designing your AWS DR plan: •Amazon Web Services required for DR •High Availability AWS Architecture •Known AWS disaster types •Impact of 3rd-party services Get an in-depth presentation covering DR on the AWS cloud.TRANSCRIPT
Solving the problem of
downtime in the cloud
AWS Cloud Disaster Recovery Plan Checklist
Are You Ready?
Founded: 2012
Offers Disaster Recovery as a
Service for cloud-based applications
Using Continuous Replication of
your Entire Application Stack
Source: Forrester
About CloudEndure
Some Of Our Customers
Agenda
DR 101 – Definitions and Terminology
Why AWS for DR?
AWS Global Infrastructure
4 Types of Disaster
3 Takeaways
Q&A
Disaster Recovery in 30 Words
Disaster recovery (DR) is the process, policies and
procedures that are related to preparing for
recovery or continuation of technology
infrastructure which are vital to an organization
after a natural or human induced crisis
DR Key Terminology
RPO – Recovery Point Objective – The maximum tolerable period in
which data might be lost.
RTO – Recovery Time Objective - The duration of time and a
service level within which a business process must be restored
after a disaster (or disruption) in order to avoid unacceptable
consequences.
Data replication – sharing information so as to ensure consistency
between redundant resources.
DR – What it’s not
Unlike Backup, which is mostly about data
loss prevention, DR is about service
availability - low RPO and RTO.
DR complements other High Availability
activities, but while those deal with
disaster prevention, DR is for those times
when the preventions failed.
Why DR?
54% of Cloud IT Managers experienced
an outage in the past 3 months
Top challenges in meeting availability
goals: Insufficient IT resources, Budget
limitations, Software Bugs
79% reports a service availability goal
of “Three Nines” (99.9%)
Source: 2014 Cloud Disaster Recovery Survey
Available for download in the “Resources” tab of the webinar
Why AWS for DR
Flexible
Define different
recovery objectives for
different components
and change them on the
fly. You can grow and
shrink your disaster site
whenever necessary
(even automatically).
Cheap
Pay for hourly usage of
resources. Only create your
disaster site when it’s
needed. Don’t pay for two
running sites all the time
Easy
DR and HA made easier –
No need to build your
DR solution from
scratch. AWS already
has many of the building
blocks built-in –
AutoScale, snapshots,
CloudFormation…
AWS Global Infrastructure
AWS Region
Availability Zone
AWS Global Infrastructure
Regions
8 publicly available regions.
Spread all over the world.
Completely independent. Different teams. Different infrastructure.
Availability Zones (AZs)
Each region contains one or more availability zones.
Physically separated, but in the same geographical location.
Share teams and software infrastructure.
Dynamic Resource Allocation
Pay for resources on an hourly basis.
Create and destroy resources quickly on demand using AWS dashboard,
CLI or API.
Automation is built into several services (such as Autoscale). APIs let
you add additional automation layers.
Types of downtime
Single-AZ
disaster
Whole-region
disaster
Single-service
disaster
Single-resource
disaster
Disaster Type 1 - Single-resource disaster
A single resource (instance, EBS, ELB…)
stops functioning.
Very high. For example, instances are
sometimes terminated by AWS or just
stop working without warning.
Make sure that no single resource is a
point of failure. Use clusters for
stateless instances (you can use
AutoScale and AMIs to help you).
Configure RAIDs for volumes. Use
services that are managed by AWS such
as RDS to store your state and data.
What is it?
Frequency
How to prepare?
Single-
resource
disaster
Disaster Type 2 - Single-AZ disaster
A whole AZ goes down, but all the
other AZs in the region still function.
More than 10 times a year (may be a
different AZ every time).
Build your system so that it’s spread
across multiple AZs and can survive
downtime of any single AZ failure.
Connect subnets in different AZs to
your ELB and turn on multi-AZ for
RDS.
Single-
AZ
disaster
What is it?
Frequency
How to prepare?
Disaster Type 3 - Single-service disaster
A specific service goes down across the
entire region. Almost always contained
within a single region.
Several times a year (a different service
every time).
Resist the temptation to use AWS
services for everything. Choose your
services carefully. Be ready to recreate
your system in a different region, where
the service works well (see next slide).
Single-
service
disaster
What is it?
Frequency
How to prepare?
Disaster Type 4 - Whole-region disaster
An entire region goes down taking all the
applications running on it with it.
Several times a year (a different region
every time) – see CloudEndure blog post
comparing the uptime of all AWS regions.
Implement cross-region DR methodology.
Take snapshots of your instances and copy
them to a different region. Use
CloudFormation to define your application
stack. Copy AMIs to a different region. Use
cross-region read replicas for RDS. Use
continuous data replication.
Whole-region
disaster
What is it?
Frequency
How to prepare?
Beyond AWS
Not all outages are caused by your cloud provider. Downtime of
used 3rd party services can take your application down too. For
example – DNS, CDN, 3-rd part login services…
Pick your 3rd party services carefully.
Check the historical stability of the
considered services. Don’t rely on 3-rd
party services more than you need to.
3 Takeaways
Design DR into your
system – the earlier
you implement DR the
easier it is to recover.
It’s too late to think
about DR after disaster
strikes.
Take advantage of
what AWS offers. AWS
provides many building
blocks to help you
build a DR solution for
your application – you
don’t need to do
everything from
scratch.
Understand the impact
of relying on services –
each used service can
cause downtime.
Check the stability of
the service you’re
using and design your
system to stay up even
if some of the services
it depends on are
down.
1 2 3