disaster recovery on demand on the cloud

34
Protect your app from Outages Nati Shalom CTO GigaSpaces @natishalom May 2013

Upload: nati-shalom

Post on 14-Jun-2015

564 views

Category:

Technology


1 download

DESCRIPTION

How to avoid Cloud Outages and leverage cloud economics to keep the cost down through automation of disaster recovery processes and on-demand deployment of the backup nodes.

TRANSCRIPT

Page 1: Disaster Recovery on Demand on the Cloud

Protect your app from OutagesNati Shalom CTO GigaSpaces@natishalom

May 2013

Page 2: Disaster Recovery on Demand on the Cloud

2

AWS and outages Outage impact Disaster Recovery – it’s all about redundancy! Cloudify as a solution for redundancy Demo with Cloudify on EC2

® Copyright 2013 GigaSpaces Ltd. All Rights Reserved

AGENDA

Page 3: Disaster Recovery on Demand on the Cloud

3

AWS USAGE

Managing Big Data on the Cloud

• AWS – around 0.5M servers• Facebook – less than 0.1M servers• Google – around 1M servers

Page 4: Disaster Recovery on Demand on the Cloud

4

THE OUTAGE PROBLEM

Page 5: Disaster Recovery on Demand on the Cloud

® Copyright 2012 GigaSpaces Ltd. All Rights Reserved5

OUTAGE – APRIL 21, 2011

Page 6: Disaster Recovery on Demand on the Cloud

® Copyright 2012 GigaSpaces Ltd. All Rights Reserved6

OUTAGE - JUNE 29, 2012

Page 7: Disaster Recovery on Demand on the Cloud

® Copyright 2012 GigaSpaces Ltd. All Rights Reserved7

OUTAGE - OCTOBER 22, 2012

Page 8: Disaster Recovery on Demand on the Cloud

® Copyright 2012 GigaSpaces Ltd. All Rights Reserved8

OUTAGE - CHRISTMAS EVE 2012

Page 9: Disaster Recovery on Demand on the Cloud

® Copyright 2012 GigaSpaces Ltd. All Rights Reserved9

NOT ONLY AMAZON

28 December 2012 - some owners of Microsoft's XBox 360 gaming console were unable to access some of their cloud-based storage files.

26 July 2012 - Service for Microsoft’s Windows Azure Europe region went down for more than two hours

29 February 2012 - The ultimate result was service impacts of 8-10 hours for users of Azure data centers in Dublin, Ireland, Chicago, and San Antonio.

Page 10: Disaster Recovery on Demand on the Cloud

10

THAT’S WHAT YOU EXPECT?

Managing Big Data on the Cloud

99% - 3.65 days downtime99.9% - 8.76 hours downtime99.99% - 53 minutes downtime99.999% - 5.26 minutes downtime

Page 11: Disaster Recovery on Demand on the Cloud

® Copyright 2012 GigaSpaces Ltd. All Rights Reserved11

OUTAGE IMPACT – DESIGN FOR FAILURES

Outage could cost…$89K per hour for Amadeus$225K per hour for PayPal!

Page 12: Disaster Recovery on Demand on the Cloud

12

DISASTER RECOVERY

Page 13: Disaster Recovery on Demand on the Cloud

13

MULTI CLOUD

Managing Big Data on the Cloud

Page 14: Disaster Recovery on Demand on the Cloud

14

PREPARE FOR DISASTER RECOVERY

Managing Big Data on the Cloud

•Dedicated expert for DR architecture•Define target recovery time & point•Assume every tier can fail•Use monitoring and alerts•Document your operational processes

Page 15: Disaster Recovery on Demand on the Cloud

15

CHAOS MONKEY

Managing Big Data on the Cloud

Page 16: Disaster Recovery on Demand on the Cloud

16

It’s all about REDUNDANCY!

Page 17: Disaster Recovery on Demand on the Cloud

17

CLONE YOUR ENVIORMENT

Managing Big Data on the Cloud

Page 18: Disaster Recovery on Demand on the Cloud

18

CLONE YOUR DATA

•RDS Read Replica•More to come…

Page 19: Disaster Recovery on Demand on the Cloud

19

Automating your DR

Processes

Page 20: Disaster Recovery on Demand on the Cloud

Leverage Existing Automation Frameworks

Configuration Centric APP Centric (PaaS)

Page 21: Disaster Recovery on Demand on the Cloud

CLONE YOUR ENV - HOW DOES IT WORK?

Page 22: Disaster Recovery on Demand on the Cloud

BUILT IN SUPPORT FOR MANAGING DATA IN THE CLOUD

Real Time Relational DB Clusters

NoSQL Clusters Hadoop

Storm MySQL MongoDB Hadoop (Hive, Pig,..)

Elastic Caching XAP Postgress Cassandra ZooKeeper

Couchbase

ElasticSearch

Page 23: Disaster Recovery on Demand on the Cloud

23

Real Life Scenario

Page 24: Disaster Recovery on Demand on the Cloud

24

Technology-based concrete process control and information service

Deployments across North America, Latin America, Asia, and Europe for nearly a decade

Part of W.R. Grace & Co , $6.3 B Company.

The problem: On-Demand HA/DR over multiple Cloud regions.

CASE STUDY: VERIFI

High Availability

Data Replication

Disaster Recovery

Page 25: Disaster Recovery on Demand on the Cloud

ELASTIC ON-DEMAND DISASTER RECOVERY

25

Problem Can we eliminate the

RTO vs. Cost trade-off in the cloud?

Solution (Elastic DR) A hybrid between Hot

and Warm DR Switch to Active site

in matter of seconds through cloud-agnostic lifecycle automation recipes

Page 26: Disaster Recovery on Demand on the Cloud

VERIFI (INITIAL) ARCHITECTURE

26

Availability region (US-West: Oregon)

Data VolumeInternet EC2 Instance

mod_cluster

EC2 Instance

JBoss

Data Volume

EC2 Instance

EC2 Instance

PostgresSQL

Cassandra

4 recipes

Page 27: Disaster Recovery on Demand on the Cloud

ELASTIC DR ON-DEMAND: FAILOVER SCENARIO

27

Region (US-West Oregon)

App ServersPostgresSQL

Region (US-East Virginia)

PostgresSQL

Cloud #1 Cloud #2

Region (US-East Virginia )

PostgresSQL

Cloud #1 Cloud #2

XApp Servers

Region (US-West California)

PostgresSQL

Cloud #3

Region failure occurs

* Initially, all those actions may be done manually by Verifi’s Ops team (e.g.: via recipe commands in CLI)

Bootstrap another cloud in a different region using the same application recipe used to bootstrap cloud #2 above*

1 2 3

Liveness poll

Liveness poll

0 Upon initial deployment, the primary deplyoment of the application “verifi” will be bootstrapped onto cloud #1, another slightly modified application recipe “verifi_dr” will be bootstrapped as cloud #2, polling cloud #1 for failure, and acting as a PostgresSQL db slave.

Turn Postgres slave into master, Start app server instances*

Page 28: Disaster Recovery on Demand on the Cloud

FAILOVER SCENARIO

28

Region (US-West Oregon)

App ServersPostgresSQL

Region (US-East Virginia)

PostgresSQL

Cloud #1 Cloud #2

Region (US-East Virginia )

PostgresSQL

Cloud #1 Cloud #2

XApp Servers

Region (US-West California)

PostgresSQL

Cloud #3

Region failure occurs

Bootstrap another cloud in a different region using the same application recipe used to bootstrap cloud #2 above*

1 2 3

Liveness poll

Liveness poll

0 Upon initial deployment, the primary deployment of the application will be bootstrapped onto cloud #1, another slightly modified application recipe will be bootstrapped as cloud #2, polling cloud #1 for failure, and acting as a PostgresSQL db slave.

Turn Postgres slave into master, Start app server instances*

Page 29: Disaster Recovery on Demand on the Cloud

29 Copyright 2012 Gigaspaces. All Rights Reserved

NEXT STEPS

Across clouds(AWS, Rackspace, Azure…etc)

Across AWS regions

Across AWS zones

1 application + overrides

Several cloud drivers

1 application + overrides1 cloud driver

1 application + overrides 1 cloud driver

Avai

labi

lity

Supported byVerifi phase #1

Page 30: Disaster Recovery on Demand on the Cloud

30 Copyright 2013 Gigaspaces. All Rights Reserved

ELASTIC ON-DEMAND DR: COSTS

Main Site (US-West) Warm DR Site (US-East) Hot DR Site

Cost $82,068 $12,625 $82,068

Main Site 1 Load balancer, 2 JBoss instances, 1 PostgreSQL master, 3 Cassandra

DR Site 1 PostgreSQL slave – All other instance start on demand upon failover

Page 31: Disaster Recovery on Demand on the Cloud

31 Copyright 2013 Gigaspaces. All Rights Reserved

ELASTIC DR: WARM DR COST, CLOUD PORTABILITY

4 recipes

DR Site$12k

Sam

e Re

cipe

$14k

$6k

$5k

$9k

Page 32: Disaster Recovery on Demand on the Cloud

32 Copyright 2013 Gigaspaces. All Rights Reserved

ELASTIC DR: HOT DR COST

4 recipes

DR Site$82k

Sam

e Re

cipe

$79k

$115k

$68k

$91k

Page 33: Disaster Recovery on Demand on the Cloud

33

Disaster Recovery – it’s all about redundancy! Cloning your environment – app stack Cloning your Data – DB Replication

Automation makes DR processes simple Use recipes to clone your app stack consistently Use replication to clone your data

Leverage cloud economics to reduce the cost DR on Demand Multi Cloud

® Copyright 2013 GigaSpaces Ltd. All Rights Reserved

SUMMARY

Page 34: Disaster Recovery on Demand on the Cloud

34

Thank You!@natishalom

® Copyright 2013 GigaSpaces Ltd. All Rights Reserved

QUESTIONS & ANSWERS