when disaster strikes the cloud: who, what, when, where and how to recover

Accelerating Enterprise OpenStack

When Disaster Strikes the Cloud

Michael Factor IBM Research - Haifa

factor@il.ibm.com

Who, What, When, Where and How to Recover

Ronen Kat IBM Research - Haifa ronenkat@il.ibm.com

Sean Cohen RedHat

scohen@redhat.com

Talk Outline q What is disaster recovery?

q Concepts and basics

q Protecting data and applications from disasters q OpenStack Cinder toolbox for disaster recovery q Applications are more than just data

q The road ahead: Kilo and beyond

What is Disaster Recovery?

According to Wikipedia, Disaster Recovery (DR) is "the process, policies and procedures . . . for recovery . . . of technology infrastructure . . . after a natural or human-induced disaster.”

Servers Storage Network Software Configuration

Surviving a disaster requires geographic dispersion

Recovery Point Objective and Recovery Time Objective

How far back in time a disaster takes one

How long until operational after a disaster

Seconds 0

RECOVERY POINT OBJECTIVE (RPO)

Minutes Hours Days Weeks Weeks

RECOVERY POINT TIME (RTO)

Days Hours Minutes Seconds

Replication

Backup restore Active site Hot site

Data and Metadata Consistency

Data consistency q If a modified datum is available,

all data it depends upon is also available

Metadata consistency q Configuration updates are seen

in the same order relative to one another and to data updates

Application VM

DB LOG

Remote Site

OpenStack Cloud Metadata

Virtual networks between the cloud VM External network access Attached volumes Volume types Virtual machines flavors SSH keys for VM access Virtual machines images

Identities of users

Protecting Data and Applications from Disasters

Data Protection: Cinder Backup and Restore

q Cinder backup q Backup a volume to backup storage

backup-create

Primary Cloud

q Can Cinder restore on secondary cloud?

q Problem: Cinder on secondary cloud is not aware of the backup

Swift backup-restore

Primary Cloud

Secondary Cloud

q Solution: “electronic tape shipping” q backup-export q backup-import

q Cinder supports since Icehouse

backup-export

Primary Cloud

Secondary Cloud

Backup reference

backup-import

q After backup-import Cinder can restore on secondary cloud q backup-restore

Swift backup-restore

Primary Cloud

Secondary Cloud

Data Protection: Cinder Volume replication

q Cinder has initial support for volume replication in Juno release

q Cinder back-ends can “advertise” support for replication

q Volume created with replication extra-spec will be allocated on back-end supporting replication and will be replicated

q Supporting back ends: q IBM Storwize, more expected in Kilo

Cinder back-end

Volume-type extra specs: “capabilities:replication

<is> True”

Data Protection: Cinder Volume replication

q Secondary volume can become primary when promoted q replication-promote

q Replication can be reversed following a replication-promote q replication-reenable

Cinder back-end

Consistency Groups q New in Juno

q Support for volume grouping for consistency

q Grouping of volumes is based on the volume-type

q Supporting q Consistency group snapshots

q Needs to be extended to support q Cinder backup q Cinder volume replication

DB LOG

Protecting Applications from Disasters

Servers Storage Network Software Configuration

Disaster Recovery Orchestration

OpenStack Tools

q Applications are defined in OpenStack by q Heat Orchestration Templates

q However q Not all applications are template based q Deployments (including configuration) change over time q Some definitions are cloud specific, e.g., networks, types q Heat templates and Stacks don’t stay consistent

q Tools that can create a template from deployment, e.g., Flame, ReHeat

q But, template will only fit the current cloud

OpenStack Tools and Beyond

q Demo: A technology preview for disaster recovery with IBM Cloud Manager

THE ROAD AHEAD

Ceph Multi-Site & Disaster Recovery (Block) example

q Export snapshots to geographically dispersed data centers q Provides disaster recovery

q Export incremental snapshots q Minimize network bandwidth by only sending changes

q  Kilo cycle focus to extends the multi-site and disaster recovery options q  RBD Mirroring q  Cinder Volume Replication

Ceph Multi-Site & Disaster Recovery (Object) example

q Zones and region support q  Deploy topologies similar to S3

and others with a global namespace

q Data center synchronization q  Back-up full or partial sets of data

between regions

q Read affinity q  Serve local copies of data to local

Disaster Recovery as a Service Catalog q Pluggable Disaster Recovery policies

q Replication targets can specify different RPO/RTO levels that can be offered based on the supported backend capabilities

q Disaster Recovery Policies q  Active - Cold standby q  Active - Hot standby q  Active - Active (requires application awareness and transaction integrity) q  Backup to Cloud / From the Cloud

Extending Heat Orchestration for Disaster Recovery

q Heat can be used to automate q Add support for Cinder replication

q Need to make Consistency group across OpenStack projects q Nova Cinder, Trove….

q Stack Snapshot Backup / Rollback

q Enable customization of workload components at recovery site. q Networks, VM configurations changes, guest agent etc.

The Road Toward Application Consistency

First phase: File system consistency

q Integrate into OpenStack to allow consistent snapshots and backups q Nova needs to request QEMU Guest Agent to freeze the file systems

(and applications if fsfreeze-hook is installed) during the snapshot

q Patches has proposed for Nova and Cinder, targeting the Kilo release

Source: Hitachi

The Road Toward Application Consistency

Next phase: Consistency at the application level

q Application-Aware on Windows with VSS Support on qemu-ga q Application notification via Microsoft Volume Shadow Copy Service (VSS)

q Application-Aware on Linux Using qemu-ga Hooks q Application-consistent snapshots can be created with scripts interacting with the

QEMU guest agent q The scripts can notify applications to flush their data

Disaster Recovery at Scale

q  Site evacuation holy grail is an automatic planned migration of the workloads and data from one cloud-scale datacenter to another.

q  New OpenStack HA approaches to help Recovery from infrastructure failures:

q  Leveraging Pacemaker to provide automated detection of a failed hypervisor and the recovery of the VMs that were running there.

q  Evacuate instance to a scheduled host was added in Juno q  Simple tagging API for instances in Nova was accepted for Kilo release

q  Can support automatic-recovery new tag

Suggest removing – no time

OpenStack Documentation needs to catch up…

q Join the OpenStack Disaster Recovery Guide q We have a basic OpenStack High Availability Guide

q http://docs.openstack.org/high-availability-guide/content/

q A very outdated “Recover cloud after disaster” section in the Admin guide http://docs.openstack.org/admin-guide-cloud/content/section_nova-disaster-recovery-process.html

Michael Factor IBM Research - Haifa

factor@il.ibm.com

THANK YOU

Ronen Kat IBM Research - Haifa ronenkat@il.ibm.com

Sean Cohen RedHat

scohen@redhat.com

when disaster strikes the cloud: who, what, when, where and how to recover

backup swift backup

data protection

end cinder

backup storage swift

qafter backupimport

qcinder backup qbackup

secondary cloud qbackup

qcan cinder

Technology

when the ocean strikes back - akademisk kvarter

when strikes …. when disaster strikes……… dr. iraphne...

when disaster strikes master.indd - when disaster...

1692 - when lightning strikes collaboration

when disaster strikes - bible study

when stroke strikes, act f.a.s.t

when tragedy strikes - finding security in a vulnerable...

when disaster strikes€¦ · when disaster strikes what...

like father like son - when midnight strikes

"when disaster strikes"

clm204 -- crisis management strategies when disaster strikes

when disaster strikes, again

crisis management strategies when disaster strikes

when disaster strikes, what will you do?

when hyperglycemia strikes pregnancy: criteria for diagnosis

when an earthquake strikes

protecting patients when disaster strikes · protecting...

when disaster strikes: ethical issues in ipac outbreak

when disaster strikes - unicef · 2019. 12. 21. · 3 |...

when disaster strikes · immediate disaster recovery when...