disaster recovery and ceph block storage introducing multi ... · disaster recovery and ceph block...

22
Disaster Recovery and Ceph Block Storage Introducing Multi-Site Mirroring Jason Dillaman RBD Project Technical Lead Vault 2017

Upload: doanthuy

Post on 01-Jul-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Disaster Recovery and Ceph Block Storage Introducing Multi ... · Disaster Recovery and Ceph Block Storage Introducing Multi-Site Mirroring Jason Dillaman RBD Project Technical Lead

Disaster Recovery and Ceph Block StorageIntroducing Multi-Site Mirroring

Jason DillamanRBD Project Technical LeadVault 2017

Page 2: Disaster Recovery and Ceph Block Storage Introducing Multi ... · Disaster Recovery and Ceph Block Storage Introducing Multi-Site Mirroring Jason Dillaman RBD Project Technical Lead

WHAT IS CEPH ALL ABOUT

2

▪ Software-defined distributed storage▪ All components scale horizontally▪ No single point of failure▪ Self-manage whenever possible▪ Hardware agnostic, commodity hardware▪ Object, block, and file in a single cluster▪ Open source (LGPL)

Page 3: Disaster Recovery and Ceph Block Storage Introducing Multi ... · Disaster Recovery and Ceph Block Storage Introducing Multi-Site Mirroring Jason Dillaman RBD Project Technical Lead

RGWA web services gateway for

object storage, compatible with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed

block device with cloud platform integration

CEPHFSA distributed file system with

POSIX semantics and scale-out metadata management

OBJECT BLOCK FILE

CEPH COMPONENTS

3

Page 4: Disaster Recovery and Ceph Block Storage Introducing Multi ... · Disaster Recovery and Ceph Block Storage Introducing Multi-Site Mirroring Jason Dillaman RBD Project Technical Lead

▪ Block device abstraction▪ Striped over fixed-size objects▪ Highlights

▪ Broad integration▪ Thinly provisioned▪ Snapshots▪ Copy-on-write clones

RADOS BLOCK DEVICE

4

Page 5: Disaster Recovery and Ceph Block Storage Introducing Multi ... · Disaster Recovery and Ceph Block Storage Introducing Multi-Site Mirroring Jason Dillaman RBD Project Technical Lead

LIBRADOS

LIBRBD

M M

M

RADOS CLUSTER

RADOS BLOCK DEVICE

5

KRBD

User-space Client Linux Host

LIBCEPH

IMAGE UPDATES

Page 6: Disaster Recovery and Ceph Block Storage Introducing Multi ... · Disaster Recovery and Ceph Block Storage Introducing Multi-Site Mirroring Jason Dillaman RBD Project Technical Lead

MIRRORING MOTIVATION

6

▪ Massively scalable and fault tolerant design▪ What about data center failures?

▪ Data is the “special sauce”▪ Failure to plan is planning to fail

▪ Snapshot-based incremental backups▪ Desire online, continuous backups

▪ a la RBD Mirroring

Page 7: Disaster Recovery and Ceph Block Storage Introducing Multi ... · Disaster Recovery and Ceph Block Storage Introducing Multi-Site Mirroring Jason Dillaman RBD Project Technical Lead

▪ Replication needs to be asynchronous▪ IO shouldn’t be blocked due to slow WAN▪ Support transient connectivity issues

▪ Replication needs to be crash consistent▪ Respect write barriers

▪ Easy management▪ Expect failure can happen anytime

MIRRORING DESIGN PRINCIPLES

7

Page 8: Disaster Recovery and Ceph Block Storage Introducing Multi ... · Disaster Recovery and Ceph Block Storage Introducing Multi-Site Mirroring Jason Dillaman RBD Project Technical Lead

▪ Journal-based approach to log all modifications▪ Support access by multiple clients▪ Client-side operation▪ Event logs appended into journal objects▪ Delay all image modifications until event safe▪ Commit journal events once image modification safe▪ Provides an ordered view of all updates to an image

MIRRORING DESIGN FOUNDATION

8

Page 9: Disaster Recovery and Ceph Block Storage Introducing Multi ... · Disaster Recovery and Ceph Block Storage Introducing Multi-Site Mirroring Jason Dillaman RBD Project Technical Lead

▪ Typical IO path with journaling:a. Create an event to describe the updateb. Asynchronously append event to journal objectc. Update in-memory image cached. Blocks cache writeback for affected extente. Completes IO to clientf. Unblock writeback once event is safe

JOURNAL DESIGN

9

Page 10: Disaster Recovery and Ceph Block Storage Introducing Multi ... · Disaster Recovery and Ceph Block Storage Introducing Multi-Site Mirroring Jason Dillaman RBD Project Technical Lead

▪ IO path with journaling w/o cache:a. Create an event to describe the updateb. Asynchronously append event to journal objectc. Asynchronously update image once event is safed. Complete IO to client once update is safe

JOURNAL DESIGN

10

Page 11: Disaster Recovery and Ceph Block Storage Introducing Multi ... · Disaster Recovery and Ceph Block Storage Introducing Multi-Site Mirroring Jason Dillaman RBD Project Technical Lead

JOURNAL DESIGN

11

LIBRADOS

LIBRBD

M M

M

RADOS CLUSTER

User-space Client

WRITES READS

JOURNAL EVENTSIMAGE UPDATES

Page 12: Disaster Recovery and Ceph Block Storage Introducing Multi ... · Disaster Recovery and Ceph Block Storage Introducing Multi-Site Mirroring Jason Dillaman RBD Project Technical Lead

▪ Mirroring peers are configured per-pool▪ Enabling mirroring is a per-image property▪ Requires that the journal image feature is enabled▪ Image is primary (R/W) or non-primary (R/O)▪ Replication is handled by rbd-mirror daemon

MIRRORING OVERVIEW

12

Page 13: Disaster Recovery and Ceph Block Storage Introducing Multi ... · Disaster Recovery and Ceph Block Storage Introducing Multi-Site Mirroring Jason Dillaman RBD Project Technical Lead

RBD-MIRROR DAEMON

▪ Requires access to remote and local clusters▪ Responsible for image (re)sync ▪ Replays journal to achieve consistency

▪ Pulls journal event from remote cluster▪ Applies event to local cluster▪ Commits journal event▪ Trim journal

▪ Transparently handles failover/failback▪ Two-way replication between two sites▪ One-way replication between N sites

13

Page 14: Disaster Recovery and Ceph Block Storage Introducing Multi ... · Disaster Recovery and Ceph Block Storage Introducing Multi-Site Mirroring Jason Dillaman RBD Project Technical Lead

RBD-MIRROR DAEMON

14

SITE B

ClientLIBRBD

M

CLUSTER A

MM M

CLUSTER B

MM

SITE A

RBD-MIRRORLIBRBD

JOURNALEVENTS

IMAGE UPDATES

IMAGE UPDATES

Page 15: Disaster Recovery and Ceph Block Storage Introducing Multi ... · Disaster Recovery and Ceph Block Storage Introducing Multi-Site Mirroring Jason Dillaman RBD Project Technical Lead

MIRRORING SETUP

15

▪ Deploy rbd-mirror daemon on each cluster▪ Jewel/Kraken only support single active daemon per-cluster

▪ Provide uniquely named “ceph.conf” for each remote cluster▪ Create pools with same name on each cluster▪ Enable pool mirroring (rbd mirror pool enable)▪ Specify peer cluster via rbd CLI (rbd mirror pool peer add)

▪ Enable image journaling feature (rbd feature enable journaling)▪ Enable image mirroring (rbd mirror image enable)

Page 16: Disaster Recovery and Ceph Block Storage Introducing Multi ... · Disaster Recovery and Ceph Block Storage Introducing Multi-Site Mirroring Jason Dillaman RBD Project Technical Lead

▪ Per-image failover / failback▪ Coordinated demotion / promotion

▪ (rbd mirror image demote)▪ (rbd mirror image promote)

▪ Uncoordinated promotion + resync▪ (rbd mirror image promote --force)

▪ Resync from force-promotion / split-brain▪ (rbd mirror image resync)

SITE FAILOVER

16

Page 17: Disaster Recovery and Ceph Block Storage Introducing Multi ... · Disaster Recovery and Ceph Block Storage Introducing Multi-Site Mirroring Jason Dillaman RBD Project Technical Lead

▪ Write IOs have worst-case 2x performance hit▪ Journal event append▪ Image object write

▪ In-memory cache can mask hit if working set fits▪ Only supported by librbd-based clients

CAVEATS

17

Page 18: Disaster Recovery and Ceph Block Storage Introducing Multi ... · Disaster Recovery and Ceph Block Storage Introducing Multi-Site Mirroring Jason Dillaman RBD Project Technical Lead

▪ Use a small SSD/NVMe-backed pool for journals▪ ‘rbd journal pool = <fast pool name>’

▪ Batch multiple events into a single journal append▪ ‘rbd journal object flush age = <seconds>’

▪ Increase journal data width to match queue depth▪ ‘rbd journal splay width = <number of objects>’

▪ Future work: potentially parallelize journal append + image write between write barriers

MITIGATION

18

Page 19: Disaster Recovery and Ceph Block Storage Introducing Multi ... · Disaster Recovery and Ceph Block Storage Introducing Multi-Site Mirroring Jason Dillaman RBD Project Technical Lead

▪ Active/Active rbd-mirror daemons▪ Deferred replication and deletion▪ “Deep Scrub” of replicated images▪ Smarter image resynchronization▪ Improved health status reporting▪ Improved pool promotion process

FUTURE FEATURES

19

Page 20: Disaster Recovery and Ceph Block Storage Introducing Multi ... · Disaster Recovery and Ceph Block Storage Introducing Multi-Site Mirroring Jason Dillaman RBD Project Technical Lead

▪ Design for failure▪ Ceph now provides additional tools for full-scale disaster

recovery▪ RBD workloads can seamlessly relocate between geographic

sites▪ Feedback is welcome

BE PREPARED

20

Page 21: Disaster Recovery and Ceph Block Storage Introducing Multi ... · Disaster Recovery and Ceph Block Storage Introducing Multi-Site Mirroring Jason Dillaman RBD Project Technical Lead

Questions?

21

Page 22: Disaster Recovery and Ceph Block Storage Introducing Multi ... · Disaster Recovery and Ceph Block Storage Introducing Multi-Site Mirroring Jason Dillaman RBD Project Technical Lead

THANK YOU!Jason DillamanRBD Project Tech Lead

[email protected]

22