ceph day la - rbd: a deep dive

39
RBD: A Deep Dive Josh Durgin

Upload: ceph-community

Post on 10-Aug-2015

112 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Ceph Day LA - RBD: A deep dive

RBD: A Deep Dive

Josh Durgin

Page 2: Ceph Day LA - RBD: A deep dive

2

ARCHITECTURAL COMPONENTS

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

Page 3: Ceph Day LA - RBD: A deep dive

3

ARCHITECTURAL COMPONENTS

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

Page 4: Ceph Day LA - RBD: A deep dive

4

STORING VIRTUAL DISKS

M M

RADOS CLUSTER

HYPERVISORLIBRBD

VM

Page 5: Ceph Day LA - RBD: A deep dive

5

KERNEL MODULE

M M

RADOS CLUSTER

LINUX HOSTKRBD

Page 6: Ceph Day LA - RBD: A deep dive

6

RBD FEATURES

● Stripe images across entire cluster (pool)

● Read-only snapshots

● Copy-on-write clones

● Broad integration

– QEMU, libvirt

– Linux kernel

– iSCSI (STGT, LIO)

– OpenStack, CloudStack, OpenNebula, Ganeti, Proxmox, oVirt

● Incremental backup (relative to snapshots)

Page 7: Ceph Day LA - RBD: A deep dive

RBD

Page 8: Ceph Day LA - RBD: A deep dive

8

RADOS

● Flat namespace within a pool

● Rich object API

– Bytes, attributes, key/value data

– Partial overwrite of existing data

– Single-object compound atomic operations

– RADOS classes (stored procedures)

● Strong consistency (CP system)

● Infrastructure aware, dynamic topology

● Hash-based placement (CRUSH)

● Direct client to server data path

Page 9: Ceph Day LA - RBD: A deep dive

9

RADOS COMPONENTS

OSDs: 10s to 1000s in a cluster One per disk (or one per SSD, RAID

group…) Serve stored objects to clients Intelligently peer for replication & recovery

Monitors: Maintain cluster membership and state Provide consensus for distributed decision-

making Small, odd number (e.g., 5) Not part of data path

M

Page 10: Ceph Day LA - RBD: A deep dive

10

ARCHITECTURAL COMPONENTS

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

Page 11: Ceph Day LA - RBD: A deep dive

11

Metadata

● rbd_directory

– Maps image name to id, and vice versa

● rbd_children

– Lists clones in a pool, indexed by parent

● rbd_id.$image_name

– The internal id, locatable using only the user-specified image name

● rbd_header.$image_id

– Per-image metadata

Page 12: Ceph Day LA - RBD: A deep dive

12

Data

● rbd_data.* objects

– Named based on offset in image

– Non-existent to start with

– Plain data in each object

– Snapshots handled by rados

– Often sparse

Page 13: Ceph Day LA - RBD: A deep dive

13

Striping

● Objects are uniformly sized

– Default is simple 4MB divisions of device

● Randomly distributed among OSDs by CRUSH

● Parallel work is spread across many spindles

● No single set of servers responsible for image

● Small objects lower space variance

Page 14: Ceph Day LA - RBD: A deep dive

14

I/O

M M

RADOS CLUSTER

HYPERVISORLIBRBD

CACHE

VM

Page 15: Ceph Day LA - RBD: A deep dive

15

Snapshots

● Object granularity

● Snapshot context [list of snap ids, latest snap id]

– Stored in rbd image header (self-managed)

– Sent with every write

● Snapshot ids managed by monitors

● Deleted asynchronously

● RADOS keeps per-object overwrite stats, so diffs are easy

Page 16: Ceph Day LA - RBD: A deep dive

16

Snapshots

LIBRBDSnap context: ([], 7)

write

Page 17: Ceph Day LA - RBD: A deep dive

17

Snapshots

LIBRBD

Create snap 8

Page 18: Ceph Day LA - RBD: A deep dive

18

Snapshots

LIBRBD

Set snap context to ([8], 8)

Page 19: Ceph Day LA - RBD: A deep dive

19

Snapshots

LIBRBDWrite with ([8], 8)

Page 20: Ceph Day LA - RBD: A deep dive

20

Snapshots

LIBRBDWrite with ([8], 8)

Snap 8

Page 21: Ceph Day LA - RBD: A deep dive

21

WATCH/NOTIFY

● establish stateful 'watch' on an object

– client interest persistently registered with object

– client keeps connection to OSD open

● send 'notify' messages to all watchers

– notify message (and payload) sent to all watchers

– notification (and reply payloads) on completion

● strictly time-bounded liveness check on watch

– no notifier falsely believes we got a message

● example: distributed cache w/ cache invalidations

Page 22: Ceph Day LA - RBD: A deep dive

22

Watch/Notify

HYPERVISOR

LIBRBD

/usr/bin/rbd

LIBRBD

watch

HYPERVISOR

LIBRBD

/usr/bin/rbd

LIBRBD

watch

Page 23: Ceph Day LA - RBD: A deep dive

23

Watch/Notify

HYPERVISOR

LIBRBD

/usr/bin/rbd

LIBRBD

HYPERVISOR

LIBRBD

/usr/bin/rbd

LIBRBD

Add snapshot

Page 24: Ceph Day LA - RBD: A deep dive

24

Watch/Notify

HYPERVISOR

LIBRBD

/usr/bin/rbd

LIBRBD

notify

HYPERVISOR

LIBRBD

/usr/bin/rbd

LIBRBD

notify

Page 25: Ceph Day LA - RBD: A deep dive

25

Watch/Notify

HYPERVISOR

LIBRBD

/usr/bin/rbd

LIBRBD

notify ack

HYPERVISOR

LIBRBD

/usr/bin/rbd

LIBRBD

notify complete

Page 26: Ceph Day LA - RBD: A deep dive

26

Watch/Notify

HYPERVISOR

LIBRBD

read metadata

HYPERVISOR

LIBRBD

Page 27: Ceph Day LA - RBD: A deep dive

27

Clones

● Copy-on-write (or read, in Hammer)

● Object granularity

● Independent settings

– striping, feature bits, object size can differ

– can be in different pools

● Clones are based on protected snapshots

– 'protected' means they can't be deleted

● Can be flattened

– Copy all data from parent

– Remove parent relationship

Page 28: Ceph Day LA - RBD: A deep dive

28

Clones - read

LIBRBD

clone parent

Page 29: Ceph Day LA - RBD: A deep dive

29

Clones - write

LIBRBD

clone parent

Page 30: Ceph Day LA - RBD: A deep dive

30

Clones - write

LIBRBD

clone parent

read

Page 31: Ceph Day LA - RBD: A deep dive

31

Clones - write

LIBRBD

clone parent

copy up, write

Page 32: Ceph Day LA - RBD: A deep dive

32

OpenStack

Page 33: Ceph Day LA - RBD: A deep dive

33

Virtualized Setup

● Secret key stored by libvirt

● XML defining VM fed in, includes

– Monitor addresses

– Client name

– QEMU block device cache setting

● Writeback recommended

– Bus on which to attach block device

● virtio-blk/virtio-scsi recommended● Ide ok for legacy systems

– Discard options (ide/scsi only), I/O throttling if desired

Page 34: Ceph Day LA - RBD: A deep dive

34

Virtualized Setup

libvirtvm.xml

QEMULIBRBD

CACHE

qemu ­enable­kvm ...

VM

Page 35: Ceph Day LA - RBD: A deep dive

35

Kernel RBD

● rbd map sets everything up

● /etc/ceph/rbdmap is like /etc/fstab

● Udev adds handy symlinks:

– /dev/rbd/$pool/$image[@snap]

● striping v2 and later feature bits not supported currently

● Can be used to back LIO, NFS, SMB, etc.

● No specialized cache, page cache used by filesystem on top

Page 36: Ceph Day LA - RBD: A deep dive

ROADMAP

Page 37: Ceph Day LA - RBD: A deep dive

37

What's new in Hammer

● Copy-on-read

● Librbd cache enabled in safe mode by default

● Readahead during boot

● Lttng tracepoints

● Allocation hints

● Cache hints

● Exclusive locking (off by default for now)

● Object map (off by default for now)

Page 38: Ceph Day LA - RBD: A deep dive

38

Future Work

● Kernel client catch up

● RBD mirroring

● Consistency groups

● QoS for rados (provisioned IOPS at rbd level)

● Active/Active iSCSI

● Performance improvements

– Newstore OSD backend

– Improved cache tiering

– Finer-grained caching

– Many more

Page 39: Ceph Day LA - RBD: A deep dive

THANK YOU!

Josh Durgin

[email protected]