opennebula conf 2014 | using ceph to provide scalable storage for opennebula by john spray

59
Using Ceph with OpenNebula John Spray [email protected]

Upload: netways

Post on 11-Jul-2015

565 views

Category:

Software


5 download

TRANSCRIPT

Page 1: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

Using Ceph with

OpenNebula

John [email protected]

Page 2: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

2 OpenNebulaConf 2014 Berlin

Agenda

● What is it?

● Architecture

● Integration with OpenNebula

● What's new?

Page 3: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

3OpenNebulaConf 2014 Berlin

What is Ceph?

Page 4: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

4 OpenNebulaConf 2014 Berlin

What is Ceph?

● Highly available resilient data store

● Free Software (LGPL)

● 10 years since inception

● Flexible object, block and filesystem interfaces

● Especially popular in private clouds as VM image service, and S3-compatible object storage service.

Page 5: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

5 OpenNebulaConf 2014 Berlin

Interfaces to storage

FILE SYSTEMCephFS

BLOCK STORAGE

RBD

OBJECT STORAGE

RGW

Keystone

Geo-Replication

Native API

Multi-tenant

S3 & Swift

OpenStack

Linux Kernel

iSCSI

Clones

Snapshots

CIFS/NFS

HDFS

Distributed Metadata

Linux Kernel

POSIX

Page 6: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

6OpenNebulaConf 2014 Berlin

Ceph Architecture

Page 7: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

7OpenNebulaConf 2014 Berlin

Architectural Components

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

Page 8: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

8OpenNebulaConf 2014 Berlin

Object Storage Daemons

FS

DISK

OSD

DISK

OSD

FS

DISK

OSD

FS

DISK

OSD

FS

btrfsxfsext4

M

M

M

Page 9: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

9OpenNebulaConf 2014 Berlin

RADOS Components

OSDs: 10s to 10000s in a cluster One per disk (or one per SSD, RAID group…) Serve stored objects to clients Intelligently peer for replication & recovery

Monitors: Maintain cluster membership and state Provide consensus for distributed decision-

making Small, odd number These do not serve stored objects to clients

M

Page 10: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

10OpenNebulaConf 2014 Berlin

Rados Cluster

APPLICATION

M M

M M

M

RADOS CLUSTER

Page 11: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

11OpenNebulaConf 2014 Berlin

Where do objects live?

??

APPLICATION

M

M

M

OBJECT

Page 12: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

12OpenNebulaConf 2014 Berlin

A Metadata Server?

1

APPLICATION

M

M

M

2

Page 13: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

13OpenNebulaConf 2014 Berlin

Calculated placement

FAPPLICATION

M

M

MA-G

H-N

O-T

U-Z

Page 14: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

14OpenNebulaConf 2014 Berlin

Even better: CRUSH

RADOS CLUSTER

OBJECT

10

01

01

10

10

01

11

01

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

Page 15: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

15OpenNebulaConf 2014 Berlin

CRUSH is a quick calculation

RADOS CLUSTER

OBJECT

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

Page 16: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

16OpenNebulaConf 2014 Berlin

CRUSH: Dynamic data placement

CRUSH: Pseudo-random placement algorithm

Fast calculation, no lookup Repeatable, deterministic

Statistically uniform distribution Stable mapping

Limited data migration on change Rule-based configuration

Infrastructure topology aware Adjustable replication Weighting

Page 17: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

17OpenNebulaConf 2014 Berlin

Architectural Components

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

Page 18: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

18OpenNebulaConf 2014 Berlin

RBD: Virtual disks in Ceph

18

RADOS BLOCK DEVICE: Storage of disk images in RADOS Decouples VMs from host Images are striped across the cluster (pool) Snapshots Copy-on-write clones Support in:

Mainline Linux Kernel (2.6.39+) Qemu/KVM OpenStack, CloudStack, OpenNebula,

Proxmox

Page 19: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

19OpenNebulaConf 2014 Berlin

Storing virtual disks

M M

RADOS CLUSTER

HYPERVISOR

LIBRBD

VM

19

Page 20: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

20OpenNebulaConf 2014 Berlin

Using Ceph with OpenNebula

Page 21: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

21OpenNebulaConf 2014 Berlin

Storage in OpenNebula deployments

OpenNebula Cloud Architecture Survey 2014 (http://c12g.com/resources/survey/)

Page 22: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

22OpenNebulaConf 2014 Berlin

RBD and libvirt/qemu

● librbd (user space) client integration with libvirt/qemu

● Support for live migration, thin clones

● Get recent versions!

● Directly supported in OpenNebula since 4.0 with the Ceph Datastore (wraps `rbd` CLI)

More info online:

http://ceph.com/docs/master/rbd/libvirt/http://docs.opennebula.org/4.10/administration/storage/ceph_ds.html

Page 23: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

23OpenNebulaConf 2014 Berlin

Other hypervisors

● OpenNebula is flexible, so can we also use Ceph with non-libvirt/qemu hypervisors?

● Kernel RBD: can present RBD images in /dev/ on hypervisor host for software unaware of librbd

● Docker: can exploit RBD volumes with a local filesystem for use as data volumes – maybe CephFS in future...?

● For unsupported hypervisors, can adapt to Ceph using e.g. iSCSI for RBD, or NFS for CephFS (but test re-exports carefully!)

Page 24: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

24OpenNebulaConf 2014 Berlin

Choosing hardware

Testing/benchmarking/expert advice is needed, but there are general guidelines:

● Prefer many cheap nodes to few expensive nodes (10 is better than 3)

● Include small but fast SSDs for OSD journals

● Don't simply buy biggest drives: consider IOPs/capacity ratio

● Provision network and IO capacity sufficient for your workload plus recovery bandwidth from node failure.

Page 25: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

25OpenNebulaConf 2014 Berlin

What's new?

Page 26: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

26OpenNebulaConf 2014 Berlin

Ceph releases

● Ceph 0.80 firefly (May 2014)

– Cache tiering & erasure coding

– Key/val OSD backends

– OSD primary affinity● Ceph 0.87 giant (October 2014)

– RBD cache enabled by default

– Performance improvements

– Locally recoverable erasure codes● Ceph x.xx hammer (2015)

Page 27: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

27OpenNebulaConf 2014 Berlin

Additional components

● Ceph FS – scale-out POSIX filesystem service, currently being stabilized

● Calamari – monitoring dashboard for Ceph

● ceph-deploy – easy SSH-based deployment tool

● Puppet, Chef modules

Page 28: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

28OpenNebulaConf 2014 Berlin

Get involved

Evaluate the latest releases:

http://ceph.com/resources/downloads/

Mailing list, IRC:

http://ceph.com/resources/mailing-list-irc/

Bugs:

http://tracker.ceph.com/projects/ceph/issues

Online developer summits:

https://wiki.ceph.com/Planning/CDS

Page 29: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

29OpenNebulaConf 2014 Berlin

Questions?

Page 30: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

30OpenNebulaConf 2014 Berlin

Page 31: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

31OpenNebulaConf 2014 Berlin

Spare slides

Page 32: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

32OpenNebulaConf 2014 Berlin

Page 33: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

33OpenNebulaConf 2014 Berlin

Ceph FS

Page 34: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

34 OpenNebulaConf 2014 Berlin

CephFS architecture

● Dynamically balanced scale-out metadata

● Inherit flexibility/scalability of RADOS for data

● POSIX compatibility

● Beyond POSIX: Subtree snapshots, recursive statistics

Weil, Sage A., et al. "Ceph: A scalable, high-performance distributed file system." Proceedings of the 7th symposium on Operating systems

design and implementation. USENIX Association, 2006.http://ceph.com/papers/weil-ceph-osdi06.pdf

Page 35: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

35OpenNebulaConf 2014 Berlin

Components

● Client: kernel, fuse, libcephfs

● Server: MDS daemon

● Storage: RADOS cluster (mons & OSDs)

Page 36: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

36OpenNebulaConf 2014 Berlin

Components

Linux host

M M

M

Ceph server daemons

ceph.ko

datametadata 0110

Page 37: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

37 OpenNebulaConf 2014 Berlin

From application to disk

ceph-mds

libcephfsceph-fuse Kernel client

RADOS

Client network protocol

Application

Disk

Page 38: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

38OpenNebulaConf 2014 Berlin

Scaling out FS metadata

● Options for distributing metadata?

– by static subvolume

– by path hash

– by dynamic subtree● Consider performance, ease of implementation

Page 39: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

39OpenNebulaConf 2014 Berlin

Dynamic subtree placement

Page 40: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

40OpenNebulaConf 2014 Berlin

Dynamic subtree placement

● Locality: get the dentries in a dir from one MDS

● Support read heavy workloads by replicating non-authoritative copies (cached with capabilities just like clients do)

● In practice work at directory fragment level in order to handle large dirs

Page 41: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

41 OpenNebulaConf 2014 Berlin

Data placement

● Stripe file contents across RADOS objects● get full rados cluster bw from clients● fairly tolerant of object losses: reads return zero

● Control striping with layout vxattrs● layouts also select between multiple data pools

● Deletion is a special case: client deletions mark files 'stray', RADOS delete ops sent by MDS

Page 42: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

42 OpenNebulaConf 2014 Berlin

Clients

● Two implementations:● ceph-fuse/libcephfs● kclient

● Interplay with VFS page cache, efficiency harder with fuse (extraneous stats etc)

● Client perf. matters, for single-client workloads

● Slow client can hold up others if it's hogging metadata locks: include clients in troubleshooting

● - future: want more per client perf stats and maybe metadata QoS per client. Clients probably group into jobs or workloads.

● - future: may want to tag client io with job id (eg hpc workload, samba client I'd, container/VM id)

Page 43: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

43OpenNebulaConf 2014 Berlin

Journaling and caching in MDS

● Metadata ops initially journaled to striped journal "file" in the metadata pool.

– I/O latency on metadata ops is sum of network latency and journal commit latency.

– Metadata remains pinned in in-memory cache until expired from journal.

Page 44: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

44 OpenNebulaConf 2014 Berlin

Journaling and caching in MDS

● In some workloads we expect almost all metadata always in cache, in others its more of a stream.

● Control cache size with mds_cache_size

● Cache eviction relies on client cooperation

● MDS journal replay not only recovers data but also warms up cache. Use standby replay to keep that cache warm.

Page 45: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

45 OpenNebulaConf 2014 Berlin

Lookup by inode

● Sometimes we need inode → path mapping:● Hard links● NFS handles

● Costly to store this: mitigate by piggybacking paths (backtraces) onto data objects

● Con: storing metadata to data pool● Con: extra IOs to set backtraces● Pro: disaster recovery from data pool

● Future: improve backtrace writing latency

Page 46: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

46 OpenNebulaConf 2014 Berlin

CephFS in practice

ceph-deploy mds create myserver

ceph osd pool create fs_data

ceph osd pool create fs_metadata

ceph fs new myfs fs_metadata fs_data

mount -t cephfs x.x.x.x:6789 /mnt/ceph

Page 47: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

47 OpenNebulaConf 2014 Berlin

Managing CephFS clients

● New in giant: see hostnames of connected clients

● Client eviction is sometimes important:● Skip the wait during reconnect phase on MDS restart● Allow others to access files locked by crashed client

● Use OpTracker to inspect ongoing operations

Page 48: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

48 OpenNebulaConf 2014 Berlin

CephFS tips

● Choose MDS servers with lots of RAM

● Investigate clients when diagnosing stuck/slow access

● Use recent Ceph and recent kernel

● Use a conservative configuration:● Single active MDS, plus one standby● Dedicated MDS server● Kernel client● No snapshots, no inline data

Page 49: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

49 OpenNebulaConf 2014 Berlin

Towards a production-ready CephFS

● Focus on resilience:

1. Don't corrupt things

2. Stay up

3. Handle the corner cases

4. When something is wrong, tell me

5. Provide the tools to diagnose and fix problems

● Achieve this first within a conservative single-MDS configuration

Page 50: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

50 OpenNebulaConf 2014 Berlin

Giant->Hammer timeframe

● Initial online fsck (a.k.a. forward scrub)

● Online diagnostics (`session ls`, MDS health alerts)

● Journal resilience & tools (cephfs-journal-tool)

● flock in the FUSE client

● Initial soft quota support

● General resilience: full OSDs, full metadata cache

Page 51: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

51 OpenNebulaConf 2014 Berlin

FSCK and repair

● Recover from damage:● Loss of data objects (which files are damaged?)● Loss of metadata objects (what subtree is damaged?)

● Continuous verification:● Are recursive stats consistent?● Does metadata on disk match cache?● Does file size metadata match data on disk?

● Repair:● Automatic where possible● Manual tools to enable support

Page 52: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

52 OpenNebulaConf 2014 Berlin

Client management

● Current eviction is not 100% safe against rogue clients● Update to client protocol to wait for OSD blacklist

● Client metadata● Initially domain name, mount point● Extension to other identifiers?

Page 53: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

53 OpenNebulaConf 2014 Berlin

Online diagnostics

● Bugs exposed relate to failures of one client to release resources for another client: “my filesystem is frozen”. Introduce new health messages:

● “client xyz is failing to respond to cache pressure”● “client xyz is ignoring capability release messages”● Add client metadata to allow us to give domain names

instead of IP addrs in messages.

● Opaque behavior in the face of dead clients. Introduce `session ls`

● Which clients does MDS think are stale?● Identify clients to evict with `session evict`

Page 54: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

54 OpenNebulaConf 2014 Berlin

Journal resilience

● Bad journal prevents MDS recovery: “my MDS crashes on startup”:

● Data loss● Software bugs

● Updated on-disk format to make recovery from damage easier

● New tool: cephfs-journal-tool● Inspect the journal, search/filter● Chop out unwanted entries/regions

Page 55: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

55 OpenNebulaConf 2014 Berlin

Handling resource limits

● Write a test, see what breaks!

● Full MDS cache:● Require some free memory to make progress● Require client cooperation to unpin cache objects● Anticipate tuning required for cache behaviour: what

should we evict?

● Full OSD cluster● Require explicit handling to abort with -ENOSPC

● MDS → RADOS flow control:● Contention between I/O to flush cache and I/O to journal

Page 56: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

56 OpenNebulaConf 2014 Berlin

Test, QA, bug fixes

● The answer to “Is CephFS production ready?”

● teuthology test framework:● Long running/thrashing test● Third party FS correctness tests● Python functional tests

● We dogfood CephFS internally● Various kclient fixes discovered● Motivation for new health monitoring metrics

● Third party testing is extremely valuable

Page 57: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

57 OpenNebulaConf 2014 Berlin

What's next?

● You tell us!

● Recent survey highlighted:● FSCK hardening● Multi-MDS hardening● Quota support

● Which use cases will community test with?● General purpose● Backup● Hadoop

Page 58: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

58 OpenNebulaConf 2014 Berlin

Reporting bugs

● Does the most recent development release or kernel fix your issue?

● What is your configuration? MDS config, Ceph version, client version, kclient or fuse

● What is your workload?

● Can you reproduce with debug logging enabled?

http://ceph.com/resources/mailing-list-irc/

http://tracker.ceph.com/projects/ceph/issues

http://ceph.com/docs/master/rados/troubleshooting/log-and-debug/

Page 59: OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula by John Spray

59 OpenNebulaConf 2014 Berlin

Future

● Ceph Developer Summit:● When: 8 October● Where: online

● Post-Hammer work:● Recent survey highlighted multi-MDS, quota support ● Testing with clustered Samba/NFS?