opennebulaconf 2014 - using ceph to provide scalable storage for opennebula - john spray
TRANSCRIPT
![Page 2: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/2.jpg)
2 OpenNebulaConf 2014 Berlin
Agenda
● What is it?
● Architecture
● Integration with OpenNebula
● What's new?
![Page 3: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/3.jpg)
3OpenNebulaConf 2014 Berlin
What is Ceph?
![Page 4: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/4.jpg)
4 OpenNebulaConf 2014 Berlin
What is Ceph?
● Highly available resilient data store
● Free Software (LGPL)
● 10 years since inception
● Flexible object, block and filesystem interfaces
● Especially popular in private clouds as VM image service, and S3-compatible object storage service.
![Page 5: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/5.jpg)
5 OpenNebulaConf 2014 Berlin
Interfaces to storage
FILE SYSTEMCephFS
BLOCK STORAGE
RBD
OBJECT STORAGE
RGW
Keystone
Geo-Replication
Native API
Multi-tenant
S3 & Swift
OpenStack
Linux Kernel
iSCSI
Clones
Snapshots
CIFS/NFS
HDFS
Distributed Metadata
Linux Kernel
POSIX
![Page 6: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/6.jpg)
6OpenNebulaConf 2014 Berlin
Ceph Architecture
![Page 7: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/7.jpg)
7OpenNebulaConf 2014 Berlin
Architectural Components
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
![Page 8: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/8.jpg)
8OpenNebulaConf 2014 Berlin
Object Storage Daemons
FS
DISK
OSD
DISK
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
btrfsxfsext4
M
M
M
![Page 9: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/9.jpg)
9OpenNebulaConf 2014 Berlin
RADOS Components
OSDs: 10s to 10000s in a cluster One per disk (or one per SSD, RAID group…) Serve stored objects to clients Intelligently peer for replication & recovery
Monitors: Maintain cluster membership and state Provide consensus for distributed decision-
making Small, odd number These do not serve stored objects to clients
M
![Page 10: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/10.jpg)
10OpenNebulaConf 2014 Berlin
Rados Cluster
APPLICATION
M M
M M
M
RADOS CLUSTER
![Page 11: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/11.jpg)
11OpenNebulaConf 2014 Berlin
Where do objects live?
??
APPLICATION
M
M
M
OBJECT
![Page 12: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/12.jpg)
12OpenNebulaConf 2014 Berlin
A Metadata Server?
1
APPLICATION
M
M
M
2
![Page 13: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/13.jpg)
13OpenNebulaConf 2014 Berlin
Calculated placement
FAPPLICATION
M
M
MA-G
H-N
O-T
U-Z
![Page 14: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/14.jpg)
14OpenNebulaConf 2014 Berlin
Even better: CRUSH
RADOS CLUSTER
OBJECT
10
01
01
10
10
01
11
01
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
![Page 15: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/15.jpg)
15OpenNebulaConf 2014 Berlin
CRUSH is a quick calculation
RADOS CLUSTER
OBJECT
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
![Page 16: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/16.jpg)
16OpenNebulaConf 2014 Berlin
CRUSH: Dynamic data placement
CRUSH: Pseudo-random placement algorithm
Fast calculation, no lookup Repeatable, deterministic
Statistically uniform distribution Stable mapping
Limited data migration on change Rule-based configuration
Infrastructure topology aware Adjustable replication Weighting
![Page 17: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/17.jpg)
17OpenNebulaConf 2014 Berlin
Architectural Components
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
![Page 18: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/18.jpg)
18OpenNebulaConf 2014 Berlin
RBD: Virtual disks in Ceph
18
RADOS BLOCK DEVICE: Storage of disk images in RADOS Decouples VMs from host Images are striped across the cluster (pool) Snapshots Copy-on-write clones Support in:
Mainline Linux Kernel (2.6.39+) Qemu/KVM OpenStack, CloudStack, OpenNebula,
Proxmox
![Page 19: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/19.jpg)
19OpenNebulaConf 2014 Berlin
Storing virtual disks
M M
RADOS CLUSTER
HYPERVISOR
LIBRBD
VM
19
![Page 20: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/20.jpg)
20OpenNebulaConf 2014 Berlin
Using Ceph with OpenNebula
![Page 21: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/21.jpg)
21OpenNebulaConf 2014 Berlin
Storage in OpenNebula deployments
OpenNebula Cloud Architecture Survey 2014 (http://c12g.com/resources/survey/)
![Page 22: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/22.jpg)
22OpenNebulaConf 2014 Berlin
RBD and libvirt/qemu
● librbd (user space) client integration with libvirt/qemu
● Support for live migration, thin clones
● Get recent versions!
● Directly supported in OpenNebula since 4.0 with the Ceph Datastore (wraps `rbd` CLI)
More info online:
http://ceph.com/docs/master/rbd/libvirt/http://docs.opennebula.org/4.10/administration/storage/ceph_ds.html
![Page 23: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/23.jpg)
23OpenNebulaConf 2014 Berlin
Other hypervisors
● OpenNebula is flexible, so can we also use Ceph with non-libvirt/qemu hypervisors?
● Kernel RBD: can present RBD images in /dev/ on hypervisor host for software unaware of librbd
● Docker: can exploit RBD volumes with a local filesystem for use as data volumes – maybe CephFS in future...?
● For unsupported hypervisors, can adapt to Ceph using e.g. iSCSI for RBD, or NFS for CephFS (but test re-exports carefully!)
![Page 24: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/24.jpg)
24OpenNebulaConf 2014 Berlin
Choosing hardware
Testing/benchmarking/expert advice is needed, but there are general guidelines:
● Prefer many cheap nodes to few expensive nodes (10 is better than 3)
● Include small but fast SSDs for OSD journals
● Don't simply buy biggest drives: consider IOPs/capacity ratio
● Provision network and IO capacity sufficient for your workload plus recovery bandwidth from node failure.
![Page 25: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/25.jpg)
25OpenNebulaConf 2014 Berlin
What's new?
![Page 26: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/26.jpg)
26OpenNebulaConf 2014 Berlin
Ceph releases
● Ceph 0.80 firefly (May 2014)
– Cache tiering & erasure coding
– Key/val OSD backends
– OSD primary affinity● Ceph 0.87 giant (October 2014)
– RBD cache enabled by default
– Performance improvements
– Locally recoverable erasure codes● Ceph x.xx hammer (2015)
![Page 27: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/27.jpg)
27OpenNebulaConf 2014 Berlin
Additional components
● Ceph FS – scale-out POSIX filesystem service, currently being stabilized
● Calamari – monitoring dashboard for Ceph
● ceph-deploy – easy SSH-based deployment tool
● Puppet, Chef modules
![Page 28: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/28.jpg)
28OpenNebulaConf 2014 Berlin
Get involved
Evaluate the latest releases:
http://ceph.com/resources/downloads/
Mailing list, IRC:
http://ceph.com/resources/mailing-list-irc/
Bugs:
http://tracker.ceph.com/projects/ceph/issues
Online developer summits:
https://wiki.ceph.com/Planning/CDS
![Page 29: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/29.jpg)
29OpenNebulaConf 2014 Berlin
Questions?
![Page 30: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/30.jpg)
30OpenNebulaConf 2014 Berlin
![Page 31: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/31.jpg)
31OpenNebulaConf 2014 Berlin
Spare slides
![Page 32: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/32.jpg)
32OpenNebulaConf 2014 Berlin
![Page 33: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/33.jpg)
33OpenNebulaConf 2014 Berlin
Ceph FS
![Page 34: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/34.jpg)
34 OpenNebulaConf 2014 Berlin
CephFS architecture
● Dynamically balanced scale-out metadata
● Inherit flexibility/scalability of RADOS for data
● POSIX compatibility
● Beyond POSIX: Subtree snapshots, recursive statistics
Weil, Sage A., et al. "Ceph: A scalable, high-performance distributed file system." Proceedings of the 7th symposium on Operating systems
design and implementation. USENIX Association, 2006.http://ceph.com/papers/weil-ceph-osdi06.pdf
![Page 35: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/35.jpg)
35OpenNebulaConf 2014 Berlin
Components
● Client: kernel, fuse, libcephfs
● Server: MDS daemon
● Storage: RADOS cluster (mons & OSDs)
![Page 36: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/36.jpg)
36OpenNebulaConf 2014 Berlin
Components
Linux host
M M
M
Ceph server daemons
ceph.ko
datametadata 0110
![Page 37: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/37.jpg)
37 OpenNebulaConf 2014 Berlin
From application to disk
ceph-mds
libcephfsceph-fuse Kernel client
RADOS
Client network protocol
Application
Disk
![Page 38: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/38.jpg)
38OpenNebulaConf 2014 Berlin
Scaling out FS metadata
● Options for distributing metadata?
– by static subvolume
– by path hash
– by dynamic subtree● Consider performance, ease of implementation
![Page 39: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/39.jpg)
39OpenNebulaConf 2014 Berlin
Dynamic subtree placement
![Page 40: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/40.jpg)
40OpenNebulaConf 2014 Berlin
Dynamic subtree placement
● Locality: get the dentries in a dir from one MDS
● Support read heavy workloads by replicating non-authoritative copies (cached with capabilities just like clients do)
● In practice work at directory fragment level in order to handle large dirs
![Page 41: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/41.jpg)
41 OpenNebulaConf 2014 Berlin
Data placement
● Stripe file contents across RADOS objects● get full rados cluster bw from clients● fairly tolerant of object losses: reads return zero
● Control striping with layout vxattrs● layouts also select between multiple data pools
● Deletion is a special case: client deletions mark files 'stray', RADOS delete ops sent by MDS
![Page 42: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/42.jpg)
42 OpenNebulaConf 2014 Berlin
Clients
● Two implementations:● ceph-fuse/libcephfs● kclient
● Interplay with VFS page cache, efficiency harder with fuse (extraneous stats etc)
● Client perf. matters, for single-client workloads
● Slow client can hold up others if it's hogging metadata locks: include clients in troubleshooting
● - future: want more per client perf stats and maybe metadata QoS per client. Clients probably group into jobs or workloads.
● - future: may want to tag client io with job id (eg hpc workload, samba client I'd, container/VM id)
![Page 43: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/43.jpg)
43OpenNebulaConf 2014 Berlin
Journaling and caching in MDS
● Metadata ops initially journaled to striped journal "file" in the metadata pool.
– I/O latency on metadata ops is sum of network latency and journal commit latency.
– Metadata remains pinned in in-memory cache until expired from journal.
![Page 44: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/44.jpg)
44 OpenNebulaConf 2014 Berlin
Journaling and caching in MDS
● In some workloads we expect almost all metadata always in cache, in others its more of a stream.
● Control cache size with mds_cache_size
● Cache eviction relies on client cooperation
● MDS journal replay not only recovers data but also warms up cache. Use standby replay to keep that cache warm.
![Page 45: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/45.jpg)
45 OpenNebulaConf 2014 Berlin
Lookup by inode
● Sometimes we need inode → path mapping:● Hard links● NFS handles
● Costly to store this: mitigate by piggybacking paths (backtraces) onto data objects
● Con: storing metadata to data pool● Con: extra IOs to set backtraces● Pro: disaster recovery from data pool
● Future: improve backtrace writing latency
![Page 46: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/46.jpg)
46 OpenNebulaConf 2014 Berlin
CephFS in practice
ceph-deploy mds create myserver
ceph osd pool create fs_data
ceph osd pool create fs_metadata
ceph fs new myfs fs_metadata fs_data
mount -t cephfs x.x.x.x:6789 /mnt/ceph
![Page 47: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/47.jpg)
47 OpenNebulaConf 2014 Berlin
Managing CephFS clients
● New in giant: see hostnames of connected clients
● Client eviction is sometimes important:● Skip the wait during reconnect phase on MDS restart● Allow others to access files locked by crashed client
● Use OpTracker to inspect ongoing operations
![Page 48: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/48.jpg)
48 OpenNebulaConf 2014 Berlin
CephFS tips
● Choose MDS servers with lots of RAM
● Investigate clients when diagnosing stuck/slow access
● Use recent Ceph and recent kernel
● Use a conservative configuration:● Single active MDS, plus one standby● Dedicated MDS server● Kernel client● No snapshots, no inline data
![Page 49: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/49.jpg)
49 OpenNebulaConf 2014 Berlin
Towards a production-ready CephFS
● Focus on resilience:
1. Don't corrupt things
2. Stay up
3. Handle the corner cases
4. When something is wrong, tell me
5. Provide the tools to diagnose and fix problems
● Achieve this first within a conservative single-MDS configuration
![Page 50: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/50.jpg)
50 OpenNebulaConf 2014 Berlin
Giant->Hammer timeframe
● Initial online fsck (a.k.a. forward scrub)
● Online diagnostics (`session ls`, MDS health alerts)
● Journal resilience & tools (cephfs-journal-tool)
● flock in the FUSE client
● Initial soft quota support
● General resilience: full OSDs, full metadata cache
![Page 51: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/51.jpg)
51 OpenNebulaConf 2014 Berlin
FSCK and repair
● Recover from damage:● Loss of data objects (which files are damaged?)● Loss of metadata objects (what subtree is damaged?)
● Continuous verification:● Are recursive stats consistent?● Does metadata on disk match cache?● Does file size metadata match data on disk?
● Repair:● Automatic where possible● Manual tools to enable support
![Page 52: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/52.jpg)
52 OpenNebulaConf 2014 Berlin
Client management
● Current eviction is not 100% safe against rogue clients● Update to client protocol to wait for OSD blacklist
● Client metadata● Initially domain name, mount point● Extension to other identifiers?
![Page 53: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/53.jpg)
53 OpenNebulaConf 2014 Berlin
Online diagnostics
● Bugs exposed relate to failures of one client to release resources for another client: “my filesystem is frozen”. Introduce new health messages:
● “client xyz is failing to respond to cache pressure”● “client xyz is ignoring capability release messages”● Add client metadata to allow us to give domain names
instead of IP addrs in messages.
● Opaque behavior in the face of dead clients. Introduce `session ls`
● Which clients does MDS think are stale?● Identify clients to evict with `session evict`
![Page 54: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/54.jpg)
54 OpenNebulaConf 2014 Berlin
Journal resilience
● Bad journal prevents MDS recovery: “my MDS crashes on startup”:
● Data loss● Software bugs
● Updated on-disk format to make recovery from damage easier
● New tool: cephfs-journal-tool● Inspect the journal, search/filter● Chop out unwanted entries/regions
![Page 55: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/55.jpg)
55 OpenNebulaConf 2014 Berlin
Handling resource limits
● Write a test, see what breaks!
● Full MDS cache:● Require some free memory to make progress● Require client cooperation to unpin cache objects● Anticipate tuning required for cache behaviour: what
should we evict?
● Full OSD cluster● Require explicit handling to abort with -ENOSPC
● MDS → RADOS flow control:● Contention between I/O to flush cache and I/O to journal
![Page 56: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/56.jpg)
56 OpenNebulaConf 2014 Berlin
Test, QA, bug fixes
● The answer to “Is CephFS production ready?”
● teuthology test framework:● Long running/thrashing test● Third party FS correctness tests● Python functional tests
● We dogfood CephFS internally● Various kclient fixes discovered● Motivation for new health monitoring metrics
● Third party testing is extremely valuable
![Page 57: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/57.jpg)
57 OpenNebulaConf 2014 Berlin
What's next?
● You tell us!
● Recent survey highlighted:● FSCK hardening● Multi-MDS hardening● Quota support
● Which use cases will community test with?● General purpose● Backup● Hadoop
![Page 58: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/58.jpg)
58 OpenNebulaConf 2014 Berlin
Reporting bugs
● Does the most recent development release or kernel fix your issue?
● What is your configuration? MDS config, Ceph version, client version, kclient or fuse
● What is your workload?
● Can you reproduce with debug logging enabled?
http://ceph.com/resources/mailing-list-irc/
http://tracker.ceph.com/projects/ceph/issues
http://ceph.com/docs/master/rados/troubleshooting/log-and-debug/
![Page 59: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray](https://reader034.vdocuments.us/reader034/viewer/2022042817/55a20b571a28abd24e8b45e4/html5/thumbnails/59.jpg)
59 OpenNebulaConf 2014 Berlin
Future
● Ceph Developer Summit:● When: 8 October● Where: online
● Post-Hammer work:● Recent survey highlighted multi-MDS, quota support ● Testing with clustered Samba/NFS?