cephfs: the stable distributed filesystem greg farnumschd.ws/hosted_files/ossna2017/52/2017-09-13...
TRANSCRIPT
![Page 1: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/1.jpg)
CephFS:The Stable Distributed Filesystem
Greg Farnum
![Page 2: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/2.jpg)
2
Hi, I'm Greg
Greg Farnum
Ceph developer since 2009
Principal Software Engineer, Red Hat
![Page 3: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/3.jpg)
3
Overview
● What is Ceph/Cephfs
● CephFS: What Works
– It's a distributed POSIX filesystem!
– There are many niceties that go with that
– Directory fragmentation
– Multi-Active MDS
● CephFS: What Doesn't Work (Yet)
– Snapshots
● Other New Ceph Stuf
● Pain Points & Use Cases
![Page 4: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/4.jpg)
4
Where Does Ceph Come From?
● Then: UC Santa Cruz Storage Systems Research Center
● Long-term research project in petabyte-scale storage
● trying to develop a Lustre successor.
● Now: Red Hat, SuSE, Canonical, 42on, Hastexo, ...
● customers in virtual block devices and object storage
● ...and reaching for filesystem users!
![Page 5: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/5.jpg)
5
Now: Fully Awesome
RGWweb services gateway
for object storage, compatible with S3 and
Swift
LIBRADOSclient library allowing apps to access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSsoftware-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDreliable, fully-distributed block device with cloud
platform integration
CEPHFSdistributed file system with POSIX semantics
and scale-out metadata management
APP HOST/VM CLIENT
![Page 6: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/6.jpg)
RADOS CLUSTER
6
APPLICATION
M M
M M
M
RADOS CLUSTER
![Page 7: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/7.jpg)
WHERE DO OBJECTS LIVE?
7
??APPLICATION
M
M
M
OBJECT
![Page 8: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/8.jpg)
A METADATA SERVER?
8
1
APPLICATION
M
M
M
2
![Page 9: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/9.jpg)
CALCULATED PLACEMENT
9
FAPPLICATION
M
M
MA-G
H-N
O-T
U-Z
![Page 10: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/10.jpg)
CRUSH IS A QUICK CALCULATION
10
RADOS CLUSTER
OBJECT
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
![Page 11: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/11.jpg)
11
CRUSH AVOIDS FAILED DEVICES
RADOS CLUSTER
OBJECT
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
10
![Page 12: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/12.jpg)
CRUSH: DYNAMIC DATA PLACEMENT
12
CRUSH: Pseudo-random placement algorithm
Fast calculation, no lookup Repeatable, deterministic
Statistically uniform distribution Stable mapping
Limited data migration on change Rule-based configuration
Infrastructure topology aware Adjustable replication Weighting
![Page 13: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/13.jpg)
DATA IS ORGANIZED INTO POOLS
13
CLUSTER
OBJECTS
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
POOLS(CONTAINING PGs)
10
01
11
01
10
01
01
10
01
10
10
01
11
01
10
01
10 01 10 11
01
11
01
10
10
01
01
01
10
10
01
01
POOLA
POOLB
POOL C
POOLD
OBJECTS
OBJECTS
OBJECTS
![Page 14: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/14.jpg)
RADOS COMPONENTS
14
OSDs: 10s to 10000s in a cluster One per disk (or one per SSD, RAID group…) Serve stored objects to clients Intelligently peer for replication & recovery
Monitors: Maintain cluster membership and state Provide consensus for distributed decision-
making Small, odd number Not in the data path!
M
![Page 15: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/15.jpg)
15
RADOS: FAILURE RECOVERY
● Each OSDMap is numbered with an epoch number
● The Monitors and OSDs store a history of OSDMaps
● Using this history, an OSD which becomes a new member of a PG can deduce every OSD which could have received a write which it needs to know about
● The process of discovering the authoritative state of the objects stored in the PG by contacting old PG members is called Peering
![Page 16: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/16.jpg)
16
11Epoch 20220: 5
RADOS: FAILURE RECOVERY
![Page 17: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/17.jpg)
17
11Epoch 20220: 305
Epoch 20113: 305
11Epoch 19884: 305
30
RADOS: FAILURE RECOVERY
![Page 18: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/18.jpg)
L
LIBRADOS: RADOS ACCESS FOR APPS
18
LIBRADOS: Direct access to RADOS for applications C, C++, Python, PHP, Java, Erlang Direct access to storage nodes No HTTP overhead
Rich object API Bytes, attributes, key/value data Partial overwrite of existing data Single-object compound atomic operations RADOS classes (stored procedures)
![Page 19: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/19.jpg)
19
Existing Awesome Ceph Stuf
![Page 20: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/20.jpg)
THE RADOS GATEWAY
20
M M
M
RADOS CLUSTER
RADOSGWLIBRADOS
socket
RADOSGWLIBRADOS
APPLICATION APPLICATION
REST
![Page 21: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/21.jpg)
RADOSGW MAKES RADOS WEBBY
21
RADOSGW: REST-based object storage proxy Uses RADOS to store objects API supports buckets, accounts Usage accounting for billing Compatible with S3 and Swift applications
![Page 22: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/22.jpg)
STORING VIRTUAL DISKS
22
M M
RADOS CLUSTER
HYPERVISORLIBRBD
VM
![Page 23: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/23.jpg)
RBD STORES VIRTUAL DISKS
23
RADOS BLOCK DEVICE: Storage of disk images in RADOS Decouples VMs from host Images are striped across the cluster (pool) Snapshots Copy-on-write clones Support in:
Mainline Linux Kernel (2.6.39+) Qemu/KVM OpenStack, CloudStack, Nebula, Proxmox
![Page 24: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/24.jpg)
24
CephFS Is Pretty Awesome
![Page 25: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/25.jpg)
25
LINUX HOST
M M
M
RADOS CLUSTER
KERNEL MODULE
datametadata 0110
Ceph-fuse, samba, Ganesha
![Page 26: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/26.jpg)
26
Hammer (LTS)
Spring 2015
Jewel (LTS)
Spring 2016
Infernalis
Fall 2015
Kraken
Winter 2016/17
Luminous (LTS)
Summer 2017
Awesomeness Timeline
10.2.z
12.2.z
0.94.z
Pre-Awesome Some AwesomeAwesome
![Page 27: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/27.jpg)
27
Awesome: It's A Filesystem!
![Page 28: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/28.jpg)
28
POSIX Filesystem
● Mounting, from multiple clients
– Not much good without that!
● POSIX-y goodness:
– Atomic updates
– Files, with names and directories and rename
● and hard links too!
● Coherent caching
– Updates from one node are visible elsewhere, immediately
![Page 29: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/29.jpg)
29
POSIX Filesystem: Consistency
● CephFS has “consistent caching”
● Clients are allowed to cache, and the server invalidates them before making changes
– This means clients never see stale data of any kind!
– And there's no opportunity for any kind of split brain situation
![Page 30: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/30.jpg)
30
POSIX Filesystem: Scaling Data
● All data is stored in RADOS
● Filesystem clients write directly to RADOS
● Need more data space? Add more OSDs!
● Faster throughput?
– Faster SSDs!
– Wider striping of files across objects!
– ...at least, up until you're limited by latency instead of throughput
![Page 31: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/31.jpg)
31
POSIX Filesystem: Scaling Metadata
● Only active metadata is stored in memory
● Size your metadata server (MDS) by active set size, not total metadata
● (We’ll discuss scaling past one server’s limits shortly)
![Page 32: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/32.jpg)
32
rstats are cool
# ext4 reports dirs as 4K
ls -lhd /ext4/data
drwxrwxr-x. 2 john john 4.0K Jun 25 14:58 /home/john/data
# cephfs reports dir size from contents
$ ls -lhd /cephfs/mydata
drwxrwxr-x. 1 john john 16M Jun 25 14:57 ./mydata
![Page 33: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/33.jpg)
33
Awesome: A Security Model
![Page 34: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/34.jpg)
34
CephX security capabilities
● Clients start out unable to access the MDS.
– Incrementally granted permissions for subtrees (or the whole tree)
– To act as a specific user
– Etc
● For real security, these must be coordinated with OSD caps:
ceph auth get-or-create client.foo \
mds “allow rw path=/foodir” \
osd “allow rw pool=foopool” \
mon “allow r”
![Page 35: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/35.jpg)
35
CephX security capabilities: Protection
● The security capabilities are encrypted by the server; can't be changed by client
● MDS only examines MDS grants
– Protects against acting as an unauthorized user
– Prevents all access to inodes/dentries not under granted path
● OSDs independently examine OSD grants
– Protects against access to unauthorized pools and namespaces
● Possible hole: if clients share namespace+pool, they can trample on raw file data
– If you don't trust your clients, give them each their own namespace (this is free) and specify it in CephFS layout for their directory hierarchy
![Page 36: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/36.jpg)
36
Awesome: Hot standby MDS
![Page 37: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/37.jpg)
37
Standby servers
● Nothing ties metadata to a particular server!
● Spin up an arbitrary number of “standby” and “standby-replay” servers
– Standby: just waiting around; can be made active
– Standby-replay: actively replaying the MDS log
● Warms up the cache for fast takeover
rename /tmp/file1 -> /home/greg/foo
rename /tmp/file2 -> /home/greg/bar
create /home/greg/baz
![Page 38: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/38.jpg)
38
Standby servers: reconnect
● Replay log, load all necessary file data from RADOS
● Let clients replay uncommitted operations, process them
● Synchronize caching states between clients and MDS
● Go active!
![Page 39: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/39.jpg)
39
Basically Awesome: Active Multi-MDS
![Page 40: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/40.jpg)
40
Active Multi-MDS
● Because no metadata is stored on MDS servers, migrating it is “easy”!
![Page 41: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/41.jpg)
41
Active Multi-MDS
Cooperative Partitioning between servers:
● Keep track of how hot metadata is
● Migrate subtrees to keep heat distribution similar
– Cheap because all metadata is in RADOS
● Maintains locality
![Page 42: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/42.jpg)
42
Active Multi-MDS: What's short
● Performance optimization is hard, so the balancing algorithm might not work well for you
● Solution: Subtree pinning
– Explicitly map folders to specific MDSes (by setting xattrs) to achieve a known-good balance.
● You can change this mapping at will
![Page 43: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/43.jpg)
43
Awesome: Directory Frags
![Page 44: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/44.jpg)
44
Directory Fragmentation
● Directories are generally loaded from disk as a unit
– But sometimes that's too much data at once!
– Or you want to spread a hot directory over many active MDSes
foo -> ino 1342, 4 MB
bar -> ino 1001, 1024 KBytes
baz -> ino 1242, 2 MB
/<ino 0,v2>/home<ino 1,v5>/greg<ino 5,v9>/
Mydir[01], total size 7MB
hi -> ino 1000, 6 MB
hello -> ino 6743, 1024 KB
whaddup -> ino 9872, 1 MB
/<ino 0,v2>/home<ino 1,v5>/greg<ino 5,v9>/
Mydir[10], total size 8MB
![Page 45: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/45.jpg)
45
Mostly Awesome: Scrub/Repair
![Page 46: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/46.jpg)
46
Forward Scrub
● Forward scrubbing, to ensure consistency
ceph daemon mds.<id> scrub_path
ceph daemon mds.<id> scrub_path recursive
ceph daemon mds.<id> scrub_path repair
ceph daemon mds.<id> tag path
● You have to run this manually right now, no automatic background scrub :(
![Page 47: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/47.jpg)
47
Repair tools: cephfs-journal-tool
● Disaster recovery for damaged journals:
– inspect/import/export/reset
– header get/set
– event recover_dentries
● Allows rebuild of metadata that exists in journal but is lost on disk
● Companion cephfs-table-tool exists for resetting session/inode/snap tables as needed afterwards.
![Page 48: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/48.jpg)
48
Repair tools: cephfs-data-scan
● “Backwards scrub”
● Iterate through all RADOS objects and tie them back to the namespace
● Parallel workers, thanks to new RADOS functionality
– cephfs-data-scan scan_extents
– cephfs-data-scan scan_inodes
![Page 49: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/49.jpg)
49
Repair tool methods
● Examine object names and send inferred stat info to “root” object
1342.1
/<v2>/home<v5>/greg<v9>/foo
1342.0
![Page 50: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/50.jpg)
50
foo -> ino 1342, 6 MB
bar -> ino 1001,1024 bytes
baz -> ino 1242, 2 MB
/<ino 0,v2>/home<ino 1,v5>/greg<ino 5,v9>/
mydir, total size 2074 bytes
Repair tool methods
● Assemble tree information from backtrace and inferred stat
/<v2>/home<v5>/greg<v9>/foo
1342.0
![Page 51: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/51.jpg)
51
Repair tool methods
● Do inference and then insertion in parallel across the cluster
M M
M M
M
RADOS CLUSTER
![Page 52: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/52.jpg)
52
CephFS: Almost Awesome Parts
![Page 53: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/53.jpg)
53
Almost Awesome: Snapshots
![Page 54: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/54.jpg)
54
Snapshots: Disk Data Structures
● Arbitrary sub-tree snapshots of the hierarchy
● Metadata stored as old_inode_t map in memory/disk
● Data stored in RADOS object snapshots
/<v2>/home<v5>/greg<v9>/foo
1342.0
foo -> ino 1342, 4 MB, [<1>,<3>,<10>]
bar -> ino 1001, 1024 KBytes
baz -> ino 1242, 2 MB
/<ino 0,v2>/home<ino 1,v5>/greg<ino 5,v9>/
Mydir[01], total size 7MB
![Page 55: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/55.jpg)
55
Snapshots
● Arbitrary sub-tree snapshots of the hierarchy
● Metadata stored as old_inode_t map in memory/disk
● Data stored in RADOS object snapshots
/<v2>/home<v5>/greg<v9>/foo
1342.0/HEAD/<v1>/home<v3>/greg<v7>/foo
1342.0/1
![Page 56: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/56.jpg)
56
Snapshots: What's Short
● Testing
● The exciting combinatorial explosion of tracking all this across diferent metadata servers!
– Much of this exists; it's incomplete in various ways
– As always, recovering from other failures which impact our state transitions
● There are pull requests under discussion which resolve the known issues
– Targeted for Mimic, we hope
– It already works well on single-MDS systems
![Page 57: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/57.jpg)
57
Almost Awesome: Multi-FS
![Page 58: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/58.jpg)
58
MultiFS: What's Present
● You can create multiple filesystems within a RADOS cluster
– Diferent pools or namespaces
● Each FS gets its own MDS and has to be connected to independently
![Page 59: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/59.jpg)
59
MultiFS: What's Missing
● Testing: This gets limited coverage in our test suite
● Security model: we know where we're going, but it's not done
– Currently exposes filesystem existence to users who don’t have permissions on it
● On the roadmap
![Page 60: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/60.jpg)
60
Pain Point: Client Trust
![Page 61: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/61.jpg)
61
Client Trust
● Clients can trash anything they can write
– Give clients separate namespaces!
● Clients can deny writes to anything they can read
– Don't share stuf across tenants
● Clients can DoS the MDS they attach to
– ...Multiple FSes in a cluster will fix this
● This is pretty fundamental. If you actively don’t trust your clients, put them behind an NFS gateway.
![Page 62: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/62.jpg)
62
Pain Point: Debugging Live Systems
![Page 63: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/63.jpg)
63
Exposing State: What's available
ceph daemon mds.a dump_ops_in_flight{ "ops": [ { "description": "client_request(client. "initiated_at": "2015-03-10 22:26:17.4 "age": 0.052026, "duration": 0.001098, "type_data": [ "submit entry: journal_and_reply", "client.4119:21120",...
![Page 64: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/64.jpg)
64
Exposing State: What's available
# ceph daemon mds.a session ls... "client_metadata": { "ceph_sha1": "a19f92cf...", "ceph_version": "ceph version 0.93...", "entity_id": "admin", "hostname": "claystone", "mount_point": "\/home\/john\/mnt" }
![Page 65: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/65.jpg)
65
Exposing State: What's available
● ceph mds tell 0 dumpcache /path/to/dump/to
● Newish: can also dump specific inodes and dentries
![Page 66: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/66.jpg)
66
Exposing State: What's missing
● Good ways of identifying why things are blocked
● Tracking accesses to a file
● ...and other things we haven't thought of yet?
![Page 67: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/67.jpg)
67
Who Should Use CephFS?
![Page 68: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/68.jpg)
68
CephFS Users
● Some vendors are targeting it at OpenStack users, since it pairs so well with RBD
● CephFS is good at large files
– It does well with small files for a distributed FS, but there’s no comparing to a local FS
● CephFS for home directories?
– If your users are patient or metadata can all be cached, but remember you want very new clients to get all the bug fixes
● Anybody who likes exploring: go for it!
![Page 69: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/69.jpg)
69
Other New Ceph Stuf
![Page 70: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/70.jpg)
70
Erasure Coding (Overwrites)
● Instead of replicating across OSDs, give them shards and parity blocks
● Previously, EC RADOS pools were append only
– simple, stable suitable for RGW
● Now available: EC with overwrite support!
– This is slow to start, as it requires a two-phase commit and optimizations to be remotely efficient will follow
– Means you can store CephFS and RBD data at 1.5x instead of 3x cost, resilient against same (or larger) number of node failures
![Page 71: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/71.jpg)
71
● BlueStore = Block + NewStore
– key/value database (RocksDB) for metadata
– all data written directly to raw block device(s)
– can combine HDD, SSD, NVMe, NVRAM
● Full data checksums (crc32c, xxhash)
● Inline compression (zlib, snappy, zstd)
● ~2x faster than FileStore
– better parallelism, efficiency on fast devices
– no double writes for data
– performs well with very small SSD journals
OSD: BLUESTORE
BlueStore
BlueFS
RocksDB
BlockDeviceBlockDeviceBlockDevice
BlueRocksEnv
data metadata
Allo
cato
r
ObjectStore
Com
pre
ssor
OSD
![Page 72: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/72.jpg)
72
● New implementation of network layer
– replaces aging SimpleMessenger
– fixed size thread pool (vs 2 threads per socket)
– scales better to larger clusters
– more healthy relationship with tcmalloc
– now the default!
● Pluggable backends
– PosixStack – Linux sockets, TCP (default, supported)
– Two experimental backends!
ASYNCMESSENGER
![Page 73: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/73.jpg)
73
CEPH-MGR
● ceph-mon monitor daemons used to do a lot more than necessary
– Frequent “PG stat” messages to support things like 'df'
– this limited cluster scalability
● ceph-mgr moves non-critical metrics into a
separate daemon
– that is more efficient
– that efficiently integrates with external modules (Python!)
● Good host for
– integrations, like Calamari REST API endpoint
● Time-series DB integration
– coming features like 'ceph top' or 'rbd top'
– high-level management functions and policy
M M
???(time for new iconography)
![Page 74: CephFS: The Stable Distributed Filesystem Greg Farnumschd.ws/hosted_files/ossna2017/52/2017-09-13 OSS CephFS Intro.pdf · Where Does Ceph Come From? ... Performance optimization is](https://reader034.vdocuments.us/reader034/viewer/2022051507/5a6ff4d97f8b9ac0538b8419/html5/thumbnails/74.jpg)
74
FOR MORE INFORMATION
● docs
– http://docs.ceph.com/
– https://github.com/ceph
● help
– [email protected], [email protected]
– #ceph, #ceph-devel on irc.oftc.net