awesome distributed storage system
DESCRIPTION
Ceph-History At the beginning (2006): part of Sage Weil ph.d researches After graduation (2007): Open source with 3 main developers 2011, S. Weil created Inktank Storage for professional services and support for Ceph (~60 developers) April 2014, Red Hat acquired Inktank ($175 Million) university of California, Santa Cruz presentation titleTRANSCRIPT
![Page 2: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/2.jpg)
Ceph-History
• At the beginning (2006): part of Sage Weil ph.d researches
• After graduation (2007): Open source with 3 main developers
• 2011, S. Weil created Inktank Storage for professional services and support for Ceph (~60 developers)
• April 2014, Red Hat acquired Inktank ($175 Million)
![Page 3: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/3.jpg)
Open Source Project
• www.ceph.com• www.github.com/ceph
• 9th release (Infernalis, 11/2015)
![Page 4: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/4.jpg)
Ceph-TargetCeph is a unified, distributed storage system designed for excellent performance, reliability and scalability.
• Designed for commodity hardware• Software-based• Open Source• Self managing/healing• Self balancing• painless scaling• No SPOF• Object Storage (S3, Swift)• Block Storage• File System (POSIX)
C A PX
![Page 5: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/5.jpg)
Architecture outline
![Page 6: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/6.jpg)
RBD
• Rados Block Device– Thin provisioning– Snapshot / Clone
– Can be used as OpenStack Cinder backend– Can be used by Libvirt
![Page 7: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/7.jpg)
CephFS
• File System– POSIX Compliant (legacy)– Network File System– Plugin for hadoop (HDFS alternative)
– Kernel (>2.6.34 / 2010) or FUSE
![Page 8: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/8.jpg)
RGW
• Rados GateWay : HTTP REST gateway for the RADOS object store– AWS S3 Compliant– OpenStack Swift Compliant
![Page 9: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/9.jpg)
What services can be easily done with a Ceph ?
• A DropBox like (Ceph RGW + OwnCloud)
• A provider of volumes (Ceph RBD)
• A NFS like (CephFS)
All with the same Ceph cluster
![Page 10: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/10.jpg)
a Ceph cluster
![Page 11: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/11.jpg)
Ceph-Concept• Object Servers (OS): store the objects• Monitor servers (MON): watch over the storage network,
maintain the group membership, ensure consistency (Strong consistency)
• Metadata Servers (MDS): store the file system structure
• A service uses a Pool that is composed of Placement Groups– a placement group is a storage space distributed over n OS
• Crush Map : defines placement rules• Replication VS Erasure Code• Cache-tiering
![Page 12: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/12.jpg)
Entities
Ceph Client Pool
PG0
PG1
PG2
PGn
…
OSD0a OSD0b OSD0c
OSD1a OSD1b OSD1c
OSD2a OSD2b OSD2c
OSDna OSDnb OSDnc
Client host OSD host
CRUD
![Page 13: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/13.jpg)
Pool TypeResiliency
• Replicated– each PG is composed of n OSD– One OSD is designed as Primary– IO is done on Primary– Each object is copied on the other OSD by the Primary (strong consistency : ack after
copy)
• Erasure Code– a pool can have an erasure code profiles (params k, m)– each PG is composed of k+m OSD– One OSD is designed as Primary– IO is done on Primary (encode and decode)– Each object is encoded into k+m chunks by the Primary and then spread to the k+m OSD
(strong consistency : ack after creation)
– the default erasure code library : jerasure – other lib can be used dynamically (plugin)
![Page 14: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/14.jpg)
Erasure code overview
![Page 15: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/15.jpg)
Object Store Device
OSD Daemon
FS (xattr)
Disk
OSD is primary for some objects : - responsible for resiliency- responsible for coherency- responsible for re-balancing- responsible for recovery
OSD is secondary for some objects : - under control of primary- capable of becoming primary
atomic transactions : put, get, delete, …
![Page 16: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/16.jpg)
Object Placement
![Page 17: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/17.jpg)
CRUSHControlled Replication Under Scalable Hashing
a pseudo-random deterministic data distribution algorithm that efficiently and robustly distributes object replicas across a heterogeneous, structured storage cluster.
This avoids the need for an index server to coordinate reads and writes.
Based on • OSD weight• rules
![Page 18: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/18.jpg)
device 0 osd.0device 1 osd.1device 2 osd.2device 3 osd.3device 4 osd.4device 5 osd.5device 6 osd.6device 7 osd.7
host ceph-osd-ssd-server-1 { id -1 alg straw hash 0 item osd.0 weight 1.00 item osd.1 weight 1.00 } host ceph-osd-ssd-server-2 { id -2 alg straw hash 0 item osd.2 weight 1.00 item osd.3 weight 1.00 } host ceph-osd-platter-server-1 { id -3 alg straw hash 0 item osd.4 weight 1.00 item osd.5 weight 1.00 } host ceph-osd-platter-server-2 { id -4 alg straw hash 0 item osd.6 weight 1.00 item osd.7 weight 1.00 } root platter { id -5 alg straw hash 0 item ceph-osd-platter-server-1 weight 2.00 item ceph-osd-platter-server-2 weight 2.00 } root ssd { id -6 alg straw hash 0 item ceph-osd-ssd-server-1 weight 2.00 item ceph-osd-ssd-server-2 weight 2.00 }
rule data { ruleset 0 type replicated min_size 2 max_size 2 step take platter step chooseleaf firstn 0 type host step emit } rule metadata { ruleset 1 type replicated min_size 0 max_size 10 step take platter step chooseleaf firstn 0 type host step emit } rule rbd { ruleset 2 type replicated min_size 0 max_size 10 step take platter step chooseleaf firstn 0 type host step emit } rule platter { ruleset 3 type replicated min_size 0 max_size 10 step take platter step chooseleaf firstn 0 type host step emit } rule ssd { ruleset 4 type replicated min_size 0 max_size 4 step take ssd step chooseleaf firstn 0 type host step emit }
OSDs Buckets Rules
![Page 19: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/19.jpg)
Cache Tiering
2 modes : Write back and Read Only
![Page 20: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/20.jpg)
Monitor
• Maintains Cluster state and history– Mon Map– OSD Map– PG Map– Crush Map– MDS Map
• Every Ceph client has a list of Mons (addresses)
![Page 21: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/21.jpg)
Dependability
• Monitors use a consensus algorithm to maintain the maps– all the mons have the same map view (strong
consistency)– based on Quorum, then if half the monitors
crashes/disappears, the system won’t be available
![Page 22: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/22.jpg)
Metadata Server (MDS)
• Store metadata of CephFS (permission bits, ACL, ownership, …)
• The data are stored into a Ceph pool (not locally)
• Cache the metadata• Provide high availability of metadata (multiple
MDS)
![Page 23: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/23.jpg)
MDSadaptive metadata cluster architecture based on Dynamic Subtree Partitioning that adaptively and intelligently distributes responsibility for managing the file system directory hierarchy among the available MDSs in the MDS cluster
![Page 24: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/24.jpg)
Ceph-Status• Portal: www.ceph.com• Code at https://github.com/ceph/ceph• Versions
– Infernalis (11/2015)– Hammer (04/2015)– Giant (10/2014)– Firefly (05/2014)– Emperor (11/2013)– Dumpling (08/2013)– Cuttlefish (05/2013)– Bobtail (01/2013)– Argonaut (07/2012)
• License : LGPL v2.1, BSD, MIT, Apache 2 …• On March 19, 2010, Linus Torvalds merged the Ceph client into Linux kernel version
2.6.34 (2010). • Active contributors: ~120• Very active community
![Page 25: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/25.jpg)
inkScope is a Ceph visualization and admin interface
Open source : https://github.com/inkscope/inkscopeversion 1.3 (23/12/2015)
![Page 26: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/26.jpg)
Architecture
![Page 27: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/27.jpg)
![Page 28: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/28.jpg)
![Page 29: Awesome distributed storage system](https://reader035.vdocuments.us/reader035/viewer/2022062401/5a4d1b687f8b9ab0599b2045/html5/thumbnails/29.jpg)