ceph - a distributed storage system
TRANSCRIPT
![Page 1: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/1.jpg)
A distributed storage system
![Page 2: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/2.jpg)
whoami
● Italo Santos
● @ Locaweb since 2007
● Sysadmin @ Storage Team
![Page 3: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/3.jpg)
Introduction● Single Storage System
● Scalable
● Reliable
● Self-healing
● Fault Tolerant
● NO single point of failure
![Page 4: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/4.jpg)
Architecture
![Page 5: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/5.jpg)
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
![Page 6: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/6.jpg)
Ceph Storage Cluster
![Page 7: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/7.jpg)
OSDs
MMonitors MDS
![Page 8: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/8.jpg)
OSDs
![Page 9: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/9.jpg)
OSDs● One per disk
● Store data
● Replication
● Recovery
● Backfilling
● Rebalancing
● OSDs heartbeat
![Page 10: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/10.jpg)
DISK
FS
DISK DISK
OSD
DISK DISK
OSD OSD OSD OSD
FS FS FSFS btrfsxfsext4
MMM
![Page 11: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/11.jpg)
M
Ceph Monitors
![Page 12: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/12.jpg)
Monitors● Cluster map
● Monitors map
● OSDs map
● Placement Group map
● CRUSH map
![Page 13: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/13.jpg)
Metadata Server (MDS)
![Page 14: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/14.jpg)
MDS● Used only by CephFS
● POSIX-compliant shared filesystem
● Manage metadata
○ Directory hierarchy
○ File metadata
● Stores metadata on RADOS
![Page 15: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/15.jpg)
CRUSH
![Page 16: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/16.jpg)
CRUSH● Pseudo-random placement algorithm
○ Fast calculation
○ Deterministic
● Statistically uniform distribution
● Limited data migration on change
● Rule-based configuration
![Page 17: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/17.jpg)
10 10 01 01 10 10 01 11 01 10
10 10 01 01 10 10 01 11 01 10
CRUSH(pg, cluster state, rule set)
hash(object name) % num pg
![Page 18: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/18.jpg)
![Page 19: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/19.jpg)
![Page 20: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/20.jpg)
CLIENT
??
![Page 21: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/21.jpg)
Placement Groups (PGs)
![Page 22: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/22.jpg)
Placement Groups● Logical collection of objects
● Maps PGs to OSDs dynamically
● Computationally less expensive
○ Reduce number of process
○ Less of per-object metadata
● Dynamically rebalance
![Page 23: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/23.jpg)
Placement Groups
![Page 24: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/24.jpg)
Placement Groups● Increase PGs reduces per-osd load
● ~100 PGs per OSD
(i.e., OSD per object = Number of replicas)
● Defined on pool creation
● PGs with multiple pools
○ Balance PGs per pool with PGs per OSD
![Page 25: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/25.jpg)
Pools
![Page 26: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/26.jpg)
Pools● Replicated
○ Object replicated N times (i.e., default size = 3)
○ Object + 2 protection replicas
● Erasure Coded
○ Stores objects as K+M chunks (i.e., size = K+M)
○ Divided into K data chunks and M coding chunks
![Page 27: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/27.jpg)
![Page 28: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/28.jpg)
![Page 29: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/29.jpg)
Ceph Clients
![Page 30: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/30.jpg)
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
![Page 31: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/31.jpg)
RadosGWCeph Object Gateway
![Page 32: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/32.jpg)
RadosGW● Object Storage Interface
● Apache + FastCGI
● S3-compatible
● Swift-compatible
● Common namespace
● Store data on Ceph cluster
![Page 33: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/33.jpg)
RBDRados Block Device
![Page 34: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/34.jpg)
RBD● Block device interface
● Data striped on ceph cluster
● Thin-provisioned
● Snapshot support
● Linux Kernel-based (librbd)
● Cloud native support
![Page 35: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/35.jpg)
CephFSCeph File System
![Page 36: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/36.jpg)
CephFS● POSIX-compliant filesystem
● Shared filesystem
● Directory hierarchy
● File metadata (owner, timestamps, mode, etc.)
● Ceph MDS required
● NOT production ready!
![Page 37: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/37.jpg)
![Page 38: Ceph - A distributed storage system](https://reader034.vdocuments.us/reader034/viewer/2022042607/55a7b5a41a28abc8168b47eb/html5/thumbnails/38.jpg)
ThanksItalo Santos @ Storage Team