an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/weil-ceph-20130416-lug.p… ·...

33
an intro to ceph for hpc sage weil – inktank lug – 2013.04.16

Upload: others

Post on 14-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

an intro to ceph for hpc

sage weil – inktanklug – 2013.04.16

Page 2: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

what is ceph?

● distributed storage system– reliable system built with unreliable components– fault tolerant, no SPoF

● commodity hardware– expensive arrays, controllers, specialized networks not required

● large scale (10s to 10,000s of nodes)– heterogenous hardware (no fork-lift upgrades)– incremental expansion (or contraction)

● dynamic cluster

Page 3: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

what is ceph?

● unified storage platform– scalable object + compute storage platform– RESTful object storage (e.g., S3, Swift)– block storage– distributed file system

● open source– LGPL server-side– client support in mainline Linux kernel

Page 4: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

RADOS – the Ceph object store

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS – the Ceph object store

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

Page 5: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

conventional HA

reliable disk array

redundant head

access network

“clients stripe data across reliable things”

Page 6: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

distributed model

“client stripe across unreliable things”“servers coordinate replication, recovery”

access network

unreliable targets

backside network(optional)

Page 7: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

DISK

FS

DISK DISK

OSD

DISK DISK

OSD OSD OSD OSD

FS FS FSFS btrfsxfsext4zfs?

MMM

Page 8: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

1010 1010 0101 0101 1010 1010 0101 1111 0101 1010

hash(object name) % num pg

CRUSH(pg, cluster state, policy)

Page 9: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

1010 1010 0101 0101 1010 1010 0101 1111 0101 1010

Page 10: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

CLIENTCLIENT

??

Page 11: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with
Page 12: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with
Page 13: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

CLIENT

??

Page 14: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

LLlibrados

● direct access to RADOS from applications

● C, C++, Python, PHP, Java, Erlang● direct access to storage nodes● no HTTP overhead

Page 15: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

● efficient key/value storage inside an object● atomic single-object transactions

– update data, attr, keys together– atomic compare-and-swap

● object-granularity snapshot infrastructure● embed code in ceph-osd daemon via plugin API

– arbitrary atomic object mutations, processing● inter-client communication via object

rich librados API

Page 16: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

die, POSIX, die

● successful exascale architectures will replace or transcend POSIX– hierarchical model does not distribute

● line between compute and storage will blur– some processes is data-local, some is not

● fault tolerance will be first-class property of architecture– for both computation and storage

Page 17: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

POSIX – I'm not dead yet!

● CephFS builds POSIX namespace on top of RADOS– metadata managed by ceph-mds daemons– stored in objects

● strong consistency, stateful client protocol– heavy prefetching, embedded inodes

● architected for HPC workloads– distribute namespace across cluster of MDSs– mitigate bursty workloads– adapt distribution as workloads shift over time

Page 18: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

MM

MM

MM

CLIENTCLIENT

0110

0110

datametadata

Page 19: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

MM

MM

MM

Page 20: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

one tree

three metadata servers

??

Page 21: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with
Page 22: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with
Page 23: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with
Page 24: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with
Page 25: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

DYNAMIC SUBTREE PARTITIONING

Page 26: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

recursive accounting

● ceph-mds tracks recursive directory stats– file sizes – file and directory counts– modification time

● efficient

$ ls -alSh | headtotal 0drwxr-xr-x 1 root root 9.7T 2011-02-04 15:51 .drwxr-xr-x 1 root root 9.7T 2010-12-16 15:06 ..drwxr-xr-x 1 pomceph pg4194980 9.6T 2011-02-24 08:25 pomcephdrwxr-xr-x 1 mcg_test1 pg2419992 23G 2011-02-02 08:57 mcg_test1drwx--x--- 1 luko adm 19G 2011-01-21 12:17 lukodrwx--x--- 1 eest adm 14G 2011-02-04 16:29 eestdrwxr-xr-x 1 mcg_test2 pg2419992 3.0G 2011-02-02 09:34 mcg_test2drwx--x--- 1 fuzyceph adm 1.5G 2011-01-18 10:46 fuzycephdrwxr-xr-x 1 dallasceph pg275 596M 2011-01-14 10:06 dallasceph

Page 27: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

snapshots

● snapshot arbitrary subdirectories● simple interface

– hidden '.snap' directory– no special tools

$ mkdir foo/.snap/one # create snapshot$ ls foo/.snapone$ ls foo/bar/.snap_one_1099511627776 # parent's snap name is mangled$ rm foo/myfile$ ls -F foobar/$ ls -F foo/.snap/onemyfile bar/$ rmdir foo/.snap/one # remove snapshot

Page 28: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

running ceph in lustre environments

● it's not ideal, but it's possible● ceph is not optimized for high end hardware

– redundancy from expensive arrays unnecessary– ceph replicates across unreliable servers– more disks, cheaper hardware

● ceph utilizes flash/NVRAM directly– write journal/buffer– usually present but buried inside disk array

Page 29: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

ORNL experiment

● tune ceph on lustre OSTs backed by DDN● started at 100MB/sec, ended at 5.5GB/sec

– net >11GB/sec w/ journaling– 12GB/sec max, so reached >90%

● double-writes– journal to one LUN, write to another

● IPoIB– no native IB support...yet

Page 30: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

slow march to respectable

● range of issues– IB, IPoIB configuration– misc DDN/SCSI tweaks– data on SAS, journals on SATA– reorganization of DDN RAID LUNs– tune OSD/node ratios– disabled cache mirroring on DDN controllers– disabled TCP autotuning– tune readahead

Page 31: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

how can you help?

● try ceph and tell us what you think– http://ceph.com/resources/downloads

● http://ceph.com/resources/mailing-list-irc/– ask if you need help

● ask your organization to start dedicating resources to the project http://github.com/ceph

● find a bug (http://tracker.ceph.com) and fix it● participate in our ceph developer summit

– http://ceph.com/events/ceph-developer-summit

Page 32: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

questions?

Page 33: an intro to ceph for hpcopensfs.org/wp-content/uploads/2013/04/Weil-Ceph-20130416-LUG.p… · 16/04/2013  · QEMU/KVM driver RBD A reliable and fully-distributed block device, with

thanks

sage weil

[email protected]

@liewegas

http://github.com/ceph

http://ceph.com/