ceph internals & data processing capabilities - susecon · ceph internals & data processing...

26
Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer [email protected] / [email protected]

Upload: doanh

Post on 15-Apr-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

Ceph Internals & Data Processing Capabilities

Joao Eduardo Luis Senior Software Engineer [email protected] / [email protected]

2

OVERALL ARCHITECTURE

RGWweb services gateway for

object storage, compatible with S3 and Swift

LIBRADOSclient library allowing apps to access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSsoftware-based, reliable, autonomous, distributed object store comprised of

self-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDreliable, fully-distributed block device with cloud

platform integration

CEPHFSdistributed file system

with POSIX semantics and scale-out metadata

management

APP HOST/VM CLIENT

3

OVERALL ARCHITECTURE

RGWweb services gateway for

object storage, compatible with S3 and Swift

LIBRADOSclient library allowing apps to access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSsoftware-based, reliable, autonomous, distributed object store comprised of

self-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDreliable, fully-distributed block device with cloud

platform integration

CEPHFSdistributed file system

with POSIX semantics and scale-out metadata

management

APP HOST/VM CLIENT

4

RADOS CLUSTER

APPLICATION

M M

M M

M

RADOS CLUSTER

5

RADOS COMPONENTS

OSDs‒ Smart storage

‒ Resilient, Distributed, Self-healing, etc

‒ 100's to thousands

Monitors‒ Keep track of cluster state

‒ Always consistent, or otherwise...

‒ 3, 5, 7, ...M

6

OBJECT STORAGE DAEMONS

FS

DISK

OSD

DISK

OSD

FS

DISK

OSD

FS

DISK

OSD

FS

M

M

M

7

WHERE IS MY OBJECT?

??APPLICATION

M

M

M

OBJECT

8

METADATA SERVER?

1

APPLICATION

M

M

M

2

9

CALCULATED PLACEMENT?

APPLICATION

M

M

MA-G

H-N

O-T

U-Z

??

F

10

CRUSH

RADOS CLUSTER

OBJECTS

10

01

01

10

10

01

11

01

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

PLACEMENT GROUPS(PGs)

11

CRUSH

RADOS CLUSTER

OBJECT

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

10

10

12

CRUSH – Failure?

RADOS CLUSTER

OBJECT

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

10

13

DATA ORGANIZED INTO POOLS

CLUSTER

OBJECTS

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

POOLS(CONTAINING PGs)

10

01

11

01

10

01

01

10

01

10

10

01

11

01

10

01

10 01 10 11

01

11

01

10

10

01

01

01

10

10

01

01

POOLA

POOLB

POOL C

POOLDOBJECTS

OBJECTS

OBJECTS

14

POOLS

OBJECT

REPLICATED POOL

CEPH STORAGE CLUSTER

ERASURE CODED POOL

CEPH STORAGE CLUSTER

COPY COPY

OBJECT

31 2 X Y

COPY4

Full copies of stored objects

Very high durability

3x (200% overhead)

Quicker recovery

One copy plus parity

Cost-effective durability

1.5x (50% overhead)

Expensive recovery

LIBRADOS

16

OVERALL ARCHITECTURE

RGWweb services gateway for

object storage, compatible with S3 and Swift

LIBRADOSclient library allowing apps to access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSsoftware-based, reliable, autonomous, distributed object store comprised of

self-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDreliable, fully-distributed block device with cloud

platform integration

CEPHFSdistributed file system

with POSIX semantics and scale-out metadata

management

APP HOST/VM CLIENT

17

EXAMPLE `HELLO WORLD!`

Connect to cluster

Create pool

Atomic (re)write

#include <rados/librados.hpp>

librados::Rados rados;rados.init(“admin”);rados.connect();

rados.pool_create(“hello_pool”);

librados::IoCtx ctx;rados.ioctx_create(“hello_pool”, ctx);

bufferlist data;data.append(“hello world!”);ctx.write_full(“hello_object”, data);

bufferlist attr;attr.append(“1”);ctx.setxattr(“hello_object”, “version”, attr);

rados.shutdown();

18

LIBRADOS

APPLIBRADOS

MMM

object “foo”

pool “bar”

0x2d87c31

pg 2.c31id 2

mod pg_num

clusterstateCRUSH

hierarchy

osdmap

19

COMPOUND OBJECT OPERATIONS

Connect to cluster

Create pool

#include <rados/librados.hpp>

librados::Rados rados;rados.init(“admin”);rados.connect();

rados.pool_create(“hello_pool”);

librados::IoCtx ctx;rados.ioctx_create(“hello_pool”, ctx);

ObjectWriteOperation op;

bufferlist data;data.append(“hello world!”);op.write_full(data);

bufferlist attr;attr.append(“1”);op.setxattr(“version”, attr);

ctx.operate(“hello_object”, &op);

rados.shutdown();

20

AP

PLI

BR

AD

OS

RADOS OBJECT CLASSES

put_foo()

calc_bar()

my_foo.soput_foo(data)

read(“bar”, data)

21

EXAMPLE

int compute_md5(cls_method_context_t hctx, bufferlist *in, bufferlist *out){ size_t size; int ret = cls_cxx_stat(hctx, &size, NULL); if (ret < 0) return ret;

bufferlist data; ret = cls_cxx_read(hctx, 0, size, data); if (ret < 0) return ret;

byte digest[AES::BLOCKSIZE]; MD5().CalculateDigest(digest, (byte*)data.c_str(), data.length());

out->append(digest, sizeof(digest));

return 0;}

bufferlist input, output;

ioctx.exec(“hello_object”, “hello_hash_class”, “compute_md5”, input, output);

ObjectReadOperation op;

uint64_t size;time_t m_time;op.stat(&size, &m_time, NULL);

bufferlist in, out;op.exec(“hello_hash_class”, “compute_md5”, in, &out);

int r = op.operate(“hello_object”, &op);

Server-side class (hello_hash_class) librados client

Compound operations

22

REAL APPLICATIONS

• Cooperative Locking

• Simple Object Reference Counting

• Image manipulation

• RADOS Block Device (RBD) & Gateway (RGW)

sources in src/cls/*

23

DYNAMIC OBJECT CLASSES IN LUA

• Noah Watkins (UCSC / Red Hat)‒ http://ceph.com/rados/dynamic-object-interfaces-with-lua/

local script = [[function say_hello(input, output) output:append("Hello, ") if #input == 0 then output:append("world") else output:append(input:str()) end output:append("!")endcls.register(say_hello)]]

local ret, outdata = clslua.exec(ioctx, "oid", script, "say_hello", "")print(outdata)

local ret, outdata = clslua.exec(ioctx, "oid", script, "say_hello", "John")print(outdata)

Thank you.

24

[email protected]@lists.ceph.com#ceph / #ceph-devel @ OFTCwww.ceph.com

25

Unpublished Work of SUSE LLC. All Rights Reserved.This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE LLC. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.

General DisclaimerThis document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.