ceph day amsterdam 2015 - building your own disaster? the safe way to make ceph storage ready!

The safe way to make Ceph storage enterprise ready!

Build your own disaster ?

Dieter KasperCTO Data Center InfrastructureEmerging Technologies & Solutions, Global Delivery

2015-03-31

� The safe way to make Ceph storage enterprise ready

� ETERNUS CD10k integrated in OpenStack

� mSHEC Erasure Code from Fujitsu

� Contribution to performance enhancements

Building Storage with Ceph looks simple

+ some servers

+ network

= storage

Building Storage with Ceph looks simple – but……

Many new Complexities

� Rightsizing server, disk types, network bandwidth

� Silos of management tools (HW, SW..)

� Keeping Ceph versions with versions of server HW, OS, connectivity, drivers in sync

� Management of maintenance and support contracts of components

� Troubleshooting

Build Ceph source storage yourself

The challenges of software defined storage

� What users want

� Open standards

� High scalability

� High reliability

� Lower costs

� No-lock in from a vendor

� What users may get

� An own developed storage system based on open / industry standard HW & SW components

� High scalability and reliability ? If the stack works !

� Lower investments but higher operational efforts

� Lock-in into the own stack

ETERNUS CD10000 – Making Ceph enterprise ready

Build Ceph source storage yourself Out of the box ETERNUS CD10000

incl. support

incl. maintenance

ETERNUS CD10000 combines open source storage with enterprise–class quality of service

E2E Solution Contract by Fujitsu based on Red Hat Ceph Enterprise

Easy Deployment / Management by Fujitsu

+ Lifecycle Management for Hardware & Software by Fujitsu

Fujitsu Maintenance, Support and Professional Services

ETERNUS CD10000: A complete offer

Unlimited Scalability

� Cluster of storage nodes

� Capacity and performance scales by adding storage nodes

� Three different node types enable differentiated service levels� Density, capacity optimized

� Performance optimized

� Optimized for small scale dev & test

� 1st version of CD10000 (Q3.2014) is released for a range o 4 to 224 nodes

� Scales up to >50 Petabyte

Basic node 12 TB Performance node 35 TB Capacity node 252 TB

Immortal System

Node1 Node2 Node(n)

Adding nodeswith new generation of hardware

………+

Adding nodes

� Non-disruptive add / remove / exchange of hardware (disks and nodes)� Mix of nodes of different generations, online technology refresh� Very long lifecycle reduces migration efforts and costs

TCO optimized

� Based on x86 industry standard architectures

� Based on open source software (Ceph)

� High-availability and self-optimizing functions are part of the design at no extra costs

� Highly automated and fully integrated management reduces operational efforts

� Online maintenance and technology refresh reduce costs of downtime dramatically

� Extreme long lifecycle delivers investment protection

� End-to-end design an maintenance from Fujitsu reduces, evaluation, integration, maintenance costs

Better service levels at reduced costs – business centric storage

One storage – seamless management

� ETERNUS CD10000 delivers one seamless management for the complete stack

� Central Ceph software deployment

� Central storage node management

� Central network management

� Central log file management

� Central cluster management

� Central configuration, administration and maintenance

� SNMP integration of all nodes and network components

Seamless management (2)

Dashboard – Overview of cluster statusDashboard – Overview of cluster status

Server Management – Management of cluster hardware – add/remove server (storage node), replace storage devicesServer Management – Management of cluster hardware – add/remove server (storage node), replace storage devices

Cluster Management – Management of cluster resources – cluster and pool creationCluster Management – Management of cluster resources – cluster and pool creation

Monitoring the cluster – Monitoring overall capacity, pool utilization, status of OSD, Monitor, and MDS processes, Placement Group status, and RBD statusMonitoring the cluster – Monitoring overall capacity, pool utilization, status of OSD, Monitor, and MDS processes, Placement Group status, and RBD status

Managing OpenStack Interoperation: Connection to OpenStack Server, and placement of pools in Cinder multi-backendManaging OpenStack Interoperation: Connection to OpenStack Server, and placement of pools in Cinder multi-backend

Optional use of Calamari Management GUI

Example: Replacing an HDD

� Plain Ceph� taking the failed disk offline in Ceph

� taking the failed disk offline on OS / Controller Level

� identify (right) hard drive in server

� exchanging hard drive

� partitioning hard drive on OS level

� Make and mount file system

� bring the disk up in Ceph again

� On ETERNUS CD10000� vsm_cli <cluster> replace-disk-out

� exchange hard drive

� vsm_cli <cluster> replace-disk-in <node> <dev>

Example: Adding a Node

� Plain Ceph� Install hardware

� Install OS

� Configure OS

� Partition disks (OSDs, Journals)

� Make filesystems

� Configure network

� Configure ssh

� Configure Ceph

� Add node to cluster

� On ETERNUS CD10000� Install hardware

• hardware will automatically PXE bootand install the current clusterenvironment including currentconfiguration

� Make node available to GUI

� Add node to cluster with mouse clickon GUI

Seamless management drives productivity

� Manual Ceph Installation

� Setting-up a 4 node Ceph cluster with 15 OSDs: 1,5 – 2 admin days

� Adding an additional node: 3 admin hours up to half a day

� Automated Installation through ETERNUS CD10000

� Setting-up a 4 node Ceph cluster with 15 OSDs: 1 hour

� Adding an additional node: 0,5 hour

Adding and Integrating Apps

� The ETERNUS CD10000 architecture enables the integration of apps

� Fujitsu is working with customers and software vendors to integrate selected storage apps

� E.g. archiving, sync & share, data discovery, cloud apps…

CloudServices

Sync & Share

ArchiveiRODSdata discovery

Object Level Access

Block Level Access

File Level Access

Central Management

Ceph Storage System S/W and Fujitsu Extensions

10GbE Frontend Network

Fast Interconnect Network

ETERNUS CD10000 at University Mainz

� Large university in Germany

� Uses iRODS Application for library services

� iRODS is an open-source data management software in use at research organizations and government agencies worldwide

� Organizes and manages large depots of distributed digital data

� Customer has built an interface from iRODS to Ceph

� Stores raw data of measurement instruments (e.g. research in chemistry and physics) for 10+ years meeting compliance rules of the EU

� Need to provide extensive and rapidly growing data volumes online at reasonable costs

� Will implement a sync & share service on top of ETERNUS CD10000

How ETERNUS CD10000 supports cloud biz

Cloud IT Trading Platform

� An European provider operates a trading platform for cloud resources (CPU, RAM, Storage)

Cloud IT Resources Supplier

� The Darmstadt data center (DARZ) offersstorage capacity via the trading platform� Using ETERNUS CD10000 to provide storage

resources for an unpredictable demand

ETERNUSCD10000

Summary ETERNUS CD10k – Key Values

ETERNUS CD10000

UnlimitedScalability

TCOoptimized

The newunified

ImmortalSystem

ZeroDowntime

ETERNUS CD10000 combines open source storage with enterprise–class quality of service

What is OpenStack

Free open source (Apache license) software governed by a non-profit foundation (corporation) with a mission to produce the ubiquitous Open Source Cloud Computing platform that will meet the needs of publ ic and private clouds regardless of size, by being simple to implement an d massively scalable.

Platin

Corporate…

� Massively scalable cloud operating system that controls large pools of compute, storage , and networking resources � Community OSS with contributions from 1000+

developers and 180+ participating organizations

� Open web-based API Programmatic IaaS� Plug-in architecture ; allows different hypervisors,

block storage systems, network implementations, hardware agnostic, etc.

http://www.openstack.org/foundation/companies/

OpenStack Summit in Paris Nov.2014

OpenStack Momentum

� Impressively demonstrated at the OpenStack Summit: more than 5.000 participants from 60+ countries, high profile companies from all industries – e.g. AT&T, BBVA, BMW, CERN, Expedia, Verizon – sharing their experience and plans around OpenStack

� OpenStack @ BMW: Replacement of a self-built IaaS cloud; covers a pool of x.000 VMs; rapid growth planned; system is up & running but currently used productively by selected departments only.

� OpenStack @ CERN: In production since July 2013; 4 operational IaaSclouds, the largest one with 70k cores on 3.000 servers; expected to pass 150k cores by Q1.2015.

Attained fast growing customer interest

� VMware clouds dominate

� OpenStack clouds already #2

� Worldwide adoption

Source: OpenStack User Survey and Feedback Nov 3rd 2014

Source: OpenStack User Survey and Feedback May 13th 2014

Why are Customers so interested?

Source: OpenStack User Survey and Feedback Nov 3rd 2014

Greatest industry & community support compared to alternative open platforms:

Eucalyptus, CloudStack, OpenNebula

“Ability to Innovate” jumped from #6 to #1

OpenStack.org User Survey Paris: Nov. 2014

OpenStack Cloud Layers

OpenStack and ETERNUS CD10000

Physical Server (CPU, Memory, SSD, HDD) and Network

Base Operating System (CentOS)

OAM-dhcp-Deploy-LCM

HypervisorKVM, ESXi, Hyper-V

Compute (Nova)

Network (Neutron) + plugins

Dashboard (Horizon)

Billing Portal

OpenStackCloud APIs

Block(RBD)

S3(Rados-GW)

Object (Swift)Volume (Cinder)

Authentication (Keystone)

Images (Glance)

EC2 APIMetering (Ceilometer)

Manila (File)

File(CephFS)

FujitsuOpen Cloud

Storage

The OpenStack – Ceph Ecosystem @Work

OpenStackCloud Controller

OpenStackCompute Node

OpenStackCompute Node…

Ceph Storage Cluster

VM Template Production VM

VM TemplateReplica VM Template

Replica

Production VMReplica

create

snapshot / clone

usemove

Backgrounds (1)

� Erasure codes for content data� Content data for ICT services is ever-growing

� Demand for higher space efficiency and durability

� Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code(Old style)Triple Replication

However, Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

1.5x space

content data

Backgrounds (2)

� Local parity improves recovery efficiency� Data recovery should be as efficient as possible

• in order to avoid multiple disk failures and data loss

� Reed Solomon code was improved by local parity methods

• data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code(No Local Parities) Local Parities

data read from disks

However, multiple disk failures is out of consideration

A Local Parity Method

� Local parity method for multiple disk failures� Existing methods is optimized for single disk failure

• e.g. Microsoft MS-LRC, Facebook Xorbas

� However, Its recovery overhead is large in case of multiple disk failures

• because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

� SHEC (= Shingled Erasure Code)� An erasure code only with local parity groups

• to improve recovery efficiency in case of multiple disk failures

� The calculation ranges of local parities are shifted and partly overlap with each other (like the shingles on a roof)• to keep enough durability

Our Proposal Method (SHEC)

k : data chunks (=10)

m : parity chunks (=6)

l : calculation range (=5)

� SHEC is implemented as an erasure code plugin of Ceph, an open source scalable object storage

SHEC’s Implementation on Ceph

4MB objects are splitinto data/parity chunks,distributed over OSDs

encode/decode logic is separatedfrom main part of Ceph Storage

SHEC plugin

1. mSHEC is more adjustable than Reed Solomon code, because SHEC provides many recovery-efficient layouts including Reed Solomon codes

2. mSHEC’s recovery time was ~20% faster than Reed Solomon code in case of double disk failures

3. mSHEC erasure-code was add to Ceph v0.93 = pre-Hammer release

4. For more information see https://wiki.ceph.com/Planning/Blueprints/Hammer/Shingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

Areas to improve Ceph performance

Ceph has an adequate performance today,

But there are performance issues which prevent us from taking full advantage of our hardware resources.

Two main goals for improvement:

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes / ODS

Turn Around Time of a single Write IO

1. LTTng general http://lttng..org/

General� open source tracing framework for Linux� trace Linux kernel and user space applications� low overhead and therefore usable on

production systems� activate tracing at runtime� Ceph code contains LTTng trace points already

Our LTTng based profiling� activate within a function, collect timestamp information at the interesting places� save collected information in a single trace point at the end of the function� transaction profiling instead of function profiling: use Ceph transaction id's to

correlate trace points� focused on primary and secondary write operations

2. Test setup

Ceph Cluster� 3 storage nodes:� 2 CPU sockets, 8 core per socket, Intel E5-2640, 2.00GHz, 128 GB memory

� 12 OSDs: 4 OSDs per storage node (SAS disks), journals on raw SSD partitions� CentOS 6.6, linux 3.10.32, Ceph v0.91, storage pools with replication 3Ceph Client� 2 CPU sockets, 6 cores per socket, Intel E5-2630, 2.30GHz, 192 GB memory� CentOS 6.6, Linux 3.10.32� Ceph kernel client (rbd.ko + libceph.ko)Test Program� fio 2.1.10� randwrite, 4kByte buffersize, libaio / iodepth 16� test writes 1 GByte of data (or 262144 I/O requests)

3. LTTng trace session

� Ceph cluster is up and running: ceph-osd binaries from standard packages� stop one ceph-osd daemon� restart with ceph-osd binary including LTTng based profiling� wait until cluster healthy� start LTTng session� run fio test� stop LTTng session� collect trace data and evaluate

Typical sample size on the osd under test:� 22.000 primary writes (approx. 262144 / 12)� 44.000 replication writes (approx. (262144 * 2) / 12)

4.1. LTTng data evaluation: Replication Write

Observation: � replication write latency suffers from the "large variance problem"� minimum and average differ by a factor of 2� This is a common problem visible for many ceph-osd components. Why is variance so large?� Observation: No single hotspot visible.� Observation: Active processing steps do not differ between minimum and average

sample as much as the total latency does.Additional latency penalty mostly at the switch from� sub_op_modify_commit to Pipe::writer � no indication that queue length is the causeQuestion: Can the overall thread load on the system and Linux scheduling be the reason for the delayed start of the Pipe::writer thread?

4.1.1 LTTng Microbenchmark Pipe::reader

� "decode": fill message MSG_OSD_SUBOP data structure from bytes in the input buffer. There is no decoding of the data buffer!

Optimizations:� "decode": a project currently restructures some messages to decrease the effort for

message encoding and decoding. � "authenticate": is currently optimized, too. Disable via "cephx sign messages"

4.1.2 LTTng Microbenchmark Pipe::writer

� "message setup": buffer allocation and encoding of message structure� "enqueue": enqueue at low level socket layer (not quite sure whether this really

covers the write/sendmsg system call to the socket)

4.1.2 LTTng Primary Write

5. Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster: 3x node with 4 OSDs each ~700 threads per node; 9x nodes with 40 OSDs each > 100k threads per node

� ThreadPool::WorkThread is a hot spot = work in the ObjectStore / FileStore

total CPU usage during test 43.17 CPU seconds

Pipe::Writer 4.59 10.63%

Pipe::Reader 5.81 13.45%

ShardedThreadPool::WorkThreadSharded 8.08 18.70%

ThreadPool::WorkThread 15.56 36.04%

FileJournal::Writer 2.41 5.57%

FileJournal::WriteFinisher 1.01 2.33%

Finisher::finisher_thread_entry 2.86 6.63%

5.1. FileStore benchmarking

� most of the work is done in FileStore::do_transactions� each write transaction consists of

� 3 calls to omap_setkeys,� the actual call to write to the file system � 2 calls to setattr

� Proposal: coalesce calls to omap_setkeys� 1 function call instead of 3 calls, set 5 key value pairs instead of 6 (duplicate key)

5.2. FileStore with coalesced omap_setkeys

6. With our omap_setkeys coalescing patch

Reduced latency in ThreadPool::WorkThread by 54 microseconds = 25%Significant reduction of CPU usage at the ceph-osd: 9% for the complete ceph-osd

� Approx 5% better performance at the Ceph client

total CPU usage during test 43.17 CPU seconds 39.33 CPU seconds

Pipe::Writer 4.59 10.63% 4.73 12.02%

Pipe::Reader 5.81 13.45% 5.91 15.04%

ShardedThreadPool::WorkThreadSharded 8.08 18.70% 7.94 20.18%

ThreadPool::WorkThread 15.56 36.04% 12.45 31.66%

FileJournal::Writer 2.41 5.57% 2.44 6.22%

FileJournal::WriteFinisher 1.01 2.33% 1.03 2.61%

Finisher::finisher_thread_entry 2.86 6.63% 2.76 7.01%

Summary on Performance

Two main goals for improvement:

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes / ODS

�There is a long path to improve the overall Ceph performance.�Many steps are necessary to get a factor of 2. Actual performance

work focus on (1) decrease latency.�To get an order of magnitude improvement on (2) we have to

master the limits bound to the overall OSD design:� Transaction structure bound across multiple Objects� PG omap data with a high level state logging

Overall Summary and Conclusion

Summary and Conclusion�ETERNUS CD10k is the safe way to make Ceph enterprise ready� Unlimited Scalability: 4 to 224 nodes, scales up to >50 Petabyte

� Immortal System with Zero downtime: Non-disruptive add / remove / exchange of hardware

(disks and nodes) or Software update

� TCO optimized: Highly automated and fully integrated management reduces operational efforts

� Tight integration in OpenStack with own GUI

�Fujitsu mSHEC technology (integrated in Hammer) shortens recovery time at ~20% compared to Reed Solomon code

�We love Ceph! But love is not blind, so we actively contribute in the performance analysis & code/performance improvements.

Fujitsu Technology Solutions

Dieter.Kasper@ts.fujitsu.com

ceph day amsterdam 2015 - building your own disaster? the safe way to make ceph storage ready!

open source storage

storage3building storage

ceph versions

fujitsu maintenance

developed storage system

integrated management

sync management of maintenance

tb performance node

Technology

ceph day nyc: ceph performance & benchmarking

ceph day la: ceph ecosystem update

ceph deployment with dell crowbar - ceph day frankfurt

ceph day berlin: erasure code in ceph

ceph day london 2014 - deploying ceph in the wild

scaling ceph at cern - ceph day frankfurt

using ceph in ostack.de - ceph day frankfurt

ceph day santa clara: ceph at dreamhost

london ceph day: ceph in the echosystem

disaster recovery and ceph block storage introducing multi...

ceph day santa clara: ceph fundamentals

opensds manageability using swordfish for cloud-native...

ceph day london 2014 - ceph ecosystem overview

ceph day nyc: ceph fundamentals

red hat ceph storage 1.2.3 red hat ceph architecture ·...

ceph day amsterdam 2015 - deploying flash storage for ceph...

ceph day amsterdam 2015 - ceph over ipv6

ceph day beijing: ceph-dokan: a native windows ceph client

sharepoint 2010 high availability and disaster recovery -...

ceph day nyc: ceph in the ecosystem