ceph day berlin: ceph@deutschetelekom - a 2+ years production liaison

CEPH@DeutscheTelekomA 2+ Years Production LiaisonIevgen Nelen, Gerd Pruessmann - Deutsche Telekom AG, DBU Cloud Services, P&I

C:/Users/a.nemeth/Desktop/Videos/BMP Videos/121204_Telekom_Cloud_Only_ohneOffText_iPad.m4v

C:/Users/a.nemeth/Desktop/Videos/BMP Videos/121204_Telekom_Cloud_Only_ohneOffText_iPad.m4v

07.05.2015 2

SpeakersIevgen Nelen & Gerd Prüßmann

• Cloud Operations Engineer

• Ceph cuttlefish

• Openstack diablo

• @eugene_nelen

• [email protected]

• Head of Platform Engineering

• CEPH argonaut

• Openstack cactus

• @2digitsLeft


Overviewthe business case

07.05.2015 4

OverviewBusiness Marketplace

• https://portal.telekomcloud.com/

• SaaS Applications from Software Partners (ISVs) and DT offered to SME customers

• i.e. Saperion, Sage, PadCloud, Teamlike, Fastbill, Imeet, Weclapp, SilverERP, Teamdisk ...

• Complements other cloud offerings from Deutsche Telekom (Enterprise cloud from T-Systems, Cisco Intercloud, Mediencenter etc.)

• IaaS platform based only on Open Source technologies like OpenStack, CEPH and Linux

• Project started in 2012 with OS Essex, CEPH in production since 3/2013 (bobtail)

07.05.2015– Strictly confidential, Confidential, Internal – Author / Presentation title 5

Overviewwhy opensource? Why ceph?

• no vendor lock in!

• easier to change and adapt new technologies / concepts - more independent from vendor priorities

• low cost of ownership and operation, utilizing commodity hardware and Open Source

• no license fees - but professional support

• modular and horizontally scalable platform

• automation and flexibility allow for faster deployment cycles, than in traditional hosting

• control over open source code - faster bug fixing and feature delivery

DETAILSBASICS

07.05.2015 7

DETAILSceph basics

• Bobtail > Cuttlefish > Dumpling > Firefly (0.80.9)

• Multiple CEPH clusters

• overall raw capacity 4.8 PB

• One S3 and cluster (~810TB raw capacity - 15 storage nodes - 3 MONs)

• multiple smaller RBD clusters for REF, LIFE and DEV

• S3 storage for cloud native apps (Teamdisk, Teamlike) and for backups (i.e RBD)

• RBD for persistent volumes / data via Openstack Cinder (i.e. DB volumes)

07.05.2015 8

Detailsceph basics

DETAILSHardware

07.05.2015 10

DETAILShardware

• Supermicro2x Intel Xeon E5-2640 v2 @ 2.00GHz64GB RAM7x SSDs18x HDDs

• Seagate TerascaleST4000NC000 4TB HDDs

• LSI MegaRAID SAS 9271-8i

• 18 OSDs per node: RAID1 with 2 SSD for /, 3 RAID0 with 1 SSD for journals, 18 raid0 with 1 hdd for OSD

• 2x10Gb network adapters

07.05.2015 11

DETAILShardware

• Supermicro1x Intel Xeon E5-2650L @ 1.80GHz64GB RAM36x HDDs

• Seagate Barracuda ST3000DM001 3TB HDDs

• LSI MegaRAIDSAS 9271-8i

• 10 OSDs per node: RAID1 for /, 10 RAID0 with 1 hdd for journals, 10 raid0 with 2 hdd for OSD

• 2x10Gb network adapters

DetailsConfiguration & deployment

07.05.2015 13

detailsconfiguration & deployment

• Razor

• Puppet

• https://github.com/TelekomCloud/puppet-ceph

• dm-crypt disk encryption

• osd location

• XFS

• 3 replica

• OMD/Check_mk http://omdistro.org/

• ceph-dash https://github.com/TelekomCloud/ceph-dash for dashboard and API

• check_mk plugins (Cluster health, OSDs, S3)

https://github.com/TelekomCloud/puppet-ceph

http://omdistro.org/

https://github.com/TelekomCloud/ceph-dash

Detailsperformance tuning

07.05.2015 15

detailsperformance tuning

• Problem - Low IOPS, IOPS drops

• fio

• Enable RAID0 Writeback cache

• Use separate disks for ceph journals (better use SSDs – scale out project)

• Problem - Recovery/Backfilling consumes a lot of cpu, decrease of performance

• osd_recovery_max_active 1 number of active recovery requests per OSD at one time

• osd_max_backfills 1 maximum number of backfills allowed to or from a single OSD

07.05.2015 16

detailsperformance Tests – current hardware / IO

07.05.2015 17

detailsperformance Tests – curr.Hardware/Bandwidth

lessons learned

07.05.2015 19

lessons learnedoperational experience

• Chose your hardware well !!

• I,e. RAID and hard disks -> enterprise grade disks (desktop HDs are missing important features like TLER/ERC)

• CPU/RAM planning: calculate 1GHz CPU power and 2GB RAM per single OSD

• pick nodes with low storage capacity density for smaller clusters

• At least 5 nodes for a 3 replica cluster (i.e. for PoC, testing and development purposes)

• Cluster configuration “adjustments”:

• increasing PG num > impact on cluster because of massive data migration

• Rolling software updates / upgrades worked perfectly

• CEPH: has a character – but highly reliable - never lost data

07.05.2015 20


• Failed / ”Slow” disks

• Inconsistent PGs

• Incomplete PGs

• RBD pool configured with min_size=2

• Blocks IO operations to the pool / cluster

• fixed in Hammer (allows PG replication while replica level below min_size pool/OSD)

/var/log/syslog.log

Apr 12 04:59:47 cephosd5 kernel: [12473860.669262] sd 6:2:10:0: [sdk] Unhandled error code

root@cephosd5:/var/log# mount | grep sdk /dev/mapper/cephosd5-journal-sdk on /var/lib/ceph/osd/journal-disk9

root@cephosd5:/var/log# grep journal-disk9 /etc/ceph/ceph.conf osd journal = /var/lib/ceph/osd/journal-disk9/osd.151-journal

/var/log/ceph/ceph-osd.151.log.1.gz

2015-04-12 04:59:47.891284 7f8a10c76700 -1 journal FileJournal::do_ write: pwrite(fd=25, hbp.length=4096) failed :(5) Input/output error

07.05.2015 21


5/7/2015 22

lessons learned incomplete PGs - what happened?

OSD node

OSD

Journal

pg pgOSD

JournalOSD node

OSD

Journal

pg pgOSD

Journal

OSD node

OSD

Journal

pg pgOSD

Journal

pg

glimpse of the future

07.05.2015 24

OverviewSCALE OUT Project

+40%

Current overall capacity:

~60 storage nodes

5,4 PB Storage Gross

~0,5 PB S3 Storage Net

Planned Capacity for 2015:

~90 storage nodes

7,5 PB Storage Gross

~1,5 PB S3 Storage Net

07.05.2015 25

Future setupscale out project

• 2 physically separated rooms

• Data distributed according the rule

• not more than 2 replicas in - one room not more than 1 replica in one rack

07.05.2015 26

Future setupNew crushmap rules

rule myrule {

ruleset 3

type replicated

min_size 1

max_size 10

step take default

step choose firstn 2 type room

step chooseleaf firstn 2 type rack

step emit

}

crushtool -i real7 --test --show-statistics --rule 3 --min-x 1 --max-x 1024 --num-rep 3 --show-mappings

CRUSH rule 3 x 1 [12,19,15]CRUSH rule 3 x 2 [14,16,13]CRUSH rule 3 x 3 [3,0,7]…

Listing 1: crushmap rule Listing 2: Simulate 1024 Objects

07.05.2015 27

Future setupdreams

• cache tiering

• make use of shiny new SSDs in a hot zone / cache pool

• SSD pools

• Openstack live migration for VMs (boot from rbd volume)

07.05.2015 29

QUESTION & ANSWERS

• Ievgen Nelen

• @eugene_nelen


• Gerd Prüßmann

• @2digitsLeft


ceph day berlin: ceph@deutschetelekom - a 2+ years production liaison

Technology

ceph cuttlefishopenstack

powerpointprsentation

ceph journals better

cloud offerings

tb raw capacity

e rbd rbd

open source technologies

tb hddslsi megaraid