ceph day berlin: ceph@deutschetelekom - a 2+ years production liaison
TRANSCRIPT
CEPH@DeutscheTelekomA 2+ Years Production LiaisonIevgen Nelen, Gerd Pruessmann - Deutsche Telekom AG, DBU Cloud Services, P&I
07.05.2015 2
SpeakersIevgen Nelen & Gerd Prüßmann
• Cloud Operations Engineer
• Ceph cuttlefish
• Openstack diablo
• @eugene_nelen
• Head of Platform Engineering
• CEPH argonaut
• Openstack cactus
• @2digitsLeft
Overviewthe business case
07.05.2015 4
OverviewBusiness Marketplace
• https://portal.telekomcloud.com/
• SaaS Applications from Software Partners (ISVs) and DT offered to SME customers
• i.e. Saperion, Sage, PadCloud, Teamlike, Fastbill, Imeet, Weclapp, SilverERP, Teamdisk ...
• Complements other cloud offerings from Deutsche Telekom (Enterprise cloud from T-Systems, Cisco Intercloud, Mediencenter etc.)
• IaaS platform based only on Open Source technologies like OpenStack, CEPH and Linux
• Project started in 2012 with OS Essex, CEPH in production since 3/2013 (bobtail)
07.05.2015– Strictly confidential, Confidential, Internal – Author / Presentation title 5
Overviewwhy opensource? Why ceph?
• no vendor lock in!
• easier to change and adapt new technologies / concepts - more independent from vendor priorities
• low cost of ownership and operation, utilizing commodity hardware and Open Source
• no license fees - but professional support
• modular and horizontally scalable platform
• automation and flexibility allow for faster deployment cycles, than in traditional hosting
• control over open source code - faster bug fixing and feature delivery
DETAILSBASICS
07.05.2015 7
DETAILSceph basics
• Bobtail > Cuttlefish > Dumpling > Firefly (0.80.9)
• Multiple CEPH clusters
• overall raw capacity 4.8 PB
• One S3 and cluster (~810TB raw capacity - 15 storage nodes - 3 MONs)
• multiple smaller RBD clusters for REF, LIFE and DEV
• S3 storage for cloud native apps (Teamdisk, Teamlike) and for backups (i.e RBD)
• RBD for persistent volumes / data via Openstack Cinder (i.e. DB volumes)
07.05.2015 8
Detailsceph basics
DETAILSHardware
07.05.2015 10
DETAILShardware
• Supermicro2x Intel Xeon E5-2640 v2 @ 2.00GHz64GB RAM7x SSDs18x HDDs
• Seagate TerascaleST4000NC000 4TB HDDs
• LSI MegaRAID SAS 9271-8i
• 18 OSDs per node: RAID1 with 2 SSD for /, 3 RAID0 with 1 SSD for journals, 18 raid0 with 1 hdd for OSD
• 2x10Gb network adapters
07.05.2015 11
DETAILShardware
• Supermicro1x Intel Xeon E5-2650L @ 1.80GHz64GB RAM36x HDDs
• Seagate Barracuda ST3000DM001 3TB HDDs
• LSI MegaRAIDSAS 9271-8i
• 10 OSDs per node: RAID1 for /, 10 RAID0 with 1 hdd for journals, 10 raid0 with 2 hdd for OSD
• 2x10Gb network adapters
DetailsConfiguration & deployment
07.05.2015 13
detailsconfiguration & deployment
• Razor
• Puppet
• https://github.com/TelekomCloud/puppet-ceph
• dm-crypt disk encryption
• osd location
• XFS
• 3 replica
• OMD/Check_mk http://omdistro.org/
• ceph-dash https://github.com/TelekomCloud/ceph-dash for dashboard and API
• check_mk plugins (Cluster health, OSDs, S3)
Detailsperformance tuning
07.05.2015 15
detailsperformance tuning
• Problem - Low IOPS, IOPS drops
• fio
• Enable RAID0 Writeback cache
• Use separate disks for ceph journals (better use SSDs – scale out project)
• Problem - Recovery/Backfilling consumes a lot of cpu, decrease of performance
• osd_recovery_max_active 1 number of active recovery requests per OSD at one time
• osd_max_backfills 1 maximum number of backfills allowed to or from a single OSD
07.05.2015 16
detailsperformance Tests – current hardware / IO
07.05.2015 17
detailsperformance Tests – curr.Hardware/Bandwidth
lessons learned
07.05.2015 19
lessons learnedoperational experience
• Chose your hardware well !!
• I,e. RAID and hard disks -> enterprise grade disks (desktop HDs are missing important features like TLER/ERC)
• CPU/RAM planning: calculate 1GHz CPU power and 2GB RAM per single OSD
• pick nodes with low storage capacity density for smaller clusters
• At least 5 nodes for a 3 replica cluster (i.e. for PoC, testing and development purposes)
• Cluster configuration “adjustments”:
• increasing PG num > impact on cluster because of massive data migration
• Rolling software updates / upgrades worked perfectly
• CEPH: has a character – but highly reliable - never lost data
07.05.2015 20
lessons learnedoperational experience
• Failed / ”Slow” disks
• Inconsistent PGs
• Incomplete PGs
• RBD pool configured with min_size=2
• Blocks IO operations to the pool / cluster
• fixed in Hammer (allows PG replication while replica level below min_size pool/OSD)
/var/log/syslog.log
Apr 12 04:59:47 cephosd5 kernel: [12473860.669262] sd 6:2:10:0: [sdk] Unhandled error code
root@cephosd5:/var/log# mount | grep sdk /dev/mapper/cephosd5-journal-sdk on /var/lib/ceph/osd/journal-disk9
root@cephosd5:/var/log# grep journal-disk9 /etc/ceph/ceph.conf osd journal = /var/lib/ceph/osd/journal-disk9/osd.151-journal
/var/log/ceph/ceph-osd.151.log.1.gz
2015-04-12 04:59:47.891284 7f8a10c76700 -1 journal FileJournal::do_ write: pwrite(fd=25, hbp.length=4096) failed :(5) Input/output error
07.05.2015 21
lessons learnedoperational experience
5/7/2015 22
lessons learned incomplete PGs - what happened?
OSD node
OSD
Journal
pg pgOSD
JournalOSD node
OSD
Journal
pg pgOSD
Journal
OSD node
OSD
Journal
pg pgOSD
Journal
pg
glimpse of the future
07.05.2015 24
OverviewSCALE OUT Project
+40%
Current overall capacity:
~60 storage nodes
5,4 PB Storage Gross
~0,5 PB S3 Storage Net
Planned Capacity for 2015:
~90 storage nodes
7,5 PB Storage Gross
~1,5 PB S3 Storage Net
07.05.2015 25
Future setupscale out project
• 2 physically separated rooms
• Data distributed according the rule
• not more than 2 replicas in - one room not more than 1 replica in one rack
07.05.2015 26
Future setupNew crushmap rules
rule myrule {
ruleset 3
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type room
step chooseleaf firstn 2 type rack
step emit
}
crushtool -i real7 --test --show-statistics --rule 3 --min-x 1 --max-x 1024 --num-rep 3 --show-mappings
CRUSH rule 3 x 1 [12,19,15]CRUSH rule 3 x 2 [14,16,13]CRUSH rule 3 x 3 [3,0,7]…
Listing 1: crushmap rule Listing 2: Simulate 1024 Objects
07.05.2015 27
Future setupdreams
• cache tiering
• make use of shiny new SSDs in a hot zone / cache pool
• SSD pools
• Openstack live migration for VMs (boot from rbd volume)
Q & a
07.05.2015 29
QUESTION & ANSWERS
• Ievgen Nelen
• @eugene_nelen
• Gerd Prüßmann
• @2digitsLeft