red hat storage day seattle: stabilizing petabyte ceph cluster in openstack cloud

Yuming Ma , Architect

StaaS, Cisco Cloud Foundation

Seattle WA, 10/18/2016

Stabilizing Petabyte Ceph Cluster in OpenStack Cloud

2© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

Highlights

1. What are we doing with Ceph?

2. What did we start with?

3. We need a bigger boat

4. Getting better and sleeping through the night

5. Lessons learned


Cisco Cloud Services provides an Openstack platform to Cisco SaaS applications and tenants through a worldwide deployment of datacenters.

Background

SaaS Cases• Collaboration• IoT• Security• Analytics• “Unknown Projects”

Swift• Database (Trove)

Backups• Static Content• Cold/Offline data for

Hadoop

Cinder• Generic/Magnetic

Volumes• Low Performance


• Boot Volumes for all VM flavors except those with Ephemeral (local) storage

• Glance Image store

• Generic Cinder Volume

• RGW for Swift Object store

• In production since March 2014

• 13 clusters in production in two years

• Each cluster is 1800TB raw over 45 nodes and 450 OSDs.

How Do We Use Ceph?

Cisco UCS

Ceph High-Perf Platform

Generic Volume

Prov IOPS

Cinder API

Object

Swift API


• Get to MVP and keep costs down.• High capacity, hence C240 M3 LFF for 4TB HDDs• Tradeoff was that C240 M3 LFF could not also accommodate SSD

• So Journal was collocated on OSD

• Monitors were on HDD based systems as well

Initial Design Considerations


CCS Ceph 1.0

RACK3RACK2

1 2 10

LSI 9271 HBA

Data partition

HDD

Journalpartition

…..

…..

XFS

…..

…..

OSD2 OSD10OSD1…..

1211

OS on RAID1

MIRROR

2x10Gb PRIVATE NETWORK

KEYSTONE API

SWIFT API

CINDER API

GLANCE API

NOVA API

OPENSTACK

RADOS GATE WAY CEPH BLOCK DEVICE (RBD)

Libvirt/kvm

2x10Gb PUBLIC NETWORK

monitors monitors monitors

15xC240

CEPH libRADOS API

RACK1

15xC240 15xC240

OSD: 45 x UCS C240 M3• 2xE5 2690 V2, 40 HT/core• 64GB RAM• 2x10Gbs for public• 2x10Gbs for cluster• 3X replication• LSI 9271 HBA• 10 x 4TB HDD, 7200 RPM• 10GB journal partition from

HDD • RHEL3.10.0-

229.1.2.el7.x86_64

NOVA: UCS C220• Ceph 0.94.1• RHEL3.10.0-

229.4.2.el7.x86_64

MON/RGW: UCS C220 M3• 2xE5 2680 V2, 40 HT/core• 64GB RAM• 2x10Gbs for public• 4x3TB HDD, 7200 RPM • RHEL3.10.0-

229.4.2.el7.x86_64

Started with Cuttlefish/Dumpling


• Nice consistent growth…• Your users will not warn

you before:• “going live”• Migrating out of S3• Backing up a Hadoop

HDFS

• Stability problems emerge after 50% used

Growth: It will happen, just not sure when


Major Stability Problems: Monitors

Problem Impact

MON election storm impacting client IO

Monmap changes due to flaky NIC or chatty messaging between MON and client. Caused unstable quorum and an election storm between MON hostsResults: blocked and slowed client IO requests

LevelDB inflation Level DB size grows to XXGB over time that prevents MON daemon from serving OSD requests Results: Blocked IO and slow request

DDOS due to chatty client msg attack

Slow response from MON to client due to levelDB or election storm cause message flood attack from client. Results: failed client operation, e.g volume creation, RBD connection


Major Stability Problems: Cluster

Problem Impact

Backfill & Recovery impacting client IO

Osdmap changes due to loss of disk, resulting in PG peering and backfilling Results: Clients receive blocked and slow IO.

Unbalanced data distribution

Data on OSDs isn’t evenly distributed. Cluster may be 50% full, but some OSDs are at 90%Results: Backfill isn’t always able to complete.

Slow disk impacting client IO

A single slow (sick, not dead) OSD can severely impact many clients until it’s ejected from the cluster.Results: Client have slow or blocked IO.


Stability Improvement StrategyStrategy Improvement

Client IO throttling* Rate limit IOPS at Nova host to 250 IOPS per volume.

Backfill and recovery throttling

Reduced IO consumption by backfill and recovery processes to yield to client IO

Retrofit with NVME (PCIe) journals

Increased overall IOPS of the cluster

Upgrade to 1.2.3/1.3.2 Overall stability and hardened MONs preventing election storm

LevelDB on SSD (replaced entire mon node)

Faster cluster map query

Re-weight by utilization Balance data distribution

*Client is the RBD client not the tenant


• Limit max/cap IO consumption at qemu layer:• iops ( IOPS read and write ) 250• bps (Bits per second read and

write ) 100 MB/s

• Predictable and controlled IOPS capacity

• NO min/guaranteed IOPS -> future Ceph feature

• NO burst map -> qemu feature:• iops_max 500• bpx_max 120 MB/s

Client IO throttling

Swing ~ 100%

Swing ~ 12%


• Problem• Blocked IO during peering• Slow requests during backfill• Both could cause client IO stall

and vCPU soft lockup

• Solution• Throttling backfill and recovery

osd recovery max active = 3 （ default : 15)

osd recovery op priority = 3 (default : 10)

osd max backfills = 1 (default : 10)

Backfill and Recovery Throttling


• Goal: 2X IOPS capacity gain• Tuning: filestore_wbthrottle_xfs_bytes_start_flusher = 4194304 （ (default :10485760)

Retrofit Ceph Journal from HDD to NVME

1 2 10

LSI 9271 HBA

Data partition

HDD

Journalpartition

…..

…..

XFS

…..

…..

OSD2 OSD10OSD1 …..

1211

OS on RAID1

MIRROR

Partition starts at 4MB(s1024), 10GB each and 4MB offset in between

1 2 10

LSI 9271 HBA

1 2 10RAID0

1DISK

1 2 10NVME

…..

…..

XFS

…..

…..

OSD2

OSD10OSD1

…..

300GB Free

1211

OS on RAID1 MIRROR


NVMe Stability Improvement AnalysisBackfill and recovery config:

osd recovery max active = 3 （ default : 15)osd max backfills = 1 (default : 10)osd recovery op priority = 3 (default : 10)

Server impact:• Shorter recovery time

Client impact • <10% impact (tested without IO

throttling, impact should be less with IO throttling)


LevelDB :• Key-value store for cluster metadata, e.g.

osdmap, pgmap, monmap, clientID, authID etc

• Not in data path• Impactful to IO operation: IO blocked by

the DB query• Larger size, longer query time, hence

longer IO wait -> slow requests

• Solution:• Level DB on SSD in increase disk IO rate• Upgrade to Hammer to reduce DB size

MON Level DB Issues


Retrofit MON Level DB from HDD to SSDNew BOM:

• UCS C220 M4 with 120GB SSD

Write wait timewith levelDB on HDD

Write wait timewith levelDB on SSD


• Problem• Election Storm & LevelDB inflation

• Solutions• Upgrade to 1.2.3 to fix election storm• Upgrade to 1.3.2 to fix levelDB inflation• Configuration change

Hardening MON Cluster with Hammer and Tuning

[mon]mon_lease = 20 (default = 5)mon_lease_renew_interval = 12 (default 3)mon_lease_ack_timeout = 40 (default 10)mon_accept_timeout = 40 (default 10)

[client]mon_client_hunt_interval = 40 (defaiult 3)


• Problem • High skew of %used of disks that is

preventing data intake even cluster capacity allows

• Impact: • Unbalanced PG distribution impacts

performance• Rebalancing is impactful as well

• Solution: • Upgrade to Hammer 1.3.2+patch• Re-weight by utilization: >10% delta

Data Distribution and Balance

1 30 59 88 1171461752042332622913203493784074360

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

us-internal-1 disk % used

% usedCluster: 67.9% fullOSDs: • Min: 47.2&• Max: 83.5%• Mean: %69.6 • Stddev: 6.5

https://github.com/ceph/ceph/pull/8026


• Migrate OS from Ubuntu to RHEL• Retrofit Journal from HDD to SSD• Retrofit MON levelDB from HDD to SSD• Expand cluster from 3 racks to 4/5 racks• Continuously upgrade Ceph version

• Challenge is at client side: need to restart nova instances to reload librbd and librados

Zero Down Time Ops


Storage Cluster Monitoring and Analytics• Three types of data: events,

metrics, logs

• Data collected from each node

• Data pushed to monitoring portals

• In-flight analytics for run-time RCA

• Predictive analytics for proactive alert, e.g Prophetstor disk failure prediction

• Plugin to synthesize data for cluster level metrics and status


• Problem • RBD image data

distributed to all disk and single disk failure can impact critical data IO

• Solution: • proactively detect future

disk failure• DiskProphet Solution

• Disk near-failure likelihood prediction

• Disk life-expectancy prediction

• Actions to optimize Ceph

Proactive Detection of Disk Failure

Normal workload

1 OSD failed, Ceph’s rebalancing

1 OSD failure predicted, No-Impact Recovery by DiskProphet

IOPS

Time


• Set Clear Stability Goals: zero downtime operation

• You can plan for everything except how tenants will use it

• Look for issues in services that consume storage• Had 50TB of “deleted volumes” that weren’t supposed to be left alone

• DevOps• It’s not just technology, it’s how your team operates as a team• Consistent performance and stability modeling• Automate rigorous testing• Automate builds and rebuilds

• Balance performance, cost and time

• Shortcuts create Technical Debt

Lessons Learned

Yuming Ma: [email protected]

Thank You


Rack-16 nodes

osd1

osd10…

Rack-25 nodes

osd1

osd10…

Rack-36 nodes

osd1

osd10…

nova1

vm1

vm20

…

nova10

vm1

vm20

…

nova2

vm1

…

…..

NVMe Journaling: Performance Testing Setup

Partition starts at 4MB(s1024), 10GB each and 4MB offset in between

1 2 10

LSI 9271 HBA

1 2 10RAID0

1DISK

1 2 10NVME

…..

…..

XFS

…..

…..

OSD2

OSD10OSD1

…..

300GB Free

1211

OS on RAID1 MIRROR OSD: C240 M3

• 2xE5 2690 V2, 40 HT/core• 64GB RAM• 2x10Gbs for public• 2x10Gbs for cluster• 3X replication• Intel P3700 400GB NVMe• LSI 9271 HBA• 10x4TB, 7200 RPM

Nova C220• 2xE5 2680 V2, 40 HT/core• 380GB RAM• 2x10Gbs for public• 3.10.0-229.4.2.el7.x86_64

vm20


NVMe Journaling: Performance Tuning

OSD host iostat:• Both nvme and hdd disk %util and low most of the time, and spikes

every ~45s.• Both nvme and hdd have very low queue size (iodepth) while frontend

VM pushes 16 qdepth to FIO.• CPU %used is reasonable, converge at <%30. But the iowait is low

which corresponding to low disk activity


NVMe Journaling: Performance TuningTuning Directions: increase disk %util:• Disk thread: 4, 16, 32• Filestore max sync interval: (0.1, 0.2, 0.5, 1 5, 10 20)


• These two tunings showed no impact:filestore_wbthrottle_xfs_ios_start_flusher: default 500 vs 10filestore_wbthrottle_xfs_inodes_start_flusher: default 500 vs 10

• Final Config:osd_journal_size = 10240 (default : journal_max_write_entries= 1000 (default : 100)journal_max_write_bytes=1048576000 (default :10485760)journal_queue_max_bytes=1048576000 (default :10485760)filestore_queue_max_bytes=1048576000 （ (default :10485760)filestore_queue_committing_max_bytes=1048576000 （ (default :10485760)filestore_wbthrottle_xfs_bytes_start_flusher = 4194304 （ (default :10485760)

NVMe Performance Tuning

Linear tuning filestore_wbthrottle_xfs_bytes_start_flusher:

filestore_wbthrottle_xfs_inodes_start_flusher

filestore_wbthrottle_xfs_ios_start_flusher

red hat storage day seattle: stabilizing petabyte ceph cluster in openstack cloud

Technology