my sql on ceph
TRANSCRIPT
MySQL and
Ceph
2:20pm – 3:10pm
Room 203
MySQL in the CloudHead-to-Head Performance Lab
1:20pm – 2:10pm
Room 203
WHOIS
Brent Compton and Kyle Bader
Storage Solution Architectures
Red Hat
Yves Trudeau
Principal Architect
Percona
AGENDA
MySQL on Ceph MySQL in the CloudHead-to-Head Performance Lab
• MySQL on Ceph vs. AWS• Head-to-head: Performance• Head-to-head: Price/performance• IOPS performance nodes for Ceph
• Why MySQL on Ceph• Ceph Architecture• Tuning: MySQL on Ceph• HW Architectural Considerations
Why MySQL on Ceph
• Ceph #1 block storage for OpenStack clouds
• MySQL #4 workload on OpenStack
(#1-3 often use databases too!)
• 70% Apps use LAMP on OpenStack
• Ceph leading open-source SDS
• MySQL leading open-source RDBMS
WHY MYSQL ON CEPH?MARKET DRIVERS
• Shared, elastic storage pool
• Dynamic DB placement
• Flexible volume resizing
• Live instance migration
• Backup to object pool
• Read replicas via copy-on-write snapshots
WHY MYSQL ON CEPH?OPS EFFICIENCY
WHY MYSQL ON CEPH?PUBLIC CLOUD FIDELITY
• Hybrid Cloud requires familiar platforms
• Developers want platform consistency
• Block storage, like the big kids
• Object storage, like the big kids
• Your hardware, datacenter, staff
WHY MYSQL ON CEPH?HYBRID CLOUD REQUIRES HIGH IOPS
Ceph Provides
• Spinning Block – General Purpose
• Object Storage - Capacity
• SSD Block – High IOPS
CEPH ARCHITECTURE
ARCHITECTURAL COMPONENTS
RGWA web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-
distributed block
device with cloud
platform integration
CEPHFSA distributed file
system with POSIX
semantics and scale-
out metadata
APP HOST/VM CLIENT
CEPH OSD
RADOS CLUSTER
RADOS CLUSTER
RADOS COMPONENTS
OSDs
• 10s to 10000s in a cluster
• Typically one per disk
• Serve stored objects to clients
• Intelligently peer for replication & recovery
Monitors
• Maintain cluster membership and state
• Provide consensus for distributed decision-making
• Small, odd number
• These do not serve stored objects to clients
WHERE DO OBJECTS LIVE?
??
A METADATA SERVER?
1
2
CALCULATED PLACEMENT
EVEN BETTER: CRUSH
CLUSTERPLACEMENT GROUPS
(PGs)
CRUSH IS A QUICK CALCULATION
CLUSTER
DYNAMIC DATA PLACEMENT
CRUSH:
• Pseudo-random placement algorithm
• Fast calculation, no lookup
• Repeatable, deterministic
• Statistically uniform distribution
• Stable mapping
• Limited data migration on change
• Rule-based configuration
• Infrastructure topology aware
• Adjustable replication
• Weighting
DATA IS ORGANIZED INTO POOLS
CLUSTERPOOLS(CONTAINING PGs)
POOL
A
POOL
B
POOL
C
POOL
D
ACCESS METHODS
ARCHITECTURAL COMPONENTS
RGWA web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-
distributed block
device with cloud
platform integration
CEPHFSA distributed file
system with POSIX
semantics and scale-
out metadata
APP HOST/VM CLIENT
ARCHITECTURAL COMPONENTS
RGWA web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-
distributed block
device with cloud
platform integration
CEPHFSA distributed file
system with POSIX
semantics and scale-
out metadata
APP HOST/VM CLIENT
ACCESSING A RADOS CLUSTER
RADOS CLUSTER
socket
RADOS ACCESS FOR APPLICATIONS
LIBRADOS
• Direct access to RADOS for applications
• C, C++, Python, PHP, Java, Erlang
• Direct access to storage nodes
• No HTTP overhead
ARCHITECTURAL COMPONENTS
RGWA web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-
distributed block
device with cloud
platform integration
CEPHFSA distributed file
system with POSIX
semantics and scale-
out metadata
APP HOST/VM CLIENT
STORING VIRTUAL DISKS
RADOS CLUSTER
STORING VIRTUAL DISKS
RADOS CLUSTER
STORING VIRTUAL DISKS
RADOS CLUSTER
PERCONA ON KRBD
RADOS CLUSTER
TUNING MYSQL ON CEPH
TUNING FOR HARMONYOVERVIEW
Tuning MySQL
• Buffer pool > 20%
• Flush each Tx or batch?
• Parallel double write-buffer
flushTuning Ceph
• RHCS 1.3.2, tcmalloc 2.4
• 128M thread cache
• Co-resident journals
• 2-4 OSDs per SSD
TUNING FOR HARMONYSAMPLE EFFECT OF MYSQL BUFFER POOL ON TpmC
-
200,000
400,000
600,000
800,000
1,000,000
1,200,000
0 1000 2000 3000 4000 5000 6000 7000 8000
tpm
C
Time (seconds) - 1 data point per minute
64x MySQL Instances on Ceph cluster: each with 25x TPC-C Warehouses
1% Buffer Pool
5% Buffer Pool
25% Buffer Pool
50% Buffer Pool
75% Buffer Pool
TUNING FOR HARMONYSAMPLE EFFECT OF MYSQL Tx FLUSH ON TpmC
-
500,000
1,000,000
1,500,000
2,000,000
2,500,000
0 1000 2000 3000 4000 5000 6000 7000 8000
tpm
C
Time (seconds) - 1 data point per minute
64x MySQL Instances on Ceph cluster: each with 25x TPC-C Warehouses
Batch Tx flush (1 sec)
Per Tx flush
TUNING FOR HARMONYSAMPLE EFFECT OF CEPH TCMALLOC VERSION ON TpmC
-
200,000
400,000
600,000
800,000
1,000,000
1,200,000
0 1000 2000 3000 4000 5000 6000 7000 8000
tpm
C
Time (seconds) - 1 data point per minute
64x MySQL Instances on Ceph cluster: each with 25x TPC-C Warehouses
Per Tx flush
Per Tx flush (tcmalloc v2.4)
TUNING FOR HARMONYCREATING A SEPARATE POOL TO SERVE IOPS WORKLOADS
Creating multiple pools in the CRUSH map
• Distinct branch in OSD tree
• Edit CRUSH map, add SSD rules
• Create pool, set crush_ruleset to SSD rule
• Add Volume Type to Cinder
TUNING FOR HARMONYIF YOU MUST USE MAGNETIC MEDIA
Reducing seeks on magnetic pools
• RBD cache is safe
• RAID Controllers with write-back cache
• SSD Journals
• Software caches
HW ARCHITECTURE
CONSIDERATIONS
ARCHITECTURAL CONSIDERATIONSUNDERSTANDING THE WORKLOAD
Traditional Ceph Workload
• $/GB
• PBs
• Unstructured data
• MB/sec
MySQL Ceph Workload
• $/IOP
• TBs
• Structured data
• IOPS
NEXT UP
2:20pm – 3:10pm
Room 203
MySQL in the CloudHead-to-Head Performance Lab