pluk2013 bodybuilding ratheesh
DESCRIPTION
Slides from my talk @ Percona Live London 2013. This talk is about database administration and how we manage percona xtradb at bodybuilding.com. There are a few benchmarks about percona, fusionio and xfs/ext4 file systems.TRANSCRIPT
S
Bodybuilding.com • Information • Motivation • Supplementation
About Me..
• Ratheesh Kaniyala
• MySQL Database Admin @ Bodybuilding.com
• Responsible for keeping the site up and running
• Work with system engineers, developers, product owners to get the best out of the database system
Agenda
S About Bodybuilding.com
S Mysql Architecture @ Bodybuilding.com
S Backups
S Monitoring
S Some Benchmarks on Fusionio
S NoSQL @ BBcom
Bodybuilding.com
S What is Bodybuilding.com?
We:
• Provide Information + Motivation + Supplementation for leading a healthy lifestyle
• Are not just for bodybuilders
• Have hundreds of vendors
• Have thousands of content (articles, exercises, videos)
• Have millions of users – still growing
Statistics
S Ranked #319 US website and #709 worldwide
S 15th largest forum with over 100 million posts
S About 65TB served from Akamai and 57 TB from cloudfront
S About 25 Million Unique visitors per month
S About 5000 Bodyspace registrations per day
APPLICATIONS
S Our main applications: S Busy Forum – Powered by vBulletin, Mysql
S Bodyspace – Php, Java, MySQL, MongoDB
S Store – ATG (Art Technology Group), Oracle 11G
S Content – Php, MySQL
Why Percona?
S This is what Percona says: • Your queries will run faster and more consistently. • You will consolidate servers on powerful hardware. • You will delay sharding, or avoid it entirely. • You will spend less time tuning and administering. • You will achieve higher uptime. • You will troubleshoot without guesswork.
S And we found this to be true for every point.
High Level Tech Diagram
WEB SERVERS
CACHING LAYER
MYSQL RDBMS
NOSQL SPHINX/SOLR
SECURITY LAYER
DATA LAYER
DATA LAYER
Cluster A Cluster B Cluster C Cluster D
A Typical MYSQL cluster
� � �
Writes & some reads Reads
Backups Offsite Backup
Analytics
Search Indexes
Query replay for lru_dump
HA
• Multi-master asynchronous replication
• Active – Passive mode
• What happens when any slave dies?
• What happens when passive master dies?
• What happens when Active master dies?
Active Master Died - Case1
Read only slaves
• M2 and all slaves will end up being at the same position
• Point all slaves to the M2 by change master
• Resume traffic
Active Master Died – Case 2
Read only slaves
• M2 and all slaves are in different states
• Promote the most up-to-date slave as master
• Fix the slave pool by looking into binary logs
Maintenance on Active master
Read only slaves
• Move all slaves to M2
• Change DNS to make M2 active
Maintenance on active contd…
Read only slaves
• Move all slaves to M2
• Change DNS to make M2 active
Maintenance on active contd…
• It is not that easy to move N slaves to M2 manually because the binlog file and position on M2 does not match that on M1
• Can we automate this slave migration? • Yes
• How? • There is an algorithm that I can share
Slave migration algorithm
S step 1: For each slave node do STOP SLAVE
S step 2: Sleeps for 10 sec so that the secondary master is slightly ahead than the stopped slaves
S step 3: STOP SLAVE on the secondary master and record the SHOW SLAVE STATUS result on it.
S step 4: For each slave node do START SLAVE UNTIL the recorded position in step 3
S step 5: For each slave node do STOP SLAVE (required for changing master later) S a) check the slave status and stop slave only when the current position = recorded position in step
3 S b) If the slave is behind by too much and does not catch up to the UNTIL position in 20 secs then
skip the node and make a note about it in the console and the log file. This node has to be taken care of manually later. Look for directions in the log file on how to change master on this node.
S step 6: do SHOW MASTER STATUS on secondary master and record the position
S step 7: For each slave node do a CHANGE MASTER TO secondary master with the position recorded in step 6
S Step 8: do START SLAVE on secondary master
S step 9: For each slave node do START SLAVE
Slave migration algorithm contd..
Backups
• Nightly logical backups (mysqldump)
• Every 2 hours - lvm snapshot backups of mysql datadir
• Stop slave
• Take snapshot
• All backups are done on a dedicated backup slave (B1)
• Copy to NAS
• Be ready for next snapshot
• mylvmbackup script provides various hook points to call a custom script
• Note: If you are not validating your backups then you don’t care about your data?
Preconnect
Premount
Backupsuccess
Snapshot Process
Update Monitoring
Backupsuccess
• Start mysql on the snapshot data – This does mysql recovery and keep the snapshot ready-to-use
• Validate backup by running custom queries, counts, statistics etc so that you are sure that it can be used for server rebuilds and recovery
Monitoring
• We use zabbix for monitoring – alerting + trending
• Close to 100,000 checks
• Mysql Performance Monitor plugin for Innodb
• Custom scripts can use zabbix trapper – e.g. data size trend
• Dbview (php based tool) for realtime processlist and slow query history
• Red Light is ON – Means trouble
Monitoring Contd..
DBVIEW
DBVIEW contd..
Benchmarking
S To understand the capacity of the hardware and software
S To understand what happens when you have a high traffic day
S To understand the impact of configuration changes
S To test the storage unit
Benchmarking with linkbench
• Linkbench from Facebook simulates a socio graphic workload
• The default workload pattern kind of matches our workload
• https://github.com/facebook/linkbench
• https://www.facebook.com/notes/facebook-engineering/linkbench-a-database-benchmark-for-the-social-graph/10151391496443920
• Read-Write operation mix, Throttle requests
• Target hot nodes
• Extendable and you can write plugins – not tried!
• Note: Linkbench partitions the linktable. If you want to benchmark without partition, create the table without partition and then load data.
Linkbench results
• Ops/Sec per thread
• [Thread-122]: 2190000/100000000 requests finished: 2.2% complete at 28311.8 ops/sec 77.4/100000
• Ops/Sec Total
• [main]: REQUEST PHASE COMPLETED. 100000000 requests done in 2641 seconds. Requests/second = 37852
Benchmark Cases • Database Size – 175GB, 400GB, 40 GB
• RAM – 64 GB
• Innodb_buffer_pool_size – 50GB for all tests
• Storage – Fusion iodrive2
• File System – XFS, EXT4
• Database Server Specs – Dell R620, 2 Chip-8 core-HT => 32 procs, Intel Xeon E5 2680 processor, debian Linux Squeeze, 2.6.32 Kernel
Benchmark Parameters
Innodb settings for fusionio
• Innodb_read_ahead=0
• Innodb_read_io_threads=8
• Innodb_adaptive_checkpoint=keep_average
• Innodb_flush_method=O_DIRECT
• Innodb_io_capacity=10000
• Innodb_flush_neighbor_pages=0
Benchmark results
25000 30000
33000 38000
0 5000
10000 15000 20000 25000 30000 35000 40000
XFS EXT4
Ops
per
sec
Percona 5.1
Percona 5.5
Linkbench 175G database on fusion iodrive2 • 200 threads • RW mix (70:30)
Innodb Mutex in 5.1
Bigger Database
10000
13000
0
2000
4000
6000
8000
10000
12000
14000
XFS EXT4
Ops
Per
Sec
• 200 threads • RW mix (70:30)
Percona 5.5.33 - 400G db
Fusion Fullness
21000
33000 33000 38000
0 5000
10000 15000 20000 25000 30000 35000 40000
XFS EXT4
Ops
Per
Sec
• 175G db + 400G junk • 200 threads • RW Mix (70:30)
More Full
Less Full
Percona 5.5.33 – 175G DB
Fragmentation Effect
27000
36000
0
5000
10000
15000
20000
25000
30000
35000
40000
XFS EXT4
Ops
Per
Sec
Percona 5.5.33 – 175G DB • 200 threads • RW mix (70:30)
Doublewrite Effect
38000
45000
34000
36000
38000
40000
42000
44000
46000
Ops
Per
Sec
Doublewrite ON
• No atomicity without double write
• Percona supports atomic writes using DirectFS NVM Filesystem
• RW Mix (70:30) • 200 threads
Percona 5.5.33 – 175G DB
READ Workload
59000
59500
58500
59000
59500
60000
XFS EXT4
Ops
Per
Sec
• 90% Reads • 10% Writes • 200 threads
Percona 5.5.33 – 175G DB
WRITE Workload
6000
14000
0
2000
4000
6000
8000
10000
12000
14000
16000
XFS EXT4
Ops
Per
Sec
• 90% writes • 10% reads • 200 threads
Percona 5.5.33 – 175G DB
More threads
18000
35000
0 5000
10000 15000 20000 25000 30000 35000 40000
XFS EXT4
Ops
Per
Sec
Percona 5.5.33 – 175G DB
• 400 Threads • RW Mix (70:30)
Bufferpool Instances
37000
37500
38000
38500
39000
39500
40000
1 2 8
Ops
Per
Sec
Bufferpool Instances
• 200 threads • 70:30 RW mix • NUMA has a
role?
Percona 5.5.33 – 175G DB
1 bufferpool instance
Low concurrency
S Is important because the workload may not always be high concurrency
S The system resources are best utilized under high concurrency but there could be a lot of situations where you could just have 8 threads running or 4 threads running.
S How does the database behave in these situation?
S Could there be performance regression?
Low concurrency contd..
7500 7300
11000 10500
0
2000
4000
6000
8000
10000
12000
Percona 5.1 Percona 5.5
4 threads
8 threads
ODIRECT Vs Buffered
16500
30000 33000
38000
0 5000
10000 15000 20000 25000 30000 35000 40000
XFS EXT4
Ops
Per
Sec
Buffered
O_DIRECT
EXT4 Buffered Vs XFS Buffered
CPU USAGE
Await
Fusion Throughput
Innodb Semaphores
Innodb Data RW
EXT4 ODIRECT Vs XFS ODIRECT
CPU Usage
Fusion Throughput
Innodb data RW
Innodb Semaphores
Innodb Transactions
NOSQL
• User feeds, Nested comments, Personalization
• User feeds can be nasty
• Some users have thousands of friends
• Millions of feeds
• Feeds can have events that cannot share a schema
• E.g. Somebody updated weight != somebody updated bodyfat %
• Mongodb a savior plus a killer
• Savior because of its schemaless nature
• Killer because of its locking issues, index immaturity
• Sharding may solve the problem but not a shardable problem.
_id:, Text:, Date:, Event_ids:
Id1, Id2, . .
_id:, Weight:,
_id:, BodyFat:,
User Feeds
Feeds Collection Events Collection
Thank You