perconalive feb-2011-share
TRANSCRIPT
![Page 1: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/1.jpg)
High Value Transaction Processing
Mark Callaghan
![Page 2: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/2.jpg)
What do I mean by value?
▪ Low price?
▪ High price/performance?
▪ Valuable data
![Page 3: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/3.jpg)
OLTP in the datacenter
▪ Sharding
▪ Availability
▪ Legacy applications
▪ Used by many applications
![Page 4: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/4.jpg)
Sharding
▪ Sharding is easy, resharding is hard
▪ Joins within a shard are still frequent and useful
▪ Some all-shards joins must use Hive
▪ Provides some fault-isolation benefits
![Page 5: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/5.jpg)
Availability
▪ Sources of downtime
▪ Schema change (but now we have OSC)
▪ Manual failover
▪ Misbehaving applications
▪ Oops
![Page 6: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/6.jpg)
Used by many applications
If your company is successful then
▪ Your database will be accessed by many different applications
▪ Application authors might not be MySQL experts
▪ Application owners might have different priorities than the DB team
![Page 7: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/7.jpg)
Legacy applications
If your company is successful then you will have
▪ Applications written many years ago by people who are gone
▪ Design decisions that are not good for your current size
▪ Not enough resources or time to rewrite applications
![Page 8: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/8.jpg)
Our busy OLTP deployment
▪ Query response time
▪ 4 ms reads, 5ms writes
▪ Network bytes per second
▪ 38GB peak
▪ Queries per second
▪ 13M peak
▪ Rows read per second
▪ 450M peak
▪ Rows changed per second
▪ 3.5M peak
▪ InnoDB page IO per second
▪ 5.2M peak
![Page 9: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/9.jpg)
Recent improvements
▪ Joint work by Facebook, Percona and Oracle/MySQL
▪ Prevent InnoDB stalls
▪ Stalls from caches
▪ Stalls from mutexes
▪ IO efficiency
▪ Improve monitoring
▪ Improve XtraBackup
![Page 10: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/10.jpg)
How do you measure performance?
▪ Response time variance leads to bad user experiences
▪ Optimizations that defer work must handle steady-state loads
▪ When designing a server the choices are:
▪ No concurrency (and no mutexes)
▪ One mutex
▪ More than one mutex
![Page 11: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/11.jpg)
This has good average performance
![Page 12: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/12.jpg)
Which metric matters?
![Page 13: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/13.jpg)
Stalls from caches
Caches that defer expensive operations must eventually complete them at the same rate at which they are deferred.
▪ InnoDB purge
▪ InnoDB insert buffer
▪ Async writes are not async
▪ Fuzzy checkpoint constraint enforcement
![Page 14: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/14.jpg)
InnoDB purge stalls
▪ InnoDB purge removes delete-marked rows
▪ Done by the main background thread in 5.1 plugin
▪ Optionally done by a separate thread in 5.5
▪ Purge is single-threaded and might be stalled by disk reads
▪ Further it gets behind, more likely it won’t catch up
▪ Need multiple purge threads as the main background thread can become the dedicated purge thread and that isn’t enough
do {n_pages_purged = trx_purge();
} while (n_pages_purged);
![Page 15: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/15.jpg)
InnoDB insert buffer stalls
▪ The insert buffer is not drained as fast as it can get full
▪ Drain rate is 5% of innodb_io_capacity
▪ bugs.mysql.com/59214
▪ Fixed in the Facebook patch and XtraDB
▪ Patch pending for MySQL 5.5
![Page 16: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/16.jpg)
Performance drops when ibuf is full
![Page 17: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/17.jpg)
Otherwise, the insert buffer is awesome
![Page 18: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/18.jpg)
Fuzzy checkpoint constraint
▪ TotalLogSize = #log_files X innodb_log_file_size
▪ AsyncLimit = 0.70 X TotalLogSize
▪ SyncLimit = 0.75 X TotalLogSize
▪ OldestDirtyLSN is the smallest oldest_modification LSN of all dirty pages in the buffer pool
▪ Age = CurrentLSN – OldestDirtyLSN
Fuzzy Checkpoint Constraint
▪ If Age > SyncLimit then flush_dirty_pages_synch()
▪ Else if Age > AsyncLimit then flush_dirty_pages_async()
![Page 19: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/19.jpg)
Async page writes are not async
▪ Async page write requests submitted per fuzzy checkpoint constraint are not async
▪ User transactions may do this via log_preflush_pool_modified_pages
▪ Caller does large write for doublewrite buffer
▪ Caller then submits in-place write requests for background write threads
▪ Caller then waits for background write threads to finish
▪ bugs.mysql.com/55004
▪ Fixed in the Facebook patch
![Page 20: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/20.jpg)
Fuzzy checkpoint constraint enforcement
Prior to InnoDB plugin 5.1.38, page writes done to enforce the fuzzy checkpoint constraint were not submitted by the main background thread.
▪ InnoDB plugin added innodb_adaptive_flushing in 5.1.38 plugin
▪ Percona added innodb_adaptive_checkpoint
▪ Facebook patch added innodb_background_checkpoint
![Page 21: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/21.jpg)
Sysbench QPS at 20 second intervals with checkpoint stalls
![Page 22: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/22.jpg)
Stalls from mutexes
▪ Extending InnoDB files
▪ Opening InnoDB tables
▪ Purge/undo lock conflicts
▪ TRUNCATE table and LOCK_open
▪ DROP table and LOCK_open
▪ Buffer pool invalidate
▪ LOCK_open and kernel_mutex
▪ Excessive calls to fcntl
▪ Deadlock detection overhead
▪ innodb_thread_concurrency
![Page 23: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/23.jpg)
Stalls from extending InnoDB files
▪ A global mutex is locked when InnoDB tables are extended while writes are done to extend the file
▪ All reads on the file are blocked until the writes are done
▪ bugs.mysql.com/56433
▪ To be fixed real soon in the Facebook patch
![Page 24: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/24.jpg)
Stalls from opening InnoDB tables
▪ Opening table handler instances is serialized on LOCK_open. Index cardinality stats might then be computed using random reads
▪ bugs.mysql.com/49463 and bugs.mysql.com/53046
▪ Fixed in the Facebook patch and MySQL 5.5
▪ When stats are recomputed many uses of that table will stall
▪ Fixed in the Facebook patch
▪ Index stats could be recomputed too frequently
▪ bugs.mysql.com/56340
▪ Fixed in the Facebook patch, MySQL 5.1 and MySQL 5.5
![Page 25: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/25.jpg)
Stalls from purge/undo lock conflicts
▪ Purge and undo are not concurrent on the same InnoDB table
▪ Purge gets a share lock on the table
▪ Undo gets an exclusive lock on the table
▪ REPLACE statements that use insert-then-undo can generate undo
▪ bugs.mysql.com/54538
▪ Fixed in MySQL 5.1.55 and MySQL 5.5
![Page 26: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/26.jpg)
TRUNCATE table and LOCK_open
▪ LOCK_open is held when the truncate is done by InnoDB
▪ When file-per-table is used the file must be removed and that can take too long
▪ The InnoDB buffer pool LRU must be scanned
▪ New queries cannot be started
▪ bugs.mysql.com/41158 and bugs.mysql.com/56696
▪ Fixed in MySQL 5.5 courtesy of meta-data locking
![Page 27: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/27.jpg)
DROP table and LOCK_open
▪ LOCK_open is held when the drop is done by InnoDB
▪ When file-per-table is used the file must be removed and that can take too long
▪ The InnoDB buffer pool LRU must be scanned
▪ New queries cannot be started
▪ bugs.mysql.com/56655
▪ Fixed in the Facebook patch
▪ Do most InnoDB processing in the background drop queue
▪ Fixed in MySQL 5.5 courtesy of meta-data locking
![Page 28: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/28.jpg)
TRUNCATE/DROP table and invalidate
▪ Pages for table removed from buffer pool and adaptive hash
▪ InnoDB buffer pool mutex locked while the LRU is scanned
▪ This is slow with a large buffer pool
▪ Most threads in InnoDB will block waiting for the buffer pool mutex
▪ bugs.mysql.com/51325 and bugs.mysql.com/56332
▪ I hope Yasufumi can fix it
![Page 29: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/29.jpg)
LOCK_open and kernel_mutex conflicts
▪ Thread A
▪ Gather table statistics while holding LOCK_open
▪ Block on kernel_mutex while starting a transaction
▪ Thread B
▪ Hold kernel_mutex while doing deadlock detection
▪ All other threads block on LOCK_open or kernel_mutex
▪ bugs.mysql.com/51557
▪ Fixed in MySQL 5.5
![Page 30: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/30.jpg)
Stalls from excessive calls to fcntl
▪ fcntl
▪ My Linux kernels get the big kernel lock on fcntl calls
▪ MySQL called fcntl too often
▪ Doubled peak QPS by hacking MySQL to call fcntl less
▪ Almost 200,000 QPS without using HandlerSocket
▪ bugs.mysql.com/54790
▪ Fixed in Facebook patch, then reverted because it broke SSL tests
▪ Not sure where or when this will be fixed
![Page 31: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/31.jpg)
Sysbench read-only with fcntl fix
![Page 32: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/32.jpg)
Stalls from deadlock detection overhead
▪ InnoDB deadlock detection was very inefficient. Worst case when all threads waited on the same row lock.
▪ Added option to disable it in the Facebook patch and rely on lock wait timeout
▪ MySQL made it more efficient in MySQL 5.1
▪ bugs.mysql.com/49047
![Page 33: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/33.jpg)
Stalls from innodb_thread_concurrency
▪ When there are 1000+ sleeping threads it can take too long to wake up a specific thread
▪ Change innodb_thread_concurrency to use FIFO scheduling in addition to existing use of LIFO and FIFO+LIFO = FLIFO
▪ Fixed in the Facebook patch
![Page 34: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/34.jpg)
Sysbench TPS with FLIFO
![Page 35: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/35.jpg)
IO efficiency
High priority problems for me are:
▪ Reducing IOPs used for my workload
▪ Supporting very large databases
Significant improvements:
▪ Switch from mysqldump to XtraBackup
▪ Run innosim to confirm storage performance
▪ Tune InnoDB
▪ Improve schemas and queries
![Page 36: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/36.jpg)
mysqldump vs XtraBackup
▪ mysqldump is slower for backup
▪ Clustered index is scanned row-at-a-time in key order (lots of random reads)
▪ Backup accounts for half of the disk reads for servers I watch
▪ Single-table restore is easy with mysqldump
▪ Possible with XtraBackup thanks to work by Vamsi from Facebook
▪ Incremental backup
▪ Not possible with mysqldump
▪ XtraBackup has incremental (scan all data, write only the changed blocks)
▪ Vamsi from Facebook added support for really incremental, scan & write only
the changed blocks
![Page 37: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/37.jpg)
innosim storage benchmark
▪ InnoDB IO simulator that models
▪ Doublewrite buffer
▪ Dirty page writes
▪ Transaction log and binlog fsync and IO
▪ User transactions that do read, write and commit
▪ Search for “facebook innosim”
▪ Source code on launchpad
![Page 38: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/38.jpg)
Tune InnoDB
▪ It is not easy to support many concurrent disk reads
▪ Innodb_thread_concurrency tickets not released when waiting for a read
▪ If innodb_thread_concurrency is too high then writers suffer
▪ If innodb_thread_concurrency is too low then readers suffer
▪ Smaller pages are better for some but not all tables
▪ A large log file can reduce the dirty page flush rate
▪ A large buffer pool can reduce the page read rate
![Page 39: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/39.jpg)
IOPs is a function of size and concurrency
![Page 40: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/40.jpg)
Smaller pages aren’t always better
![Page 41: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/41.jpg)
Checkpoint IO rate by log file size
![Page 42: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/42.jpg)
Page read rate by buffer pool size
![Page 43: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/43.jpg)
Improve schemas
▪ Make your performance critical queries index only
▪ Primary key columns are included in the secondary index
▪ Understand how the insert buffer makes index maintenance cheaper
▪ Figure out how to do schema changes with minimal downtime
▪ We used the Online Schema Change tool (thanks Vamsi)
▪ You can also do the schema change on a slave first and then promote it
![Page 44: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/44.jpg)
Monitoring
▪ Per table, index, account via information_schema tables
▪ Efficient and always enabled
▪ Easy to use
▪ Enhanced slow query log
▪ Facebook patch added options to do sampling for the slow query log
▪ Sample from all queries and from all queries that have an error
▪ Error is limited to errno, error text must wait for 5.5 plugin
▪ Aggregate by query text and URL from query commen
![Page 45: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/45.jpg)
Open Problems
▪ Parallel replication apply
▪ Support max concurrent queries
▪ Automate slave failover when a master fails
▪ Use InnoDB compression for OLTP
▪ Multi-master replication with conflict resolution
![Page 46: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/46.jpg)
Parallel replication apply
▪ Replication apply is single-threaded. This causes lag on IO-bound slaves even when SQL is simple
▪ mk-slave-prefetch can help but something better is needed
▪ Is a thread running BEGIN; replay-slave-sql; ROLLBACK better?
▪ I want:
▪ N replay queues
▪ Binlog events (SBR or RBR) hashed to queues by database names
▪ Each queue replayed in parallel
![Page 47: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/47.jpg)
Max concurrent queries
▪ Use large values for max concurrent connections per account
▪ Enforce smaller values for max concurrent queries
▪ We have begun testing an implementation.
▪ Enforce at statement entry
▪ Account for threads that block (row lock, disk IO, network IO)
![Page 48: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/48.jpg)
Automate slave failover
▪ Global transactions IDs from the Google patch is awesome
▪ But I don’t have the skills to port or support it
▪ A unique ID per binlog group or event might be sufficient
▪ Add an attribute to binlog event metadata
▪ Preserve on the slave similar to server ID
![Page 49: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/49.jpg)
InnoDB compression for OLTP
▪ Change InnoDB to not log page images for compressed pages
▪ Logging them increases the log IO rate
▪ Increasing the log IO rate then increases the checkpoint IO rate
▪ Change InnoDB to use QuickLZ instead of zlib for compression
▪ Add an option to limit compression to the PK index
▪ Add per-table compression statistics
![Page 50: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/50.jpg)
MySQL in the datacenter
▪ Previously dominated the market
▪ Now it must learn to share
▪ PostgreSQL continues to improve for OLTP
▪ Hbase, Cassandra, MongoDB are getting transactions today
![Page 51: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/51.jpg)
Why NoSQL
▪ Do less, but do it better
▪ Some offer write-optimized data stores
▪ Some don’t require sharding
▪ Interesting HA models
▪ Cassandra doesn’t have the notion of failover
▪ HBase doesn’t require failover when a server dies
▪ Healthy development communities improve code quickly
![Page 52: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/52.jpg)
What comes next
▪ Batch extraction is not the answer for MySQL/NoSQL integration
▪ NoSQL deployments will be reminded that
▪ Some of your problems are independent of technology
▪ You need better monitoring
▪ There is downtime when you need to modify the clustered index
▪ Database ops is hard with legacy apps and multi-user deployments
▪ In a few years someone will document the many stalls in HBase
![Page 53: Perconalive feb-2011-share](https://reader031.vdocuments.us/reader031/viewer/2022030310/58f9a91b760da3da068b6b11/html5/thumbnails/53.jpg)
The End
Thank you