perconalive feb-2011-share

High Value Transaction Processing

Mark Callaghan

What do I mean by value?

▪ Low price?

▪ High price/performance?

▪ Valuable data

OLTP in the datacenter

▪ Sharding

▪ Availability

▪ Legacy applications

▪ Used by many applications

Sharding

▪ Sharding is easy, resharding is hard

▪ Joins within a shard are still frequent and useful

▪ Some all-shards joins must use Hive

▪ Provides some fault-isolation benefits

Availability

▪ Sources of downtime

▪ Schema change (but now we have OSC)

▪ Manual failover

▪ Misbehaving applications

▪ Oops

Used by many applications

If your company is successful then

▪ Your database will be accessed by many different applications

▪ Application authors might not be MySQL experts

▪ Application owners might have different priorities than the DB team

Legacy applications

If your company is successful then you will have

▪ Applications written many years ago by people who are gone

▪ Design decisions that are not good for your current size

▪ Not enough resources or time to rewrite applications

Our busy OLTP deployment

▪ Query response time

▪ 4 ms reads, 5ms writes

▪ Network bytes per second

▪ 38GB peak

▪ Queries per second

▪ 13M peak

▪ Rows read per second

▪ 450M peak

▪ Rows changed per second

▪ 3.5M peak

▪ InnoDB page IO per second

▪ 5.2M peak

Recent improvements

▪ Joint work by Facebook, Percona and Oracle/MySQL

▪ Prevent InnoDB stalls

▪ Stalls from caches

▪ Stalls from mutexes

▪ IO efficiency

▪ Improve monitoring

▪ Improve XtraBackup

How do you measure performance?

▪ Response time variance leads to bad user experiences

▪ Optimizations that defer work must handle steady-state loads

▪ When designing a server the choices are:

▪ No concurrency (and no mutexes)

▪ One mutex

▪ More than one mutex

This has good average performance

Which metric matters?

Stalls from caches

Caches that defer expensive operations must eventually complete them at the same rate at which they are deferred.

▪ InnoDB purge

▪ InnoDB insert buffer

▪ Async writes are not async

▪ Fuzzy checkpoint constraint enforcement

InnoDB purge stalls

▪ InnoDB purge removes delete-marked rows

▪ Done by the main background thread in 5.1 plugin

▪ Optionally done by a separate thread in 5.5

▪ Purge is single-threaded and might be stalled by disk reads

▪ Further it gets behind, more likely it won’t catch up

▪ Need multiple purge threads as the main background thread can become the dedicated purge thread and that isn’t enough

do {n_pages_purged = trx_purge();

} while (n_pages_purged);

InnoDB insert buffer stalls

▪ The insert buffer is not drained as fast as it can get full

▪ Drain rate is 5% of innodb_io_capacity

▪ bugs.mysql.com/59214

▪ Fixed in the Facebook patch and XtraDB

▪ Patch pending for MySQL 5.5

Performance drops when ibuf is full

Otherwise, the insert buffer is awesome

Fuzzy checkpoint constraint

▪ TotalLogSize = #log_files X innodb_log_file_size

▪ AsyncLimit = 0.70 X TotalLogSize

▪ SyncLimit = 0.75 X TotalLogSize

▪ OldestDirtyLSN is the smallest oldest_modification LSN of all dirty pages in the buffer pool

▪ Age = CurrentLSN – OldestDirtyLSN

Fuzzy Checkpoint Constraint

▪ If Age > SyncLimit then flush_dirty_pages_synch()

▪ Else if Age > AsyncLimit then flush_dirty_pages_async()

Async page writes are not async

▪ Async page write requests submitted per fuzzy checkpoint constraint are not async

▪ User transactions may do this via log_preflush_pool_modified_pages

▪ Caller does large write for doublewrite buffer

▪ Caller then submits in-place write requests for background write threads

▪ Caller then waits for background write threads to finish


▪ Fixed in the Facebook patch

Fuzzy checkpoint constraint enforcement

Prior to InnoDB plugin 5.1.38, page writes done to enforce the fuzzy checkpoint constraint were not submitted by the main background thread.

▪ InnoDB plugin added innodb_adaptive_flushing in 5.1.38 plugin

▪ Percona added innodb_adaptive_checkpoint

▪ Facebook patch added innodb_background_checkpoint

Sysbench QPS at 20 second intervals with checkpoint stalls

Stalls from mutexes

▪ Extending InnoDB files

▪ Opening InnoDB tables

▪ Purge/undo lock conflicts

▪ TRUNCATE table and LOCK_open

▪ DROP table and LOCK_open

▪ Buffer pool invalidate

▪ LOCK_open and kernel_mutex

▪ Excessive calls to fcntl

▪ Deadlock detection overhead

▪ innodb_thread_concurrency

Stalls from extending InnoDB files

▪ A global mutex is locked when InnoDB tables are extended while writes are done to extend the file

▪ All reads on the file are blocked until the writes are done


▪ To be fixed real soon in the Facebook patch

Stalls from opening InnoDB tables

▪ Opening table handler instances is serialized on LOCK_open. Index cardinality stats might then be computed using random reads

▪ bugs.mysql.com/49463 and bugs.mysql.com/53046

▪ Fixed in the Facebook patch and MySQL 5.5

▪ When stats are recomputed many uses of that table will stall


▪ Index stats could be recomputed too frequently


▪ Fixed in the Facebook patch, MySQL 5.1 and MySQL 5.5

Stalls from purge/undo lock conflicts

▪ Purge and undo are not concurrent on the same InnoDB table

▪ Purge gets a share lock on the table

▪ Undo gets an exclusive lock on the table

▪ REPLACE statements that use insert-then-undo can generate undo


▪ Fixed in MySQL 5.1.55 and MySQL 5.5

TRUNCATE table and LOCK_open

▪ LOCK_open is held when the truncate is done by InnoDB

▪ When file-per-table is used the file must be removed and that can take too long

▪ The InnoDB buffer pool LRU must be scanned

▪ New queries cannot be started


▪ Fixed in MySQL 5.5 courtesy of meta-data locking

DROP table and LOCK_open

▪ LOCK_open is held when the drop is done by InnoDB

▪ When file-per-table is used the file must be removed and that can take too long

▪ The InnoDB buffer pool LRU must be scanned

▪ New queries cannot be started



▪ Do most InnoDB processing in the background drop queue

▪ Fixed in MySQL 5.5 courtesy of meta-data locking

TRUNCATE/DROP table and invalidate

▪ Pages for table removed from buffer pool and adaptive hash

▪ InnoDB buffer pool mutex locked while the LRU is scanned

▪ This is slow with a large buffer pool

▪ Most threads in InnoDB will block waiting for the buffer pool mutex


▪ I hope Yasufumi can fix it

LOCK_open and kernel_mutex conflicts

▪ Thread A

▪ Gather table statistics while holding LOCK_open

▪ Block on kernel_mutex while starting a transaction

▪ Thread B

▪ Hold kernel_mutex while doing deadlock detection

▪ All other threads block on LOCK_open or kernel_mutex


▪ Fixed in MySQL 5.5

Stalls from excessive calls to fcntl

▪ fcntl

▪ My Linux kernels get the big kernel lock on fcntl calls

▪ MySQL called fcntl too often

▪ Doubled peak QPS by hacking MySQL to call fcntl less

▪ Almost 200,000 QPS without using HandlerSocket


▪ Fixed in Facebook patch, then reverted because it broke SSL tests

▪ Not sure where or when this will be fixed

Sysbench read-only with fcntl fix

Stalls from deadlock detection overhead

▪ InnoDB deadlock detection was very inefficient. Worst case when all threads waited on the same row lock.

▪ Added option to disable it in the Facebook patch and rely on lock wait timeout

▪ MySQL made it more efficient in MySQL 5.1


Stalls from innodb_thread_concurrency

▪ When there are 1000+ sleeping threads it can take too long to wake up a specific thread

▪ Change innodb_thread_concurrency to use FIFO scheduling in addition to existing use of LIFO and FIFO+LIFO = FLIFO


Sysbench TPS with FLIFO

IO efficiency

High priority problems for me are:

▪ Reducing IOPs used for my workload

▪ Supporting very large databases

Significant improvements:

▪ Switch from mysqldump to XtraBackup

▪ Run innosim to confirm storage performance

▪ Tune InnoDB

▪ Improve schemas and queries

mysqldump vs XtraBackup

▪ mysqldump is slower for backup

▪ Clustered index is scanned row-at-a-time in key order (lots of random reads)

▪ Backup accounts for half of the disk reads for servers I watch

▪ Single-table restore is easy with mysqldump

▪ Possible with XtraBackup thanks to work by Vamsi from Facebook

▪ Incremental backup

▪ Not possible with mysqldump

▪ XtraBackup has incremental (scan all data, write only the changed blocks)

▪ Vamsi from Facebook added support for really incremental, scan & write only

the changed blocks

innosim storage benchmark

▪ InnoDB IO simulator that models

▪ Doublewrite buffer

▪ Dirty page writes

▪ Transaction log and binlog fsync and IO

▪ User transactions that do read, write and commit

▪ Search for “facebook innosim”

▪ Source code on launchpad

Tune InnoDB

▪ It is not easy to support many concurrent disk reads

▪ Innodb_thread_concurrency tickets not released when waiting for a read

▪ If innodb_thread_concurrency is too high then writers suffer

▪ If innodb_thread_concurrency is too low then readers suffer

▪ Smaller pages are better for some but not all tables

▪ A large log file can reduce the dirty page flush rate

▪ A large buffer pool can reduce the page read rate

IOPs is a function of size and concurrency

Smaller pages aren’t always better

Checkpoint IO rate by log file size

Page read rate by buffer pool size

Improve schemas

▪ Make your performance critical queries index only

▪ Primary key columns are included in the secondary index

▪ Understand how the insert buffer makes index maintenance cheaper

▪ Figure out how to do schema changes with minimal downtime

▪ We used the Online Schema Change tool (thanks Vamsi)

▪ You can also do the schema change on a slave first and then promote it

Monitoring

▪ Per table, index, account via information_schema tables

▪ Efficient and always enabled

▪ Easy to use

▪ Enhanced slow query log

▪ Facebook patch added options to do sampling for the slow query log

▪ Sample from all queries and from all queries that have an error

▪ Error is limited to errno, error text must wait for 5.5 plugin

▪ Aggregate by query text and URL from query commen

Open Problems

▪ Parallel replication apply

▪ Support max concurrent queries

▪ Automate slave failover when a master fails

▪ Use InnoDB compression for OLTP

▪ Multi-master replication with conflict resolution

Parallel replication apply

▪ Replication apply is single-threaded. This causes lag on IO-bound slaves even when SQL is simple

▪ mk-slave-prefetch can help but something better is needed

▪ Is a thread running BEGIN; replay-slave-sql; ROLLBACK better?

▪ I want:

▪ N replay queues

▪ Binlog events (SBR or RBR) hashed to queues by database names

▪ Each queue replayed in parallel

Max concurrent queries

▪ Use large values for max concurrent connections per account

▪ Enforce smaller values for max concurrent queries

▪ We have begun testing an implementation.

▪ Enforce at statement entry

▪ Account for threads that block (row lock, disk IO, network IO)

Automate slave failover

▪ Global transactions IDs from the Google patch is awesome

▪ But I don’t have the skills to port or support it

▪ A unique ID per binlog group or event might be sufficient

▪ Add an attribute to binlog event metadata

▪ Preserve on the slave similar to server ID

InnoDB compression for OLTP

▪ Change InnoDB to not log page images for compressed pages

▪ Logging them increases the log IO rate

▪ Increasing the log IO rate then increases the checkpoint IO rate

▪ Change InnoDB to use QuickLZ instead of zlib for compression

▪ Add an option to limit compression to the PK index

▪ Add per-table compression statistics

MySQL in the datacenter

▪ Previously dominated the market

▪ Now it must learn to share

▪ PostgreSQL continues to improve for OLTP

▪ Hbase, Cassandra, MongoDB are getting transactions today

Why NoSQL

▪ Do less, but do it better

▪ Some offer write-optimized data stores

▪ Some don’t require sharding

▪ Interesting HA models

▪ Cassandra doesn’t have the notion of failover

▪ HBase doesn’t require failover when a server dies

▪ Healthy development communities improve code quickly

What comes next

▪ Batch extraction is not the answer for MySQL/NoSQL integration

▪ NoSQL deployments will be reminded that

▪ Some of your problems are independent of technology

▪ You need better monitoring

▪ There is downtime when you need to modify the clustered index

▪ Database ops is hard with legacy apps and multi-user deployments

▪ In a few years someone will document the many stalls in HBase

The End

Thank you

perconalive feb-2011-share

Data & Analytics