dw tpain - gordon klok

20
dw-tpain Gordon Klok Demonware, Inc

Upload: devopsdays

Post on 08-May-2015

308 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Dw tpain - Gordon Klok

dw-tpain

Gordon KlokDemonware, Inc

Page 2: Dw tpain - Gordon Klok

A Demonware MySQL cluster

More info about the LSG see: “Erlang and First Person Shooters”

http://www.erlang-factory.com/upload/presentations/395/ErlangandFirst-PersonShooters.pdf

Page 4: Dw tpain - Gordon Klok
Page 5: Dw tpain - Gordon Klok

Not good

Page 6: Dw tpain - Gordon Klok

Everything checks out on the database server

Page 7: Dw tpain - Gordon Klok

NFS Latency during backup window.

Page 8: Dw tpain - Gordon Klok

Paging...

So whats the problem?How is the slow performance of the

NAS affecting the client?

Page 9: Dw tpain - Gordon Klok
Page 10: Dw tpain - Gordon Klok

• /proc/sys/vm/dirty_background_ratio (default 10): Maximum percentage of active that can be filled with dirty pages before pdflush begins to write them

• /proc/sys/vm/dirty_expire_centiseconds (default 3000): In hundredths of a second, how long data can be in the page cache before it's considered expired and must be written at the next opportunity. Note that this default is very long: a full 30 seconds. That means that under normal circumstances, unless you write enough to trigger the other pdflush method, Linux won't actually commit anything you write until 30 seconds later.

• See “The Linux Page Cache and pdflush:Theory of Operation and Tuning for Write-Heavy Loads” http://www.westnet.com/~gsmith/content/linux-pdflush.htm

pd_fush

Page 11: Dw tpain - Gordon Klok

pd_flush continued

• Lots of memory in our database severs: not hitting the 10% threshold

• During the pd_flush interval about ~775MB of dirty pages were accumulating

• NFS write operations were taking about 0.6 sec per 65k block

• Meaning a11 second stall in all IO while waiting for dirty pages to be written out

Page 12: Dw tpain - Gordon Klok

pv(1) pv - monitor the progress of data through a pipe

-L RATE, --rate-limit RATE Limit the transfer to a maximum of RATE bytes per

second. A suffix of "k", "m", "g", or "t" can be added to denote kilobytes (*1024), megabytes, and so on.

Very useful feature:

Page 13: Dw tpain - Gordon Klok

Fixed limit problem

• If the pv rate limit is set to high we still interfere with MySQL by stealing IO to the database servers attached storage

• if the pv rate limit is set lower then the rate of incoming writes to mysql innobackupex will not finish until it fills the volume holding the xtrabackup logs.

• Need a dynamic rate limiter

Page 14: Dw tpain - Gordon Klok

-R PID, --remote PID If PID is an instance of pv that is already running, -R PID will cause that

instance to act as though it had been given this instance's command line instead. For example, if pv -L 123k is running with process ID 9876, then running pv -R 9876 -L 321k will cause it to start using a rate limit of 321k instead of 123k. Note that some options cannot be changed while running, such as -c, -l, -f, -E, and -S.

pv(1) pv - monitor the progress of data through a pipe

Making pv dynamic

Page 15: Dw tpain - Gordon Klok

Metric for IO Health

iostat -x -d -m 5 /dev/sdaDevice: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util

sda 0.06 0.51 0.08 0.39 0.00 0.00 21.17 0.00 2.23 1.22 0.06

Through a bit of experimentation we found when avgqu-sz hovered around 1.0 for the device containing the mysql data files we achieved the most efficient backup with least

impact.

avgqu-sz - The average queue length of the requests that were issued to the device.

Page 16: Dw tpain - Gordon Klok

A PID Controller

Page 17: Dw tpain - Gordon Klok
Page 18: Dw tpain - Gordon Klok

Success!

Page 19: Dw tpain - Gordon Klok