wish list from postgresql - linux kernel summit 2009
DESCRIPTION
This explains storage and buffer usage in Postgres and discusses about I/O and buffer management in Linux kernel.TRANSCRIPT
Linux Kernel Summit 2009
Wish list from PostgreSQL
Itagaki TakahiroNTT Open Source Software Center
Released at “Linux Kernel Summit 2009”
http://events.linuxfoundation.org/archive/2009/linux-kernel-summit
October 18 - 20, 2009 - Tokyo, Japan
2
Agenda
Background
Postgres won’t use Direct I/O!
Storage and buffer usage in Postgres
Discussions
Low priority I/O for background tasks
Avoid duplicated caching in DB and kernel buffers
3
Background: Postgres won’t use Direct I/O!
Our policy is to delegate as much as possible to the kernel and avoid re-implementing the whole block layer in user-space of PostgreSQL.
It might be opposite requirements from commercial DBMS folks.
We’d like to keep I/O layer in small.
We won’t use RAW device, too.
Layout of files should be managedby file system.
Not ideal, but it is good approach to support many platforms by a small number of developers.
<100 active main developers
<10 committers
support >10 platforms
codes for block layer is
<30K lines (5%)
Postgres code lines (600K lines)
4
Background: Storage and buffer usage in Postgres
Consist of multiple processes.
Use file system and multiple files. (per 1GB of table / per 16MB of xlog)
Mainly use traditional system calls. (lseek, read, write, fsync)Starting to use posix_fadvise() in the latest version.
We depends on kernel buffer cache and I/O managements.Do not use synchronous I/O to access data files.
Do not read-ahead by itself; expect read() to do it.
postmaster(listener process)
backend(SQL executor process)
1GB 1GB 1GB
16MB
data files
xlog files 16MB 16MB
fork()
lseek()
read()
write()
own I/O exclusion control
writer(sync process)
lseek()
write()
fsync()
storage + file system
own shared buffer pool with shmget()
overwrites
expands
5
Low priority I/O for background tasks
PostgreSQL uses some background tasks
VACUUM – cleanup DELETE’d rows and reclaim the area.
CHECKPOINT – flush all modified pages to disks.
Current behavior in Postgres
Take some sleep every constant amount of I/O.
Consume constant I/O band width regardless of workload.
Ideal behavior
Background tasks can use all ofsurplus I/O band width as far asit does not affect to service.
Requirements
Low priority I/O should affect buffered writes and fsync.
Normal I/O should not wait for low priority I/Os; so fsync should not block lseek, read, write (both overwrites and extends).
sometimes
sometimesnot blockedpread()
sometimespwrite()
blockedlseek()
blockedblockedwrite()
blockednot blockedread()
off-cache pageon-cache page
Does operation blocked by fsync() ?
6
Avoid duplicated caching in DB and kernel buffers
Both postgres and kernel might cache file data because postgres uses buffered I/O.
Same blocks might be cached in DB and kernel buffers.
Approaches to eliminate duplicated caching
Direct I/OPros: Can eliminate kernel cache
Cons: Need to add I/O manager to Postgres
mmapPros: Can eliminate DB cache
Cons: Hard to implements “Write-Ahead Logging” because mapped blocks could be flushed out at arbitrary timing.
mmap is better to avoid reinvention of I/O manager in Postgres.
Requirements
Have a control flag to prevent modified blocks to be flushed out.The flag is released when WAL buffers are written into storage.
– mlock() is not enough because it cannot prevent flushing.madvise( MADV_{ DOFLUSH | DONTFLUSH } ) ?
storage
kernel buffers
DB buffersduplicated