wish list from postgresql - linux kernel summit 2009

6

Click here to load reader

Upload: takahiro-itagaki

Post on 18-Dec-2014

1.259 views

Category:

Technology


0 download

DESCRIPTION

This explains storage and buffer usage in Postgres and discusses about I/O and buffer management in Linux kernel.

TRANSCRIPT

Page 1: Wish list from PostgreSQL - Linux Kernel Summit 2009

Linux Kernel Summit 2009

Wish list from PostgreSQL

Itagaki TakahiroNTT Open Source Software Center

Released at “Linux Kernel Summit 2009”

http://events.linuxfoundation.org/archive/2009/linux-kernel-summit

October 18 - 20, 2009 - Tokyo, Japan

Page 2: Wish list from PostgreSQL - Linux Kernel Summit 2009

2

Agenda

Background

Postgres won’t use Direct I/O!

Storage and buffer usage in Postgres

Discussions

Low priority I/O for background tasks

Avoid duplicated caching in DB and kernel buffers

Page 3: Wish list from PostgreSQL - Linux Kernel Summit 2009

3

Background: Postgres won’t use Direct I/O!

Our policy is to delegate as much as possible to the kernel and avoid re-implementing the whole block layer in user-space of PostgreSQL.

It might be opposite requirements from commercial DBMS folks.

We’d like to keep I/O layer in small.

We won’t use RAW device, too.

Layout of files should be managedby file system.

Not ideal, but it is good approach to support many platforms by a small number of developers.

<100 active main developers

<10 committers

support >10 platforms

codes for block layer is

<30K lines (5%)

Postgres code lines (600K lines)

Page 4: Wish list from PostgreSQL - Linux Kernel Summit 2009

4

Background: Storage and buffer usage in Postgres

Consist of multiple processes.

Use file system and multiple files. (per 1GB of table / per 16MB of xlog)

Mainly use traditional system calls. (lseek, read, write, fsync)Starting to use posix_fadvise() in the latest version.

We depends on kernel buffer cache and I/O managements.Do not use synchronous I/O to access data files.

Do not read-ahead by itself; expect read() to do it.

postmaster(listener process)

backend(SQL executor process)

1GB 1GB 1GB

16MB

data files

xlog files 16MB 16MB

fork()

lseek()

read()

write()

own I/O exclusion control

writer(sync process)

lseek()

write()

fsync()

storage + file system

own shared buffer pool with shmget()

overwrites

expands

Page 5: Wish list from PostgreSQL - Linux Kernel Summit 2009

5

Low priority I/O for background tasks

PostgreSQL uses some background tasks

VACUUM – cleanup DELETE’d rows and reclaim the area.

CHECKPOINT – flush all modified pages to disks.

Current behavior in Postgres

Take some sleep every constant amount of I/O.

Consume constant I/O band width regardless of workload.

Ideal behavior

Background tasks can use all ofsurplus I/O band width as far asit does not affect to service.

Requirements

Low priority I/O should affect buffered writes and fsync.

Normal I/O should not wait for low priority I/Os; so fsync should not block lseek, read, write (both overwrites and extends).

sometimes

sometimesnot blockedpread()

sometimespwrite()

blockedlseek()

blockedblockedwrite()

blockednot blockedread()

off-cache pageon-cache page

Does operation blocked by fsync() ?

Page 6: Wish list from PostgreSQL - Linux Kernel Summit 2009

6

Avoid duplicated caching in DB and kernel buffers

Both postgres and kernel might cache file data because postgres uses buffered I/O.

Same blocks might be cached in DB and kernel buffers.

Approaches to eliminate duplicated caching

Direct I/OPros: Can eliminate kernel cache

Cons: Need to add I/O manager to Postgres

mmapPros: Can eliminate DB cache

Cons: Hard to implements “Write-Ahead Logging” because mapped blocks could be flushed out at arbitrary timing.

mmap is better to avoid reinvention of I/O manager in Postgres.

Requirements

Have a control flag to prevent modified blocks to be flushed out.The flag is released when WAL buffers are written into storage.

– mlock() is not enough because it cannot prevent flushing.madvise( MADV_{ DOFLUSH | DONTFLUSH } ) ?

storage

kernel buffers

DB buffersduplicated