postgresql on nfs - j.battiato, pgday2016

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

PostgreSQL on Network File System?




© Lucasfilm™

PostgreSQL on Network File System?




What people say about NFS




What people say about NFS

IT’S FAST!

CC0 1.0




PostgreSQL & NFS

Myths & Truths

Jonathan Battiato

www.2ndquadrant.com




Jonathan Battiato

Linux SysAdmin & DBA

Jonathan Battiato

Linux SysAdmin & DBA

jonny_lee_84

[email protected]

jonny-lee31

~$ whoami




Thanks to:

Giuseppe Broccolo, PhD

PostgreSQL & PostGIS consultant

@giubro gbroccolo7

[email protected]

gbroccolo gemini__81




#PGNFS#PostgreSQL#NFS

TAGs

@jonny_lee_84@2ndquadrant_it@PGDayIT




Network File Systemgeneral characteristics

Part 1




Network File System (Sun Microsystem®, 1984)

● A protocol for distributed file system – v2, v3, v4

● servers export the file system, clients mounts the export locally

● Many clients, one server

● High performances through fast network

LAN




NFS v3 vs. NFS v2● Strenght

– TCP (UDP is still a choice)

– Asynchronous communication (caching)

– A daemon manages file locking and connections

– Larger block size than 8KB (up to 64KB - depends on Kernel)● Weakness

– Stateless (NFS)

– Plain text data transmissions

– Host authentication only

– Dynamic port assignment




NFS v4 vs. NFS v3● Partial solutions

– Single port (2049)easy to apply firewall rules to filter NFS traffic

– Authentication and locking deamons included in the protocol

– operations are always stateful

– UDP not allowed anymore

NFS v4 is the default on RHEL 6




Client-Server Optimization

● Block size (wsize, rsize – nfs protocol level)Requests are organised in data blocks exchanged between client & server

● NFS relies on TCP or UDP protocols (transport level)UDP is statless, TCP is statefulTCP choose automatically the packet size

● Packet size & MTU (network level)Packets size can be optimised to transfer blocks within the network: jumbo frames

CHOOSE WISELY!




Main concerns about synchronisation

● NO Time synchronisationUse NTP!

● File Lockingv2 does not support file lockingv3 uses daemon for locking, performances drop when in use

● Delayed write cacheCaching writes could cause data lossWithout write cache performances drop

● Read (metadata) cacheNFS server may show not-updated attributes, any program that relies on file attributes may not work




Part 2

NFS & PostgreSQL




NFS & PostgreSQL

www.postgresql.org/docs/9.6/static/creating-cluster.html#CREATING-CLUSTER-NFS

● NFS disks are completely transparent for PostgreSQL

● If NFS client/server implementation does not provide standard file system semantics, this can cause reliability problems

● PostgreSQL advices

– NFS mount options

● avoid soft-mounting: hard

– General mount options

● avoid asynchronous writes: fsync

LAN




hard-mount optionNFS calls must be retried indefinitely

both data and WAL entries sequentiality may not be preserved

write cacheWALs have to be flushed as a database action is committed

many processes are involved during WALs flush on disk

→ several NFS clients

→ WAL entries sequentiality may not be preserved

attribute cachemany processes are involved during WALs flush on disk

→ several NFS clients

→ files attributes may not have consistent views

Sources of corruptions




Synchronous writessync vs. fsync=on:

– NFS v2:● if sync is specified, the server will not complete a request until both data/metadata are

written to the disk

– NFS v3 / v4:● if sync is specified, the server will complete a request returning the status of the write:

– NFS_FILE_SYNC, NFS_DATA_SYNC, NFS_UNSTABLE● data is effectively forced to be flushed on disk once a sync method system call is issued

– be careful to set fsync=on




Reliability vs. Performance

● NFS exports must be mounted with safe options

● Are these option fine for a database?

– is the performance deeply impacted?

– is data safely guaranteed in case of crash?




Part 3

NFS & PostgreSQLperformance

&reliability benchmarks




The customer

NetAppFAS 8080 Full Flash

Clustered Mode

physical host

storage

2 CPUs x 8GB RAMRHEL 6

Kernel v. 2.6.32PostgreSQL 9.5.3

Ethernet10Gb/s

No routers




Reliability tests● try different crash scenarios under high concurrency load, check if

the instance recovers properly, then execute a VACUUM FULL:

– kill -9 to postmaster

– forced reboot through kernel SysReq: ● echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger

– power off of the VM from the VMWare vSphere® remote panel

– kill the TCP/IP connections through tcpkill

[root@pgsql] ~# tcpkill host X.X.X.X

tcpkill: listening on eth0 [host X.X.X.X]

mailto:root@pgsql




● try different crash scenarios under high concurrency load, check ifthe instance recovers properly, then execute a VACUUM FULL:

– kill -9 to postmaster

– forced reboot through kernel SysReq: ● echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger

– power off of the VM from the VMWare vSphere® remote panel

– kill the TCP/IP connections through tcpkill

[root@pgsql] ~# tcpkill host X.X.X.X

tcpkill: listening on eth0 [host X.X.X.X]

Reliability tests

mailto:root@pgsql




T. Vondra – test the persistence of recycled WALs in ext4– update attributes of recycled WALs flush them on disk through → fdatasync

– fdatasync does not force the flush of metadata the update may get lost after a crash→

– logged changes are contained in file “in the future” data loss!→

github.com/2ndQuadrant/ext4-data-loss

● INSERT/UPDATE new records in parallel and synchronously on the db and on a file● simulate a crash compare db & file contents after the crash recovery→

Reliability tests




Check NFS behaviour for recycled WALs (file attributes caching)

– noac

●

– ac

●

Reliability tests





– noac

● several tests, no data loss!– ac

●

Reliability tests





– noac

● several tests, no data loss!– ac

● several tests, no data loss!

Reliability tests




All tests passed, but this does not ensure that it is totally safe!

Reliability tests




Enhance reliability for PostgreSQL DBs ● allow page checksums (9.3+): initdb –data-checksums ...

– execute CRC32 calculation for each 8KB data block

● a checksum failure means that the data block is corrupted

– force wal_log_hint=true

● write the entire 8KB page to the WAL, even for hint bits modification

– take into account the impact to the performance:

● R: checksum extra-calculation every 8KB● W: increase the amount of information logged into WALs




Enhance performance for PostgreSQL DBs ● sync’ed writes are slow, even if NFS caching is enabled

● if possible, consider asynchronous commit:– let the server return success as soon as the transaction is logically completed

● synchronous_commit=off – it can be set per user/session● WAL entries will be flushed in a second moment,

but not later than 3x wal_writer_delay● in case of crashes, the DB can be recovered in a consistent state,

but there could be a window of data loss




file system I/O speed:

– reads: 700MB/s

– writes: 700MB/s

synchronous_commit=on:

– INSERT:

● noac ~ ac (single & multi client)

– SELECT:

● noac ~ 16x ac

DB commit rate and SELECT rate:

synchronous_commit=off:

– INSERT:

● noac ~ 8x ac (single & multi client)

– SELECT:

● noac ~ 16x ac




Conclusions● NFS is not natively thought for reliability purposes

– the protocol is thought to enhance the performance

– NFS v4 is preferable

● PostgreSQL allows to adopt many countermeasures – it is at least able to promptly detect data corruptions

● PostgreSQL can be used with NFS– ready to accept minimal data loss

11/06/2010 ITPUG - ConfSL 2010 37

Jonathan Battiato 2ndQuadrant Italia@2ndquadrant_it @jonny_lee_84


Creative Commons license

This work is licensed under a Creative Commons

Attribution-ShareAlike 4.0 International License

https://creativecommons.org/licenses/by-nc-sa/4.0/

© 2016 2ndQuadrant Italia – http://www.2ndquadrant.it

postgresql on nfs - j.battiato, pgday2016

Software