postgresql on nfs - j.battiato, pgday2016
TRANSCRIPT
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
PostgreSQL on Network File System?
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
© Lucasfilm™
PostgreSQL on Network File System?
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
What people say about NFS
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
What people say about NFS
IT’S FAST!
CC0 1.0
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
PostgreSQL & NFS
Myths & Truths
Jonathan Battiato
www.2ndquadrant.com
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
Jonathan Battiato
Linux SysAdmin & DBA
Jonathan Battiato
Linux SysAdmin & DBA
jonny_lee_84
jonny-lee31
~$ whoami
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
Thanks to:
Giuseppe Broccolo, PhD
PostgreSQL & PostGIS consultant
@giubro gbroccolo7
gbroccolo gemini__81
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
#PGNFS#PostgreSQL#NFS
TAGs
@jonny_lee_84@2ndquadrant_it@PGDayIT
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
Network File Systemgeneral characteristics
Part 1
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
Network File System (Sun Microsystem®, 1984)
● A protocol for distributed file system – v2, v3, v4
● servers export the file system, clients mounts the export locally
● Many clients, one server
● High performances through fast network
LAN
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
NFS v3 vs. NFS v2● Strenght
– TCP (UDP is still a choice)
– Asynchronous communication (caching)
– A daemon manages file locking and connections
– Larger block size than 8KB (up to 64KB - depends on Kernel)● Weakness
– Stateless (NFS)
– Plain text data transmissions
– Host authentication only
– Dynamic port assignment
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
NFS v4 vs. NFS v3● Partial solutions
– Single port (2049)easy to apply firewall rules to filter NFS traffic
– Authentication and locking deamons included in the protocol
– operations are always stateful
– UDP not allowed anymore
NFS v4 is the default on RHEL 6
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
Client-Server Optimization
● Block size (wsize, rsize – nfs protocol level)Requests are organised in data blocks exchanged between client & server
● NFS relies on TCP or UDP protocols (transport level)UDP is statless, TCP is statefulTCP choose automatically the packet size
● Packet size & MTU (network level)Packets size can be optimised to transfer blocks within the network: jumbo frames
CHOOSE WISELY!
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
Main concerns about synchronisation
● NO Time synchronisationUse NTP!
● File Lockingv2 does not support file lockingv3 uses daemon for locking, performances drop when in use
● Delayed write cacheCaching writes could cause data lossWithout write cache performances drop
● Read (metadata) cacheNFS server may show not-updated attributes, any program that relies on file attributes may not work
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
Part 2
NFS & PostgreSQL
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
NFS & PostgreSQL
www.postgresql.org/docs/9.6/static/creating-cluster.html#CREATING-CLUSTER-NFS
● NFS disks are completely transparent for PostgreSQL
● If NFS client/server implementation does not provide standard file system semantics, this can cause reliability problems
● PostgreSQL advices
– NFS mount options
● avoid soft-mounting: hard
– General mount options
● avoid asynchronous writes: fsync
LAN
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
hard-mount optionNFS calls must be retried indefinitely
both data and WAL entries sequentiality may not be preserved
write cacheWALs have to be flushed as a database action is committed
many processes are involved during WALs flush on disk
→ several NFS clients
→ WAL entries sequentiality may not be preserved
attribute cachemany processes are involved during WALs flush on disk
→ several NFS clients
→ files attributes may not have consistent views
Sources of corruptions
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
Synchronous writessync vs. fsync=on:
– NFS v2:● if sync is specified, the server will not complete a request until both data/metadata are
written to the disk
– NFS v3 / v4:● if sync is specified, the server will complete a request returning the status of the write:
– NFS_FILE_SYNC, NFS_DATA_SYNC, NFS_UNSTABLE● data is effectively forced to be flushed on disk once a sync method system call is issued
– be careful to set fsync=on
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
Reliability vs. Performance
● NFS exports must be mounted with safe options
● Are these option fine for a database?
– is the performance deeply impacted?
– is data safely guaranteed in case of crash?
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
Part 3
NFS & PostgreSQLperformance
&reliability benchmarks
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
The customer
NetAppFAS 8080 Full Flash
Clustered Mode
physical host
storage
2 CPUs x 8GB RAMRHEL 6
Kernel v. 2.6.32PostgreSQL 9.5.3
Ethernet10Gb/s
No routers
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
Reliability tests● try different crash scenarios under high concurrency load, check if
the instance recovers properly, then execute a VACUUM FULL:
– kill -9 to postmaster
– forced reboot through kernel SysReq: ● echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger
– power off of the VM from the VMWare vSphere® remote panel
– kill the TCP/IP connections through tcpkill
[root@pgsql] ~# tcpkill host X.X.X.X
tcpkill: listening on eth0 [host X.X.X.X]
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
● try different crash scenarios under high concurrency load, check ifthe instance recovers properly, then execute a VACUUM FULL:
– kill -9 to postmaster
– forced reboot through kernel SysReq: ● echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger
– power off of the VM from the VMWare vSphere® remote panel
– kill the TCP/IP connections through tcpkill
[root@pgsql] ~# tcpkill host X.X.X.X
tcpkill: listening on eth0 [host X.X.X.X]
Reliability tests
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
● try different crash scenarios under high concurrency load, check ifthe instance recovers properly, then execute a VACUUM FULL:
– kill -9 to postmaster
– forced reboot through kernel SysReq: ● echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger
– power off of the VM from the VMWare vSphere® remote panel
– kill the TCP/IP connections through tcpkill
[root@pgsql] ~# tcpkill host X.X.X.X
tcpkill: listening on eth0 [host X.X.X.X]
Reliability tests
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
● try different crash scenarios under high concurrency load, check ifthe instance recovers properly, then execute a VACUUM FULL:
– kill -9 to postmaster
– forced reboot through kernel SysReq: ● echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger
– power off of the VM from the VMWare vSphere® remote panel
– kill the TCP/IP connections through tcpkill
[root@pgsql] ~# tcpkill host X.X.X.X
tcpkill: listening on eth0 [host X.X.X.X]
Reliability tests
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
● try different crash scenarios under high concurrency load, check ifthe instance recovers properly, then execute a VACUUM FULL:
– kill -9 to postmaster
– forced reboot through kernel SysReq: ● echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger
– power off of the VM from the VMWare vSphere® remote panel
– kill the TCP/IP connections through tcpkill
[root@pgsql] ~# tcpkill host X.X.X.X
tcpkill: listening on eth0 [host X.X.X.X]
Reliability tests
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
T. Vondra – test the persistence of recycled WALs in ext4– update attributes of recycled WALs flush them on disk through → fdatasync
– fdatasync does not force the flush of metadata the update may get lost after a crash→
– logged changes are contained in file “in the future” data loss!→
github.com/2ndQuadrant/ext4-data-loss
● INSERT/UPDATE new records in parallel and synchronously on the db and on a file● simulate a crash compare db & file contents after the crash recovery→
Reliability tests
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
Check NFS behaviour for recycled WALs (file attributes caching)
– noac
●
– ac
●
Reliability tests
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
Check NFS behaviour for recycled WALs (file attributes caching)
– noac
● several tests, no data loss!– ac
●
Reliability tests
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
Check NFS behaviour for recycled WALs (file attributes caching)
– noac
● several tests, no data loss!– ac
● several tests, no data loss!
Reliability tests
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
Check NFS behaviour for recycled WALs (file attributes caching)
– noac
● several tests, no data loss!– ac
● several tests, no data loss!
Reliability tests
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
All tests passed, but this does not ensure that it is totally safe!
Reliability tests
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
Enhance reliability for PostgreSQL DBs ● allow page checksums (9.3+): initdb –data-checksums ...
– execute CRC32 calculation for each 8KB data block
● a checksum failure means that the data block is corrupted
– force wal_log_hint=true
● write the entire 8KB page to the WAL, even for hint bits modification
– take into account the impact to the performance:
● R: checksum extra-calculation every 8KB● W: increase the amount of information logged into WALs
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
Enhance performance for PostgreSQL DBs ● sync’ed writes are slow, even if NFS caching is enabled
● if possible, consider asynchronous commit:– let the server return success as soon as the transaction is logically completed
● synchronous_commit=off – it can be set per user/session● WAL entries will be flushed in a second moment,
but not later than 3x wal_writer_delay● in case of crashes, the DB can be recovered in a consistent state,
but there could be a window of data loss
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
file system I/O speed:
– reads: 700MB/s
– writes: 700MB/s
synchronous_commit=on:
– INSERT:
● noac ~ ac (single & multi client)
– SELECT:
● noac ~ 16x ac
DB commit rate and SELECT rate:
synchronous_commit=off:
– INSERT:
● noac ~ 8x ac (single & multi client)
– SELECT:
● noac ~ 16x ac
Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
Conclusions● NFS is not natively thought for reliability purposes
– the protocol is thought to enhance the performance
– NFS v4 is preferable
● PostgreSQL allows to adopt many countermeasures – it is at least able to promptly detect data corruptions
● PostgreSQL can be used with NFS– ready to accept minimal data loss
11/06/2010 ITPUG - ConfSL 2010 37
Jonathan Battiato 2ndQuadrant Italia@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th edition Prato Dec 13th 2016
Creative Commons license
This work is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License
https://creativecommons.org/licenses/by-nc-sa/4.0/
© 2016 2ndQuadrant Italia – http://www.2ndquadrant.it