postgresql on nfs - j.battiato, pgday2016

37
Jonathan Battiato 2ndQuadrant Italia @2ndquadrant_it @jonny_lee_84 PGDay.IT 2016 - 10 th edition Prato Dec 13 th 2016 PostgreSQL on Network File System?

Upload: jonathan-battiato

Post on 24-Jan-2017

43 views

Category:

Software


8 download

TRANSCRIPT

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

PostgreSQL on Network File System?

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

© Lucasfilm™

PostgreSQL on Network File System?

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

What people say about NFS

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

What people say about NFS

IT’S FAST!

CC0 1.0

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

PostgreSQL & NFS

Myths & Truths

Jonathan Battiato

www.2ndquadrant.com

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

Jonathan Battiato

Linux SysAdmin & DBA

Jonathan Battiato

Linux SysAdmin & DBA

jonny_lee_84

[email protected]

jonny-lee31

~$ whoami

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

Thanks to:

Giuseppe Broccolo, PhD

PostgreSQL & PostGIS consultant

@giubro gbroccolo7

[email protected]

gbroccolo gemini__81

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

#PGNFS#PostgreSQL#NFS

TAGs

@jonny_lee_84@2ndquadrant_it@PGDayIT

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

Network File Systemgeneral characteristics

Part 1

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

Network File System (Sun Microsystem®, 1984)

● A protocol for distributed file system – v2, v3, v4

● servers export the file system, clients mounts the export locally

● Many clients, one server

● High performances through fast network

LAN

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

NFS v3 vs. NFS v2● Strenght

– TCP (UDP is still a choice)

– Asynchronous communication (caching)

– A daemon manages file locking and connections

– Larger block size than 8KB (up to 64KB - depends on Kernel)● Weakness

– Stateless (NFS)

– Plain text data transmissions

– Host authentication only

– Dynamic port assignment

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

NFS v4 vs. NFS v3● Partial solutions

– Single port (2049)easy to apply firewall rules to filter NFS traffic

– Authentication and locking deamons included in the protocol

– operations are always stateful

– UDP not allowed anymore

NFS v4 is the default on RHEL 6

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

Client-Server Optimization

● Block size (wsize, rsize – nfs protocol level)Requests are organised in data blocks exchanged between client & server

● NFS relies on TCP or UDP protocols (transport level)UDP is statless, TCP is statefulTCP choose automatically the packet size

● Packet size & MTU (network level)Packets size can be optimised to transfer blocks within the network: jumbo frames

CHOOSE WISELY!

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

Main concerns about synchronisation

● NO Time synchronisationUse NTP!

● File Lockingv2 does not support file lockingv3 uses daemon for locking, performances drop when in use

● Delayed write cacheCaching writes could cause data lossWithout write cache performances drop

● Read (metadata) cacheNFS server may show not-updated attributes, any program that relies on file attributes may not work

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

Part 2

NFS & PostgreSQL

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

NFS & PostgreSQL

www.postgresql.org/docs/9.6/static/creating-cluster.html#CREATING-CLUSTER-NFS

● NFS disks are completely transparent for PostgreSQL

● If NFS client/server implementation does not provide standard file system semantics, this can cause reliability problems

● PostgreSQL advices

– NFS mount options

● avoid soft-mounting: hard

– General mount options

● avoid asynchronous writes: fsync

LAN

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

hard-mount optionNFS calls must be retried indefinitely

both data and WAL entries sequentiality may not be preserved

write cacheWALs have to be flushed as a database action is committed

many processes are involved during WALs flush on disk

→ several NFS clients

→ WAL entries sequentiality may not be preserved

attribute cachemany processes are involved during WALs flush on disk

→ several NFS clients

→ files attributes may not have consistent views

Sources of corruptions

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

Synchronous writessync vs. fsync=on:

– NFS v2:● if sync is specified, the server will not complete a request until both data/metadata are

written to the disk

– NFS v3 / v4:● if sync is specified, the server will complete a request returning the status of the write:

– NFS_FILE_SYNC, NFS_DATA_SYNC, NFS_UNSTABLE● data is effectively forced to be flushed on disk once a sync method system call is issued

– be careful to set fsync=on

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

Reliability vs. Performance

● NFS exports must be mounted with safe options

● Are these option fine for a database?

– is the performance deeply impacted?

– is data safely guaranteed in case of crash?

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

Part 3

NFS & PostgreSQLperformance

&reliability benchmarks

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

The customer

NetAppFAS 8080 Full Flash

Clustered Mode

physical host

storage

2 CPUs x 8GB RAMRHEL 6

Kernel v. 2.6.32PostgreSQL 9.5.3

Ethernet10Gb/s

No routers

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

Reliability tests● try different crash scenarios under high concurrency load, check if

the instance recovers properly, then execute a VACUUM FULL:

– kill -9 to postmaster

– forced reboot through kernel SysReq: ● echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger

– power off of the VM from the VMWare vSphere® remote panel

– kill the TCP/IP connections through tcpkill

[root@pgsql] ~# tcpkill host X.X.X.X

tcpkill: listening on eth0 [host X.X.X.X]

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

● try different crash scenarios under high concurrency load, check ifthe instance recovers properly, then execute a VACUUM FULL:

– kill -9 to postmaster

– forced reboot through kernel SysReq: ● echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger

– power off of the VM from the VMWare vSphere® remote panel

– kill the TCP/IP connections through tcpkill

[root@pgsql] ~# tcpkill host X.X.X.X

tcpkill: listening on eth0 [host X.X.X.X]

Reliability tests

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

● try different crash scenarios under high concurrency load, check ifthe instance recovers properly, then execute a VACUUM FULL:

– kill -9 to postmaster

– forced reboot through kernel SysReq: ● echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger

– power off of the VM from the VMWare vSphere® remote panel

– kill the TCP/IP connections through tcpkill

[root@pgsql] ~# tcpkill host X.X.X.X

tcpkill: listening on eth0 [host X.X.X.X]

Reliability tests

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

● try different crash scenarios under high concurrency load, check ifthe instance recovers properly, then execute a VACUUM FULL:

– kill -9 to postmaster

– forced reboot through kernel SysReq: ● echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger

– power off of the VM from the VMWare vSphere® remote panel

– kill the TCP/IP connections through tcpkill

[root@pgsql] ~# tcpkill host X.X.X.X

tcpkill: listening on eth0 [host X.X.X.X]

Reliability tests

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

● try different crash scenarios under high concurrency load, check ifthe instance recovers properly, then execute a VACUUM FULL:

– kill -9 to postmaster

– forced reboot through kernel SysReq: ● echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger

– power off of the VM from the VMWare vSphere® remote panel

– kill the TCP/IP connections through tcpkill

[root@pgsql] ~# tcpkill host X.X.X.X

tcpkill: listening on eth0 [host X.X.X.X]

Reliability tests

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

T. Vondra – test the persistence of recycled WALs in ext4– update attributes of recycled WALs flush them on disk through → fdatasync

– fdatasync does not force the flush of metadata the update may get lost after a crash→

– logged changes are contained in file “in the future” data loss!→

github.com/2ndQuadrant/ext4-data-loss

● INSERT/UPDATE new records in parallel and synchronously on the db and on a file● simulate a crash compare db & file contents after the crash recovery→

Reliability tests

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

Check NFS behaviour for recycled WALs (file attributes caching)

– noac

– ac

Reliability tests

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

Check NFS behaviour for recycled WALs (file attributes caching)

– noac

● several tests, no data loss!– ac

Reliability tests

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

Check NFS behaviour for recycled WALs (file attributes caching)

– noac

● several tests, no data loss!– ac

● several tests, no data loss!

Reliability tests

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

Check NFS behaviour for recycled WALs (file attributes caching)

– noac

● several tests, no data loss!– ac

● several tests, no data loss!

Reliability tests

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

All tests passed, but this does not ensure that it is totally safe!

Reliability tests

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

Enhance reliability for PostgreSQL DBs ● allow page checksums (9.3+): initdb –data-checksums ...

– execute CRC32 calculation for each 8KB data block

● a checksum failure means that the data block is corrupted

– force wal_log_hint=true

● write the entire 8KB page to the WAL, even for hint bits modification

– take into account the impact to the performance:

● R: checksum extra-calculation every 8KB● W: increase the amount of information logged into WALs

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

Enhance performance for PostgreSQL DBs ● sync’ed writes are slow, even if NFS caching is enabled

● if possible, consider asynchronous commit:– let the server return success as soon as the transaction is logically completed

● synchronous_commit=off – it can be set per user/session● WAL entries will be flushed in a second moment,

but not later than 3x wal_writer_delay● in case of crashes, the DB can be recovered in a consistent state,

but there could be a window of data loss

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

file system I/O speed:

– reads: 700MB/s

– writes: 700MB/s

synchronous_commit=on:

– INSERT:

● noac ~ ac (single & multi client)

– SELECT:

● noac ~ 16x ac

DB commit rate and SELECT rate:

synchronous_commit=off:

– INSERT:

● noac ~ 8x ac (single & multi client)

– SELECT:

● noac ~ 16x ac

Jonathan Battiato 2ndQuadrant Italia

@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

Conclusions● NFS is not natively thought for reliability purposes

– the protocol is thought to enhance the performance

– NFS v4 is preferable

● PostgreSQL allows to adopt many countermeasures – it is at least able to promptly detect data corruptions

● PostgreSQL can be used with NFS– ready to accept minimal data loss

11/06/2010 ITPUG - ConfSL 2010 37

Jonathan Battiato 2ndQuadrant Italia@2ndquadrant_it @jonny_lee_84

PGDay.IT 2016 - 10th edition Prato Dec 13th 2016

Creative Commons license

This work is licensed under a Creative Commons

Attribution-ShareAlike 4.0 International License

https://creativecommons.org/licenses/by-nc-sa/4.0/

© 2016 2ndQuadrant Italia – http://www.2ndquadrant.it