Streaming Huge Databases using Logical Decoding
Adventures of a naive programmer
Oleksandr “Alex” Shulgin, Zalando SE
Overview
● Introduction● Problem Statement● Approach● Problems● Some Numbers● Conclusion
Introduction
What is Logical Decoding all about?
● A new feature of PostgreSQL since version 9.4.
● Allows streaming database changes in a custom format.
● Requires an Output Plugin to be written (yes, in C).
● Consistent snapshot before all the changes?
Logical Decoding
CREATE_REPLICATION_SLOT "slot1" LOGICAL <plugin_name>;
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SET TRANSACTION SNAPSHOT 'XXXXXXXX-N';
SELECT pg_export_snapshot();
Logical Decoding Output{
"action": "I", /* INSERT */
"relation": [
"myschema", /* INTO myschema.mytable(id) */
"mytable"
],
"newtuple": {
"id": 1 /* VALUES(1) */
}
}
Problem Statement
?
Approach
?
W
W
https://github.com/zalando/saiki-tawon
Command Line and Dashboard
Long Live the Snapshot{ "w-8ee20b3": { "snapshot_id": "51E426C2-1", "ts_start": "2016-01-14 11:02:40 UTC", "heartbeat": "2016-01-15 07:10:27 UTC", "pid": 1, "backend_pid": 58793 }, "w-fbfb655": { "snapshot_id": "51E426C4-1", "ts_start": "2016-01-14 11:02:41 UTC", "heartbeat": "2016-01-15 07:10:28 UTC", "pid": 1, "backend_pid": 58794 },
...
The Source System
● Ubuntu precise (12.04) 3.2.0-xx-generic
● CPU: @2.50GHz Xeon with 24 cores
● RAM: 125.88 GB
● 6x HDDs (spinning drives) in a RAID 1+0, 5 TB total capacity
● Data size: 3.0 TB / 17 B rows (+ 1.8 TB indexes)
● PostgreSQL 9.6devel
The Target System
“Things are working amazingly fast when you write to /dev/null.”
– proverbial wisdom
Problems?
Problems!
● OpenVPN quickly becomes the bottleneck on the laptop
Problems
● OpenVPN quickly becomes the bottleneck on the laptop
Obvious solution: deploy workers closer to the database.
Docker + Mesosphere DCOS
https://zalando-techmonkeys.github.io/
Problems
● Workers quickly run out of memory
The (problematic) code:
cursor.execute("SELECT * FROM mytable")
Problems
● Workers quickly run out of memory
The (problematic) code:
cursor.execute("SELECT * FROM mytable")
● Invokes PQexec().
● Async. connection doesn’t help.
● psycopg2 is not designed to stream results.
Problems
● Invoke COPY protocol!
Corrected code:
cursor.copy_expert(
"COPY (SELECT * FROM mytable) TO STDOUT",
...)
Problems
● Invoke COPY protocol!
Corrected code:
cursor.copy_expert(
"COPY (SELECT * FROM mytable) TO STDOUT",
...)
● Tried 32 MB, then 64 MB per worker: it was not enough...
● One of the values was around 80 MB(!). Not much we can do.
More Problems?
● More problems with this code
The correct(?) code:
cursor.copy_expert(
"COPY (SELECT * FROM mytable) TO STDOUT",
...)
More Problems?
● More problems with this code
The correct(?) code:
cursor.copy_expert(
"COPY (SELECT * FROM mytable) TO STDOUT",
...)
● SELECT * FROM [ONLY] myschema.mytable
NoSQL?
● How about some JSON for comparison?
SELECT row_to_json(x.*) FROM mytable AS x
● Slows down the export 2-3 times.
● Not 100% equivalent to what output plugin emits.
● Have to write a C function for every plugin.
What if we would write a generic function...
CREATE FUNCTION pg_logical_slot_stream_relation(
IN slot_name NAME,
IN relnamespace NAME,
IN relname NAME,
IN nochildren BOOL DEFAULT FALSE,
VARIADIC options TEXT[] DEFAULT '{}',
OUT data TEXT
)
RETURNS SETOF TEXT ...
The Final Codecursor.copy_expert(
"COPY (SELECT pg_logical_slot_stream_relation(...)) TO STDOUT",
...)
● Do not use SELECT … FROM pg_logical_slot_… – it caches result in the backend.
● Requires changes to core PostgreSQL.
● Ideally should not require a slot, only a snapshot.
● Slots cannot be used concurrently (yet).
At Last: Some Numbers6 client processes, SSL (no compression), 1Gbit/s network interface
Query Run Time Volume Notes
SELECT * FROM … 7.5 h 2.7 TB 105 MB/s
pglogical / JSON 17.5 h 6.7 TB 112 MB/s
pglogical / native 30+ h (incomplete) 11+ TB 106 MB/s
Bottled Water / Avro 13.5 h 5.0 TB 108 MB/s
Space for ImprovementIn native protocol format pglogical_output emits metadata per each tuple.
● Metadata overhead: 5.0 TB (167%)
○ nspname + relname + attnames
● Protocol overhead: 1.5 TB (50%)
○ message type + lengths
Set plugin option relmeta_cache_size to -1.
● Network is apparently the bottleneck.
● What if we enable SSL compression?..
A Common Number: ~110 MB/s
sslcompression=1?
● Nowadays seems to be really hard to do, re: CRIME vulnerability.
● Older distro versions: set env. OPENSSL_DEFAULT_ZLIB
● Newer distro versions: OpenSSL is compiled without zlib. Good luck!
● TLSv1.3 will remove support for compression.
● HINT: your streaming replication is likely to be running uncompressed.
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-4929
A Much Better Picture
Compression FTW!24 client processes, SSL (with compression)
Query Run Time Volume Notes
SELECT * FROM … 3h (vs. 7.5 h) 2.7 TB
pglogical / JSON 7-8* h (vs. 17.5 h) 6.7 TB *ordering
pglogical / native 8-9* h (vs. 30+ h) 7.2 TB (vs. 11+ TB)
Bottled Water / Avro 10 h (vs. 13.5 h) 5.0 TB
In Conclusion
● Set relmeta_cache_size with pglogical_output native.
● Run a benchmark to see if you need compression.
● Order tables from largest to smallest.
● Do listen on the replication slot once the export is finished.
● Help needed: review the proposed streaming interface!
References
● https://github.com/zalando/saiki-tawon
● https://github.com/2ndQuadrant/postgres/tree/dev/pglogical-output
● https://github.com/confluentinc/bottledwater-pg/
● https://zalando-techmonkeys.github.io
● https://github.com/zalando/pg_view
● http://www.slideshare.net/AlexanderShulgin3/adding-replication-
protocol-support-for-psycopg2
● http://www.postgresql.org/message-id/flat/CACACo5RNZ0OB8K...
Thank you!Questions?