automating postgresql failover for high...
TRANSCRIPT
Automating PostgreSQL failover for high availability
David Pacheco (@dapsays)Joyent
This talk
• Using PostgreSQL to build Joyent’s Manta service
• Operational experiences
• Incidents
• Tooling
• Wish list
Background
• Most of this work was done in 2012 - 2013, with a major rework in late 2014
• None of us is primarily a database engineer or database administrator (at least not when this work was done)
• We learned a lot since then!
The Problem
• Building Joyent’s Manta storage service
• HTTP-based, multi-tenant object store
• PUT/GET/DELETE of arbitrary byte streams
• ... with in situ, Unix-like compute as a first-class operation
Manta: storage architecture
• Front door handles API requests toPUT/GET/DELETE objects
• Storage tier stores user data on ZFS.
• Metadata tier maps user-facing paths to backend storage objects
Metadata tier
• Durability is top priority.
• Strongly consistent (CP): because you can build AP atop CP, but not the reverse
• Workload:
• heavy read/write (small records)
• 24/7/365 duty cycle
Modern cloud system
• Fault tolerant
• Horizontally scalable
Modern cloud system
• Fault tolerant (replication)
• Horizontally scalable (sharding)
PostgreSQL for metadata
• Going-in position: use PostgreSQL, at least until we know why it won’t work
• Known to scale well vertically
• Good operability (tools, observability)
• Built-in replication
Fault tolerance
• Three-peer cluster:
• Primary peer services all requests
• PostgreSQL synchronous replication to a synchronous peer (“secondary”, or “S”)
• Asynchronous replication to a third peer (“async”, or “A”)
• In three-datacenter config, each peer is in a separate datacenter.
ClientsClients
Async 1Async 1
Manatee cluster
Primary Sync Async 1
syncreplication
asyncreplication
Clients
Sync fails
A(prim)
B(sync)
C(async)initial
Sync fails
A(prim)
B(sync)
C(async)initial
B fails,A promotes C
C(sync)
A(prim)
Sync fails
A(prim)
B(sync)
C(async)initial
B fails,A promotes C
C(sync)
later: B returns
B(async)
C(sync)
A(prim)
A(prim)
Unexpected constraint
• If P is synchronously replicating to S, it’s not guaranteed that switching P and S will work. And it often doesn’t.
Primary fails
A(prim)
B(sync)
C(async)initial
Primary fails
A(prim)
B(sync)
C(async)initial
A fails,B takes over, promotes C
B(prim)
C(sync)
Primary fails
A(prim)
B(sync)
C(async)initial
A fails,B takes over, promotes C
A(deposed)
B(prim)
C(sync)
Primary fails
A(prim)
B(sync)
C(async)initial
A fails,B takes over, promotes C
A(deposed)
B(prim)
C(sync)
later: A rebuilt
B(prim)
C(sync)
A(async)
Fault tolerance
• If P fails, S takes over and promotes A.
• If S fails, P promotes A.
• If A fails, no immediate impact.
• Although writes to S and A will fail, the client needs to know which peer is primary in order to work.
• Moray takes care of this (more later).
PGPGPG
ZooKeeperZooKeeper
Cluster components
ClientsClients
Async 1Async 1Primary Sync Async 1
Moray ZooKeeper
pg queries
monitorcluster state
monitor, updatecluster state
PG PGsyncrepl
asyncrepl
Consensus
• PostgreSQL peers run by a babysitter process called Manatee
• Manatee instances are connected to ZooKeeper for consensus.
• Cluster state stored in ZooKeeper.
• Clients read ZooKeeper state.
• Still non-trivial.
Demo
Cluster state
• Identities of primary peer, sync peer, async peers (in order), deposed peers
• Generation number
• WAL position at start of generation
Cluster state management
• Distributed algorithm guarantees no unsafe transitions
• Tools clearly report current state and history of all transitions
• From-scratch rebuild required when primary becomes deposed.
Moray: custom client
• Connection pooling
• Decouple application (Manta) from database
• Handles client side of cluster state (keeps track of which peer is primary)
Moray
• Moray: key-value store built atop PostgreSQL
• “buckets”: tables
• “objects” rows
• Operations
• put/get/delete bucket config
• put/get/delete/update/find objects
• transactions of the above
Horizontal scalability
• Shard based on object’s directory name
• All objects on a single peer makes listing possible.
• Use consistent hashing implementation
• Hidden behind Electric Moray service.
All open-source
• Manatee: https://github.com/joyent/manatee
• Much love still needed to make this useful for others (sorry we’re behind on PRs!)
• Manatee state machine: https://github.com/joyent/manatee-state-machine(includes more formal write-up)
• Moray: https://github.com/joyent/moray
• Manta: https://github.com/joyent/manta
Manatee today
• Still running version 9.2. (Are we stuck?)
• Moray/Manatee stack incorporated into SDC (SmartDataCenter) for storing all SDC data: users, compute nodes, containers, networks, ...
Manatee/Moray for jobs
• We also use the Manatee/Moray/PostgreSQL stack to manage execution of compute jobs
• All state in database: easy to debug, easy to reason about fault tolerance
• Arbitrary numbers of inputs, tasks, and outputs(jobs purged after 24h)
• Most queries based on jobId or workerId(all state is denormalized)
Incident 1: job performance
• Job queries became very slow
• 7-second job takes 10 minutes
Incident 1: job performance
• Typical reason: insufficient vacuum, analyze(exacerbated by long transactions)
• Problem: 24/7 duty cycle: cron-based “vacuum analyze” is not okay.
Incident 2: queries hung
• Observation: queries hung
Incident 2: queries hung
• Observation: queries hung
• Cause: poor client error handling(uncaughtExcept handler FTL)
Incident 3: queries hung(again)
• Observation: queries hung
• Observation: autovacuum running: “to prevent wraparound”
Incident 3: queries hung
• We spent a while trying to speed up the autovacuum
• prefetching
• disabling sleeping (!)
Incident 3: queries hung
• Five hours later: autovacuum completes
• Service is back online
Incident 3: queries hung
• 5 minutes later: autovacuum starts again (“to prevent wraparound”)
Incident 3: queries hung
• Solution: tune up autovacuum_freeze_max_age(from 200M to 500M)
Incident 3: queries hung
• But why were we blocked?
• In the heat of battle, saved pg_locks information
Incident 3: queries hung
• Wrote“pglockanalyze” to summarize locks
• Root cause: DROP TRIGGER
PG locking
Queries Locks
data path ACCESS SHARE
autovacuum SHARE UPDATE EXCLUSIVE
DROP TRIGGER ACCESS EXCLUSIVE
Incident 3: queries hung
• Root cause was our application
• … but we could not hit this without a wraparound auto vacuum
• … and now that we know about it, there’s no way to verify that exclusive locks are not being used (DTrace probes are close!)
Incident: 70K autovacuums
• At one point, dubious design choice resulted in 70,000 tiny tables created around the same time
• Months later, nearly all of them ran into transaction wraparound autovacuum
• Surprisingly: this was mostly fine, except for the tons of I/O being run to update the 130MB stats collector file
Tools we built
• https://github.com/joyent/pgsqlstat(DTrace-based tools for watching overall activity, queries, transactions, lock events, and latency)
• https://github.com/joyent/pglockanalyze(Summarize lock dependencies)
• https://github.com/joyent/pgstatsmon(Periodically report basic stats to statsd)
Silly stuff we did
• Used ZK join order for initial cluster config
• Tried to use PostgreSQL as a queue
• Autovacuum configuration
• Connection management(especially for super-users, replication)
• Created 70,000 tables
• Periodic CREATE/DROP TRIGGER
• Didn’t look carefully at existing ecosystem.
Still not root-caused• manual “vacuum” as root succeeded, but
didn’t update table’s vacuum time
• “freeze” autovacuum succeeded, but didn’t update freeze age
• Several times: PostgreSQL crashes on startup during WAL replay.
• Typically: primary fine, sync crashes
• Rebuild sync from latest primary snapshot. Scary.
Stuff we’ve loved
• Durability, performance
• Observability:
• EXPLAIN
• DTrace probes
• internal views (pg_stat_activity, pg_stat_*_tables, pg_locks)
• We really value root-cause-analysis!
Wish list
• Some of these seem to be implemented...
• Online upgrade (or migration)
• What locks are being taken?(not based on sampling)
• Option for reads to go through sync replication path (to guarantee read-after-write consistency)
Automating PostgreSQL failover for high availability
David Pacheco (@dapsays)Joyent