automating postgresql failover for high...

Automating PostgreSQL failover for high availability

David Pacheco (@dapsays)Joyent

This talk

• Using PostgreSQL to build Joyent’s Manta service

• Operational experiences

• Incidents

• Tooling

• Wish list

Background

• Most of this work was done in 2012 - 2013, with a major rework in late 2014

• None of us is primarily a database engineer or database administrator (at least not when this work was done)

• We learned a lot since then!

The Problem

• Building Joyent’s Manta storage service

• HTTP-based, multi-tenant object store

• PUT/GET/DELETE of arbitrary byte streams

• ... with in situ, Unix-like compute as a first-class operation

Manta: storage architecture

• Front door handles API requests toPUT/GET/DELETE objects

• Storage tier stores user data on ZFS.

• Metadata tier maps user-facing paths to backend storage objects

Metadata tier

• Durability is top priority.

• Strongly consistent (CP): because you can build AP atop CP, but not the reverse

• Workload:

• heavy read/write (small records)

• 24/7/365 duty cycle

Modern cloud system

• Fault tolerant

• Horizontally scalable

Modern cloud system

• Fault tolerant (replication)

• Horizontally scalable (sharding)

PostgreSQL for metadata

• Going-in position: use PostgreSQL, at least until we know why it won’t work

• Known to scale well vertically

• Good operability (tools, observability)

• Built-in replication

Fault tolerance

• Three-peer cluster:

• Primary peer services all requests

• PostgreSQL synchronous replication to a synchronous peer (“secondary”, or “S”)

• Asynchronous replication to a third peer (“async”, or “A”)

• In three-datacenter config, each peer is in a separate datacenter.

ClientsClients

Async 1Async 1

Manatee cluster

Primary Sync Async 1

syncreplication

asyncreplication

Clients

Sync fails

A(prim)

B(sync)

C(async)initial

Sync fails

A(prim)

B(sync)

C(async)initial

B fails,A promotes C

C(sync)

A(prim)

Sync fails

A(prim)

B(sync)

C(async)initial

B fails,A promotes C

C(sync)

later: B returns

B(async)

C(sync)

A(prim)

A(prim)

Unexpected constraint

• If P is synchronously replicating to S, it’s not guaranteed that switching P and S will work. And it often doesn’t.

Primary fails

A(prim)

B(sync)

C(async)initial

Primary fails

A(prim)

B(sync)

C(async)initial

A fails,B takes over, promotes C

B(prim)

C(sync)

Primary fails

A(prim)

B(sync)

C(async)initial


A(deposed)

B(prim)

C(sync)

Primary fails

A(prim)

B(sync)

C(async)initial


A(deposed)

B(prim)

C(sync)

later: A rebuilt

B(prim)

C(sync)

A(async)

Fault tolerance

• If P fails, S takes over and promotes A.

• If S fails, P promotes A.

• If A fails, no immediate impact.

• Although writes to S and A will fail, the client needs to know which peer is primary in order to work.

• Moray takes care of this (more later).

PGPGPG

ZooKeeperZooKeeper

Cluster components

ClientsClients

Async 1Async 1Primary Sync Async 1

Moray ZooKeeper

pg queries

monitorcluster state

monitor, updatecluster state

PG PGsyncrepl

asyncrepl

Consensus

• PostgreSQL peers run by a babysitter process called Manatee

• Manatee instances are connected to ZooKeeper for consensus.

• Cluster state stored in ZooKeeper.

• Clients read ZooKeeper state.

• Still non-trivial.

Cluster state

• Identities of primary peer, sync peer, async peers (in order), deposed peers

• Generation number

• WAL position at start of generation

Cluster state management

• Distributed algorithm guarantees no unsafe transitions

• Tools clearly report current state and history of all transitions

• From-scratch rebuild required when primary becomes deposed.

Moray: custom client

• Connection pooling

• Decouple application (Manta) from database

• Handles client side of cluster state (keeps track of which peer is primary)

Moray

• Moray: key-value store built atop PostgreSQL

• “buckets”: tables

• “objects” rows

• Operations

• put/get/delete bucket config

• put/get/delete/update/find objects

• transactions of the above

Horizontal scalability

• Shard based on object’s directory name

• All objects on a single peer makes listing possible.

• Use consistent hashing implementation

• Hidden behind Electric Moray service.

All open-source

• Manatee: https://github.com/joyent/manatee

• Much love still needed to make this useful for others (sorry we’re behind on PRs!)

• Manatee state machine: https://github.com/joyent/manatee-state-machine(includes more formal write-up)

• Moray: https://github.com/joyent/moray

• Manta: https://github.com/joyent/manta

https://github.com/joyent/manatee

https://github.com/joyent/manatee-state-machine

https://github.com/joyent/moray

https://github.com/joyent/manta

Manatee today

• Still running version 9.2. (Are we stuck?)

• Moray/Manatee stack incorporated into SDC (SmartDataCenter) for storing all SDC data: users, compute nodes, containers, networks, ...

Manatee/Moray for jobs

• We also use the Manatee/Moray/PostgreSQL stack to manage execution of compute jobs

• All state in database: easy to debug, easy to reason about fault tolerance

• Arbitrary numbers of inputs, tasks, and outputs(jobs purged after 24h)

• Most queries based on jobId or workerId(all state is denormalized)

Incident 1: job performance

• Job queries became very slow

• 7-second job takes 10 minutes

Incident 1: job performance

• Typical reason: insufficient vacuum, analyze(exacerbated by long transactions)

• Problem: 24/7 duty cycle: cron-based “vacuum analyze” is not okay.

Incident 2: queries hung

• Observation: queries hung



• Cause: poor client error handling(uncaughtExcept handler FTL)

Incident 3: queries hung(again)


• Observation: autovacuum running: “to prevent wraparound”


• We spent a while trying to speed up the autovacuum

• prefetching

• disabling sleeping (!)


• Five hours later: autovacuum completes

• Service is back online


• 5 minutes later: autovacuum starts again (“to prevent wraparound”)


• Solution: tune up autovacuum_freeze_max_age(from 200M to 500M)


• But why were we blocked?

• In the heat of battle, saved pg_locks information


• Wrote“pglockanalyze” to summarize locks

• Root cause: DROP TRIGGER

PG locking

Queries Locks

data path ACCESS SHARE

autovacuum SHARE UPDATE EXCLUSIVE

DROP TRIGGER ACCESS EXCLUSIVE


• Root cause was our application

• … but we could not hit this without a wraparound auto vacuum

• … and now that we know about it, there’s no way to verify that exclusive locks are not being used (DTrace probes are close!)

Incident: 70K autovacuums

• At one point, dubious design choice resulted in 70,000 tiny tables created around the same time

• Months later, nearly all of them ran into transaction wraparound autovacuum

• Surprisingly: this was mostly fine, except for the tons of I/O being run to update the 130MB stats collector file

Tools we built

• https://github.com/joyent/pgsqlstat(DTrace-based tools for watching overall activity, queries, transactions, lock events, and latency)

• https://github.com/joyent/pglockanalyze(Summarize lock dependencies)

• https://github.com/joyent/pgstatsmon(Periodically report basic stats to statsd)

https://github.com/joyent/pgsqlstat

https://github.com/joyent/pglockanalyze

https://github.com/joyent/pgstatsmon

Silly stuff we did

• Used ZK join order for initial cluster config

• Tried to use PostgreSQL as a queue

• Autovacuum configuration

• Connection management(especially for super-users, replication)

• Created 70,000 tables

• Periodic CREATE/DROP TRIGGER

• Didn’t look carefully at existing ecosystem.

Still not root-caused• manual “vacuum” as root succeeded, but

didn’t update table’s vacuum time

• “freeze” autovacuum succeeded, but didn’t update freeze age

• Several times: PostgreSQL crashes on startup during WAL replay.

• Typically: primary fine, sync crashes

• Rebuild sync from latest primary snapshot. Scary.

Stuff we’ve loved

• Durability, performance

• Observability:

• EXPLAIN

• DTrace probes

• internal views (pg_stat_activity, pg_stat_*_tables, pg_locks)

• We really value root-cause-analysis!

Wish list

• Some of these seem to be implemented...

• Online upgrade (or migration)

• What locks are being taken?(not based on sampling)

• Option for reads to go through sync replication path (to guarantee read-after-write consistency)

Automating PostgreSQL failover for high availability

David Pacheco (@dapsays)Joyent

automating postgresql failover for high...

Documents