advanced postgres monitoring
TRANSCRIPT
SPEAKER
WHO IS THIS GUY?
▸ Sr. Database Architect at Medallia
▸ Recent fun employments:
▸ Principal Database Engineer@ WithMe
▸ Lead Database Architect @ OmniTI
▸ Expertise in PostgreSQL , Oracle, MySQL, NoSQL
▸ Contact : [email protected] or [email protected]
▸ Twitter: @DenishPatel
▸ Blog: http://www.pateldenish.com
▸ Postgres Slack Channel (https://postgres-slack.herokuapp.com/)
2
AGENDA
DISCUSSION LIST
▸ What to look for monitoring solution in general?
▸ Comparison - selected open source and commercial monitoring solutions
▸ Which metrics to collect and how?
▸ Which metrics to alert on and how to define thresholds?
▸ How to keep up with monitoring changes ?
▸ How to react on alerts at 3AM?
▸ Open discussion3
SELECTION CRITERIA
WHAT TO LOOK FOR IN MONITORING SOLUTION ?
▸ Blend of system monitoring with Postgres support
▸ Centralized monitoring
▸ Hosted vs On-premise
▸ Security concerns of clients
▸ Alerting and Dashboard/Graphs
▸ Easy installation and configuration
▸ Postgres Support
▸ pg_stat_statements
▸ Resource monitoring - CPU, RAM, DISK IO & Network
▸ pgbouncer support
4
COMPARISON
MONITORING SOLUTIONS
▸ Open Source
▸ Sensu
▸ Zabbix
▸ Zenoss (Limited capabilities)
▸ Nagios
▸ (Stop using Nagios so it can die peacefully!!)
▸ SAAS Offerings
▸ Wavefront
▸ Circonus
▸ Vividcortex
▸ OkMeter
▸ NewRelic5
COMPARISON
OPEN SOURCE SOLUTIONS
6
Postgres support?
Configuration Reco Confidence
Sensu Yescheck_postgres
Easy Graphite
HIGH
Zabbix Yes Plugin Easy
MED
Zenoss Yes Plugin Easy
MED
Nagios Yes check_postgres Difficult
LOW
COMPARISON
SAAS OFFERINGS
7
Postgres support Configuration Confidence
Wavefront Collectd
Yes collectd plugins HIGH
Circonus Yes Default checks HIGH
Vividcortex Yes Default checks HIGH
Okmeter Yes One click install pgbouncer
HIGH
New Relic Yes Plugins - missing some metrics
MEDIUM
COMPARISON
SAAS OFFERINGS
▸ Capacity Planning
▸ Real Time Analytics
▸ Anomaly Detection
▸ Data Retention
▸ Support Reviews
▸ Pricing
8
WAVEFRONT.COM
WAVEFRONT
▸ Nice Dashboard and alerting functionality
▸ Very scalable solution
▸ Works with existing metrics collection tools i.e collectd
▸ Real time analytics capability
▸ Complete monitoring suite
9
OKMETER.IO
OKMETER
▸ It is agent based system so you just need to install agent in your environment to monitor application, database or any other servers
▸ Very easy to install and configure
▸ Provides easy to configure Postgres Server monitoring using pg_stat_statements with server stats. Once you install agent, you get everything without any effort
▸ Built-in pgbouncer monitoring
▸ Built-in all resources monitoring ; Disk, CPU, Network & Memory10
USE CASE
MONITORING SOLUTION
▸ 150+ DB clusters across the globe
▸ Easy installation
▸ Standardization
▸ Centralized solution
▸ Real time analytics
▸ Support new Infra - Docker/Aurora/Mesos
12
METRICS COLLECTION
SETUP ROLE
13
create role collectd login encrypted password 'XXX'; create schema collectd; set search_path = collectd,pg_catalog; grant usage on schema collectd to collectd; alter role collectd set search_path = collectd,pg_catalog;
▸ Things to consider:
▸ Separate role for monitoring
▸ No SUPER ROLE
▸ Limited permissions
METRICS COLLECTION
COLLECTD PLUGIN
14
LoadPlugin postgresql # https://github.com/collectd/collectd/blob/master/src/postgresql_default.conf <Plugin postgresql> <Database dba> Host "localhost" Port "5432" User "collectd" Query backends Query transactions Query queries Query table_states Query disk_io Query disk_usage Query query_plans Query connections #custom Query slow_queries #custom Query txn_wraparound #custom Query locks #custom Query wal_files #custom Query scans #custom Query seq_scans #custom Query avg_querytime #custom Query checkpoints #custom Query slave_lag #custom </Database> </Plugin>
METRICS COLLECTION
BACKENDS
15
Query backends> Statement "SELECT count(*) AS count \ FROM pg_stat_activity \ WHERE datname = $1;"
Param database
<Result> Type "pg_numbackends" ValuesFrom "count" </Result> </Query>
METRICS COLLECTION
PG_STAT_ACTIVITY
16
create or replace function pg_stat_activity()
returns set of pg_catalog.pg_stat_activity
as $$
begin
return query(select * from pg_catalog.pg_stat_activity);
end $$
language plpgsql security definer;
revoke all on function pg_stat_activity() from public;
grant execute on function pg_stat_activity() to collectd;
METRICS COLLECTION
TRANSACTIONS
17
<Query transactions> Statement "SELECT xact_commit, xact_rollback \ FROM pg_stat_database \ WHERE datname = $1;" Param database <Result> Type "pg_xact" InstancePrefix "commit" ValuesFrom "xact_commit" </Result> <Result> Type "pg_xact" InstancePrefix "rollback" ValuesFrom "xact_rollback" </Result> </Query>
METRICS COLLECTION
QUERIES (DML)
18
<Query queries> Statement "SELECT sum(n_tup_ins) AS ins, \ sum(n_tup_upd) AS upd, \ sum(n_tup_del) AS del, \ sum(n_tup_hot_upd) AS hot_upd \ FROM pg_stat_user_tables;" <Result> Type "pg_n_tup_c" InstancePrefix "ins" ValuesFrom "ins" </Result> .. . . . </Query>
METRICS COLLECTION
TABLE_STATES
19
<Query table_states> Statement "SELECT sum(n_live_tup) AS live, sum(n_dead_tup) AS dead \ FROM pg_stat_user_tables;"
<Result> Type "pg_n_tup_g" InstancePrefix "live" ValuesFrom "live" </Result> <Result> Type "pg_n_tup_g" InstancePrefix "dead" ValuesFrom "dead" </Result>
</Query>
METRICS COLLECTION
QUERY_PLANS
20
<Query query_plans> Statement "SELECT sum(seq_scan) AS seq, \ sum(seq_tup_read) AS seq_tup_read, \ sum(idx_scan) AS idx, \ sum(idx_tup_fetch) AS idx_tup_fetch \ FROM pg_stat_user_tables;"
<Result> Type "pg_scan" InstancePrefix "seq" ValuesFrom "seq" . . </Query>
METRICS COLLECTION
DISK_IO
21
<Query disk_io>
Statement "SELECT coalesce(sum(heap_blks_read), 0) AS heap_read, \
coalesce(sum(heap_blks_hit), 0) AS heap_hit, \
coalesce(sum(idx_blks_read), 0) AS idx_read, \
coalesce(sum(idx_blks_hit), 0) AS idx_hit, \
coalesce(sum(toast_blks_read), 0) AS toast_read, \
coalesce(sum(toast_blks_hit), 0) AS toast_hit, \
coalesce(sum(tidx_blks_read), 0) AS tidx_read, \
coalesce(sum(tidx_blks_hit), 0) AS tidx_hit \
FROM pg_statio_user_tables;"
METRICS COLLECTIONS
DISK USAGE / DB SIZE
22
<Query disk_usage>
Statement "SELECT pg_database_size($1) AS size;"
Param database
<Result>
Type pg_db_size
ValuesFrom "size"
</Result>
</Query>
METRICS COLLECTION
CONNECTIONS #CUSTOM
23
<Query connections> Statement "SELECT COUNT(state) AS count, state FROM (SELECT CASE WHEN state = 'idle' THEN 'idle' WHEN state = 'idle in transaction' THEN 'idle_in_transaction' WHEN state = 'active' THEN 'active' ELSE 'unknown' END AS state FROM collectd.pg_stat_activity) state GROUP BY state UNION SELECT COUNT(*) AS count, 'waiting' AS state FROM collectd.pg_stat_activity WHERE waiting ;" <Result> Type "pg_numbackends" InstancePrefix "state" InstancesFrom "state" ValuesFrom "count" </Result> </Query>
METRICS COLLECTION
SLOW_QUERIES
25
<Query slow_queries>
Statement "SELECT COUNT(*) AS count FROM collectd.pg_stat_activity WHERE state='active'
AND now()-query_start > '300 seconds'::interval
AND query ~* '^(insert|update|delete|select)' ;"
<Result>
Type "counter"
InstancePrefix "pg_slow_queries"
ValuesFrom "count"
</Result>
</Query>
METRICS COLLECTION
TXN_WRAPAROUND
26
<Query txn_wraparound>
Statement "SELECT age(datfrozenxid) as txn_wrap_age FROM pg_database ;"
<Result>
Type "counter"
InstancePrefix "txn_wraparound"
ValuesFrom "txn_wrap_age"
</Result>
</Query>
METRICS COLLECTION
LOCKS
27
<Query locks>
Statement "SELECT COUNT(mode) AS count, mode FROM pg_locks GROUP BY mode
UNION SELECT COUNT(*) AS count, 'waiting' AS mode FROM pg_locks
WHERE granted is false ;"
<Result>
Type "gauge"
InstancePrefix "pg_locks"
InstancesFrom "mode"
ValuesFrom "count"
</Result>
</Query>
METRICS COLLECTION
WAL_FILES
29
<Query wal_files>
Statement "SELECT archived_count AS count, failed_count AS failed FROM pg_stat_archiver;" <Result>
Type "gauge"
InstancePrefix "pg_wal_count"
ValuesFrom "count"
</Result>
<Result>
Type "gauge"
InstancePrefix "pg_wal_failed"
ValuesFrom "failed"
</Result> </Query>
METRICS COLLECTION
SCANS
30
<Query scans>
Statement "SELECT sum(idx_scan) as index_scans, sum(seq_scan) as seq_scans,
sum(idx_tup_fetch) as index_tup_fetch, sum(seq_tup_read) as seq_tup_read
FROM pg_stat_all_tables ; "
<Result>
Type "pg_scan"
InstancePrefix "index"
ValuesFrom "index_scans"
</Result>
.
.
</Query>
METRIC COLLECTION
SEQ_SCANS
31
<Query seq_scans>
Statement "SELECT CASE WHEN status='OK' THEN 0 ELSE 1 END AS status
FROM ( SELECT get_seq_scan_on_large_tables AS status
FROM collectd.get_seq_scan_on_large_tables) AS foo;"
<Result>
Type "gauge"
InstancePrefix "pg_seq_scans"
ValuesFrom "status"
</Result>
</Query>
METRICS COLLECTION
SEQ_SCAN_ON_LARGE_TABLES
32
CREATE MATERIALIZED VIEW collectd.seq_scan_on_large_tables AS SELECT relid, schemaname, relname, seq_scan, seq_tup_read , pg_relation_size(relid) as relsize, now() as refreshed_at FROM pg_stat_all_tables WHERE pg_relation_size(relid) > 1073741824 AND schemaname not in ('pg_catalog', 'information_schema') UNION ALL SELECT 0,'0','0','0',0,0,now();
ALTER materialized VIEW collectd.seq_scan_on_large_tables OWNER TO collectd;
METRICS COLLECTION
GET_SEQ_SCAN_ON_LARGE_TABLES
33
CREATE OR REPLACE FUNCTION collectd.get_seq_scan_on_large_tables() RETURNS text AS $$
DECLARE v_matview text; v_refreshed_at timestamptz; v_tables_with_seq_scan text[]; BEGIN SELECT refreshed_at INTO v_refreshed_at FROM collectd.seq_scan_on_large_tables WHERE relid=0; -- refresh MV every 4 hours IF v_refreshed_at < now() - interval '4 hours' and pg_is_in_recovery() is false THEN REFRESH MATERIALIZED VIEW collectd.seq_scan_on_large_tables; END IF;
SELECT ARRAY (SELECT base.relname ||':'|| (current.seq_scan-base.seq_scan) INTO v_tables_with_seq_scan FROM collectd.seq_scan_on_large_tables AS base LEFT JOIN pg_stat_all_tables AS current ON (base.schemaname=base.schemaname AND base.relname=current.relname) WHERE (current.seq_scan-base.seq_scan) > 0 AND ((current.seq_tup_read-base.seq_tup_read)/(current.seq_scan-base.seq_scan)) > 50000 ) AS tables_with_seq_scan;
IF v_tables_with_seq_scan = '{}' THEN RETURN 'OK'; ELSE RETURN 'PROBLEM: Seq scan on table: '|| array_to_string(v_tables_with_seq_scan,'&'); END If; END; $$ LANGUAGE 'plpgsql' SECURITY DEFINER;
METRICS COLLECTION
AVG_QUERYTIME
34
<Query avg_querytime>
Statement "SELECT sum(total_time)/sum(calls) AS avg_querytime FROM collectd.get_stat_statements() ;"
<Result>
Type "gauge"
InstancePrefix "pg_avg_querytime"
ValuesFrom "avg_querytime"
</Result>
</Query>
<Query scans>
METRICS COLLECTION
GET_STAT_STATEMENTS
35
create extension IF NOT EXISTS pg_stat_statements WITH SCHEMA collectd;
alter schema collectd owner to collectd;
CREATE OR REPLACE FUNCTION collectd.get_stat_statements() RETURNS SETOF pg_stat_statements AS
$$
SELECT * FROM pg_stat_statements
WHERE dbid IN (SELECT oid FROM pg_database WHERE datname = current_database());
$$ LANGUAGE sql VOLATILE SECURITY DEFINER;
METRICS COLLECTIONS
CHECKPOINTS
38
<Query checkpoints>
Statement "SELECT (checkpoints_timed + checkpoints_req) AS total_checkpoints
FROM pg_stat_bgwriter ;"
<Result>
Type "counter"
InstancePrefix "pg_checkpoints"
ValuesFrom "total_checkpoints"
</Result>
</Query>
METRICS COLLECTION
SLAVE LAG
39
<Query slave_lag>
Statement "SELECT CASE WHEN pg_is_in_recovery = 'false' THEN 0
ELSE COALESCE(ROUND(EXTRACT(epoch FROM now() pg_last_xact_replay_timestamp())),0) END AS seconds
FROM pg_is_in_recovery();"
<Result>
Type "counter"
InstancePrefix "slave_lag"
ValuesFrom "seconds"
</Result>
</Query>
ALERTING
SETUP ALERTS ON DB METRICS
▸ Uptime
▸ Waiting Connections
▸ # of connections waiting > 5
▸ Slow queries
▸ # of slow queries > 5
▸ Seq scan on large tables
▸ TXN Wraparound
▸ Age Over 1.5B
▸ Disk space usage
▸ 85%?
▸ Slave lag
▸ 5 minutes?
40
MONITORING CHANGES
HOW TO KEEP UP?▸ Design with failover in mind
▸ Keep eyes on new features for monitoring in latest DB or OS version
▸ Postgres 9.5 enhancements
▸ Commit timestamp tracking
▸ SELECT * FROM pg_last_committed_xact();
▸ cluster_name
▸ $ ps -ef | grep checkpointer
▸ postgres 12181 12178 0 11:12 ? 00:00:00 postgres: personnel: checkpointer process
▸ postgres 12207 12204 0 11:12 ? 00:00:00 postgres: reportsdb: checkpointer process
▸ postgres 12233 12230 0 11:12 ? 00:00:00 postgres: management: checkpointer process
▸ A bunch of changes coming in Postgres 9.6
▸ Improve the pg_stat_activity view provides more details about waiting on what resources
▸ Deploy monitoring through config management tools 41
INCIDENT MANAGEMENT
HOW TO BE READY TO HANDLE 3AM CALL?▸ PagerDuty calendar : https://www.pagerduty.com/
▸ Document metrics
▸ URL for the Dashboard
▸ Alert resolution procedure
▸ Clear SLAs (Decision)
▸ Escalation policy
▸ Scenarios
▸ Wait for server to bring backup
▸ Failover
▸ Review alerts before going OnCall
▸ Oncall notification
▸ Think for the worst and document accordingly
▸ What if you are in movie theatre/beach etc.?
▸ What if you can’t jump on the server?
▸ Keep the document up-to-date 42
KEEP IN TOUCH
THANKS & Q/A
▸ You!
▸ Conference committee
▸ Contact for further Q/A
▸ Twitter: @DenishPatel
▸ pateldenish.com