pg_flight_recorder: Continuously sample PostgreSQL system state via pg_cron

· databases ai-agents · Source ↗

A pure-SQL flight recorder for PostgreSQL 15–18 that captures wait events, locks, WAL, and query performance every 60s via pg_cron—no sidecars, no agents, no external polling.

What Matters

  • Two extensions: pgfr_record (core collection, ring buffers, scheduling) and optional pgfr_analyze (reporting, anomaly detection, time travel queries).
  • Ring buffers retain sampled activity (wait events, sessions, locks) for 2h hot; archives persist 7 days. Snapshots (WAL, I/O, tables) kept 30 days.
  • Default retention is ~2.5GB uncompressed, ~150MB compressed—exportable via pg_dump on a single schema.
  • Circuit breaker skips collection if recent runs averaged >1s; load shedding kicks in above 70% active connections; per-query section timeout is 250ms.
  • pgfr_analyze.what_happened_at('timestamp') and incident_timeline() enable point-in-time incident reconstruction from archived samples.
  • XID and MultiXID wraparound monitoring included; configurable warning ratios (default tunable via xid_warning_ratio config key, per postgres-howto #0044).
  • Switching to troubleshooting profile drops to 60s sampling with expanded capture; production_safe profile uses 300s intervals for minimal overhead.

Original | Discuss on HN