Linux 7.0 Broke PostgreSQL: The Preemption Regression Explained

· systems · Source ↗

TLDR

  • Linux 7.0 removed PREEMPT_NONE, and PREEMPT_LAZY’s scheduler behavior turned minor page faults into a spinlock catastrophe, halving PostgreSQL throughput on high-core servers.

Key Takeaways

  • On a 96-vCPU Graviton4 running pgbench at scale factor 8,470 with 1,024 clients, Linux 7.0 dropped throughput from 98,565 to 50,751 TPS.
  • perf pinpointed the culprit: 55.6% of CPU time inside s_lock, the spinlock in StrategyGetBuffer, PostgreSQL’s global buffer pool selection function.
  • Root cause: PREEMPT_LAZY allows the kernel to preempt a backend mid-fault-handler. While that backend holds the spinlock, every other backend spins, burning CPU; the cost multiplies across all waiting backends.
  • With PREEMPT_NONE (removed in Linux 7.0), the fault resolved before rescheduling occurred; spinlock hold time stayed short and damage stayed bounded.
  • Fix: enable huge pages (2 MB or 1 GB) to reduce a 120 GB shared buffer pool from ~31 million potential page faults to ~61,440 or ~120. Set huge_pages = on (not try) so misconfigured instances fail fast.

Hacker News Comment Review

  • No substantive HN discussion yet. The single comment links to a related prior thread without adding new technical detail.

Original | Discuss on HN