Linux 7.0 removed PREEMPT_NONE, and PREEMPT_LAZY’s scheduler behavior turned minor page faults into a spinlock catastrophe, halving PostgreSQL throughput on high-core servers.
Key Takeaways
On a 96-vCPU Graviton4 running pgbench at scale factor 8,470 with 1,024 clients, Linux 7.0 dropped throughput from 98,565 to 50,751 TPS.
perf pinpointed the culprit: 55.6% of CPU time inside s_lock, the spinlock in StrategyGetBuffer, PostgreSQL’s global buffer pool selection function.
Root cause: PREEMPT_LAZY allows the kernel to preempt a backend mid-fault-handler. While that backend holds the spinlock, every other backend spins, burning CPU; the cost multiplies across all waiting backends.
With PREEMPT_NONE (removed in Linux 7.0), the fault resolved before rescheduling occurred; spinlock hold time stayed short and damage stayed bounded.
Fix: enable huge pages (2 MB or 1 GB) to reduce a 120 GB shared buffer pool from ~31 million potential page faults to ~61,440 or ~120. Set huge_pages = on (not try) so misconfigured instances fail fast.
Hacker News Comment Review
No substantive HN discussion yet. The single comment links to a related prior thread without adding new technical detail.