When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug

· systems · Source ↗

TLDR

  • Cloudflare’s quiche QUIC stack inherited a Linux CUBIC idle-period bug that pinned cwnd at minimum, causing 61% test failures under post-loss recovery.

Key Takeaways

  • CUBIC’s congestion_recovery_start_time in quiche advances on every send where bytes_in_flight == 0, pushing the recovery boundary into the future and forcing perpetual recovery state.
  • The root cause traces to a 2017 Linux kernel CUBIC idle-fix that had a follow-up correction; quiche ported the first fix but missed the second.
  • The death spiral requires three simultaneous conditions: a real loss event setting the recovery boundary, congestion avoidance mode active, and cwnd collapsed to the two-packet floor (2700 bytes).
  • Reno passed the same test 100% of the time, confirming the bug is CUBIC-specific and tied to its epoch-based growth curve, not shared loss-based logic.
  • The near-one-line fix correctly distinguishes RTT wait time from true application idleness, preserving CUBIC’s growth curve shape without resetting the epoch.

Hacker News Comment Review

  • No substantive HN discussion yet.

Original | Discuss on HN