Our agent found a bug with WireGuard in Google Kubernetes Engine

TLDR

  • Lovable’s infra team traced sporadic GKE failures to an anetd WireGuard concurrency panic and a hidden MTU mismatch, using AI-assisted log analysis and tcpdump.

Key Takeaways

  • anetd pods were crashing ~120 times per pod over six days due to a concurrent map-access panic in Google’s WireGuard integration code, not WireGuard itself.
  • Disabling node-to-node WireGuard encryption stopped the crashes but exposed a second failure: nodes had mismatched MTUs (1420 vs 1500 bytes), breaking Valkey connections intermittently.
  • The MTU mismatch was only visible after the first fix; nodes not yet restarted still used the WireGuard-era 1420-byte MTU, causing fragmentation errors on cross-node traffic.
  • AI agent querying Clickhouse logs surfaced the anetd restart pattern that manual log review had missed; tcpdump and Wireshark confirmed the MTU issue.
  • Google has since patched the WireGuard concurrency bug; Lovable’s pod creation volume (50+ sandboxes/sec peak) surfaced a race condition Google hadn’t caught in typical workloads.

Hacker News Comment Review

  • No substantive HN discussion yet.

Original | Discuss on HN