Lovable’s infra team traced sporadic GKE failures to an anetd WireGuard concurrency panic and a hidden MTU mismatch, using AI-assisted log analysis and tcpdump.
Key Takeaways
anetd pods were crashing ~120 times per pod over six days due to a concurrent map-access panic in Google’s WireGuard integration code, not WireGuard itself.
Disabling node-to-node WireGuard encryption stopped the crashes but exposed a second failure: nodes had mismatched MTUs (1420 vs 1500 bytes), breaking Valkey connections intermittently.
The MTU mismatch was only visible after the first fix; nodes not yet restarted still used the WireGuard-era 1420-byte MTU, causing fragmentation errors on cross-node traffic.
AI agent querying Clickhouse logs surfaced the anetd restart pattern that manual log review had missed; tcpdump and Wireshark confirmed the MTU issue.
Google has since patched the WireGuard concurrency bug; Lovable’s pod creation volume (50+ sandboxes/sec peak) surfaced a race condition Google hadn’t caught in typical workloads.