Incident Report: Railway Blocked by Google Cloud (Resolved)

· cloud · Source ↗

TLDR

  • GCP incorrectly suspended Railway’s production account via automated action on May 19, causing an 8-hour platform-wide outage across GCP, Railway Metal, and AWS.

Key Takeaways

  • GCP account suspension at 22:20 UTC took down Railway’s dashboard, API, control plane databases, and GCP compute instantly.
  • Edge proxies on Railway Metal and AWS stayed up briefly, but relied on a GCP-hosted network control plane to populate routing tables; once route caches expired (~22:35 UTC), all regions returned 404s.
  • Full recovery took ~8 hours: persistent disks came back by 23:54 UTC, networking by 01:30 UTC, dashboard by 02:55 UTC, full resolution by 07:58 UTC.
  • GitHub then rate-limited Railway’s OAuth and webhook integrations due to burst retry volume, blocking logins and builds as a secondary failure.
  • Fixes planned: decouple network control plane from GCP (true mesh), extend HA database shards across AWS and Metal, remove GCP from the data plane hot path.

Hacker News Comment Review

  • Skepticism runs alongside sympathy: several commenters questioned Railway’s own abuse prevention, noting heavy spam from Railway IPs and a free tier that attracts crypto mining and bot operators, raising the possibility that customer workload abuse triggered GCP’s automated suspension.
  • GCP’s pattern of automated account actions without proactive human outreach drew comparisons to the 2024 UniSuper deletion incident, reinforcing a persistent reputation risk that many cite as a reason to avoid GCP despite its ergonomics.
  • Some longtime Railway users said this outage was a breaking point, citing a broader pattern of operational immaturity rather than treating this purely as a GCP fault.

Notable Comments

  • @jkogara: Noticed Railway displayed new T&C on login post-incident explicitly banning bots, torrents, and illegal workloads, suggesting abuse may have triggered the GCP automated restriction.
  • @BitWiseVibe: “The amount of spam from Railway IPs is insane” – flags systemic abuse prevention failure as a root contributor.

Original | Discuss on HN