Tech

Railway Outage Exposes Single-Point Dependency on Google Cloud

The cloud infrastructure provider acknowledges architectural flaws following a service interruption triggered by a Google Cloud Platform account error and secondary GitHub rate-limiting.

Author
Owen Mercer
Markets and Finance Editor
Published
Draft
Source: Hacker News · original
Tech
No image available
Platform-wide disruption lasted eight hours after automated suspension cascaded across AWS and Metal environments

Railway experienced a platform-wide service disruption on May 19, 2026, following an incorrect automated suspension of its production account by Google Cloud Platform. The incident, which lasted approximately eight hours, resulted in the temporary loss of service for all infrastructure hosted on Google Cloud, including the dashboard, API, and core network control plane.

The outage began at 22:20 UTC when Google Cloud placed Railway’s account into a suspended status. While workloads on Railway’s own Metal and AWS environments remained technically online, the platform’s edge proxies relied on a Google Cloud-hosted control plane API to populate routing tables. As network routing caches expired, these proxies could no longer resolve routes to active instances, causing the outage to cascade beyond Google Cloud. By the time caches cleared, all Railway workloads across all regions became unreachable, returning 404 errors.

Recovery was complicated by secondary effects, including GitHub rate-limiting Railway’s OAuth and webhook integrations due to a surge in retry requests. This temporarily blocked user logins and builds. Additionally, Terms-of-service acceptance records were reset, requiring users to re-accept terms upon their next dashboard visit. Persistent disks were restored to a ready state by 23:54 UTC, but core networking and edge routing did not fully restore until approximately 01:30 UTC on May 20.

Railway acknowledged that its network architecture contained a single-point dependency on the GCP-hosted control plane. Although the underlying mesh ring of high-availability fibre interconnects between Metal, GCP, and AWS remained operational, the failure to re-populate routing tables after cache expiration rendered the system ineffective. The company stated it takes full responsibility for the architectural decisions that allowed an upstream provider action to cascade into a platform-wide outage.

In response, Railway has outlined plans to implement a "true mesh" network architecture. This involves removing the dependency on the GCP-hosted control plane, extending high-availability database shards across AWS and Metal, and removing Google Cloud services from the data plane’s hot path. The company aims to ensure that core services are not dependent on any single vendor, with full restoration of services confirmed by approximately 04:00 UTC on May 20.

Continue reading

More from Tech

Read next: Apple to roll out manual EQ controls for AirPods in iOS 27 update
Read next: Apple rolls out visionOS 27, integrating AI-driven Siri into Vision Pro headset
Read next: Apple Overhauls Siri with Google Gemini Partnership and Standalone App at WWDC 2026