One CI check would have caught both of Railway's billion-row Postgres migration outages
Railway had two Postgres migration outages in six weeks, October 28 and December 8, 2025. Same cascade, different trigger. A close reading of both post-mortems and the single pre-merge check.
On October 28, 2025, a schema change took down Railway's API for roughly 52 minutes. Six weeks later, on December 8, a second migration cascaded the same way through the same infrastructure, after the team had already shipped the fix they believed addressed the October incident. Both incidents are public. Both were documented in detail by Railway. What follows is a close reading of the two post-mortems, the lock mechanic that connected them, and the single check that would have flagged either migration at pull-request time.
October 28: the lock that locked out the operators
Railway's internal monitoring fired two minutes after the change went live. The on-call team got paged, declared an incident, and then lost their ability to intervene.
The change added a column and created a non-CONCURRENTLY index on a Postgres table containing roughly one billion rows. That table is touched by nearly every API operation Railway exposes. CREATE INDEX without CONCURRENTLY takes a SHARE lock that blocks writes for the duration of the build, and the build took about thirty minutes because the table has about a billion rows.
What made it an outage rather than a slow window was the cascade downstream of the lock. API servers kept accepting incoming requests. Each request that touched the locked table blocked, waiting. Each blocked request held a Postgres connection open while it waited. PgBouncer's connection pool drained. Railway had configured pg_use_reserved_connections to keep a handful of slots available for administrative access, but those were consumed too, because the reserved slots still count against the same pool when the waiting stack grows deep enough. At that point, an operator who opened a terminal to cancel the statement and end the outage was met with this:
FATAL: remaining connection slots are reserved for
roles with privileges of the "pg_use_reserved_connections" role
They could not get in. There was no administrative path back into the database from outside the saturated pool, and no way for the pool to drain while the lock was still held. The index finished on its own after about thirty minutes. The outage ended when the lock released, not when anyone intervened.
October's post-mortem fixes
Railway published a thorough post-mortem the following day. Three preventative measures were announced. First, CI enforcement of CONCURRENTLY on every index creation, implemented as a lint rule on migration files in the repository. Second, tighter PgBouncer connection pool limits to slow the drain the next time a cascade started. Third, a dedicated admin-only Postgres user with connection limits configured to sit above the application's saturation ceiling, so that even at full pool exhaustion an operator could still get in.
All three are correct fixes for the symptoms observed in October. All three are narrow. Only one of them, the CI rule, was a change to the system that merges code, and it targeted exactly the syntactic shape that caused the October migration to fail.
December 8: same cascade, different door
Six weeks later, the team merged a migration at 15:03 UTC. Applying began at 15:17. By 15:18, read replicas were failing their health checks and dropping out of rotation. At 15:26 the migration completed and replicas recovered. Twenty-three minutes of degraded service, one nullable column added.
The change was ADD COLUMN on a nullable field with no default and no index. In Postgres 11 and later, this is a metadata-only operation. No table rewrite, no full scan, effectively instant. The DDL itself was not the problem.
The problem was that a long-running query was already holding an ACCESS SHARE lock on the same table when the migration arrived. Postgres received the migration's ACCESS EXCLUSIVE request and queued it behind the held ACCESS SHARE. That part is unavoidable. What happens next is the part that still surprises engineers who have not been through it: every subsequent lock request on that table, including trivial reads asking for ACCESS SHARE, queues behind the waiting ACCESS EXCLUSIVE. Postgres does not allow the new readers to jump ahead, even though ACCESS SHARE is compatible with ACCESS SHARE. Letting them jump would starve the waiting writer. So they wait. And while they wait they hold connections. PgBouncer saturates. Replica health checks are among the queries that queue, so replicas fall out of rotation.
December 8, state of the lock queue shortly after the migration began applying. The DDL is not the slow thing. The held reader is. But the cost is paid by every query that arrives after the DDL and joins the back of a queue it did not cause.
When the long-running reader finally finished, the migration applied in a fraction of a second, and the cascade unwound in the order it had formed.
Why the October fix did not catch December
The October lint rule scanned migration files for a regex pattern: CREATE INDEX without CONCURRENTLY. The December migration did not create an index. It did not need CONCURRENTLY. It would have passed every syntactic rule a team could reasonably write about schema changes without also rejecting half of legitimate work.
What October and December share is not a DDL shape. It is a lock interaction. The October DDL needed an exclusive lock that Postgres could not release quickly because an index build was in progress. The December DDL needed an exclusive lock that Postgres could not grant because a reader was holding an incompatible share. Side by side, the two incidents line up on almost everything that matters:
Incident 1, October 28, 2025
CREATE INDEX (non-CONCURRENT)SHARE for ~30 minIncident 2, December 8, 2025
ADD COLUMN (nullable, no index)ACCESS EXCLUSIVE queued behind ACCESS SHAREThe rows that differ are in the top half. The rows that matter for downstream impact are in the bottom half, and they match. From Postgres's perspective, the two situations are the same failure: a waiting ACCESS EXCLUSIVE in a queue behind an existing lock, and every subsequent request joining the back of the queue. From the connection pool's perspective, identical: connections accumulating on waits, slots exhausting, admin access or replica health disappearing. Only the arrival path differs.
The symmetry is cleaner on paper than it was in the war room. Real incidents rarely reduce this neatly. Railway's own post-mortems credit several contributing factors that do not fit in either column of the table: the deploy timing landed during peak traffic, monitoring surfaced the pool saturation before it surfaced the underlying lock, and the team's rollback tooling assumed it could connect to the database to read state. None of that changes the argument that follows, which is about the mechanism shared between the two. It does mean that "one CI check would have caught both" is a claim about the highest-risk link in the chain, not a claim that the chain had only one link.
The only way a static lint rule catches both is if it rejects every migration that requires
ACCESS EXCLUSIVE. Which is virtually every schema change worth shipping.
This is not a rare edge case
Any team running continuous migrations against a hot table will hit a version of it. The arithmetic is not subtle. More migrations per week, more long-running queries in flight at any moment, more probability that a DDL arrives while a reader is holding. Railway is a public example because Railway writes careful post-mortems. The same pattern is running daily, unlabeled, inside dozens of other SaaS companies that will not publish the incident report because nobody paged outside the engineering org or because the duration fell under the SLA threshold that triggers external communication.
The reason this matters more in 2026 than it did when the lock mechanic was first written up at length is not that Postgres changed. It did not. What changed is the population on either side of the merge button. More teams ship more migrations per week. Fewer of those teams have a DBA in the review loop. And a growing share of schema changes is written, or at least drafted, by coding agents that do not know which tables are a billion rows and which are ten thousand, do not read pg_stat_activity, and have no model of the lock graph. The population of migrations that could trigger a cascade has grown. The population of humans positioned to notice before merge has shrunk.
The check that would have blocked both
What is realistic, and what would have flagged both, is a check that treats the migration as a plan rather than as text. At pull-request time, before merge, it parses the DDL, identifies the lock level the statement requires against the live schema, reads the current state of pg_locks and pg_stat_activity on production, estimates how long the exclusive lock will actually be held given table size and current read pressure, and blocks the merge when that estimate exceeds the team's deploy window or when long-running queries already hold conflicting locks on the target table.
Applied to October: the check reads that the target table has ~1B rows. It reads that CREATE INDEX without CONCURRENTLY takes SHARE and will hold it for the full duration of the build. It estimates that duration at 20 to 40 minutes at observed I/O rates on that instance. Verdict: block the merge. Add CONCURRENTLY, or schedule the build for a maintenance window with writes paused.
Applied to December: the check reads that the DDL needs ACCESS EXCLUSIVE on a table where a reader has been holding ACCESS SHARE for several minutes and is still active. Estimated queue time: indefinite, depending on when the reader finishes. Verdict: block the merge. Wait for the reader to complete, or set an aggressive lock_timeout on the migration so that if the cascade begins, the migration aborts in seconds instead of minutes.
A rendering of what that verdict might look like at PR time, delivered as a comment on the change rather than a dashboard:
# datapace-agent · schema-safety
BLOCK · PR #2843 (migrations/0147_add_archived_at.sql)
DDL ALTER TABLE sessions ADD COLUMN archived_at timestamptz
Lock level ACCESS EXCLUSIVE on sessions (1.04B rows)
Observed 1 reader holding ACCESS SHARE for 00:04:12 (pid 48213)
ETA indefinite, depends on reader completion
Window deploy closes in 2m
Verdict merge blocked · set lock_timeout=3s
or wait for pid 48213 to complete
Fix apply suggestion: wrap DDL with
SET lock_timeout = '3s'; SET statement_timeout = '30s';
Neither verdict is a rule about syntax. Both are statements about the interaction between the proposed statement, the live schema, and the live lock state. That interaction is what determined whether Railway had an outage on each of those days, and it is what a pre-merge check would have had to see to prevent either.
The check is probabilistic, not deterministic
Production state shifts between PR time and deploy time. Long-running queries are stochastic. A simulation that reports "estimated queue 4 minutes" against the readers observed at 14:00 may be irrelevant by 14:15 when the merge button is clicked. The check will raise false positives on migrations that would have applied cleanly, and it will occasionally miss a cascade that formed between the verdict and the deploy.
What the check does well is rank. It consistently flags the highest-risk migrations at the earliest point in the workflow where a human can still reconsider, which is the point where the cost of reconsideration is close to zero. It does not replace lock_timeout on the DDL itself, it does not replace a tight deploy window, and it does not replace a rollback plan. It sits in front of those and filters down the volume they have to handle.
Complementary guardrails
The mitigations that would have shortened October and December, once the cascade started, already exist and are well understood. lock_timeout set to a few seconds on the DDL, so that if it finds itself at the back of a long wait it aborts itself instead of holding the queue open. statement_timeout as a backstop on application queries, so that a single slow reader cannot hold ACCESS SHARE for minutes. A deployment window that refuses to merge DDL when traffic on the target table is above a threshold. A Postgres admin role with reserved connections configured outside the pool the application uses. These are complements to a pre-merge check, not substitutes for it. The check decides whether the DDL should merge at all. The guardrails decide how bad it can get if it does. Neither layer replaces the other. The check without the guardrails still lets marginal migrations deploy without seatbelts. The guardrails without the check let every migration into production and rely on lock_timeout to bail out the bad ones at the cost of visible failed deploys.
Closing note
The reason the October and December incidents share one article, rather than one each, is that they are the same incident with different triggers. The first one taught Railway something specific. The second one taught them that what they learned was narrower than they thought. Both would have been flagged by the same check, because the check that flags either has to be about the lock graph at merge time, not the text of the SQL. Neither check nor guardrail is perfect on its own. Together, they shift the expected cost of the next migration from "we will see" to "we have seen, and the worst case is bounded."
Sources
- Railway, Incident report, October 28, 2025
- Railway, Incident report, December 8, 2025
Want to optimize your database performance?
Get AI-powered recommendations for your specific database setup.