New: how one CI check would have caught both of Railway's billion-row Postgres migration outages. Read the post →
Pattern
April 19, 2026
11 min read

PR-time simulation vs post-deploy lock_timeout: both, not either

lock_timeout is correct and necessary. It is not enough. A PR-time lock-graph simulator catches the same class of migration before the merge, without producing a failed-deploy signal.

#PostgreSQL#Schema migrations#lock_timeout#Database reliability#CI/CD

lock_timeout is a correct, necessary, and well-documented Postgres setting. GoCardless published the canonical write-up in 2016 after their production API went down for 15 seconds during a ALTER TABLE. Nikolay Samokhvalov's 2021 Postgres.ai post made it authoritative: "Zero-downtime Postgres schema migrations need this: lock_timeout and retries." Every serious team running Postgres should be using it. It is also not enough on its own. The class of migration that takes down production can be flagged before it merges, with a different check that reads production state at PR time. This post is about why those two layers complement each other, and what each one actually catches.

TL;DR. lock_timeout catches a migration mid-flight by aborting it after N seconds of waiting. It produces a visible failed deploy. PR-time lock-graph simulation catches the same class of migration before the merge button is pressed. The migration never reaches the deploy pipeline. Both layers are needed: the simulator is the first line for the predictable cases, lock_timeout is the safety net for the cases the simulator misses. Neither replaces the other.

What lock_timeout actually does

The Postgres setting is simple. SET lock_timeout = '3s'; before a DDL tells Postgres to abort the statement if it has been waiting to acquire a lock for more than 3 seconds. The abort is immediate and clean; the DDL reports a canceling statement due to lock timeout error, no partial state is left behind, and any queries queued behind the DDL are released.

The canonical use, from the GoCardless post and the Samokhvalov follow-up, is:

SET lock_timeout = '3s';
SET statement_timeout = '30s';

ALTER TABLE sessions ADD COLUMN archived_at timestamptz;

If the ALTER TABLE cannot acquire ACCESS EXCLUSIVE within 3 seconds (because a reader is holding ACCESS SHARE, or because there is an unbounded queue in front of it as the earlier post on lock-queue fairness describes), Postgres aborts the DDL. The migration framework catches the error, marks the deploy failed, and the team retries later. No cascade, no 23-minute outage, no pool drain.

Tools like ActiveRecord SaferMigrations (GoCardless), Doctolib's safe-pg-migrations, and Xata's pgroll ship this pattern by default, typically with automatic retry and exponential backoff. A Postgres team in 2026 that is not setting lock_timeout in migrations is ignoring a decade of public operational wisdom.

What lock_timeout does not do

Three classes of migration that lock_timeout alone does not catch.

The migration that is going to take hours but acquires the lock quickly. lock_timeout limits how long the DDL waits for the lock. It does not limit how long the DDL runs once it has the lock. A CREATE INDEX without CONCURRENTLY on a billion-row table acquires SHARE immediately and holds it for 30 minutes while the build proceeds. lock_timeout = 3s does not save you. This is exactly the shape of Railway's October 28, 2025 incident, covered in the first post on this blog. The lock was acquired fast. The index build was slow.

The migration that produces cascades under the timeout. A lock_timeout of 3 seconds bounds the wait, but a 3-second queue on a heavily trafficked table can still exhaust the connection pool as readers accumulate. The migration aborts, but the cascade already peaked. For a 30-engineer SaaS with a 50-connection pool and 200 requests per second, 3 seconds is enough queue to drain the pool.

The migration that produces a visible failed deploy. This is the subtler one. Even when lock_timeout works perfectly and the migration aborts before any damage, the failed deploy is itself an operational signal. The deploy pipeline shows red. The team has to investigate. The migration has to be retried, possibly under different conditions. The friction is real, and it compounds: teams learn that migrations sometimes fail and build cultural workarounds (deploying at 3am, splitting migrations pre-emptively) that leak complexity across the organization. The operational cost is smaller than an outage but not zero.

What PR-time simulation adds

Defense-layer timeline. Top lane: developer writes migration, opens PR, merges, deploy pipeline runs, migration runs in prod, users affected. Layer 1 PR-time simulation: verdict at PR review time, reads pg_locks/pg_stat_activity/table sizes live, blocks merge if lock window is unsafe. Unsafe migrations never reach deploy. Layer 2 post-deploy lock_timeout: guardrail during migration run, migration aborts if lock wait exceeds N seconds, failed deploy requires retry, visible as failed CI or deploy. Both layers needed: simulator catches the predictable case, lock_timeout catches what it missed. Only the simulator prevents a user-visible failed deploy for migrations we already know are unsafe.

The two layers intercept at different points in the workflow.

A PR-time lock-graph simulator is a different layer. It reads production state before the merge, parses the proposed DDL, estimates the lock window against current traffic, and delivers a verdict as a PR comment. The mechanics, described in more detail in the earlier posts on the staging gap and the lock-queue fairness rule, involve four pieces of data: the proposed DDL, the current contents of pg_locks and pg_stat_activity, the target table's size and row count, and a cost model for the specific DDL operation.

The simulator catches three specific classes that lock_timeout does not.

The slow-build migration. A CREATE INDEX on a billion-row table. The simulator reads the row count, applies the I/O-rate cost model, estimates a 30-minute build, and blocks the merge with a add CONCURRENTLY suggestion. lock_timeout cannot catch this. The simulator can.

The cascade-under-timeout migration. An ALTER TABLE that will queue behind a currently-running long reader. The simulator reads pg_stat_activity, sees the reader, estimates the queue time, and blocks the merge. lock_timeout = 3s might catch the cascade at run time, but the cascade already cost the pool. The simulator prevents the cascade from starting.

The migration that would cause a failed deploy. Any migration the simulator would reject never reaches the deploy pipeline. The team fixes the migration in the PR, re-runs the check, and merges only when the verdict is clean. No failed-deploy signal reaches users, and no red deploy appears on the dashboard.

What PR-time simulation does not do

Honest limits. The simulator is probabilistic, and the cases it cannot cover are the cases where lock_timeout is still necessary.

Production state shifts between PR time and deploy time. The simulator reads pg_locks at the moment of the PR comment. The developer reads the comment an hour later, fixes the issue, and merges. By the time the deploy runs, the lock graph may have changed. A new long-running reader may have started. The simulator's estimate from an hour ago is stale. lock_timeout at deploy time is the safety net for exactly this shift.

Uncommon migrations the cost model does not cover well. The cost model works well for common DDL (index builds, column additions, constraint additions) because the lock mechanics are well-understood and the cost is largely a function of table size. For less common operations (partition splits, inheritance changes, custom extension DDL), the model is less accurate, and its confidence interval should reflect that. A simulator that overclaims accuracy on operations it does not model well is worse than one that defers to lock_timeout plus human review.

Workload spikes. A migration that would be safe on a Tuesday afternoon is unsafe at 11am on Black Friday. The simulator can be aware of the deploy window, but it does not know what is going to happen in the next minute of production traffic. lock_timeout covers the surprise.

The long-tail behavioral change. A migration that adds a column is fast. A migration that adds a column with a default value on Postgres 10 or earlier is a table rewrite. The simulator knows this; a careless developer may not. The check catches the common shape of this trap. It will not catch every version-specific behavioral quirk Postgres has accumulated over a decade. lock_timeout is the guard for the unknown.

The two-layer picture

The two layers catch different failure modes and stack cleanly. The simulator is deterministic-ish and loud: it speaks up at PR time, names a risk, and suggests a fix. The lock_timeout is automatic and silent until it fires. The simulator moves the decision left, to the point where fixing is cheap. The lock_timeout is the last line of defense when the decision cannot be moved left in time.

PR-time simulation

Whenbefore merge
Readslive production state
Catchesslow builds, cascades, known patterns
Missesstate shift between PR and deploy
OutputPR comment, block or pass
Cost of a misspushes the check to layer 2

Post-deploy lock_timeout

Whenduring migration run
Readslive lock queue
Catchesany prolonged wait
Missesslow DDL that has the lock
Outputaborted DDL, failed deploy
Cost of a misscascade to pool, outage

Configuring both

The recommended stack for a Postgres team that has both layers running.

Migration runner. Use a framework that sets lock_timeout and statement_timeout per-migration automatically, and that retries with backoff when the timeout fires. GoCardless's ActiveRecord SaferMigrations, Doctolib's safe-pg-migrations, and Xata's pgroll all ship this by default. Samokhvalov's post is the standard reference for tuning the retry behavior.

PR-time check. A check in CI or in the PR-review surface that reads live production state via a read-only Postgres role, parses the proposed DDL, and delivers a verdict before merge. Datapace is one implementation. A team can build one in-house; the non-obvious piece is the cost model, which requires calibration against the team's own production instance rather than TPC benchmarks.

Observability layer. pganalyze, Datadog DBM, or equivalent, watching for post-deploy regressions the first two layers missed. These are the diagnostic signal for when both earlier layers were insufficient and a problem still reached production. Covered in detail in the earlier post on pg_stat_statements and auto_explain.

Three layers, each catching what the previous one missed. The cost is low because each layer is cheap to run. The benefit is that the migration that would have caused the outage gets filtered at the first layer that is capable of seeing it, not the last.

Closing note

The case for PR-time simulation is not a case against lock_timeout. lock_timeout is correct, necessary, and unconditional. Samokhvalov's post is required reading for anyone shipping Postgres migrations. What the simulator adds is the layer above: catching the migration before it reaches the point where lock_timeout has to decide whether to abort it. A team that runs both is not double-protected against the same failure; it is protected against two different failure modes that each get through the other layer.

If you want this kind of PR-time verdict in front of your existing lock_timeout discipline, that is what we are building at Datapace.

Frequently asked questions

If lock_timeout catches the cascade, does the simulator matter?

It catches it at a different cost. lock_timeout aborts the migration after N seconds of waiting, which is enough time for connection-pool saturation to begin. The simulator prevents the wait from starting. The cascade that lock_timeout aborts has still consumed connections. The cascade the simulator prevents has consumed nothing. For teams with a fragile pool configuration, this difference is operationally meaningful.

What value of lock_timeout should we use?

The Samokhvalov post and the GoCardless posts both recommend "the shortest pause your application can tolerate," typically 1 to 3 seconds. Longer timeouts catch more cascades before they escalate but also raise the false-positive rate (legitimate migrations aborted because the reader happened to be slow today). 3 seconds is a common default. Tuning down is a judgment call about your pool's sensitivity.

Can the simulator replace lock_timeout entirely if it is accurate enough?

No. Production state shifts between PR time and deploy time, and the simulator's estimate has a real error bar. A migration that the simulator judged safe can encounter a long-running reader that started after the verdict but before the deploy. lock_timeout is the cheap, deterministic guard for that case. Nothing replaces it.

Does pgroll already handle this?

Partly. pgroll applies DDL through versioned views and uses short lock_timeout with retries on the phases that take locks. It is a migration runner with lock_timeout built in. It is not a PR-time simulator; it does not read production state before the merge and it does not block the merge based on a predicted cost. The two are complementary. pgroll + PR-time check is a stronger stack than either alone. Covered in detail in the earlier post on online DDL tools.

Is this argument specific to schema migrations?

The same shape applies to any database change whose cost is a function of production state: large backfills, data-migration jobs, index builds outside the migration system, extension upgrades. The specifics of what the simulator reads differ per class; the two-layer argument is the same.

Sources

  1. N. Samokhvalov, "Zero-downtime Postgres schema migrations need this: lock_timeout and retries", Postgres.ai, September 2021.
  2. GoCardless Engineering, "Zero-downtime Postgres migrations: a little help" and "Zero-downtime Postgres migrations: the hard parts", 2016.
  3. Xata, pgroll GitHub repository.
  4. Doctolib, safe-pg-migrations.
  5. Railway, Incident report, October 28, 2025.
  6. Datapace blog, "ACCESS SHARE does not jump the queue: Postgres lock fairness".
  7. Datapace blog, "The 8,400x staging gap: why staging lies about migration safety".
  8. Datapace blog, "pgroll, pg-osc, pg_karnak, gh-ost: an online DDL tool face-off".

Want to optimize your database performance?

Get AI-powered recommendations for your specific database setup.