New: how one CI check would have caught both of Railway's billion-row Postgres migration outages. Read the post →
Essay
April 19, 2026
12 min read

Spider 2.0, BIRD-Interact, and AI-generated SQL in production

Frontier LLMs hit 10 to 24 percent on realistic enterprise SQL benchmarks. Coding agents write 4 percent of public GitHub commits. The gap between the two is the new systemic risk.

#PostgreSQL#Schema migrations#AI coding#LLM benchmarks#Database reliability

On Spider 1.0, a 2018 text-to-SQL benchmark, GPT-4o hits 86.6 percent. On Spider 2.0, the 2024 benchmark designed to reflect real enterprise workflows, the same GPT-4o hits 10.1 percent. On BIRD-Interact's full 600-problem set released in August 2025, the best frontier LLM measured hits 16.33 percent. In the same window that those benchmarks documented the drop, coding agents wrote roughly one in twenty public GitHub commits. Some fraction of those commits are Postgres migrations. The gap between benchmark behavior and production deployment is larger than it has ever been, and it is still widening.

TL;DR. Three independent data points converge: frontier LLMs hit 10 to 24 percent on realistic enterprise SQL benchmarks, paraphrasing drops accuracy by 10 to 20 more points, and coding agents are being deployed to write migrations at the same time. The response is not "stop using AI." The response is to run the AI-generated DDL through a lock-graph simulator at PR time, so the migration that would fail silently in production gets flagged before merge.

Three datapoints, one trajectory

The three benchmarks that matter are Spider 2.0 (ICLR 2025 Oral), BIRD-Interact (ICLR 2026, released in two stages in June and August 2025), and the EMNLP Findings 2025 SQL2NL paraphrase-robustness paper. Each measures a different failure mode. Together they describe a single trajectory: the closer a benchmark gets to real production SQL work, the more frontier LLM accuracy collapses.

Bar chart of frontier LLM success rates. Spider 1.0 single-shot with GPT-4o: 86.6 percent. BIRD single-shot with o1-preview agent: 73 percent. Spider 2.0 enterprise GPT-4o: 10.1 percent. Spider 2.0 o1-preview agent: 21.3 percent. BIRD-Interact full 600-set best LLM: 16.3 percent. BIRD-Interact a-mode Claude 3.7 Sonnet: 17.8 percent. Vertical dashed line separates the 2018 and 2023 benchmarks from the 2025 and 2026 benchmarks.

As benchmarks approach real enterprise SQL, frontier LLM accuracy drops by 60 to 70 percentage points.

Spider 2.0: the enterprise-workflow collapse

Spider 2.0, from the xlang-ai group and published at ICLR 2025, is 632 problems drawn from real enterprise database use cases. Its databases carry over 1,000 columns and live in BigQuery or Snowflake. The paper reports: GPT-4o at 10.1 percent, o1-preview at 17.1 percent, and a code-agent framework built on o1-preview at 21.3 percent. The comparison numbers the same paper reports for Spider 1.0, on the same models, are 86.6 percent (GPT-4o) and 91.2 percent (code-agent). BIRD, an earlier realistic benchmark from 2023, sat at 73 percent. Moving from Spider 1.0 to Spider 2.0 is a drop of roughly 70 percentage points on the same model. The architecture did not change. The benchmark got harder in a way that maps to real work.

BIRD-Interact: the multi-turn collapse

BIRD-Interact, from bird-bench, reframes text-to-SQL as an interactive task. The model works through conversational (c-Interact) or agentic (a-Interact) sessions rather than single-shot generation. The paper reports: o3-mini at 24.4 percent Success Rate on c-Interact, Claude 3.7 Sonnet at 17.78 percent on a-Interact, in the June 2025 release. When the full 600-problem test set dropped in August 2025, the best model measured came in at 16.33 percent, with c-Interact and a-Interact splits sitting around 10 percent each. The paper identifies "Interaction-Time Scaling": giving the model more turns helps. It does not get the model to the accuracy numbers the marketing of frontier LLMs would suggest.

SQL2NL: the paraphrase collapse

Safarzadeh, Oroojlooy, and Roth's EMNLP Findings 2025 paper tests what happens when you paraphrase a prompt without changing its meaning. They use a schema-aligned SQL-to-NL pipeline to generate semantically equivalent prompts that are lexically different, then measure accuracy on the same underlying query. LLaMa 3.3 70B drops from 77.11 percent to 66.9 percent on paraphrased Spider queries, a 10.23-point drop. LLaMa 3.1 8B drops from 62.9 percent to 42.5 percent, a 20.4-point drop. GPT-4o mini is "disproportionately affected." The interpretation the paper draws is blunt: state-of-the-art models are far more brittle than standard benchmarks suggest. The same prompt, rephrased by the product manager rather than the data analyst, shifts the accuracy by double digits.

Meanwhile, agents are writing the migrations

The second datapoint is independently tracked. Claude Code launched in February 2025 and reached general availability in May 2025. By early 2026, Anthropic's telemetry reports that Claude Code is authoring roughly 4 percent of all public GitHub commits. The JetBrains 2026 Developer Survey puts workplace adoption of GitHub Copilot at 29 percent and of Cursor and Claude Code at 18 percent each. These are aggregate numbers across all kinds of code. There is no published breakdown of what share of those commits are database migration files, but the base rates suggest the absolute count is not small: a 15 to 60 engineer SaaS team shipping one or two migrations a week, with half its engineers using a coding agent some fraction of the time, is generating a steady stream of AI-drafted DDL.

The coding agent, unlike the text-to-SQL benchmark model, has access to the repo. It can read prior migrations, read the ORM definitions, read the test suite. This raises its floor. It does not raise its ceiling past the benchmark numbers, because the part of the task the agent cannot do, and the text-to-SQL model cannot do, is read the production lock graph and estimate how long ACCESS EXCLUSIVE is going to be held. That information is not in the repo.

What the benchmarks measure, and what they miss

The three benchmarks above measure different slices of the SQL-generation problem. None of them measures the one slice that determines whether a migration takes down production.

What the benchmarks test

Inputschema + NL question
Outputa SQL query or DDL
Success criterionexecution matches reference
Environmentoffline test database
Signaldoes the query return the right rows

What production requires

Inputschema + live prod state + workload
Outputa migration that deploys safely
Success criterionno lock cascade, no timeout, no data loss
Environmentlive production under load
Signaldoes the lock release inside the deploy window

A benchmark success means the generated SQL produces the correct result against a reference database that has no concurrent readers, no write traffic, no long-running analytical queries holding locks, and no billion-row production tables. A production success means the generated DDL acquires ACCESS EXCLUSIVE in a window that does not cascade into a pool drain, validates any new constraint without rewriting the heap, and leaves replicas healthy on the other side.

The benchmark can be perfect and the migration can still cause an outage. Railway's December 8, 2025 incident was a nullable-column ADD COLUMN that any LLM would emit correctly and any benchmark would score as success. The DDL was correct. The outage was the waitMask rule interacting with a long-running reader, covered in depth in an earlier post on this blog. There is no benchmark in the NL2SQL literature that scores this.

The systemic-risk argument

Three things are true at once, and the combination is new in 2026.

First, frontier LLM accuracy on realistic enterprise SQL sits between 10 and 24 percent. That number is not going to jump to 80 in the next quarter. The gap between Spider 1.0 and Spider 2.0 persisted across model generations, and each new benchmark has been designed to track capability rather than race to ceiling.

Second, paraphrasing the same prompt shifts accuracy by 10 to 20 points. The production developer asking an agent to "add a nullable archived_at timestamp to the sessions table" and the one asking to "soft-delete sessions by timestamp" are describing the same migration. One gets a migration that works. The other may not. The team cannot know in advance which rephrasing lands on the accurate side of the model's paraphrase manifold.

Third, coding agents are writing a measurable fraction of migrations. 4 percent of public GitHub commits is the published floor; the actual number for private repos, where most production migrations live, is unknown. The population of DDL that reaches git push without a DBA-level human in the authorship loop has grown substantially since 2023.

Each datapoint on its own is defensible: AI coding is useful, benchmarks are benchmarks, robustness gaps exist in every ML system. The combination produces a measurable shift in the base rate of production-unsafe migrations reaching merge. The Railway post-mortems are public because Railway writes post-mortems. The same pattern is running unlogged at many other companies.

What to do that is not "stop using AI"

The response that the research community implicitly endorses, "wait for the benchmarks to saturate before deploying agents to write DDL," is not a policy any engineering organization is going to adopt. The agents are useful, the productivity gain is real, and the teams that forgo it lose ground to the teams that do not. The question is what check sits between the agent's output and production.

A pull-request-time lock-graph simulator is one answer. The check does not need to know whether the DDL was written by a human or an agent. It reads the proposed DDL, parses the lock levels it will require against the target schema, reads the current state of pg_locks and pg_stat_activity on production, and rejects the merge when the estimated lock window exceeds a threshold. The check is independent of authorship. It catches the migration because the migration is unsafe, not because the agent generated it.

The three benchmark failure modes each correspond to a specific class of lock-graph miss:

  1. Spider 2.0 enterprise complexity: the agent emits a join against a table with a thousand columns and no useful index. The lock simulator notices the sequential scan, estimates the ACCESS SHARE hold time against current traffic, and blocks.
  2. BIRD-Interact multi-turn drift: the agent, after several refinement turns, emits a DDL that subtly differs from what the developer asked for: an ALTER TABLE that rewrites the heap instead of the metadata-only variant. The simulator notices the rewrite and blocks.
  3. SQL2NL paraphrase brittleness: the same developer asks the same thing two different ways, gets two different migrations, and merges the wrong one. The simulator evaluates the DDL, not the prompt. The benchmark gap is invisible to it.

None of the three is solved by the simulator. The first one is a question of index design and table architecture, which the simulator is not trying to fix. The second one is a question of whether the developer reads the diff before merging, which no tool replaces. The third one is a question of prompt engineering. What the simulator does is take the output of whatever process produced the SQL, benchmark-accurate or not, and evaluate it against production state. The check is defensible regardless of who drafted the DDL. That is the point.

Why this check is different from the existing stack

Every existing guardrail around AI-generated code targets the code surface. Linters catch syntactic errors. Unit tests catch behavioral regressions. CI catches compile failures. None of these look at the production database. An AI-written migration that passes lint, passes tests, and passes CI is then deployed against a database the CI environment has no visibility into. The simulator plugs the specific hole the existing stack leaves: the production lock graph the test environment cannot reproduce.

Closing note

The benchmark numbers are not an attack on frontier LLMs. They are the most honest measurement the research community has produced, and they say something specific: enterprise SQL is hard, and the hard part is not generating syntactically valid SQL. The hard part is generating SQL that behaves correctly against a live production schema with live production traffic, and that part is what the benchmarks explicitly do not model. The agents are going to keep writing migrations. The benchmarks are going to keep climbing slowly. The gap between the two is where the cascades happen.

A pre-merge check that reads production state and verdicts the proposed DDL is the one piece of the stack that sits in the gap. If you want that kind of verdict at PR time instead of a post-mortem, that is what we are building at Datapace.

Frequently asked questions

Are the benchmark numbers representative of what Claude Code or Cursor produce on private repos?

Nobody knows. The benchmarks are public; the private-repo output is not. What is known is that the coding agents have access to the repo and prior migrations, which the benchmark models do not. That advantage raises the floor. It does not raise the ceiling past the limits the benchmarks describe, because the limiting factor (reading the production lock graph) is not in the repo.

Aren't the benchmarks already improving? Won't 2026 and 2027 numbers be much higher?

Possibly. Spider 1.0 went from 20 percent to 90 percent in four years. The relevant question is whether Spider 2.0 and BIRD-Interact follow the same trajectory. The benchmarks were deliberately built to resist that saturation, and the drop from Spider 1.0 to Spider 2.0 is larger than the gain any single model generation has delivered in the last two years. Betting that the benchmarks saturate in time is a bet with asymmetric downside.

If paraphrasing shifts accuracy by 10 to 20 points, why not just standardize prompts?

Teams could, and some do. The effect shows up in structured prompts too. The paraphrase-robustness paper tests prompts drawn from the same schema and intent, not wildly different rephrasings. A team that standardizes on "add a nullable timestamp column to X" still hits the manifold on any clause that is not part of the template.

Is this specific to Postgres?

No. The benchmarks cover BigQuery, Snowflake, SQLite, and Postgres. The lock-graph miss is specific to Postgres and MySQL and any engine that has an exclusive DDL lock mode. On BigQuery and Snowflake the equivalent failure is a slot-budget exhaustion rather than a lock cascade. The argument shape transfers; the specific cost model differs.

What does the simulator look like in practice?

A PR comment that lists the DDL's required lock levels, the current state of pg_locks on production, an estimated lock window, and a block or pass verdict. The developer sees the verdict at the merge button, which is the same place the diff lives. It does not replace the developer's judgment. It gives the developer one additional data point they cannot get from the AI or from the CI.

Sources

  1. F. Lei, J. Chen, Y. Ye et al., "Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows", ICLR 2025 Oral.
  2. J. Xie, B. Yuan, J. Lin et al., "BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions", ICLR 2026.
  3. M. Safarzadeh, A. Oroojlooy, D. Roth, "Evaluating NL2SQL via SQL2NL", EMNLP Findings 2025.
  4. BIRD-Interact project page, bird-interact.github.io (leaderboard and August 2025 release notes).
  5. Spider 2.0 project page, spider2-sql.github.io.
  6. Anthropic, "Claude 4 and Claude Code general availability", May 2025.
  7. JetBrains, 2026 Developer Ecosystem Survey (AI coding tool adoption).
  8. Datapace blog, "One CI check would have caught both of Railway's billion-row Postgres migration outages".
  9. Datapace blog, "ACCESS SHARE does not jump the queue: Postgres lock fairness".

Want to optimize your database performance?

Get AI-powered recommendations for your specific database setup.

Spider 2.0, BIRD-Interact, and AI-generated SQL in production | Datapace