New: how one CI check would have caught both of Railway's billion-row Postgres migration outages. Read the post →
Essay
April 19, 2026
13 min read

The attribution gap: why DB research can't name the commit

DBSherlock, iSQUAD, D-Bot, and RCRank each infer a DB-internal root cause. None of them names the commit. The gap between the fourth depth of inference and the developer workflow.

#PostgreSQL#Database reliability#Root cause analysis#Research#APM

A decade of database performance research has converged on sophisticated root-cause analysis. DBSherlock in 2016 produced causal models over hundreds of DB statistics. Alibaba's iSQUAD in 2020 diagnosed intermittent slow queries with an 80.4% F1 score. D-Bot in 2024 added LLM-powered reasoning and natural-language diagnosis reports. RCRank in 2025 ranked root causes by impact across multimodal signals. Every one of those systems stops at the same wall. None of them answers the question a developer actually asks when an endpoint slows down: which commit did this.

TL;DR. Published database-performance diagnosis systems infer root causes at four depths: metric threshold, plan change, anomaly class, and DB configuration. The developer workflow operates at two deeper depths: SQL or migration change, and Git commit. No published system natively crosses the gap between the fourth and the fifth. That gap is not a research oversight. It is a different problem living outside the DB.

Six depths of causal inference

Every diagnostic system can be placed on a single axis: how deep into the causal chain does its output reach. The reader scanning a Datadog alert at 2am, the SRE on call, and the engineer who merged the offending migration are each asking a different question, and they are at different depths.

Six levels of causal inference across database diagnostic systems. Level 1 metric threshold: baseline APM stops here. Level 2 query plan change: pganalyze, Datadog DBM, New Relic QPM stop here. Level 3 anomaly class: iSQUAD and RCRank stop here. Level 4 DB configuration or knob: DBSherlock and D-Bot stop here. Level 5 SQL or migration change and Level 6 Git commit or pull request: no published system operates here. Levels 5 and 6 are the attribution gap.

Where each generation of diagnostic system stops. The depth metaphor is ours; the stopping points are what the papers and product docs actually claim.

Read left to right. A baseline APM dashboard reports that p99 latency on the checkout endpoint crossed 500ms. That is level 1. A database monitoring tool like Datadog DBM, pganalyze, or New Relic Query Performance Monitoring will go further: it shows the offending query, the new execution plan, the row counts, the wait events. That is level 2. A research system like iSQUAD or RCRank groups the signals into an anomaly class: buffer-pool pressure, lock cascade, slow I/O, plan regression. Level 3. DBSherlock and D-Bot get one more step down: they infer that a specific DB configuration change, knob setting, or environment shift is the cause. Level 4.

The developer asks a different question. "Did a schema change cause this." "Is my PR the one that did it." "Which commit shipped the query that now times out." Those are levels 5 and 6. Every system named above falls silent at level 5.

What each system actually claims

The gap is not a matter of interpretation. It is what each paper and each product literally says its scope is.

DBSherlock (Yoon, Niu, Mozafari, SIGMOD 2016)

DBSherlock operates on "hundreds of statistics and configurations collected over the lifetime of the system." The output is "a small set of potential causes" drawn from the DB's own observed state. The paper's contribution is causal models that identify "which DB-side factor correlates with the anomaly." The DBA confirms the root cause; the confirmation is then used to improve future diagnoses. Nothing in the loop reads the application repository or considers a commit as a candidate cause. The model's inputs are DB statistics. Its outputs are DB-internal explanations.

iSQUAD (Ma et al., VLDB 2020, Alibaba)

iSQUAD targets intermittent slow queries on Alibaba's OLTP cloud database. The framework has four components: anomaly extraction from KPIs, dependency cleansing to remove downstream propagation, type-oriented pattern clustering, and a Bayesian Case Model. The reported F1 is 80.4% on real production data. The diagnosis classifies the anomaly type: CPU saturation, I/O contention, buffer pool pressure, lock wait pattern. That classification is level 3. iSQUAD does not know which application code change preceded the anomaly, because application code is not an input signal.

D-Bot (Zhou, Li et al., VLDB 2024 and SIGMOD Companion 2025)

D-Bot is an LLM-powered DBA copilot. It "automatically acquires pertinent knowledge from diagnostic documents, interacts with users for self-refinement, and generates reasonable and well-founded diagnosis reports (i.e., specifying root causes, solutions, and references)." Its inputs are the same DB-side inputs as prior systems, plus a corpus of diagnostic documentation that the LLM retrieves from. The output is a natural-language diagnosis naming a DB-side cause. Level 4. D-Bot does not read the application's migration history or its Git log.

RCRank (Ouyang et al., VLDB 2025)

RCRank is a multimodal ranking framework. The modalities are query statements, execution plans, execution logs, and key performance indicators. The output is a ranked list of root-cause types, with the ranking "adapted to the impact of each cause on query latency." RCRank's contribution is impact-aware ranking and multimodal fusion. Its inputs and its taxonomy are DB-internal. An anomaly triggered by a schema migration appears to RCRank as a plan change plus a lock pattern, not as a migration.

Survey positioning

The 2023 Science China survey by Huang, Qin, Zhang, Tu, Li, and Cui names the three topics of database performance optimization research: prediction, diagnosis, and tuning. "Diagnosis" in that taxonomy is defined as "determining the root cause of performance regression" using DB-side signals. The survey covers dozens of systems. None of the systems it surveys takes application code or the Git log as a primary input. That is not a blind spot in the survey. It is an accurate description of the field.

The commercial tier does not close the gap either

A reasonable objection is that academic systems target DBA workflows while commercial APM tools target developer workflows, and the commercial tools must therefore close the gap. They do not. They move it one step.

Datadog DBM correlates database load to APM services by injecting trace identifiers into DB metric collection. The docs are explicit about what this buys you: "you can attribute load on a database to individual services to understand which services are most active on the database." That is service-to-query attribution. The leap from "this query is hot" to "this commit shipped the change that made it hot" is not automatic. Datadog offers separate Deployment Tracking that requires the user to instrument deployment events with repository_url and commit_sha. The UI then overlays deployments on latency charts so a human can visually correlate. Correlation is a dashed line on a graph, not a causal claim.

pganalyze combines pg_stat_statements with auto_explain logs to give exact plan samples per query. Its deployment integration with Datadog is described as "emit trace data from slow query plan samples via OpenTelemetry" using the traceparent tag. The tool will show that a query's plan changed between time T1 and T2. It will not natively tell you that PR #2843 merged at T2 is why.

New Relic Query Performance Monitoring shows slow-query details, stack traces in the application code that issued the query, and historical baselines. A developer can combine that with the team's deploy calendar to reason about whether a deployment preceded the regression. The reasoning is manual.

What the commercial tools give you

Signalquery plan, wait events, load
Attributionquery → service
Deploy overlayvisual correlation (manual)
Commit linkuser instrumentation (SHA tag)
Causal claimnone; the user infers causality

What the developer needs

Signalregression anchored to a time
Attributionregression → commit or PR
Deploy overlayautomatic, per change
Commit linkdirect, with confidence interval
Causal claim"this PR is the most likely cause"

The commercial tier offers a place to manually connect a regression to a commit, given enough user instrumentation. No tier, commercial or academic, offers a system that reads the DB signals, reads the Git history, and outputs a commit-level attribution with a confidence score.

Why the gap is structural, not an oversight

Four reasons the published research does not cross from level 4 to level 5.

Different input set. A system that attributes a regression to a Git commit needs to read, at minimum, the commit log, the migration history, and the SQL diff between commits. Those are not DB signals. They live in the repository. A diagnostic framework built around DB metric collection has no reason to ingest them, and pivoting its input set changes the problem from "infer anomaly type from KPIs" to "correlate two time-series from different systems." That is a different research problem with a different research community.

Different output shape. DB research outputs an explanation in the DB's own vocabulary: "buffer pool pressure plus lock cascade on table X." A developer workflow needs an output in the developer's vocabulary: "PR #2843, line 147 of 0042_add_archived_at.sql, introduced a sequential scan on a 12M-row table." The translation between those vocabularies is nontrivial and is not the contribution a DB paper is rewarded for making.

Different workflow. Level 1 through 4 outputs live in a dashboard. The DBA or SRE reads them. Level 5 and 6 outputs live in the code review surface. The developer reads them. The research lineage has optimized for the dashboard because the dashboard is where a DBA spends their day. The attribution gap sits across a workflow boundary that the research lineage has never had a strong incentive to cross.

The population that would pay for it has changed. Between 2015 and 2025, the dedicated DBA role shrank at 15 to 60 engineer SaaS startups. The person holding the pager for a database regression is now a backend engineer who works out of a PR, not a dashboard. The research community is correctly serving the population that commissioned it: cloud-database operators with DBA-ish workflows. The startup engineer is a different customer, and has been underserved on purpose.

What closing the gap looks like

A system that closes the gap has to do three things simultaneously.

First, ingest DB signals continuously: pg_stat_statements for aggregate query cost, auto_explain for plan shapes, pg_locks for contention, pg_stat_activity for the current working set. This is what existing systems already do. The signal is known.

Second, ingest repository signals: the commit log with timestamps, migration files, the SQL diff between commits, and the CI/CD event stream that reports when each commit became the running version in production. None of the academic systems named above ingest any of this. It is what the gap is made of.

Third, for every detected regression, answer the question: given the time the regression began, the plan change observed, and the set of commits that deployed in the preceding window, what is the most probable cause? The output is a commit or PR, not a knob. The answer is probabilistic, with a confidence bound. For a SQL-diff cause ("the query in PR #2843 removed the index hint"), the confidence is high. For a correlated-but-indirect cause ("migration in PR #2841 locked the table long enough for queue saturation to start two minutes later"), the confidence is lower and the output names both.

A verdict in that shape looks like this, rendered in the surface where the developer already works:

# datapace-agent · regression attribution
SLOW · api.orders.checkout p99: 420ms → 2100ms (+400%)
         detected 2026-04-18 14:07 UTC, 4 min after deploy

  Query        SELECT * FROM orders WHERE email = $1
  Plan change  Index Scan (idx_orders_user) → Seq Scan (12M rows)
  Root cause   commit 8b8dc7a (PR #2843, merged 14:03 UTC)
               diff removed WHERE user_id = $1 predicate
               query now matches fewer indexes
  Confidence   0.92
  Alternate    PR #2841 (migrations/0042_add_archived_at.sql)
               confidence 0.04, reason: no plan-relevant column changes

Nothing in that verdict is novel in its individual parts. pg_stat_statements emits the query. auto_explain emits the plan change. The Git log emits the commits. What is novel is the join: the regression time window on the left, the commit window on the right, the plan diff in the middle, and a ranked output that names a PR.

The verdict is probabilistic, not deterministic

The attribution is a ranking, not a certainty. Two PRs may deploy in the same five-minute window. A regression may be caused by data growth that happens to cross a threshold at the same moment as an unrelated commit. A plan change may have nothing to do with the deploys at all and be triggered by statistics auto-analyze. The system has to be honest about these cases: rank them with low confidence, name the alternates, and let the developer decide.

This is the same posture every modern fault-localization system takes, from SZZ-style bug prediction to flaky-test root causers. The output is a ranked list, and the top-ranked cause is usually right, and sometimes it is not, and the cost of being occasionally wrong is bounded by the fact that the developer reads the verdict and applies judgment. The failure mode is worse if the system claims certainty.

Closing note

The research lineage from DBSherlock through RCRank is the right lineage to read if you want to understand how DB-side anomaly diagnosis works. It is also the right lineage to understand the shape of its ceiling. None of these systems is broken. None of them missed a trick. They solved a clean version of a different problem: given DB signals, infer the DB-internal cause. Cross that problem with the commit graph and you get a different system, with different inputs, a different output shape, and a different customer. The gap between the two is not an oversight. It is what the two research communities have each decided is out of scope, reasonably, for more than a decade.

If you want a verdict at PR time that names a commit rather than a knob, that is what we are building at Datapace. The academic lineage is the foundation. What we add is the join between the DB signals it already knows how to read, and the Git history it has never had a reason to ingest.

Frequently asked questions

Does Datadog DBM attribute a regression to a commit?

Not automatically. Datadog DBM correlates query load to APM services via injected trace identifiers. Separately, Deployment Tracking lets a user tag deployments with commit_sha and repository_url, which the UI then overlays on latency charts. The correlation between a regression and a commit is a visual overlay that the user reads. No causal claim is emitted by the product.

What about pganalyze Index Advisor or Query Advisor?

Those surfaces recommend indexes or query rewrites based on plan data. They do not map a detected regression back to a PR. pganalyze's integration model is to emit plan data via OpenTelemetry and let the downstream APM do correlation. The user connects the two surfaces manually.

Can you not just git-bisect the regression?

Git bisect works when the regression reproduces deterministically on demand. Database regressions often do not: they depend on production data volume, concurrent workload, cache warmth, and the plan cache. Bisecting a production regression means deploying each candidate commit to production and waiting for the regression to manifest, which is not viable at merge frequency in a healthy SaaS team.

Why not just ingest commit history into one of the existing systems?

Because the existing systems are optimized end-to-end for a different input shape. iSQUAD's clustering, DBSherlock's causal models, and RCRank's cross-modal Transformers all assume DB-metric inputs. Attaching commit history as an additional modality is an interesting research project, not a bolt-on feature. The fact that nobody has published it is evidence that the join is not trivial.

Is this a position against the research, or against the products?

Neither. The research is correct within its scope, and the products are honest about their scope. The point is that the developer's workflow begins at level 6 and has never been covered by the research or the products. That is a gap to fill, not a mistake to criticize.

Sources

  1. D. Y. Yoon, N. Niu, B. Mozafari, "DBSherlock: A Performance Diagnostic Tool for Transactional Databases", SIGMOD 2016.
  2. M. Ma, Z. Yin, S. Zhang et al., "Diagnosing Root Causes of Intermittent Slow Queries in Cloud Databases" (iSQUAD), PVLDB 13(8), 2020.
  3. X. Zhou, G. Li et al., "D-Bot: Database Diagnosis System using Large Language Models", PVLDB 2024; SIGMOD Companion 2025.
  4. B. Ouyang, Y. Zhang, H. Cheng et al., "RCRank: Multimodal Ranking of Root Causes of Slow Queries in Cloud Database Systems", PVLDB 2025.
  5. S. Huang, Y. Qin, X. Zhang, Y. Tu, Z. Li, B. Cui, "Survey on performance optimization for database systems", Science China Information Sciences, 2023.
  6. Datadog, "Correlate Database Monitoring and Traces", product documentation.
  7. Datadog, "Deployment Tracking", product documentation.
  8. pganalyze, "Integrating pganalyze with Datadog APM", product documentation.
  9. New Relic, "View slow query details", product documentation.

Want to optimize your database performance?

Get AI-powered recommendations for your specific database setup.