The time-series anomaly detection literature has shipped a generation of sophisticated methods benchmarked on Yahoo S5, the Numenta Anomaly Benchmark, and the UCR Time Series Anomaly Archive. Teams that point these methods at Postgres metrics, specifically pg_stat_statements, auto_explain logs, and the standard Prometheus Postgres exporter, get alerting pipelines with false-positive rates that make on-call rotations unusable. This is not a bug in the methods. It is a mismatch between what the benchmarks optimized for and what a Postgres workload produces. Two 2025 papers call this out specifically.

TL;DR. Yahoo S5, NAB, and UCR characterize anomalies that are structurally different from what Postgres metrics actually show. Postgres anomalies are regime shifts after a merge, not point outliers in a smooth series. A detector trained on the benchmarks will flag normal diurnal peaks, miss the post-deploy plan regression, and page the on-call engineer on the wrong events. The fix is domain-specific detection that understands query normalization, workload seasonality, and deploy-boundary regressions. Generic TSAD is not that.

The benchmarks optimize for sensor-like anomalies

The three datasets that dominate TSAD evaluation are Yahoo S5, the Numenta Anomaly Benchmark (NAB), and the UCR Time Series Anomaly Archive. Each was built with different assumptions and carries different failure modes.

Yahoo S5 is real and synthetic time-series from Yahoo services with tagged anomaly points. The 2020 critique paper by Wu and Keogh, "Current Time Series Anomaly Detection Benchmarks are Flawed," documents four specific problems with Yahoo S5 and similar datasets: trivially detectable anomalies, run-to-failure sequences rather than realistic in-stream anomalies, unrealistic density of labeled anomalies, and mislabeling. The paper's argument, summarized by the title, is that methods that score well on Yahoo S5 are solving a benchmark, not the general problem.

NAB provides 58 time series across seven categories including AWS CloudWatch, Twitter engagement, traffic, and synthetic series. The anomalies are individual labeled windows. NAB's design explicitly assumes streaming detection, which is closer to production than Yahoo's offline framing, but the categories are still sensor-like rather than workload-like.

UCR, proposed after the Wu and Keogh critique, is 250 carefully curated time series designed to fix the specific flaws identified in the earlier benchmarks. It is the cleanest of the three. It is also heavily pattern-wise: the anomalies are shape disruptions in otherwise regular signals, which maps better to industrial sensor data than to a database workload.

The 2025 paper by Andreas C. Müller (Microsoft), "Open Challenges in Time Series Anomaly Detection: An Industry Perspective," makes the industry critique explicit. Current research uses definitions that "miss critical aspects of how anomaly detection is commonly used in practice." The paper calls out five specific areas the benchmarks underweight: streaming algorithms, human-in-the-loop refinement, point-process anomalies, conditional anomalies, and population-level analysis across many series. All five are operationally central and benchmark-peripheral.

Postgres metrics break five benchmark assumptions

The 2025 Expert Systems with Applications survey, "Insights into KPI-based Performance Anomaly Detection in Database Systems," covers the specific shape of database KPI data. Five properties make Postgres metrics distinctive, and all five push away from the Yahoo/NAB/UCR assumptions.

Five axes of mismatch. A generic TSAD method optimized on Yahoo/NAB/UCR has to fail on at least one of them.

1. Deploy boundaries, not smooth drift

The most consequential Postgres metric regressions happen at deploy boundaries. A commit ships, the plan cache flushes or the query changes, and the p99 latency for a specific normalized query jumps from 12 ms to 300 ms within a minute. To a generic TSAD method, this looks like a step function inside a time series that was otherwise stationary. Yahoo S5's dominant anomaly type is a point outlier in a smooth series. NAB catches both outliers and step changes but scores them the same way. UCR's pattern-wise framing treats a sustained step as a long anomaly window. None of the three captures the crucial operational signal: the step coincides with a deploy event, and the deploy event is the attribution.

2. Seasonality is compound

A database workload carries diurnal seasonality (US business hours versus nighttime), weekly seasonality (Monday-Friday versus weekend), per-tenant seasonality (one customer running their monthly batch on the 1st), and release-cadence effects (every Tuesday-Thursday deploy window behaves differently). A TSAD method that models one or two of these layers will flag the other layers as anomalies. Yahoo S5 is mostly single-seasonality. NAB includes some multi-seasonality but not at the scale Postgres metrics carry. UCR is pattern-wise and does not emphasize multi-layer seasonality at all.

3. Per-query intent, thousands of series

A well-instrumented Postgres deployment exposes per-query metrics through pg_stat_statements. On a production instance this is thousands of normalized queries, each with its own latency and throughput series. A detector that treats the engine's aggregate latency as a single series misses regressions that hit only one query. A detector that runs independently on each query series gets thousands of false positives because most queries are low-volume and noisy. The correct approach is population-level analysis across the query set, which is one of the five gaps Müller's 2025 paper names.

4. Workload context changes the meaning of a spike

A three-second p99 on the checkout-page query at 2am is an anomaly. The same three-second p99 at 11am on Black Friday is not. Conditional anomaly detection, again one of Müller's five gaps, is the framing that handles this. Generic TSAD methods treat the metric value as the signal. The correct signal is the metric conditional on the workload context (time of day, campaign, tenant mix).

5. False-positive cost is operational, not mathematical

The benchmarks measure precision, recall, and various F-scores. In production, a false positive is a page at 3am. A team that accepts a false-positive rate of 5 percent on a thousand-query instance gets roughly 50 spurious pages per day, which is career-ending for the on-call rotation. The precision threshold production requires is several orders of magnitude tighter than anything the benchmarks reward.

Two ways the mismatch shows up in real pipelines

The two failure modes are systematic enough to name.

Noise above threshold. The team tunes a generic detector, it flags the 11am diurnal peak as anomalous, the team adds a seasonality filter, the detector now misses the post-deploy step on a low-volume query, the team adds deploy-aware reasoning, the detector is no longer a generic TSAD method, it is a domain-specific pipeline. Every mature database monitoring stack has been through this loop. The endpoint is always a set of hand-tuned rules that do more work than the detector.

Alert fatigue into silence. The on-call rotation accumulates false pages, the team raises the threshold, the threshold quiets the noise and also quiets the real signal, the next production regression slips through for hours. This path is well-documented in SRE literature and is the practical consequence of applying a TSAD method tuned on the wrong distribution.

Four domain-specific components close the gap

Four concrete components that together close the gap.

Query normalization as the primary axis. The unit of analysis is the normalized query from pg_stat_statements, not the instance aggregate. Per-query baselines, per-query anomaly scoring, per-query history. This is population-level analysis, in the sense Müller's paper uses the term, with the population being the set of normalized queries on a single instance.

Deploy events as first-class context. The pipeline ingests the deploy event stream (commit timestamps, CI/CD publish events) and uses it as conditional context. A step change coinciding with a deploy is not an anomaly to be flagged; it is a regression to be attributed. The attribution is covered in detail in the earlier post on the attribution gap.

Compound seasonality models. Diurnal, weekly, and per-tenant seasonality decomposed and subtracted before any outlier reasoning. A Prophet-style model or an STL decomposition handles the first two; the third requires a tenant-aware grouping.

Conditional thresholds by workload regime. The checkout-page query under Black Friday load has a different expected distribution than under baseline load. The detector has to know which regime it is in. This is more engineering than ML, and it is what production-grade pipelines end up doing anyway.

Generic TSAD methods do not get to any of these four automatically. They can be retrofitted with enough hand-engineering, at which point the "TSAD method" label describes a small part of the pipeline. The rest is domain knowledge that does not appear in the benchmarks at all.

Generic TSAD stack

Unit: instance aggregate series
Context: none or simple seasonality
Signal: metric value
Output: anomaly score
Failure: noise above threshold

Postgres-aware stack

Unit: normalized query (thousands)
Context: deploys, seasonality, workload regime
Signal: metric conditional on context
Output: regression with attribution
Failure: missing or incomplete deploy stream

Closing note

The TSAD research lineage has real results and keeps improving. The critique here is not that the methods are bad. It is that the benchmarks they optimize on encode assumptions that do not match the shape of Postgres metrics. Müller's 2025 paper says this from the industry side; the ESA 2025 database survey says it from the database side. The intersection is where database monitoring tooling has to operate, and generic TSAD is one component of that tooling, not the tooling itself.

What we are building at Datapace is the context layer and security gateway that lets teams run AI operations on production databases safely, with DBAs in control. That context layer (what's there, what it means, how it connects) is what makes detection domain-specific: per-query baselines, deploy-boundary attribution, compound seasonality, conditional thresholds. The TSAD literature informs the component work. The product is the pipeline around it. If this is the kind of pipeline your team needs, you can book a call with us.

Frequently asked questions

Are there Postgres-specific TSAD benchmarks?

Not of the same scale as Yahoo S5 or UCR. The ESA 2025 database-systems KPI survey is the most recent literature review, and it notes the absence. Production observability vendors (pganalyze, Datadog DBM) publish case studies, not benchmarks. The cleanest path forward would be a public Postgres-metrics benchmark derived from real production traces, and nobody has published one at a scale and license that supports reproduction.

What about Prophet or STL for Postgres metrics?

Both work for the seasonality-decomposition piece. Neither solves the per-query population-level problem or the deploy-boundary attribution problem by themselves. They are components in the pipeline, not the pipeline.

Are deep-learning methods any better here?

On the specific TSAD metrics the benchmarks report, yes. In production on Postgres metrics, the gains do not obviously transfer, because the failure mode is not "the model is not expressive enough." It is "the model is optimizing for the wrong objective on the wrong distribution." A deeper model optimized on Yahoo S5 is not a better detector for pg_stat_statements regressions. It is the same detector with more parameters.

If generic TSAD fails, what is the minimum viable Postgres-aware detector?

Per-query baseline from a rolling window, deploy-event ingestion, a simple conditional threshold that accounts for diurnal and weekly seasonality, and a top-N ranker that surfaces the worst regressions rather than alerting on every one. This is a small enough stack that teams build it in-house. The limiting factor tends to be the deploy-event ingestion, because it crosses a workflow boundary between CI/CD and monitoring that most stacks do not connect by default.

Is the critique specific to Postgres, or does it apply to other DBs?

The shape of the argument transfers. MySQL's performance_schema, MongoDB's slow-query log, and Snowflake's query history all share the multi-seasonality, per-query population, and deploy-boundary characteristics. Postgres is the one this post focuses on, but a similar essay could be written for any engine.

Sources

A. C. Müller, "Open Challenges in Time Series Anomaly Detection: An Industry Perspective", arXiv 2025.
"Insights into KPI-based Performance Anomaly Detection in Database Systems: A Comprehensive Study", Expert Systems with Applications, 2025.
R. Wu and E. Keogh, "Current Time Series Anomaly Detection Benchmarks are Flawed and are Creating the Illusion of Progress", arXiv 2020.
UCR Time Series Anomaly Archive, UCR 2021 dataset.
Numenta Anomaly Benchmark, github.com/numenta/NAB.
Yahoo, S5 labeled anomaly detection dataset.
Datapace blog, "The attribution gap: why DB research can't name the commit".

Time-series anomaly detection fails on Postgres