One sequential scan, one missing index, one larger instance
A sequential scan on a hot table is both a performance problem and a cost problem. The database does not distinguish. The fix for both is the same index, and catching it at PR time is straightforward.
A sequential scan on a table with a hundred million rows is always the same story. The query gets progressively slower as the table grows, nobody notices on day one, the database CPU creeps up week by week, and eventually the team is on a larger instance that costs twice as much. The performance problem and the cost problem are the same problem. The fix for both is one index.
What a sequential scan actually is
When Postgres executes SELECT ... WHERE some_column = $1, it has two options. Option one: walk an index on some_column, jump to the matching row, return it. Option two: read every page of the table from start to finish, check each row, return the ones that match. Option two is a sequential scan, also called a Seq Scan in EXPLAIN output.
Sequential scans are not bad in themselves. On a small table, they are the fastest plan because the whole table fits in a handful of pages. The planner will prefer a sequential scan over an index scan when it estimates the scan will cost less. That decision is rational at small scale and wrong at large scale. The planner re-evaluates every execution, but only with the statistics it has about the table. If the statistics are stale, or the query returns many more rows than expected, Postgres chooses the wrong option.
The moment a sequential scan becomes a problem is when a hot read path runs it on a table that keeps growing. A users table with ten thousand rows is a seq scan under 5 ms. The same table with a hundred million rows is a seq scan taking three seconds and reading three gigabytes into the buffer cache on every call.
The performance cost
Concrete. Consider a sessions table with 50 million rows, a row width of 240 bytes (so 12 GB), and a query pattern SELECT * FROM sessions WHERE user_id = $1 running five thousand times per minute.
Before an index:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM sessions WHERE user_id = 12345;
Seq Scan on sessions
Filter: (user_id = 12345)
Rows Removed by Filter: 49,999,988
Buffers: shared hit=1,024 read=1,570,432
(actual time=2,104.220..2,847.190 rows=12 loops=1)
Execution Time: 2,847.412 ms
Every call reads 12 GB of pages. Most come from disk. Execution time is three seconds. Five thousand calls per minute times three seconds is 250 minutes of database-CPU-seconds spent per wall-clock minute, which on any realistic instance means 100 percent CPU and no headroom for anything else.
After an index:
CREATE INDEX CONCURRENTLY ON sessions (user_id);
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM sessions WHERE user_id = 12345;
Index Scan using sessions_user_id_idx on sessions
Index Cond: (user_id = 12345)
Buffers: shared hit=16 read=2
(actual time=0.208..0.740 rows=12 loops=1)
Execution Time: 0.820 ms
The plan changes from Seq Scan to Index Scan. Buffer reads drop from 1.5 million to 18. Execution time drops from 2,847 ms to 0.82 ms. Same data, same result, 3,472x faster.
The cost cost
Cloud Postgres pricing is dominated by CPU and I/O. Paying to scan 12 GB of pages five thousand times a minute is paying to move roughly 3.6 TB of data through the buffer cache every hour. That does not come from nowhere. It comes from your instance's CPU budget, from its provisioned IOPS, and from whatever throughput allowance your storage tier gives you.
The cost of the missing index is the difference between the instance you are running and the instance you could be running. If a sequential scan is pushing CPU to 80 percent during peak, the rational response is to move to a larger instance that can absorb the spike. That migration is usually permanent: you do not typically scale back down after the scan is fixed, because nobody thinks to. The extra instance size is paid every month forever, for a problem that was one index create away from not existing.
Before the index
Seq ScanAfter the index
Index ScanFinding the sequential scans that matter
Not every sequential scan is worth fixing. A Seq Scan on a thousand-row lookup table is fine. The ones that matter are seq scans on hot tables that keep growing. The query below identifies them directly from Postgres statistics, without guessing:
SELECT
schemaname,
relname,
seq_scan,
seq_tup_read,
idx_scan,
n_live_tup,
CASE
WHEN seq_scan = 0 THEN 0
ELSE seq_tup_read / seq_scan
END AS avg_rows_read_per_scan
FROM pg_stat_user_tables
WHERE seq_scan > 0
AND n_live_tup > 100000
ORDER BY seq_tup_read DESC
LIMIT 20;
Sort by seq_tup_read: this is the total number of rows read by sequential scans on each table. A high value with a large n_live_tup (row count) and low idx_scan (index scan count) is the signature of a table that is being scanned end-to-end on a hot query path. That is the table that needs an index. avg_rows_read_per_scan tells you how expensive each scan is individually; if it is close to n_live_tup, every scan is reading the entire table.
Adding the index without an outage
A naive CREATE INDEX takes a SHARE lock that blocks writes for the duration of the build. On a 50-million-row table, the build takes long enough to page someone. Always use CREATE INDEX CONCURRENTLY on production tables:
CREATE INDEX CONCURRENTLY ON sessions (user_id);
CONCURRENTLY uses a weaker lock that does not block writes, and the build runs in the background. The trade-off is that the build takes longer (typically 2x) and cannot run inside a transaction. If the build fails midway, Postgres leaves an invalid index behind that has to be dropped manually: DROP INDEX CONCURRENTLY sessions_user_id_idx;. Verify the final state with \d sessions.
Catching the next one at PR time
Sequential scans on hot paths are one of the easiest database problems to detect statically. The check is: for every WHERE clause in the application's SQL, which columns are filtered, and do those columns have indexes on them. The answer lives in the same schema the database already exposes. A tool running at PR time can read the new DDL in the migration, cross-reference against pg_stats for the target table, and flag a missing index before the migration lands. That is what Datapace does here, and the check is deterministic, not probabilistic: either the index exists or it does not. The agent can optionally suggest the CREATE INDEX CONCURRENTLY statement to add alongside the migration.
A useful caveat
Adding indexes is not free. Each index slows down every INSERT, UPDATE, and DELETE on the table, because the index itself has to be updated. On a write-heavy table, indexing every filterable column will make writes pay for reads they do not need. The calculus is: an index pays for itself when the read path runs often enough, or costs enough individually, that its read benefit exceeds the write overhead. For a cold audit log that is mostly written and rarely read, no index. For a sessions table that is mostly read and rarely written, every filter column indexed.
Closing note
The performance cost and the billing cost of a missing index are two views of the same sequential scan. The database does not distinguish. A query that reads 1.5 million pages from disk costs CPU time and I/O quota. Both of those are already paid for in the instance class your team chose, and if the scan pushes them above threshold, the team upgrades the instance. That upgrade is the cost of the missing index, compounding monthly, indefinitely. One CREATE INDEX CONCURRENTLY is a cheaper fix than a forever-larger instance.
Frequently asked questions
How do I know if I have a missing index?
Query pg_stat_user_tables sorted by seq_tup_read DESC. A row with a large table size (n_live_tup), high sequential-scan row count, and low index-scan count is being scanned on hot paths. That is the signature of a missing index. Cross-reference against pg_stat_statements to find the specific query doing the scanning.
Why would Postgres choose a sequential scan when an index exists?
Because the planner estimates it will be cheaper. This happens when statistics are stale (run ANALYZE <table>), when the query returns so many rows that walking the index is actually slower than reading everything, or when the index is not usable for the filter (wrong column order, wrong operator class, function applied to the column). EXPLAIN ANALYZE will show you which.
Is CREATE INDEX CONCURRENTLY always safe?
Safer, not always safe. It avoids blocking writes, but it still consumes CPU and I/O during the build, which can degrade other queries on a heavily loaded system. It also cannot run inside a transaction, so migration tooling that wraps every statement in a BEGIN; ... COMMIT; will reject it. Most migration tools have an opt-out marker for this specific statement. If the build fails, the invalid index has to be dropped manually.
What about partial indexes and expression indexes?
Both are first-class solutions. A partial index (CREATE INDEX ... WHERE condition) is smaller and faster to maintain when only a subset of the table is ever queried, for example WHERE archived = false. An expression index (CREATE INDEX ... ON table (lower(email))) is required when the query applies a function to the column in its filter. Both are usable by the planner if the query's filter matches the index predicate or expression exactly.
Sources
- PostgreSQL documentation, CREATE INDEX
- PostgreSQL documentation, pg_stat_user_tables
- PostgreSQL documentation, Partial indexes
Want to optimize your database performance?
Get AI-powered recommendations for your specific database setup.