PostgreSQL Inside
PostgreSQL Inside
The longer we use PostgreSQL, the more we run into a few concepts at least once: MVCC, indexes, EXPLAIN, isolation levels, VACUUM, partitioning.
1. MVCC
Multi-Version Concurrency Control is a concurrency control technique that keeps multiple versions of the same row at once so that reads and writes do not block each other. PostgreSQL is built on this model.
- A transaction sees data through a snapshot taken at its start.
- An update creates a new row version and marks the old version with a transaction ID.
- Old versions that no transaction looks at anymore become candidates for VACUUM.
The advantage is that it offers a consistent snapshot without read locks. The limit is bloat from accumulated dead tuples.
2. Index types
PostgreSQL ships with several index access methods.
| Type | Where it fits |
|---|---|
| B-tree | Equality and range comparisons (default). Almost any equality or inequality search. |
| Hash | Equality only. WAL logging stabilized from 9.6+. |
| GIN | Multi-value columns (tsvector, jsonb, arrays). Full-text and key existence checks. |
| GiST | Ranges and spatial (PostGIS), part of similarity search. |
| SP-GiST | Space partitioning (Quad-tree, KD-tree style). |
| BRIN | Naturally sorted columns of very large tables (timestamps, etc.). Small index size. |
| BLOOM | Multi-column equality combination check (extension). |
Index choice depends on data distribution and query shape. The wrong index drives up disk and write cost before it helps performance.
3. EXPLAIN and EXPLAIN ANALYZE
EXPLAIN shows the execution plan the planner produced, while EXPLAIN ANALYZE measures actual execution time and row counts.
EXPLAIN ANALYZE
SELECT id, email FROM users WHERE email = 'a@b.com';
Common nodes we see when reading.
- Seq Scan: full table scan. Normal on small tables or low selectivity.
- Index Scan / Index Only Scan: access through an index. Index Only is meaningful only when the visibility map is clean.
- Bitmap Heap Scan: combines several index results as a bitmap and visits the heap once.
- Nested Loop / Hash Join / Merge Join: join algorithms. The planner picks based on data size and distribution.
actual timevscost: cost is estimated, actual time is measured. A large gap suggests stale statistics.
Turning on the BUFFERS option shows cache hits and disk reads.
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;
4. Transaction isolation levels
The SQL standard defines four levels. Here is how PostgreSQL behaves.
| Level | PostgreSQL's actual behavior |
|---|---|
| Read Uncommitted | Treated identically to Read Committed (no dirty reads occur). |
| Read Committed | Default. A new snapshot per statement. Non-repeatable reads and phantoms possible. |
| Repeatable Read | Snapshot fixed at transaction start. SSI reinforcement blocks some phantoms. |
| Serializable | SSI (Serializable Snapshot Isolation, introduced in 9.1). Guarantees a serializable result. On conflict, one transaction fails. |
Raising the level boosts safety but increases conflict and retry cost. Most OLTP runs fine on Read Committed. Places like payments or inventory are worth reviewing for Serializable or explicit locks.
5. VACUUM and autovacuum
VACUUM reclaims dead tuples and refreshes the visibility map and statistics. The autovacuum daemon runs it automatically based on thresholds.
VACUUM(default): returns space to the free space map. Disk usage does not shrink.VACUUM FULL: rewrites the table and shrinks disk too. It holds an ACCESS EXCLUSIVE lock during the operation, so it is risky in production.ANALYZE: collects statistics. Affects planner accuracy.
Bloat happens when dead tuples accumulate faster than reclamation. Results are increased disk usage, index bloat, and lower cache hit rates. It is common on tables with heavy update or delete patterns. The pgstattuple extension is often mentioned as a diagnostic tool.
6. Partitioning
Declarative partitioning landed in PostgreSQL 10. A single table is split into child tables by a key.
CREATE TABLE events (
id BIGSERIAL,
occurred_at TIMESTAMPTZ NOT NULL,
payload JSONB
) PARTITION BY RANGE (occurred_at);
CREATE TABLE events_2026_04 PARTITION OF events
FOR VALUES FROM ('2026-04-01') TO ('2026-05-01');
The advantage is that index and VACUUM cost on a large table can be split per partition. The limit is that each partition keeps its own statistics, and there is (so far) no global index, so PK and unique constraints must include the partition key.
7. Distributed SQL and time-series specialists
When sharding automation and distributed strong consistency take priority, candidates like Citus (PostgreSQL extension), CockroachDB, and YugabyteDB come up. We trade single-node PostgreSQL's simplicity for global scale.
timescaledb adds time-series hypertables, compression, and continuous aggregate views on top of PostgreSQL. Standalone engines like InfluxDB occupy their own place, but timescaledb is often picked when the requirement is "we want to keep using existing SQL tools."
8. Operational tools
CREATE INDEX CONCURRENTLY— reduces locking when building an index on a table with production traffic (only one such build per table at a time).pg_stat_statements— shows accumulated execution time, count, and plan per query. The doorway to tuning.pg_stat_activity,pg_locks— inspect current sessions and locks. Track deadlocks.- Long-running transactions — long transactions hold the vacuum horizon and accelerate bloat. Periodic checks are needed.
9. Common pitfalls
JSONB index choice — pick GIN(jsonb_path_ops) or expression indexes to match the query shape. GIN is not always the right answer.
SELECT * plus large rows — TOASTed columns get pulled along and create surprising costs.
Bad statistics — if ANALYZE has not run recently or samples are small, the planner picks a poor plan. default_statistics_target can also be tuned per column.
SERIALIZABLE's retry obligation — when a conflict happens, the client must rerun the transaction. Libraries and code must assume this.
Index bloat — frequent updates to the same row break HOT updates and inflate index entries. Periodic REINDEX CONCURRENTLY is worth considering.
Closing thoughts
The internals of PostgreSQL do not need to be known all at once. Reading EXPLAIN is a start, and isolation levels and VACUUM only become meaningful after an incident. This article sits in front of those incidents before they hit production.
Next
- three-layer-cache
- redis-roles
References: Postgres MVCC docs, Postgres index types, Using EXPLAIN, Transaction isolation, Routine Vacuuming.