pgvector and RAG

The trend of combining large language model responses with external knowledge has settled under the name RAG (Retrieval-Augmented Generation). At its core sits vector similarity search, and PostgreSQL's pgvector extension is the closest fit.

1. About pgvector

pgvector is a PostgreSQL extension that Andrew Kane published in 2021. It adds the vector type, distance operators, and vector indexes (IVFFlat and HNSW).

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE chunks (
  id BIGSERIAL PRIMARY KEY,
  doc_id BIGINT NOT NULL,
  content TEXT NOT NULL,
  embedding vector(1536) NOT NULL
);

Distance operators.

Operator	Distance
`<->`	L2 (Euclidean)
`<=>`	Cosine distance
`<#>`	Negative inner product — smaller is more similar

Use ORDER BY embedding <=> $1 LIMIT 10 to fetch the K nearest.

2. Vector similarity

The starting assumption is that mapping text or images into fixed-dimensional real vectors places semantically close items close in the vector space. Three distance measures are common.

Cosine similarity — looks only at direction. Ignores length difference.
L2 (Euclidean) — accounts for length too.
Inner product — for normalized vectors, effectively the same as cosine.

Embedding API responses from OpenAI and Cohere are usually normalized, so cosine is often used. Model documentation often specifies the recommended distance.

3. IVFFlat

Brute force (Seq Scan plus distance calculation) is fast enough on small data. As data grows, ANN (Approximate Nearest Neighbor) indexes come in.

IVFFlat pre-partitions the vector space into K clusters (lists) and looks at only the nearest few clusters (probes) at search time.

CREATE INDEX ON chunks USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

SET ivfflat.probes = 10;
SELECT id FROM chunks ORDER BY embedding <=> $1 LIMIT 10;

Fast build, low memory.
Accuracy and speed are decided by the balance of lists and probes.
Build the index after loading enough data so clustering quality is good.

4. HNSW

Yu. A. Malkov and D. A. Yashunin proposed the algorithm in their 2018 paper. It builds a multilayer graph to find nearest neighbors quickly.

CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);

SET hnsw.ef_search = 40;

Search accuracy and speed are generally excellent.
Index size and build time are larger than IVFFlat.
Strong with incremental additions.

HNSW landed in pgvector 0.5 and later versions improved performance and memory.

5. Embedding models

Model	Provider	Dimension	Memo
OpenAI `text-embedding-3-small`	OpenAI API	1536 (or reducible)	Multilingual, relatively cheap
OpenAI `text-embedding-3-large`	OpenAI API	3072	High quality
Cohere `embed-multilingual-v3.0`	Cohere API	1024	Strong multilingual
BGE (BAAI)	Open weights	384/768/1024	Self-hostable
nomic-embed-text	Nomic	768	Open weights
Sentence-Transformers	Open	Various	Many small models

Higher dimensions generally raise accuracy but storage and search costs rise proportionally. Models like Matryoshka embeddings allow trimming dimensions after the fact.

6. Comparison with dedicated vector DBs

System	Origin and year	Memo
Pinecone	2019, managed SaaS	Simple ops, usage-based pricing
Qdrant	2021, Rust	Self-hostable, strong filtering
Weaviate	2019, Go	Modular, hybrid search
Milvus	2019, open	Large scale, GPU acceleration
Chroma	2022, Python	Local-dev friendly
Vespa	open-sourced 2017 (Yahoo)	Search, ranking, vectors combined
Elasticsearch / OpenSearch	2010 / 2021	Full-text plus vector (kNN)

Strengths of dedicated vector DBs.

Very large scale (billions or more) and distributed sharding.
Pre-filtering and metadata combinations are often faster.
Vector-specific ops tools (statistics, tuning).

Strengths of pgvector.

Combine vectors, relational data, and transactions inside one DB.
One operational surface (backups, auth, monitoring all in PostgreSQL).
Sufficient performance at small to medium scale.

The choice balances data scale, operational headcount, and accuracy needs.

7. RAG pipeline

source → chunking → embedding → storage → retrieval → reranking → LLM context → response

Chunking — split documents into small pieces. Too large and search aim blurs; too small and meaning scatters.

Fixed token or character count (e.g. 512 tokens with 50-token overlap).
Markdown headings or sentence boundaries.
Semantic-based (estimate boundaries with another model).

Embedding and storage — compute embeddings per chunk via a model API or a local model and store in a vector column. Keep metadata (document ID, source URL, date, tags) as separate columns.

Retrieval — embed the query with the same model and fetch the K nearest with <=>. Combining metadata filters (date, document type) is where the RDBMS WHERE shines.

8. Reranking and hybrid search

Vector search is fast but has limits in semantic accuracy. Fetch 30 to 50 results first, then rerank with a cross-encoder or LLM and feed the top 5 to 10 into the LLM context. Models like Cohere Rerank, BGE Reranker, and cross-encoder/ms-marco come up.

Hybrid search combines keyword search (BM25, tsvector) with vector search. In documents heavy on abbreviations and proper nouns, keyword search complements. Simple combination functions like RRF (Reciprocal Rank Fusion) are common.

9. Common pitfalls

Re-embedding obligation after model change — when embedding models change, the vector space changes too. Old vectors and new queries live in different spaces and comparisons are meaningless. Plan migrations early.

Dimension mismatch — putting a different dimension into a vector(1536) column errors out. Separate per-model dimensions into separate columns or tables.

Index build timing — building IVFFlat when there is little data leaves clustering incomplete. Recommended to build the index after some load.

Distance operator must match index — when an index was built with vector_cosine_ops, only <=> uses it. Other operators fall back to Seq Scan.

Meaning broken at chunk boundaries — cuts mid-sentence blur retrieval. Use overlap or align to boundaries.

Context window overflow — feeding all retrieval results to the LLM exceeds the token limit. Combine reranking with summarization.

Closing thoughts

In vector search, chunking, reranking, and metadata design are harder than the implementation. Index type and dimension choices are heavy to revisit, so make them carefully at the start.

supabase
fcm-push

References: pgvector GitHub, pgvector indexing guide, HNSW paper, OpenAI Embeddings, Pinecone Learn — RAG, Cohere Rerank.

pgvector and RAG

pgvector and RAG

1. About pgvector

2. Vector similarity

3. IVFFlat

4. HNSW

5. Embedding models

6. Comparison with dedicated vector DBs

7. RAG pipeline

8. Reranking and hybrid search

9. Common pitfalls

Closing thoughts

Next

Back to data