pgvector and RAG
pgvector and RAG
The trend of combining large language model responses with external knowledge has settled under the name RAG (Retrieval-Augmented Generation). At its core sits vector similarity search, and PostgreSQL's pgvector extension is the closest fit.
1. About pgvector
pgvector is a PostgreSQL extension that Andrew Kane published in 2021. It adds the vector type, distance operators, and vector indexes (IVFFlat and HNSW).
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE chunks (
id BIGSERIAL PRIMARY KEY,
doc_id BIGINT NOT NULL,
content TEXT NOT NULL,
embedding vector(1536) NOT NULL
);
Distance operators.
| Operator | Distance |
|---|---|
<-> |
L2 (Euclidean) |
<=> |
Cosine distance |
<#> |
Negative inner product — smaller is more similar |
Use ORDER BY embedding <=> $1 LIMIT 10 to fetch the K nearest.
2. Vector similarity
The starting assumption is that mapping text or images into fixed-dimensional real vectors places semantically close items close in the vector space. Three distance measures are common.
- Cosine similarity — looks only at direction. Ignores length difference.
- L2 (Euclidean) — accounts for length too.
- Inner product — for normalized vectors, effectively the same as cosine.
Embedding API responses from OpenAI and Cohere are usually normalized, so cosine is often used. Model documentation often specifies the recommended distance.
3. IVFFlat
Brute force (Seq Scan plus distance calculation) is fast enough on small data. As data grows, ANN (Approximate Nearest Neighbor) indexes come in.
IVFFlat pre-partitions the vector space into K clusters (lists) and looks at only the nearest few clusters (probes) at search time.
CREATE INDEX ON chunks USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
SET ivfflat.probes = 10;
SELECT id FROM chunks ORDER BY embedding <=> $1 LIMIT 10;
- Fast build, low memory.
- Accuracy and speed are decided by the balance of
listsandprobes. - Build the index after loading enough data so clustering quality is good.
4. HNSW
Yu. A. Malkov and D. A. Yashunin proposed the algorithm in their 2018 paper. It builds a multilayer graph to find nearest neighbors quickly.
CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);
SET hnsw.ef_search = 40;
- Search accuracy and speed are generally excellent.
- Index size and build time are larger than IVFFlat.
- Strong with incremental additions.
HNSW landed in pgvector 0.5 and later versions improved performance and memory.
5. Embedding models
| Model | Provider | Dimension | Memo |
|---|---|---|---|
OpenAI text-embedding-3-small |
OpenAI API | 1536 (or reducible) | Multilingual, relatively cheap |
OpenAI text-embedding-3-large |
OpenAI API | 3072 | High quality |
Cohere embed-multilingual-v3.0 |
Cohere API | 1024 | Strong multilingual |
| BGE (BAAI) | Open weights | 384/768/1024 | Self-hostable |
| nomic-embed-text | Nomic | 768 | Open weights |
| Sentence-Transformers | Open | Various | Many small models |
Higher dimensions generally raise accuracy but storage and search costs rise proportionally. Models like Matryoshka embeddings allow trimming dimensions after the fact.
6. Comparison with dedicated vector DBs
| System | Origin and year | Memo |
|---|---|---|
| Pinecone | 2019, managed SaaS | Simple ops, usage-based pricing |
| Qdrant | 2021, Rust | Self-hostable, strong filtering |
| Weaviate | 2019, Go | Modular, hybrid search |
| Milvus | 2019, open | Large scale, GPU acceleration |
| Chroma | 2022, Python | Local-dev friendly |
| Vespa | open-sourced 2017 (Yahoo) | Search, ranking, vectors combined |
| Elasticsearch / OpenSearch | 2010 / 2021 | Full-text plus vector (kNN) |
Strengths of dedicated vector DBs.
- Very large scale (billions or more) and distributed sharding.
- Pre-filtering and metadata combinations are often faster.
- Vector-specific ops tools (statistics, tuning).
Strengths of pgvector.
- Combine vectors, relational data, and transactions inside one DB.
- One operational surface (backups, auth, monitoring all in PostgreSQL).
- Sufficient performance at small to medium scale.
The choice balances data scale, operational headcount, and accuracy needs.
7. RAG pipeline
source → chunking → embedding → storage → retrieval → reranking → LLM context → response
Chunking — split documents into small pieces. Too large and search aim blurs; too small and meaning scatters.
- Fixed token or character count (e.g. 512 tokens with 50-token overlap).
- Markdown headings or sentence boundaries.
- Semantic-based (estimate boundaries with another model).
Embedding and storage — compute embeddings per chunk via a model API or a local model and store in a vector column. Keep metadata (document ID, source URL, date, tags) as separate columns.
Retrieval — embed the query with the same model and fetch the K nearest with <=>. Combining metadata filters (date, document type) is where the RDBMS WHERE shines.
8. Reranking and hybrid search
Vector search is fast but has limits in semantic accuracy. Fetch 30 to 50 results first, then rerank with a cross-encoder or LLM and feed the top 5 to 10 into the LLM context. Models like Cohere Rerank, BGE Reranker, and cross-encoder/ms-marco come up.
Hybrid search combines keyword search (BM25, tsvector) with vector search. In documents heavy on abbreviations and proper nouns, keyword search complements. Simple combination functions like RRF (Reciprocal Rank Fusion) are common.
9. Common pitfalls
Re-embedding obligation after model change — when embedding models change, the vector space changes too. Old vectors and new queries live in different spaces and comparisons are meaningless. Plan migrations early.
Dimension mismatch — putting a different dimension into a vector(1536) column errors out. Separate per-model dimensions into separate columns or tables.
Index build timing — building IVFFlat when there is little data leaves clustering incomplete. Recommended to build the index after some load.
Distance operator must match index — when an index was built with vector_cosine_ops, only <=> uses it. Other operators fall back to Seq Scan.
Meaning broken at chunk boundaries — cuts mid-sentence blur retrieval. Use overlap or align to boundaries.
Context window overflow — feeding all retrieval results to the LLM exceeds the token limit. Combine reranking with summarization.
Closing thoughts
In vector search, chunking, reranking, and metadata design are harder than the implementation. Index type and dimension choices are heavy to revisit, so make them carefully at the start.
Next
- supabase
- fcm-push
References: pgvector GitHub, pgvector indexing guide, HNSW paper, OpenAI Embeddings, Pinecone Learn — RAG, Cohere Rerank.