RAG and Search — Chunking · Embedding · Reranking · Hybrid

RAG (Retrieval-Augmented Generation) is an approach to supplement an LLM's limited training cutoff and memory limits with external knowledge. By separating retrieval and generation, you can handle new material without retraining the model.

1. About RAG

The term RAG was cemented in the 2020 NeurIPS paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" by Patrick Lewis et al. The simple idea is: retrieve related documents from an external corpus and include them in the model's input for generation.

source → preprocess·chunk → embed → store in index
        ↓
user query → embed → candidate retrieval → rerank → context → LLM → response

2. Chunking strategies

Documents often cannot fit whole into the LLM context, so we split them into smaller units. There is no single right answer for chunking — it depends on data · domain.

Strategy	Description	Pros / Cons
Fixed size	By token · char count (512 tokens)	Simple · ignores semantic boundaries
Sliding window	Fixed + overlap (50 tokens)	Mitigates boundary loss · increases storage
Semantic	Estimate boundaries by sentence-embedding similarity	Preserves meaning · expensive
Hierarchical	Document → section → paragraph → chunk	Search by abstraction · complex impl
Structure-based	Markdown headings · HTML · code AST	For materials with clear structure

Overlap is usually 10~20% of chunk size. Too much increases redundant retrieval · cost; too little cuts off sentences spanning boundaries.

For mixed Korean·English documents, the 256~768 token range is often cited. Code · technical documents fit structure-based chunking by function · heading.

3. Embedding → search

Pass chunks through an embedding model to make fixed-dimensional vectors. The query is embedded with the same model, and the K nearest by distance · similarity are retrieved. Cosine is the most common distance, but follow the model's recommendation.

4. pgvector

With the pgvector extension on Postgres, vector search lives inside the same DB. Distance operators (<-> L2 · <=> cosine · <#> negative inner product) and two ANN index types:

IVFFlat — Clustering-based. Fast to build · low memory. The index should be built after some data has accumulated, for cluster quality.
HNSW — Multi-layer graph based (Malkov · Yashunin 2018). Excellent search accuracy·speed; longer build time · more memory. Strong with incremental adds.

Choosing between this and a dedicated vector DB (Pinecone · Qdrant · Weaviate · Milvus) is a balance of data scale · operational surface · filtering needs.

5. Reranking

Vector search is fast but has limits in semantic accuracy. The procedure is to fetch 30~~100 candidates first, then re-score with a cross-encoder or LLM and keep the top 5~~10 — that is reranking.

Model	Note
Cohere Rerank	Commercial API.
BGE-Reranker (BAAI)	Open weights, multilingual variants.
Cross-Encoder/ms-marco	Published by Sentence-Transformers.
Jina Reranker	Multilingual · code variants.

A bi-encoder (embedding) encodes the two texts separately, while a cross-encoder feeds query·document together for deeper interaction. Accurate but expensive — only use it in the narrow stage after first-pass retrieval.

6. Hybrid search

Combine keyword search (BM25 · Postgres tsvector) with vector search. Tokens like abbreviations · proper nouns · numbers can be weak for embeddings, and keyword search complements them.

A simple combination — RRF (Reciprocal Rank Fusion):

score(d) = Σ 1 / (k + rank_i(d))   # k is usually 60

A simple function that uses only the rank from each retriever, but it works well.

7. Evaluation metrics

Retrieval stage:

Metric	Meaning
Recall@K	Ratio where the answer document is in the top K
Precision@K	Ratio of relevant ones in top K
MRR	Mean reciprocal rank of the first answer
nDCG	Cumulative gain weighted by rank

Generation stage — frameworks like RAGAS · TruLens evaluate context faithfulness · accuracy · answer relevance via an LLM. Treat automated evaluation as a reference and pair it with manual review.

8. Other paths

Long-context LLMs — Approaches like Gemini 1.5 Pro's 1M tokens, Claude 3's 200k, that put everything in without RAG. Balance with cost · latency · "lost in the middle."
Fine-tuning — Train directly on domain data. When the material is mostly static and there's enough volume.
Graph RAG — Build documents into entity·relation graphs. Microsoft published the GraphRAG case in 2024.
Agentic retrieval — The LLM iteratively generates and refines search queries.

9. Combining metadata

Store creation date · source URL · document type · tags as columns alongside chunks. The strength of pgvector is the place where these combine with the RDBMS WHERE.

SELECT id, content
FROM chunks
WHERE doc_type = 'manual' AND created_at > now() - interval '30 days'
ORDER BY embedding <=> $1
LIMIT 20;

If the filter is too narrow, few candidates remain after the index filters them, and accuracy drops. The decision of prefilter · postfilter position depends on data distribution.

10. Spots where you often get stuck

Re-embedding when changing the embedding model — vector spaces of different models are not compatible. Plan migration from the start.

Dimension mismatch — A vector(1536) column won't accept other dimensions. Separate columns·tables per model.

Distance operator and index match — If the index is built with vector_cosine_ops, only <=> uses the index.

Context window overflow — Putting all retrieval results into the LLM exceeds the limit. Combine with reranking · summarization.

Lost in the middle — The observation that information placed in the middle of context is reflected less by models than at either end (Liu et al. 2023). Place key material at the front and back.

Duplicate documents — When the same content enters as multiple chunks, retrieval results fill with the same information. Dedup or MMR (Maximum Marginal Relevance).

No evaluation data — Without query·answer pairs from your own domain, all tuning is guesswork. Start with a small evaluation set.

Closing thoughts

RAG is a simple idea, but each spot — chunking · embedding model · index · reranking · evaluation — is domain-dependent. Without making a small evaluation set first, all tuning stays at the level of intuition. Combining pgvector + Postgres metadata filters is the simplest answer for small to mid-sized operations.

prompt-design
gemini-api

References: Lewis et al. RAG (2020) · Lost in the Middle (2023) · HNSW (Malkov · Yashunin) · pgvector · Pinecone Learn RAG · Cohere Rerank · RAGAS · Microsoft GraphRAG.

RAG and Search — Chunking · Embedding · Reranking · Hybrid

RAG and Search — Chunking · Embedding · Reranking · Hybrid

1. About RAG

2. Chunking strategies

3. Embedding → search

4. pgvector

5. Reranking

6. Hybrid search

7. Evaluation metrics

8. Other paths

9. Combining metadata

10. Spots where you often get stuck

Closing thoughts

Next

Back to ai