RAG and Search — Chunking · Embedding · Reranking · Hybrid
RAG and Search — Chunking · Embedding · Reranking · Hybrid
RAG (Retrieval-Augmented Generation) is an approach to supplement an LLM's limited training cutoff and memory limits with external knowledge. By separating retrieval and generation, you can handle new material without retraining the model.
1. About RAG
The term RAG was cemented in the 2020 NeurIPS paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" by Patrick Lewis et al. The simple idea is: retrieve related documents from an external corpus and include them in the model's input for generation.
source → preprocess·chunk → embed → store in index
↓
user query → embed → candidate retrieval → rerank → context → LLM → response
2. Chunking strategies
Documents often cannot fit whole into the LLM context, so we split them into smaller units. There is no single right answer for chunking — it depends on data · domain.
| Strategy | Description | Pros / Cons |
|---|---|---|
| Fixed size | By token · char count (512 tokens) | Simple · ignores semantic boundaries |
| Sliding window | Fixed + overlap (50 tokens) | Mitigates boundary loss · increases storage |
| Semantic | Estimate boundaries by sentence-embedding similarity | Preserves meaning · expensive |
| Hierarchical | Document → section → paragraph → chunk | Search by abstraction · complex impl |
| Structure-based | Markdown headings · HTML · code AST | For materials with clear structure |
Overlap is usually 10~20% of chunk size. Too much increases redundant retrieval · cost; too little cuts off sentences spanning boundaries.
For mixed Korean·English documents, the 256~768 token range is often cited. Code · technical documents fit structure-based chunking by function · heading.
3. Embedding → search
Pass chunks through an embedding model to make fixed-dimensional vectors. The query is embedded with the same model, and the K nearest by distance · similarity are retrieved. Cosine is the most common distance, but follow the model's recommendation.
4. pgvector
With the pgvector extension on Postgres, vector search lives inside the same DB. Distance operators (<-> L2 · <=> cosine · <#> negative inner product) and two ANN index types:
- IVFFlat — Clustering-based. Fast to build · low memory. The index should be built after some data has accumulated, for cluster quality.
- HNSW — Multi-layer graph based (Malkov · Yashunin 2018). Excellent search accuracy·speed; longer build time · more memory. Strong with incremental adds.
Choosing between this and a dedicated vector DB (Pinecone · Qdrant · Weaviate · Milvus) is a balance of data scale · operational surface · filtering needs.
5. Reranking
Vector search is fast but has limits in semantic accuracy. The procedure is to fetch 30100 candidates first, then re-score with a cross-encoder or LLM and keep the top 510 — that is reranking.
| Model | Note |
|---|---|
| Cohere Rerank | Commercial API. |
| BGE-Reranker (BAAI) | Open weights, multilingual variants. |
| Cross-Encoder/ms-marco | Published by Sentence-Transformers. |
| Jina Reranker | Multilingual · code variants. |
A bi-encoder (embedding) encodes the two texts separately, while a cross-encoder feeds query·document together for deeper interaction. Accurate but expensive — only use it in the narrow stage after first-pass retrieval.
6. Hybrid search
Combine keyword search (BM25 · Postgres tsvector) with vector search. Tokens like abbreviations · proper nouns · numbers can be weak for embeddings, and keyword search complements them.
A simple combination — RRF (Reciprocal Rank Fusion):
score(d) = Σ 1 / (k + rank_i(d)) # k is usually 60
A simple function that uses only the rank from each retriever, but it works well.
7. Evaluation metrics
Retrieval stage:
| Metric | Meaning |
|---|---|
| Recall@K | Ratio where the answer document is in the top K |
| Precision@K | Ratio of relevant ones in top K |
| MRR | Mean reciprocal rank of the first answer |
| nDCG | Cumulative gain weighted by rank |
Generation stage — frameworks like RAGAS · TruLens evaluate context faithfulness · accuracy · answer relevance via an LLM. Treat automated evaluation as a reference and pair it with manual review.
8. Other paths
- Long-context LLMs — Approaches like Gemini 1.5 Pro's 1M tokens, Claude 3's 200k, that put everything in without RAG. Balance with cost · latency · "lost in the middle."
- Fine-tuning — Train directly on domain data. When the material is mostly static and there's enough volume.
- Graph RAG — Build documents into entity·relation graphs. Microsoft published the GraphRAG case in 2024.
- Agentic retrieval — The LLM iteratively generates and refines search queries.
9. Combining metadata
Store creation date · source URL · document type · tags as columns alongside chunks. The strength of pgvector is the place where these combine with the RDBMS WHERE.
SELECT id, content
FROM chunks
WHERE doc_type = 'manual' AND created_at > now() - interval '30 days'
ORDER BY embedding <=> $1
LIMIT 20;
If the filter is too narrow, few candidates remain after the index filters them, and accuracy drops. The decision of prefilter · postfilter position depends on data distribution.
10. Spots where you often get stuck
Re-embedding when changing the embedding model — vector spaces of different models are not compatible. Plan migration from the start.
Dimension mismatch — A vector(1536) column won't accept other dimensions. Separate columns·tables per model.
Distance operator and index match — If the index is built with vector_cosine_ops, only <=> uses the index.
Context window overflow — Putting all retrieval results into the LLM exceeds the limit. Combine with reranking · summarization.
Lost in the middle — The observation that information placed in the middle of context is reflected less by models than at either end (Liu et al. 2023). Place key material at the front and back.
Duplicate documents — When the same content enters as multiple chunks, retrieval results fill with the same information. Dedup or MMR (Maximum Marginal Relevance).
No evaluation data — Without query·answer pairs from your own domain, all tuning is guesswork. Start with a small evaluation set.
Closing thoughts
RAG is a simple idea, but each spot — chunking · embedding model · index · reranking · evaluation — is domain-dependent. Without making a small evaluation set first, all tuning stays at the level of intuition. Combining pgvector + Postgres metadata filters is the simplest answer for small to mid-sized operations.
Next
- prompt-design
- gemini-api
References: Lewis et al. RAG (2020) · Lost in the Middle (2023) · HNSW (Malkov · Yashunin) · pgvector · Pinecone Learn RAG · Cohere Rerank · RAGAS · Microsoft GraphRAG.