Step 4
RAG pipeline
35 min
RAG pipeline
Chunk → embed → store → retrieve → inject → generate. Six steps, that's all.
1. Chunking
def chunk_by_sentence(text: str, max_chars=1000, overlap=100):
sentences = re.split(r'(?<=[.!?。])\s+', text)
chunks, cur = [], ""
for s in sentences:
if len(cur) + len(s) > max_chars:
chunks.append(cur)
cur = cur[-overlap:] + " " + s
else:
cur += " " + s
if cur: chunks.append(cur)
return chunks
1000 chars (~500 tokens) with 100-char overlap. Respect sentence and code-block boundaries.
2. Indexing (batch)
async def index_document(doc_id, text):
chunks = chunk_by_sentence(text)
for i in range(0, len(chunks), 10):
batch = chunks[i:i+10]
resp = genai.embed_content(
model="models/text-embedding-004",
content=batch, task_type="retrieval_document",
)
# bulk insert ...
Batches of 10–50. Mind API rate limits.
3. Retrieve — top-k
async def retrieve(query, k=5):
q_emb = genai.embed_content(
model="models/text-embedding-004",
content=query, task_type="retrieval_query",
)["embedding"]
rows = await pool.fetch(
"SELECT content, 1 - (embedding <=> $1::vector) AS score "
"FROM document_chunks ORDER BY embedding <=> $1::vector LIMIT $2",
q_emb, k,
)
return [(r["content"], r["score"]) for r in rows]
k of 3–10 is practical.
4. (Optional) rerank
top-k=20 → rerank model → top-5. Adds 200–500ms, improves accuracy 10–20pp. Skip for MVP.
5. Prompt injection
def build_prompt(query, chunks):
context = "\n\n---\n\n".join(chunks)
return f"""Answer ONLY from the documents below.
If not found, reply "Not found in the documents."
Cite sources.
# Documents
{context}
# Question
{query}
# Answer (with citations)
"""
The three anti-hallucination moves: "only from…", escape path, citation demand.
6. Generate
resp = client.chat.completions.create(
model="gemma-2-9b-it",
messages=[{"role": "user", "content": prompt}],
temperature=0.3, max_tokens=500,
)
7. Full flow
async def ask(query):
chunks = await retrieve(query, k=5)
prompt = build_prompt(query, [c for c, _ in chunks])
answer = generate(prompt)
return {"answer": answer, "sources": chunks}
Returning sources lets the UI expand "why this answer?".
8. Gotchas
- Chunks too small → meaning fragments
- Chunks too large → dilution, overflow
- No threshold → low-score chunks injected
- No citations requested → hallucinations rise
Closing
A first RAG with k=5, temperature 0.3, and "only from…" usually works. Tune later with real user logs.
Next
- 05-gemini-openai-api