RAG pipeline

Chunk → embed → store → retrieve → inject → generate. Six steps, that's all.

1. Chunking

def chunk_by_sentence(text: str, max_chars=1000, overlap=100):
    sentences = re.split(r'(?<=[.!?。])\s+', text)
    chunks, cur = [], ""
    for s in sentences:
        if len(cur) + len(s) > max_chars:
            chunks.append(cur)
            cur = cur[-overlap:] + " " + s
        else:
            cur += " " + s
    if cur: chunks.append(cur)
    return chunks

1000 chars (~500 tokens) with 100-char overlap. Respect sentence and code-block boundaries.

2. Indexing (batch)

async def index_document(doc_id, text):
    chunks = chunk_by_sentence(text)
    for i in range(0, len(chunks), 10):
        batch = chunks[i:i+10]
        resp = genai.embed_content(
            model="models/text-embedding-004",
            content=batch, task_type="retrieval_document",
        )
        # bulk insert ...

Batches of 10–50. Mind API rate limits.

3. Retrieve — top-k

async def retrieve(query, k=5):
    q_emb = genai.embed_content(
        model="models/text-embedding-004",
        content=query, task_type="retrieval_query",
    )["embedding"]
    rows = await pool.fetch(
        "SELECT content, 1 - (embedding <=> $1::vector) AS score "
        "FROM document_chunks ORDER BY embedding <=> $1::vector LIMIT $2",
        q_emb, k,
    )
    return [(r["content"], r["score"]) for r in rows]

k of 3–10 is practical.

4. (Optional) rerank

top-k=20 → rerank model → top-5. Adds 200–500ms, improves accuracy 10–20pp. Skip for MVP.

5. Prompt injection

def build_prompt(query, chunks):
    context = "\n\n---\n\n".join(chunks)
    return f"""Answer ONLY from the documents below.
If not found, reply "Not found in the documents."
Cite sources.

# Documents
{context}

# Question
{query}

# Answer (with citations)
"""

The three anti-hallucination moves: "only from…", escape path, citation demand.

6. Generate

resp = client.chat.completions.create(
    model="gemma-2-9b-it",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.3, max_tokens=500,
)

7. Full flow

async def ask(query):
    chunks = await retrieve(query, k=5)
    prompt = build_prompt(query, [c for c, _ in chunks])
    answer = generate(prompt)
    return {"answer": answer, "sources": chunks}

Returning sources lets the UI expand "why this answer?".

8. Gotchas

Chunks too small → meaning fragments
Chunks too large → dilution, overflow
No threshold → low-score chunks injected
No citations requested → hallucinations rise

Closing

A first RAG with k=5, temperature 0.3, and "only from…" usually works. Tune later with real user logs.

05-gemini-openai-api