Local LLM · pgvector · building a RAG chatbot
Build a chatbot that answers from your own documents with LM Studio + pgvector + Gemini. Seven steps — embeddings, prompts, and a SaaS comparison.
- Difficulty
- advanced
- Lessons
- 7
Local LLM · pgvector · building a RAG chatbot
Sometimes a single ChatGPT call is not enough. Internal docs, personal notes, data you cannot send outside. RAG (Retrieval Augmented Generation) lets an LLM answer only from materials you hand-pick.
Who it's for
- Engineers running LLMs on local GPUs or on-prem without sending data out
- Anyone who wants a chatbot that answers with citations from their own documents
- People wanting a single track covering embeddings, vector search, and prompt design
What you can do afterwards
- Run Gemma / Llama family models locally with LM Studio
- Store embeddings in PostgreSQL + pgvector with HNSW indexes
- Build a minimal FastAPI + LangChain pipeline (retrieve → prompt → generate)
- Swap Gemini and local LLMs freely
- Control system prompts, few-shot, and output schemas
Flow
[1] Local LLM ──▶ [2] Embeddings ──▶ [3] pgvector ──▶ [4] RAG pipeline
│
▼
[7] vs SaaS RAG ◀── [6] Prompts ◀── [5] Cloud switch
The first half (1–4) is the mechanical "turn meaning into numbers and search." The second half (5–7) is the operational judgment on models, prompts, tools.
Steps
- Why local LLMs · getting started with LM Studio — OpenAI-compatible endpoint · swapping models · VRAM
- Embeddings — text to vectors — the math behind semantic search · 768 dims
- pgvector + HNSW setup — install · index choice · cosine vs dot product
- RAG pipeline — chunking · retrieve · top-k · rerank · prompt injection
- Gemini · OpenAI-compatible APIs — switching local ↔ cloud · cost · latency
- Prompt design — system prompts · few-shot · output schemas · hallucination
- NotebookLM vs your own RAG — SaaS RAG comparison; choosing the right tool per slot
Prerequisites — python-data-pipeline + Python 3.13+ + uv + PostgreSQL 15+ + LM Studio.