Local LLM · pgvector · building a RAG chatbot

Sometimes a single ChatGPT call is not enough. Internal docs, personal notes, data you cannot send outside. RAG (Retrieval Augmented Generation) lets an LLM answer only from materials you hand-pick.

Who it's for

Engineers running LLMs on local GPUs or on-prem without sending data out
Anyone who wants a chatbot that answers with citations from their own documents
People wanting a single track covering embeddings, vector search, and prompt design

What you can do afterwards

Run Gemma / Llama family models locally with LM Studio
Store embeddings in PostgreSQL + pgvector with HNSW indexes
Build a minimal FastAPI + LangChain pipeline (retrieve → prompt → generate)
Swap Gemini and local LLMs freely
Control system prompts, few-shot, and output schemas

Steps

Why local LLMs · getting started with LM Studio — OpenAI-compatible endpoint · swapping models · VRAM
Embeddings — text to vectors — the math behind semantic search · 768 dims
pgvector + HNSW setup — install · index choice · cosine vs dot product
RAG pipeline — chunking · retrieve · top-k · rerank · prompt injection
Gemini · OpenAI-compatible APIs — switching local ↔ cloud · cost · latency
Prompt design — system prompts · few-shot · output schemas · hallucination

Prerequisites — python-data-pipeline + Python 3.11 + uv + PostgreSQL 15+ + LM Studio.

Local LLM · pgvector · building a RAG chatbot

Local LLM · pgvector · building a RAG chatbot

Who it's for

What you can do afterwards

Steps

Lessons

Other courses

Why local LLMs · getting started with LM Studio

Embeddings — text to vectors

pgvector + HNSW setup

RAG pipeline

Gemini · OpenAI-compatible APIs

Prompt design

Step 7 — NotebookLM vs your own RAG