Why local LLMs · getting started with LM Studio

A one-line ChatGPT call is fast and easy. Still, there are places where a local LLM is the answer.

1. Four places local wins

Data cannot leave — internal docs, health, finance
Per-request cost adds up — dozens of calls per second in backends
Predictable latency — cloud tail latencies hit 500ms+
Offline · personal device — AI baked into a Tauri desktop app

Quality and context length still favour Claude Opus or GPT-4 class models.

2. LM Studio — the standard local launcher

Free, macOS / Windows / Linux. Pick a GGUF and run Gemma, Llama, Qwen, Mistral.

# Download LM Studio
# Search → gemma-2-9b-it · llama-3.2-3b · qwen2.5-coder
# Load Model → Server tab → Start Server (default http://localhost:1234)

3. OpenAI-compatible endpoint

Call it with the OpenAI SDK as-is.

from openai import OpenAI
client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")
resp = client.chat.completions.create(
    model="gemma-2-9b-it",
    messages=[{"role": "user", "content": "Answer briefly: 1 + 1 = ?"}],
    temperature=0.3,
)

Swap base_url + model to switch between cloud and local.

4. VRAM guide

Params	Quant	VRAM
3B	Q4_K_M	4 GB
7 ~ 9B	Q4_K_M	8 ~ 12 GB
14B	Q4_K_M	16 GB
32B	Q4_K_M	24 GB +

CPU-only works but generates 1–5 tok/s. Use GPU for realtime.

5. Picking a model

Code · RAG summary — Qwen2.5-Coder · Gemma 2 9B
Korean quality — Gemma 2 9B · Gemma 4 e2b-it (2026)
Low VRAM — Llama 3.2 3B · Phi-3 mini

Start with Gemma 2 9B Q4_K_M.

6. Gotchas

Model name mismatch — use id returned by curl /v1/models
Temperature too high — RAG 0.1–0.4, creative 0.7–1.0
Context accumulates — no auto-trim across calls; trim manually

Closing

Start your first RAG against Gemini or OpenAI to validate the flow, then switch to local. Local is not a silver bullet; being able to switch on demand is the real win.

02-embeddings