Step 1
Why local LLMs · getting started with LM Studio
25 min
Why local LLMs · getting started with LM Studio
A one-line ChatGPT call is fast and easy. Still, there are places where a local LLM is the answer.
1. Four places local wins
- Data cannot leave — internal docs, health, finance
- Per-request cost adds up — dozens of calls per second in backends
- Predictable latency — cloud tail latencies hit 500ms+
- Offline · personal device — AI baked into a Tauri desktop app
Quality and context length still favour Claude Opus or GPT-4 class models.
2. LM Studio — the standard local launcher
Free, macOS / Windows / Linux. Pick a GGUF and run Gemma, Llama, Qwen, Mistral.
# Download LM Studio
# Search → gemma-2-9b-it · llama-3.2-3b · qwen2.5-coder
# Load Model → Server tab → Start Server (default http://localhost:1234)
3. OpenAI-compatible endpoint
Call it with the OpenAI SDK as-is.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")
resp = client.chat.completions.create(
model="gemma-2-9b-it",
messages=[{"role": "user", "content": "Answer briefly: 1 + 1 = ?"}],
temperature=0.3,
)
Swap base_url + model to switch between cloud and local.
4. VRAM guide
| Params | Quant | VRAM |
|---|---|---|
| 3B | Q4_K_M | 4 GB |
| 7 ~ 9B | Q4_K_M | 8 ~ 12 GB |
| 14B | Q4_K_M | 16 GB |
| 32B | Q4_K_M | 24 GB + |
CPU-only works but generates 1–5 tok/s. Use GPU for realtime.
5. Picking a model
- Code · RAG summary — Qwen2.5-Coder · Gemma 2 9B
- Korean quality — Gemma 2 9B · Gemma 4 e2b-it (2026)
- Low VRAM — Llama 3.2 3B · Phi-3 mini
Start with Gemma 2 9B Q4_K_M.
6. Gotchas
- Model name mismatch — use id returned by
curl /v1/models - Temperature too high — RAG 0.1–0.4, creative 0.7–1.0
- Context accumulates — no auto-trim across calls; trim manually
Closing
Start your first RAG against Gemini or OpenAI to validate the flow, then switch to local. Local is not a silver bullet; being able to switch on demand is the real win.
Next
- 02-embeddings