Local LLM — LM Studio · llama.cpp · Ollama · vLLM
Local LLM — LM Studio · llama.cpp · Ollama · vLLM
The trend of running large language models directly on personal computers has taken hold rapidly since 2023. Where there used to be only cloud APIs, the combination of quantization formats like GGUF, inference runtimes like llama.cpp, and tools like LM Studio · Ollama · vLLM has made it possible to run reasonably large models even on a laptop.
1. About the tools
llama.cpp — A C/C++ inference runtime released by Georgi Gerganov in March 2023. It started as a way to run Meta's LLaMA weights on CPU. Since then it has expanded with GPU acceleration (CUDA · Metal · Vulkan · ROCm), a wider model lineup, and the GGUF format standardization. Many other tools use llama.cpp as their internal engine.
LM Studio — A desktop app released by Element Labs in 2023. On Windows · macOS · Linux it provides a GUI to search, download, and run inference. Internally it uses both llama.cpp and the MLX (Apple Silicon) backend. A strong point is the OpenAI-compatible local server (http://localhost:1234/v1) — turn it on and other apps connect as is. Free for non-commercial use; commercial use has a separate licensing policy to check.
Ollama — A CLI-centric tool released in 2023. The simplicity of ollama run llama3.2 pulls and runs a model in one line. It has its own server (default 127.0.0.1:11434) and an OpenAI-compatible endpoint. The Modelfile shape resembles Docker.
vLLM — A serving engine released by UC Berkeley Sky Computing Lab in 2023. It pushed up throughput with a KV cache management technique called PagedAttention. It sits in a different place than LM Studio · Ollama, which are for a single user running light inference — vLLM targets many concurrent requests · high-throughput serving, and a GPU is essentially a prerequisite.
2. GGUF and quantization
GGUF (GPT-Generated Unified Format) is the successor to GGML, a format that settled in around August 2023. It packs model weights + metadata (tokenizer · architecture) into a single file. Quantization is the technique of reducing weights from 16/32-bit floats to fewer bits.
| Notation | Bits | Note |
|---|---|---|
| Q2_K | 2~3 | Smallest · large quality loss |
| Q4_0 / Q4_K_M | 4~5 | Common compromise |
| Q5_K_M | 5~6 | Quality·size balance |
| Q6_K | 6~7 | Good quality |
| Q8_0 | 8 | Small loss · half size |
| F16 / BF16 | 16 | Close to full precision |
Variants with K are K-quants — block-wise quantization that improves quality at the same bit count. _M · _S mean medium · small.
3. Model size vs memory
Approximate RAM · VRAM footprint for a 7B model (grows further depending on context length · KV cache):
| Model size | F16 | Q4_K_M | Q8_0 |
|---|---|---|---|
| 7B | |||
| 13B | |||
| 70B | ~130 GB+ | ~70 GB+ |
Exact numbers vary by model · tokenizer · implementation, so refer to the model card and measurements.
4. Inference backends
- CUDA — NVIDIA GPU. Best supported by most tools.
- Metal — Apple Silicon (M1/M2/M3/M4). Thanks to Unified Memory, the GPU can use a large portion of system RAM.
- ROCm — AMD GPU. Support is gradually expanding.
- Vulkan — General purpose. Performance is usually lower than CUDA · Metal.
- CPU — Slowest, but runs anywhere.
5. Other tools
| Tool | Position | Trait |
|---|---|---|
| LM Studio | Desktop GUI | Search · download · local server in one screen. |
| Ollama | CLI · background daemon | Simple with ollama run. |
| llama.cpp | Library · binary | Lowest layer. |
| Jan | Desktop GUI | Open-source LM Studio alternative. |
| GPT4All | Desktop GUI | Nomic-led · own model ecosystem. |
| vLLM | Server engine | High throughput · multi-user · GPU required. |
| TGI | Server engine | Hugging Face's serving tool. |
| MLX | Apple Silicon framework | Released by Apple in 2023. |
| llamafile | Single executable | Mozilla's Justine Tunney. |
6. OpenAI-compatible server
LM Studio · Ollama · vLLM all provide OpenAI-compatible endpoints. Client code can stay almost untouched — only the base URL changes.
export OPENAI_BASE_URL=http://localhost:1234/v1
export OPENAI_API_KEY=anything
In Windows PowerShell: $env:OPENAI_BASE_URL = "...".
7. The thread of model selection
- 7~8B — Practical on laptops · consumer GPUs. Fine for general dialogue · summarization.
- 13~14B — One step up. 16 GB+ VRAM recommended.
- 30~34B — 24 GB+ VRAM, or a Mac with large unified memory.
- 70B+ — Datacenter-class or multiple GPUs.
Context length and KV cache — as context grows, the KV cache occupies memory separately from model size. Increasing 8k → 32k → 128k makes the cache grow more than proportionally and is a common cause of OOM.
8. Spots where you often get stuck
Model license differences — Llama · Gemma · Qwen · Mistral all have different licenses. Check commercial use · redistribution conditions on the model card.
Tokenizer mismatch — Cases where outputs go off due to tokenizer differences during GGUF conversion. Use the official conversion when possible.
Quantization limits — Low-bit quantization like Q2 · Q3 shows pronounced quality loss on small models. The observation is that larger models suffer less loss at the same bit count, relatively.
GPU memory + system memory split — When everything doesn't fit on the GPU, parts spill to CPU and speed drops sharply.
Driver · CUDA version — If the NVIDIA driver · CUDA version falls outside what the tool expects, GPU acceleration drops out and execution proceeds on CPU.
Exposing the local server externally — Default binding is usually 127.0.0.1. When exposing externally, handle authentication · firewall separately.
Difference between benchmarks and felt experience — Real usage after quantization can differ from short benchmarks. Compare directly with your own domain inputs.
Closing thoughts
Local LLMs are appealing for zero cost · privacy · internet independence. 7B Q4_K_M is the practical starting point on laptops · consumer GPUs. With large models + large context, the KV cache decides memory, so trimming context length down to what the domain actually requires is the standard flow.
Next
- rag-pgvector
- prompt-design
References: llama.cpp GitHub · LM Studio · Ollama · vLLM · GGUF spec · Apple MLX · NVIDIA CUDA.