LLM Landscape — Closed · Open · Korean-Specialized · Evaluation · Pricing
LLM Landscape — Closed · Open · Korean-Specialized · Evaluation · Pricing
The LLM market shifts fast. Closed-API and open-weight, English-centric and multilingual, cloud and self-hosted, models specialized for Korean — all sit alongside each other.
1. Closed (API · weights private)
| Provider | Representative models | First release |
|---|---|---|
| OpenAI | GPT-3.5 · GPT-4 · GPT-4o · o1 · o3 | ChatGPT 2022-11-30. |
| Anthropic | Claude · Claude 2 · 3 · 3.5 · 4 series | Claude 2023-03. |
| Google DeepMind | Gemini 1.0 · 1.5 · 2.0 · 2.5 | Gemini 2023-12-06. |
| Mistral AI | Mistral Large · Pixtral | 2023~. |
| Cohere | Command R · R+ | 2021~. |
| xAI | Grok series | 2023-11. |
Even within the same provider, model capability shifts quickly across generations · dates.
2. Open weights
Model families whose weights can be downloaded and run for inference. License conditions differ per model.
| Model family | Origin | Note |
|---|---|---|
| Llama 2 / 3 / 3.1 / 3.2 / 3.3 | Meta | Custom license (conditional commercial). |
| Mistral · Mixtral · Codestral | Mistral AI | Mix of Apache 2.0 variants and non-commercial variants. |
| Gemma · Gemma 2 / 3 | Gemma license. | |
| Qwen / Qwen2 / Qwen2.5 / Qwen3 | Alibaba | Many Apache 2.0 variants. |
| DeepSeek (V2 · V3 · R1) | DeepSeek | License conditions vary per model. |
| Phi series | Microsoft | Known for small size. |
| Yi series | 01.AI | 2023~. |
| Falcon | TII (UAE) | 2023~. |
| OLMo | Allen AI | Aims to open even the training data. |
| StableLM · StableCode | Stability AI | 2023~. |
The degree of "open" varies per model. There's a difference between weights-only public, training code public, and even training data public. Hence the view that "open weights" is more accurate than "open source."
3. Korean-specialized · Korean-company models
| Model | Origin | Note |
|---|---|---|
| HyperCLOVA X | Naver | Released 2023. Self-trained Korean LLM. |
| A.X (Adot X) | SK Telecom | Self-developed Korean model family. |
| Solar | Upstage | Open-weight variant published. |
| EXAONE | LG AI Research | Some open-weight variants published. |
| KoAlpaca · Polyglot-Ko | Community | Korean fine-tuning attempts. |
Korean ability holds more meaning per-model. The observation is that even the same global model can vary widely in Korean across generations.
4. Reasoning models · multimodal · context length
Reasoning models — From late 2024, the trend of OpenAI o1 · o3, DeepSeek R1, Claude's extended thinking, Gemini 2.5's thinking mode. The model goes through longer internal reasoning before responding, spending more tokens · time accordingly.
Multimodal — Models that take images · audio · video · documents as input alongside text are now standard. GPT-4o · Gemini · Claude 3.x.
Context length expansion:
| Model | Context |
|---|---|
| GPT-4 (initial) | 8k · 32k |
| GPT-4-Turbo / GPT-4o | 128k |
| Claude 3 / 3.5 | 200k |
| Gemini 1.5 Pro | 1M (at release) |
A larger context isn't always the answer. Position effects like "lost in the middle" come along with cost · latency.
5. Evaluation sites
| Site | Operator | Trait |
|---|---|---|
| LMArena | LMSYS · UC Berkeley | Human blind comparison of two models → Elo. |
| LiveBench | Abacus.AI | Periodically refreshed eval set (mitigates data leakage). |
| MMLU | Hendrycks et al. 2020 | Multi-subject multiple choice. |
| BigBench / BBH | Google Research | Collection of various hard tasks. |
| HumanEval / MBPP | OpenAI · Google | Standard for coding eval. |
| SWE-bench | Princeton | Real GitHub-issue resolution rate. |
| GAIA | Hugging Face · Meta | General assistant tasks. |
| Open LLM Leaderboard | Hugging Face | Composite for open-weight models. |
Limits of evaluation:
- Suspicion of training-data leakage (cases where benchmarks ended up in training data).
- Many evaluations are English-centric.
- A single score doesn't directly tie to your domain performance.
6. Pricing models
Per-token billing (API) — Most closed models price input · output tokens separately. Output tokens are usually more expensive. With the introduction of context caching · prompt caching, cached input gets a discount.
Cost per request ≈ (input tokens × input rate) + (output tokens × output rate)
Subscription model (consumer) — ChatGPT Plus / Team / Enterprise · Claude Pro / Team · Gemini Advanced · Perplexity Pro. Bundles UI · quota · extra features.
Self-hosted — Open weights + your own GPU or cloud GPU. Per-call cost disappears, but GPU time · MLOps people · model updates · evaluation · operational burden grow. For small light workloads, API; for high traffic · strong data control needs, self-hosted. The threshold is per-workload.
Data usage policy — Even within the same provider, free tier · paid API · enterprise policies differ. Check the terms and the model card every time.
7. The thread of selection
- Fast and cheap, in volume — GPT-4o-mini / Claude Haiku / Gemini Flash / small open models.
- Quality first — GPT-4 / Claude Sonnet · Opus / Gemini Pro / large open models.
- Reinforced reasoning — o1 · o3 / Claude extended thinking / Gemini Thinking / DeepSeek R1.
- On-device · privacy — Small variants of Llama · Gemma · Phi · Qwen + LM Studio · Ollama.
- Heavy Korean — Korean-specialized models · multilingual-strong global models, with your own domain evaluation.
8. Spots where you often get stuck
Aliasing volatility — Aliases like gpt-4 · gemini-1.5-pro-latest point at different models depending on the time. Operations should pin to dates.
Benchmark over-trust — #1 isn't #1 in your domain.
License differences — Open weights doesn't mean all are commercially usable. Check the model card.
Data training use — Free / paid / enterprise can have different policies. Don't let sensitive info into the input.
Generation-change regression — A new model doesn't beat the old in every aspect. Sometimes regression appears in your tasks.
Advertised vs actual context length — Sometimes the advertised limit and the per-model input · output limit differ.
Tokens of reasoning models — Whether thinking tokens are included in the response or billed separately differs per provider.
"AGI" · "superhuman" expressions — Marketing expressions should be filtered out when interpreting evaluation results.
Closing thoughts
The LLM landscape shifts fast, so single-model dependence in operation drags regression risk. Pin the model + your own domain evaluation set + a shape that lets you swap models with one environment-variable line + cost monitoring — these four spots are the standard for stable operation.
Next
- (end of ai)
References: LMArena · LiveBench · Open LLM Leaderboard · OpenAI Models · Anthropic Models · Gemini Models · Meta Llama · Mistral · DeepSeek.