LLM Landscape — Closed · Open · Korean-Specialized · Evaluation · Pricing

The LLM market shifts fast. Closed-API and open-weight, English-centric and multilingual, cloud and self-hosted, models specialized for Korean — all sit alongside each other.

1. Closed (API · weights private)

Provider	Representative models	First release
OpenAI	GPT-3.5 · GPT-4 · GPT-4o · o1 · o3	ChatGPT 2022-11-30.
Anthropic	Claude · Claude 2 · 3 · 3.5 · 4 series	Claude 2023-03.
Google DeepMind	Gemini 1.0 · 1.5 · 2.0 · 2.5	Gemini 2023-12-06.
Mistral AI	Mistral Large · Pixtral	2023~.
Cohere	Command R · R+	2021~.
xAI	Grok series	2023-11.

Even within the same provider, model capability shifts quickly across generations · dates.

2. Open weights

Model families whose weights can be downloaded and run for inference. License conditions differ per model.

Model family	Origin	Note
Llama 2 / 3 / 3.1 / 3.2 / 3.3	Meta	Custom license (conditional commercial).
Mistral · Mixtral · Codestral	Mistral AI	Mix of Apache 2.0 variants and non-commercial variants.
Gemma · Gemma 2 / 3	Google	Gemma license.
Qwen / Qwen2 / Qwen2.5 / Qwen3	Alibaba	Many Apache 2.0 variants.
DeepSeek (V2 · V3 · R1)	DeepSeek	License conditions vary per model.
Phi series	Microsoft	Known for small size.
Yi series	01.AI	2023~.
Falcon	TII (UAE)	2023~.
OLMo	Allen AI	Aims to open even the training data.
StableLM · StableCode	Stability AI	2023~.

The degree of "open" varies per model. There's a difference between weights-only public, training code public, and even training data public. Hence the view that "open weights" is more accurate than "open source."

3. Korean-specialized · Korean-company models

Model	Origin	Note
HyperCLOVA X	Naver	Released 2023. Self-trained Korean LLM.
A.X (Adot X)	SK Telecom	Self-developed Korean model family.
Solar	Upstage	Open-weight variant published.
EXAONE	LG AI Research	Some open-weight variants published.
KoAlpaca · Polyglot-Ko	Community	Korean fine-tuning attempts.

Korean ability holds more meaning per-model. The observation is that even the same global model can vary widely in Korean across generations.

4. Reasoning models · multimodal · context length

Reasoning models — From late 2024, the trend of OpenAI o1 · o3, DeepSeek R1, Claude's extended thinking, Gemini 2.5's thinking mode. The model goes through longer internal reasoning before responding, spending more tokens · time accordingly.

Multimodal — Models that take images · audio · video · documents as input alongside text are now standard. GPT-4o · Gemini · Claude 3.x.

Context length expansion:

Model	Context
GPT-4 (initial)	8k · 32k
GPT-4-Turbo / GPT-4o	128k
Claude 3 / 3.5	200k
Gemini 1.5 Pro	1M (at release)

A larger context isn't always the answer. Position effects like "lost in the middle" come along with cost · latency.

5. Evaluation sites

Site	Operator	Trait
LMArena	LMSYS · UC Berkeley	Human blind comparison of two models → Elo.
LiveBench	Abacus.AI	Periodically refreshed eval set (mitigates data leakage).
MMLU	Hendrycks et al. 2020	Multi-subject multiple choice.
BigBench / BBH	Google Research	Collection of various hard tasks.
HumanEval / MBPP	OpenAI · Google	Standard for coding eval.
SWE-bench	Princeton	Real GitHub-issue resolution rate.
GAIA	Hugging Face · Meta	General assistant tasks.
Open LLM Leaderboard	Hugging Face	Composite for open-weight models.

Limits of evaluation:

Suspicion of training-data leakage (cases where benchmarks ended up in training data).
Many evaluations are English-centric.
A single score doesn't directly tie to your domain performance.

6. Pricing models

Per-token billing (API) — Most closed models price input · output tokens separately. Output tokens are usually more expensive. With the introduction of context caching · prompt caching, cached input gets a discount.

Cost per request ≈ (input tokens × input rate) + (output tokens × output rate)

Subscription model (consumer) — ChatGPT Plus / Team / Enterprise · Claude Pro / Team · Gemini Advanced · Perplexity Pro. Bundles UI · quota · extra features.

Self-hosted — Open weights + your own GPU or cloud GPU. Per-call cost disappears, but GPU time · MLOps people · model updates · evaluation · operational burden grow. For small light workloads, API; for high traffic · strong data control needs, self-hosted. The threshold is per-workload.

Data usage policy — Even within the same provider, free tier · paid API · enterprise policies differ. Check the terms and the model card every time.

7. The thread of selection

Fast and cheap, in volume — GPT-4o-mini / Claude Haiku / Gemini Flash / small open models.
Quality first — GPT-4 / Claude Sonnet · Opus / Gemini Pro / large open models.
Reinforced reasoning — o1 · o3 / Claude extended thinking / Gemini Thinking / DeepSeek R1.
On-device · privacy — Small variants of Llama · Gemma · Phi · Qwen + LM Studio · Ollama.
Heavy Korean — Korean-specialized models · multilingual-strong global models, with your own domain evaluation.

8. Spots where you often get stuck

Aliasing volatility — Aliases like gpt-4 · gemini-1.5-pro-latest point at different models depending on the time. Operations should pin to dates.

Benchmark over-trust — #1 isn't #1 in your domain.

License differences — Open weights doesn't mean all are commercially usable. Check the model card.

Data training use — Free / paid / enterprise can have different policies. Don't let sensitive info into the input.

Generation-change regression — A new model doesn't beat the old in every aspect. Sometimes regression appears in your tasks.

Advertised vs actual context length — Sometimes the advertised limit and the per-model input · output limit differ.

Tokens of reasoning models — Whether thinking tokens are included in the response or billed separately differs per provider.

"AGI" · "superhuman" expressions — Marketing expressions should be filtered out when interpreting evaluation results.

Closing thoughts

The LLM landscape shifts fast, so single-model dependence in operation drags regression risk. Pin the model + your own domain evaluation set + a shape that lets you swap models with one environment-variable line + cost monitoring — these four spots are the standard for stable operation.

(end of ai)

References: LMArena · LiveBench · Open LLM Leaderboard · OpenAI Models · Anthropic Models · Gemini Models · Meta Llama · Mistral · DeepSeek.

LLM Landscape — Closed · Open · Korean-Specialized · Evaluation · Pricing

LLM Landscape — Closed · Open · Korean-Specialized · Evaluation · Pricing

1. Closed (API · weights private)

2. Open weights

3. Korean-specialized · Korean-company models

4. Reasoning models · multimodal · context length

5. Evaluation sites

6. Pricing models

7. The thread of selection

8. Spots where you often get stuck

Closing thoughts

Next

Back to ai