Prompt Design — Message Roles · CoT · ReAct · Sampling · Injection

A prompt is closer to the design of an input interface itself than just a string sent to an LLM. Message structure · reasoning patterns · sampling parameters · security all entangle at once.

1. About message roles

OpenAI's ChatML, which settled around late 2022, made message roles the standard shape. Most current Chat Completions-compatible APIs follow the same structure.

Role	Meaning
`system`	The model's persona · rules · output format directives.
`user`	The end user's input.
`assistant`	The model's previous responses (conversation history).
`tool` (or `function`)	Returns the result of a tool call.

The Anthropic Claude API has a slightly different shape, putting system outside the messages array as a separate field. The meaning is the same.

2. Tokenizers

LLMs don't see text as is — they see it in token units. tiktoken (OpenAI) · SentencePiece · BPE variants differ per model. Korean tends to use more tokens than English (1.5~2× for the same meaning). Token count determines cost · context limit, so the tokenizer difference is worth being aware of.

3. Zero-shot · Few-shot

Zero-shot — Instructions only, without examples.
Few-shot — A few input·output examples included. Brown et al.'s GPT-3 paper (2020) labeled this in-context learning.

input: apple
output: red

input: banana
output: yellow

input: grape
output:

4. Chain-of-Thought (CoT)

A pattern formalized by Wei et al. in 2022. Prompts like "let's think step by step" or examples that walk through reasoning push the model to output its thought process. Reports show improved accuracy on arithmetic · logic · multi-step reasoning.

Kojima et al. (2022) showed that the single line "Let's think step by step." produces a CoT effect even in zero-shot.

Self-Consistency (Wang et al. 2022) — Solve the same question multiple times at high temperature, then majority-vote. Often discussed alongside CoT.

5. ReAct

Yao et al. (2022). A blend of Reasoning + Acting. The model explicitly outputs a "Thought → Action → Observation" loop, weaving tool calls and reasoning into a single flow. Many modern agent frameworks are based on ReAct variants.

Thought: The user is asking about yesterday's exchange rate.
Action: search("yesterday USD/KRW")
Observation: 1378.5
Thought: Compose the answer.
Final Answer: ...

Tree-of-Thought (Yao et al. 2023) — Instead of a single chain, multiple reasoning paths are spread out as a tree, and each node is evaluated · selected. More expensive, but suited for problems that need search.

6. Structured output

The place where output is forced into a fixed schema like JSON · XML:

Provide an explicit schema + examples.
JSON mode or schema-enforce options on OpenAI · Anthropic · Google.
Library-side validation · retry (Pydantic AI · Instructor · Outlines).

7. Sampling parameters

Parameter	Meaning	Effect
`temperature`	Flatten the probability distribution (0~2)	Closer to 0 = more deterministic.
`top_p` (nucleus)	Candidates up to cumulative probability p	1.0 = all, lower = narrower.
`top_k`	Only top k candidates	Support varies by model · API.
`presence_penalty`	Penalty on already-appeared tokens	Drives new topics.
`frequency_penalty`	Penalty on frequently-appeared tokens	Suppresses repetition.
`seed`	Try reproducibility by fixing the seed	Guarantees vary by API.

Threads of choice:

For deterministic outputs like classification·extraction — temperature=0.
For writing·ideation — temperature=0.7~1.0 + top_p=0.9.
Squeezing both temperature and top_p hard at once makes outputs monotone. The common recommendation is to tune only one.

8. Quirks of Korean prompts

Token efficiency — Korean often expresses the same meaning with more tokens than English. This affects context and cost.

Mixed honorific·casual register — Without explicit instruction, the model may mix tones inconsistently. Specify the tone in the system message.

Proper nouns·abbreviations — Loanword spellings can vary by model. Add the English form on first appearance.

Variation in Korean ability per model — Even within the same model family, Korean stability differs from English.

9. System message patterns

You are a professional editor.
- Reply in Korean.
- Reduce assertive expressions and describe objectively.
- Output is in markdown format.
- Say "I don't know" when you don't know.

The observation is that overly long rules cause the model to ignore parts. Keeping core rules short and showing exceptions through examples is more stable.

Threads of evaluation:

Build a small evaluation set first (20~100 cases).
When changing the model · prompt, rerun the same evaluation set to watch for regressions.
Evaluation can also be automated by an LLM, but human review in parallel is recommended.

10. Prompt injection

Since 2023, OWASP has organized the LLM Top 10, placing LLM01:2025 Prompt Injection as the top threat.

Direct injection — A user inserts text like "ignore previous instructions and …" into input to alter model behavior.

Indirect injection — Instructions planted in advance by an attacker inside external materials (web pages · email · PDFs) the model retrieves · summarizes. Unless the model distinguishes tool-result text from user intent, it tries to follow them.

Threads of mitigation:

Trust boundary separation — different trust levels for system · user input · external tool output.
Constrain in system to quote tool results but not follow instructions inside them.
Output validation · tool whitelist · least privilege.
Put irreversible actions (file deletion · payment) behind a separate human approval step.

The consensus is that complete defense doesn't exist yet. Leave room for regression when the model changes.

11. Spots where you often get stuck

Context position effect — "lost in the middle." Material in the middle is reflected less than at either end.

CoT exposure — There may be a policy not to show the reasoning to users. Separate channel · summary separation.

Confident wrong answers — The model gives natural-sounding answers even where it doesn't know. Specify "say I don't know when you don't know" + complement with tools · search.

Few-shot example bias — The format · length · order of examples affects outputs. Diversify and randomize order of examples.

Model version differences — Even with the same model name, behavior varies by point in time. Pin to API model snapshots · dates.

Even at temperature=0 it isn't identical — Reports indicate that infrastructure-side non-determinism prevents complete reproducibility.

Limits of JSON mode — Even with a forced schema, the result can have semantically empty values · inconsistent labels. Postprocessing validation is needed.

Closing thoughts

Small prompt changes can produce big differences in results, but without measurement it's hard to be sure which change improved things. Start with a small evaluation set, pin the model, then tune incrementally — that's the firmest flow. For injection mitigation, trust-boundary separation at the system-design stage is the core; almost no place can be defended by a single prompt line.

gemini-api
embeddings-deep

References: Brown et al. GPT-3 (2020) · Wei et al. CoT (2022) · Yao et al. ReAct (2022) · Liu et al. Lost in the Middle (2023) · OpenAI Prompt Guide · Anthropic Prompt Guide · Gemini Prompt Guide · OWASP LLM Top 10.

Prompt Design — Message Roles · CoT · ReAct · Sampling · Injection

Prompt Design — Message Roles · CoT · ReAct · Sampling · Injection

1. About message roles

2. Tokenizers

3. Zero-shot · Few-shot

4. Chain-of-Thought (CoT)

5. ReAct

6. Structured output

7. Sampling parameters

8. Quirks of Korean prompts

9. System message patterns

10. Prompt injection

11. Spots where you often get stuck

Closing thoughts

Next

Back to ai