Prompt Design — Message Roles · CoT · ReAct · Sampling · Injection
Prompt Design — Message Roles · CoT · ReAct · Sampling · Injection
A prompt is closer to the design of an input interface itself than just a string sent to an LLM. Message structure · reasoning patterns · sampling parameters · security all entangle at once.
1. About message roles
OpenAI's ChatML, which settled around late 2022, made message roles the standard shape. Most current Chat Completions-compatible APIs follow the same structure.
| Role | Meaning |
|---|---|
system |
The model's persona · rules · output format directives. |
user |
The end user's input. |
assistant |
The model's previous responses (conversation history). |
tool (or function) |
Returns the result of a tool call. |
The Anthropic Claude API has a slightly different shape, putting system outside the messages array as a separate field. The meaning is the same.
2. Tokenizers
LLMs don't see text as is — they see it in token units. tiktoken (OpenAI) · SentencePiece · BPE variants differ per model. Korean tends to use more tokens than English (1.5~2× for the same meaning). Token count determines cost · context limit, so the tokenizer difference is worth being aware of.
3. Zero-shot · Few-shot
- Zero-shot — Instructions only, without examples.
- Few-shot — A few input·output examples included. Brown et al.'s GPT-3 paper (2020) labeled this in-context learning.
input: apple
output: red
input: banana
output: yellow
input: grape
output:
4. Chain-of-Thought (CoT)
A pattern formalized by Wei et al. in 2022. Prompts like "let's think step by step" or examples that walk through reasoning push the model to output its thought process. Reports show improved accuracy on arithmetic · logic · multi-step reasoning.
Kojima et al. (2022) showed that the single line "Let's think step by step." produces a CoT effect even in zero-shot.
Self-Consistency (Wang et al. 2022) — Solve the same question multiple times at high temperature, then majority-vote. Often discussed alongside CoT.
5. ReAct
Yao et al. (2022). A blend of Reasoning + Acting. The model explicitly outputs a "Thought → Action → Observation" loop, weaving tool calls and reasoning into a single flow. Many modern agent frameworks are based on ReAct variants.
Thought: The user is asking about yesterday's exchange rate.
Action: search("yesterday USD/KRW")
Observation: 1378.5
Thought: Compose the answer.
Final Answer: ...
Tree-of-Thought (Yao et al. 2023) — Instead of a single chain, multiple reasoning paths are spread out as a tree, and each node is evaluated · selected. More expensive, but suited for problems that need search.
6. Structured output
The place where output is forced into a fixed schema like JSON · XML:
- Provide an explicit schema + examples.
- JSON mode or schema-enforce options on OpenAI · Anthropic · Google.
- Library-side validation · retry (Pydantic AI · Instructor · Outlines).
7. Sampling parameters
| Parameter | Meaning | Effect |
|---|---|---|
temperature |
Flatten the probability distribution (0~2) | Closer to 0 = more deterministic. |
top_p (nucleus) |
Candidates up to cumulative probability p | 1.0 = all, lower = narrower. |
top_k |
Only top k candidates | Support varies by model · API. |
presence_penalty |
Penalty on already-appeared tokens | Drives new topics. |
frequency_penalty |
Penalty on frequently-appeared tokens | Suppresses repetition. |
seed |
Try reproducibility by fixing the seed | Guarantees vary by API. |
Threads of choice:
- For deterministic outputs like classification·extraction —
temperature=0. - For writing·ideation —
temperature=0.7~1.0+top_p=0.9. - Squeezing both temperature and top_p hard at once makes outputs monotone. The common recommendation is to tune only one.
8. Quirks of Korean prompts
Token efficiency — Korean often expresses the same meaning with more tokens than English. This affects context and cost.
Mixed honorific·casual register — Without explicit instruction, the model may mix tones inconsistently. Specify the tone in the system message.
Proper nouns·abbreviations — Loanword spellings can vary by model. Add the English form on first appearance.
Variation in Korean ability per model — Even within the same model family, Korean stability differs from English.
9. System message patterns
You are a professional editor.
- Reply in Korean.
- Reduce assertive expressions and describe objectively.
- Output is in markdown format.
- Say "I don't know" when you don't know.
The observation is that overly long rules cause the model to ignore parts. Keeping core rules short and showing exceptions through examples is more stable.
Threads of evaluation:
- Build a small evaluation set first (20~100 cases).
- When changing the model · prompt, rerun the same evaluation set to watch for regressions.
- Evaluation can also be automated by an LLM, but human review in parallel is recommended.
10. Prompt injection
Since 2023, OWASP has organized the LLM Top 10, placing LLM01:2025 Prompt Injection as the top threat.
Direct injection — A user inserts text like "ignore previous instructions and …" into input to alter model behavior.
Indirect injection — Instructions planted in advance by an attacker inside external materials (web pages · email · PDFs) the model retrieves · summarizes. Unless the model distinguishes tool-result text from user intent, it tries to follow them.
Threads of mitigation:
- Trust boundary separation — different trust levels for system · user input · external tool output.
- Constrain in
systemto quote tool results but not follow instructions inside them. - Output validation · tool whitelist · least privilege.
- Put irreversible actions (file deletion · payment) behind a separate human approval step.
The consensus is that complete defense doesn't exist yet. Leave room for regression when the model changes.
11. Spots where you often get stuck
Context position effect — "lost in the middle." Material in the middle is reflected less than at either end.
CoT exposure — There may be a policy not to show the reasoning to users. Separate channel · summary separation.
Confident wrong answers — The model gives natural-sounding answers even where it doesn't know. Specify "say I don't know when you don't know" + complement with tools · search.
Few-shot example bias — The format · length · order of examples affects outputs. Diversify and randomize order of examples.
Model version differences — Even with the same model name, behavior varies by point in time. Pin to API model snapshots · dates.
Even at temperature=0 it isn't identical — Reports indicate that infrastructure-side non-determinism prevents complete reproducibility.
Limits of JSON mode — Even with a forced schema, the result can have semantically empty values · inconsistent labels. Postprocessing validation is needed.
Closing thoughts
Small prompt changes can produce big differences in results, but without measurement it's hard to be sure which change improved things. Start with a small evaluation set, pin the model, then tune incrementally — that's the firmest flow. For injection mitigation, trust-boundary separation at the system-design stage is the core; almost no place can be defended by a single prompt line.
Next
- gemini-api
- embeddings-deep
References: Brown et al. GPT-3 (2020) · Wei et al. CoT (2022) · Yao et al. ReAct (2022) · Liu et al. Lost in the Middle (2023) · OpenAI Prompt Guide · Anthropic Prompt Guide · Gemini Prompt Guide · OWASP LLM Top 10.