Documents written in the native language
Documents written in the native language
Choosing to write docs and comments in a native language other than English is a decision development teams in non-English-speaking regions often face. This article covers the convention of separating identifiers (English) from documentation/comments (native language), and Unicode identifier standards.
1. Two kinds of text
Software code contains two kinds of text:
- Identifiers (variables, functions, types, file names) — the surface that meets the compiler and language specification.
- Natural language (comments, docs, logs, error messages) — text only humans read.
A common split adopted by non-English-speaking teams:
- Identifiers — English. Consistency with external libraries and standard APIs, ease of tools and search.
- Comments, docs, commit messages — native language. Accuracy of internal team communication.
- User-facing text — separated through a localization system (i18n).
This split is not an absolute rule, but reports commonly say it is the de facto pattern in internal codebases of Japanese, Korean, Chinese, German, and French companies.
2. Unicode identifiers
Most modern languages allow non-ASCII characters in identifiers:
- Python — Unicode identifiers allowed by PEP 3131 (adopted in Python 3, 2007). The standard rules are UAX#31.
- JavaScript / ECMAScript — Unicode identifiers based on ID_Start / ID_Continue since ECMAScript 2015 (ES6).
- Java — Unicode identifiers allowed from the start.
- C# — Unicode identifiers allowed (
@prefix avoids keywords). - Rust — non-ASCII identifiers allowed via RFC 2457 (2018). Opt-in by default.
- Go — Unicode identifiers allowed.
So Korean / Chinese-character identifiers are syntactically possible in many languages. Yet they are rarely adopted in practice because:
- External compatibility — standard APIs, libraries, docs, and examples are in English. English identifiers blend naturally with them.
- Tools — IDE autocomplete, stack traces, log parsers, and grep often assume English.
- Typing / IME — switching IME every time you type an identifier breaks flow.
- Security — homoglyph attacks. Latin
aand Cyrillicаlook identical visually, allowing identifier spoofing. Some languages (such as Rust) emit lint warnings for this.
For these reasons, the split of identifiers in English and natural-language text in the native language is often recommended.
3. Other paths
All-English — the standard choice for global OSS projects or multinational teams. Easier for outside contributors to join, with less friction at public release. The downside is reduced communication accuracy for non-English speakers. Subtle business rules written only in English can blur the intent.
All-native — unify identifiers in the native language too. Occasionally seen in internal-only or educational code. Maintenance burden tends to grow because consistency breaks at the moment external dependencies (library calls) mix in.
English identifiers + native natural language (mixed) — the most common compromise. New team members can adapt to both external standards and internal context quickly.
The choice varies by team composition, customers, and external exposure.
4. Practice for Korean documents
For internal docs and comments written in Korean, the following are often recommended:
- Half-width / full-width distinction — code blocks and identifiers in half-width only, body text in regular Hangul.
- Loanword notation — keep library/product names in their original form (PostgreSQL, React). Mixing transliteration and original form makes search hard.
- Technical terms — established Korean translations ("비동기", "동시성") in Korean; for unsettled terms keep the original ("memoization") or list both once side by side.
- Line breaks / punctuation — one sentence per line (or one paragraph per line) helps git diff readability.
5. Common shapes
# Validates the user token and reissues a new token if expired.
# The expiration threshold is controlled by the environment variable TOKEN_TTL.
def verify_and_refresh(token: str) -> Token:
decoded = decode(token)
if decoded.expires_at - now() < REFRESH_THRESHOLD:
return issue_new(decoded.user_id)
return decoded
User text separated via i18n:
// messages/ko.json
{ "login.fail": "로그인에 실패했습니다." }
// messages/en.json
{ "login.fail": "Login failed." }
6. Common pitfalls
Releasing a library with Hangul identifiers makes it hard for English-speaking users to use — separate policies for public OSS and internal code.
Homoglyph accidents — Latin a and Cyrillic а mixed into identifiers create hard-to-catch bugs. Keep linter and compiler warnings on.
Hangul filenames have OS-dependent normalization forms — macOS commonly uses NFD (decomposed jamo), Windows/Linux NFC (composed). Hangul filenames in a git repo can cause conflicts due to this difference. One reason the convention of preferring English filenames has solidified.
Native-language comments grow too long and go unread — express the same information in code itself (names, structure) first; have comments focus on "why it was done this way".
Closing thoughts
The most natural split for non-English-speaking teams is English identifiers + native-language natural language + i18n user text. When these three balance, you can have both the consistency of external standards and the accuracy of internal communication. Korean identifiers may be syntactically possible, but the costs of tools, IME, and homoglyph characters are high.
Next
- no-ai-credit
- feature-flag-skeptic
PEP 3131 Supporting Non-ASCII Identifiers · Unicode UAX #31 Identifier and Pattern Syntax · ECMAScript Identifier Names and Identifiers · Rust RFC 2457 Non-ASCII identifiers · Unicode Security Considerations (UTS #39) · National Institute of Korean Language · The Twelve-Factor App for reference.