Text encoding and line endings
Text encoding and line endings
Even for the same character, the bytes a computer stores depend on the era and environment. As a result, Korean text breaks or git diff fills with meaningless line changes.
1. A short history of encoding
ASCII (1963) — American Standard Code for Information Interchange. 7-bit (0-127). English letters, digits, basic symbols. The bedrock of English text but unable to represent Korean, Japanese, or Chinese.
EUC-KR · CP949 — EUC-KR is an 8-bit encoding for Korean. KS X 1001 hanja and Hangul ranges in 2 bytes. Even so, EUC-KR's Hangul range is limited to KS X 1001's 2,350 syllables, leaving some out. CP949 (code page 949) is Microsoft's extension of EUC-KR carrying all 11,172 Hangul syllables. The Korean Windows ANSI code page is CP949.
Unicode (1991) and UTF-8 (1993) — the unified character set the Unicode Consortium published in 1991. Every human character receives a code point (Hangul '가' = U+AC00). UTF-8 is the variable-length encoding Ken Thompson and Rob Pike designed in 1993:
- The ASCII range (0-127) takes 1 byte and is fully ASCII compatible.
- Hangul usually takes 3 bytes.
- Supplementary plane characters such as emoji take 4 bytes.
UTF-8's strengths are ASCII compatibility, self-synchronization (decoding can start mid-stream and find boundaries), and endian independence. The web, Unix, and most modern tools default to UTF-8. W3C has recommended UTF-8 in HTML since 2008.
UTF-16 — the 16-bit-unit encoding used inside Java, .NET, Windows, and some older systems. Hangul is 2 bytes, emoji is 4 (surrogate pairs). Endianness (LE/BE) is its own concern.
2. BOM (Byte Order Mark)
A BOM is a marker byte at the start of a file that announces the encoding and endianness.
| Encoding | BOM bytes |
|---|---|
| UTF-8 | EF BB BF |
| UTF-16 LE | FF FE |
| UTF-16 BE | FE FF |
UTF-8 has no endianness so a BOM is not required, but some tools rely on it for encoding detection:
- Microsoft Excel — opening a CSV on Korean Windows guesses ANSI (CP949). UTF-8 CSVs commonly break, so saving with a BOM lets Excel recognize UTF-8.
- Where it hurts — a BOM at the top of a shell script breaks the shebang (
#!/usr/bin/env bash). Some JSON parsers also struggle with BOMs.
3. Line endings
| Notation | Bytes | Origin |
|---|---|---|
| LF | 0A |
Unix family, modern macOS, Linux, Web |
| CR | 0D |
Old Mac (System 9 and earlier). Nearly extinct. |
| CRLF | 0D 0A |
DOS, Windows, some network protocols (HTTP and SMTP headers) |
CR (Carriage Return) and LF (Line Feed) come from mechanical teletypes — "carriage to the left edge" and "advance one line." DOS chose CRLF, which sends both. Unix simplified to a single LF.
Most modern tools handle either side automatically. Differences surface in specific places: git diff, shell script execution, some JSON or YAML parsers, Python's open(..., newline='').
4. git's line-ending handling
core.autocrlf:
| Value | Behavior |
|---|---|
true |
LF→CRLF on checkout, CRLF→LF on commit (recommended on Windows). |
input |
CRLF→LF on commit only (recommended on macOS, Linux). |
false |
No conversion. |
.gitattributes (recommended SSOT) — declared at the repository level. Takes precedence over core.autocrlf:
* text=auto
*.sh text eol=lf
*.bat text eol=crlf
*.cmd text eol=crlf
*.ps1 text eol=crlf
*.png binary
*.jpg binary
text=auto lets git normalize files it identifies as text. eol=lf or eol=crlf enforces line endings regardless of OS. Mark binaries as binary.
5. PowerShell 5.1 encoding traps
Windows PowerShell 5.1's output cmdlets (Out-File, Set-Content, >) default to UTF-16 LE (with BOM). Files turn out broken in environments expecting UTF-8.
# Safe pattern under 5.1
"hello" | Out-File -Encoding utf8 file.txt
[System.IO.File]::WriteAllText("file.txt", "hello", [System.Text.UTF8Encoding]::new($false)) # UTF-8 without BOM
PowerShell 7+ defaults to UTF-8 (no BOM), so the issue largely disappears.
6. The VS Code indicator
The VS Code status bar's bottom-right shows the current file's encoding and line ending:
- Encoding —
UTF-8,UTF-8 with BOM,Windows 1252. - Line ending —
LF,CRLF.
Clicking opens convert and reinterpret options. "Save with Encoding" rewrites the file, "Reopen with Encoding" reads a broken file with the right encoding.
{
"files.encoding": "utf8",
"files.eol": "\n"
}
7. Common commands
| Task | macOS · Linux | Windows |
|---|---|---|
| Check file encoding | file -I path |
(PowerShell) Get-Content path -Encoding Byte -TotalCount 3 |
| Force convert to LF | dos2unix path |
dos2unix path (Git Bash) |
| Force convert to CRLF | unix2dos path |
unix2dos path |
| Check git encoding | git ls-files --eol |
Same |
8. Common pitfalls
A .sh saved as CRLF gets \r appended to the shebang's interpreter, so the system tries to run bash\r and fails.
Parsing UTF-8 JSON with a BOM via the standard library errors out in some languages (an older era of JSON.parse rejected BOM).
Korean breakage in Excel UTF-8 CSV — add a BOM, or use the "Data → Get External Data" menu to specify the encoding.
Notepad on Windows saving as UTF-16 LE so Linux tools fail — Notepad on Windows 10 1903+ defaults to UTF-8 (no BOM).
Python's open(path, "r") defaults the encoding to the OS locale. On Korean Windows that may be cp949, so pass encoding="utf-8" explicitly.
Closing thoughts
Once UTF-8 + LF + .gitattributes settle in as the SSOT trio, encoding and line-ending accidents from OS differences nearly disappear. The two spots still worth babysitting are PowerShell 5.1's UTF-16 LE default and Excel's CP949 guesswork.
Next
- first-terminal-day
- data-formats
The Unicode Standard · RFC 3629 UTF-8 · git-scm gitattributes · Microsoft Code Pages · Wikipedia Newline for reference.