Text encoding and line endings

Even for the same character, the bytes a computer stores depend on the era and environment. As a result, Korean text breaks or git diff fills with meaningless line changes.

1. A short history of encoding

ASCII (1963) — American Standard Code for Information Interchange. 7-bit (0-127). English letters, digits, basic symbols. The bedrock of English text but unable to represent Korean, Japanese, or Chinese.

EUC-KR · CP949 — EUC-KR is an 8-bit encoding for Korean. KS X 1001 hanja and Hangul ranges in 2 bytes. Even so, EUC-KR's Hangul range is limited to KS X 1001's 2,350 syllables, leaving some out. CP949 (code page 949) is Microsoft's extension of EUC-KR carrying all 11,172 Hangul syllables. The Korean Windows ANSI code page is CP949.

Unicode (1991) and UTF-8 (1993) — the unified character set the Unicode Consortium published in 1991. Every human character receives a code point (Hangul '가' = U+AC00). UTF-8 is the variable-length encoding Ken Thompson and Rob Pike designed in 1993:

The ASCII range (0-127) takes 1 byte and is fully ASCII compatible.
Hangul usually takes 3 bytes.
Supplementary plane characters such as emoji take 4 bytes.

UTF-8's strengths are ASCII compatibility, self-synchronization (decoding can start mid-stream and find boundaries), and endian independence. The web, Unix, and most modern tools default to UTF-8. W3C has recommended UTF-8 in HTML since 2008.

UTF-16 — the 16-bit-unit encoding used inside Java, .NET, Windows, and some older systems. Hangul is 2 bytes, emoji is 4 (surrogate pairs). Endianness (LE/BE) is its own concern.

2. BOM (Byte Order Mark)

A BOM is a marker byte at the start of a file that announces the encoding and endianness.

Encoding	BOM bytes
UTF-8	`EF BB BF`
UTF-16 LE	`FF FE`
UTF-16 BE	`FE FF`

UTF-8 has no endianness so a BOM is not required, but some tools rely on it for encoding detection:

Microsoft Excel — opening a CSV on Korean Windows guesses ANSI (CP949). UTF-8 CSVs commonly break, so saving with a BOM lets Excel recognize UTF-8.
Where it hurts — a BOM at the top of a shell script breaks the shebang (#!/usr/bin/env bash). Some JSON parsers also struggle with BOMs.

3. Line endings

Notation	Bytes	Origin
LF	`0A`	Unix family, modern macOS, Linux, Web
CR	`0D`	Old Mac (System 9 and earlier). Nearly extinct.
CRLF	`0D 0A`	DOS, Windows, some network protocols (HTTP and SMTP headers)

CR (Carriage Return) and LF (Line Feed) come from mechanical teletypes — "carriage to the left edge" and "advance one line." DOS chose CRLF, which sends both. Unix simplified to a single LF.

Most modern tools handle either side automatically. Differences surface in specific places: git diff, shell script execution, some JSON or YAML parsers, Python's open(..., newline='').

4. git's line-ending handling

core.autocrlf:

Value	Behavior
`true`	LF→CRLF on checkout, CRLF→LF on commit (recommended on Windows).
`input`	CRLF→LF on commit only (recommended on macOS, Linux).
`false`	No conversion.

.gitattributes (recommended SSOT) — declared at the repository level. Takes precedence over core.autocrlf:

* text=auto
*.sh text eol=lf
*.bat text eol=crlf
*.cmd text eol=crlf
*.ps1 text eol=crlf
*.png binary
*.jpg binary

text=auto lets git normalize files it identifies as text. eol=lf or eol=crlf enforces line endings regardless of OS. Mark binaries as binary.

5. PowerShell 5.1 encoding traps

Windows PowerShell 5.1's output cmdlets (Out-File, Set-Content, >) default to UTF-16 LE (with BOM). Files turn out broken in environments expecting UTF-8.

# Safe pattern under 5.1
"hello" | Out-File -Encoding utf8 file.txt
[System.IO.File]::WriteAllText("file.txt", "hello", [System.Text.UTF8Encoding]::new($false))  # UTF-8 without BOM

PowerShell 7+ defaults to UTF-8 (no BOM), so the issue largely disappears.

6. The VS Code indicator

The VS Code status bar's bottom-right shows the current file's encoding and line ending:

Encoding — UTF-8, UTF-8 with BOM, Windows 1252.
Line ending — LF, CRLF.

Clicking opens convert and reinterpret options. "Save with Encoding" rewrites the file, "Reopen with Encoding" reads a broken file with the right encoding.

{
  "files.encoding": "utf8",
  "files.eol": "\n"
}

7. Common commands

Task	macOS · Linux	Windows
Check file encoding	`file -I path`	(PowerShell) `Get-Content path -Encoding Byte -TotalCount 3`
Force convert to LF	`dos2unix path`	`dos2unix path` (Git Bash)
Force convert to CRLF	`unix2dos path`	`unix2dos path`
Check git encoding	`git ls-files --eol`	Same

8. Common pitfalls

A .sh saved as CRLF gets \r appended to the shebang's interpreter, so the system tries to run bash\r and fails.

Parsing UTF-8 JSON with a BOM via the standard library errors out in some languages (an older era of JSON.parse rejected BOM).

Korean breakage in Excel UTF-8 CSV — add a BOM, or use the "Data → Get External Data" menu to specify the encoding.

Notepad on Windows saving as UTF-16 LE so Linux tools fail — Notepad on Windows 10 1903+ defaults to UTF-8 (no BOM).

Python's open(path, "r") defaults the encoding to the OS locale. On Korean Windows that may be cp949, so pass encoding="utf-8" explicitly.

Closing thoughts

Once UTF-8 + LF + .gitattributes settle in as the SSOT trio, encoding and line-ending accidents from OS differences nearly disappear. The two spots still worth babysitting are PowerShell 5.1's UTF-16 LE default and Excel's CP949 guesswork.

first-terminal-day
data-formats

The Unicode Standard · RFC 3629 UTF-8 · git-scm gitattributes · Microsoft Code Pages · Wikipedia Newline for reference.

Text encoding and line endings

Text encoding and line endings

1. A short history of encoding

2. BOM (Byte Order Mark)

3. Line endings

4. git's line-ending handling

5. PowerShell 5.1 encoding traps

6. The VS Code indicator

7. Common commands

8. Common pitfalls

Closing thoughts

Next

More in environment