Regular expressions — finding strings by pattern
Regular expressions — finding strings by pattern
Whenever someone asks "pull just the emails out of this text", the first tool that comes to mind is the regular expression (regex). Short strings, big expressive power — and the same reason makes it look alien when first learning. This article covers regex's origins, the core metacharacters, common patterns, language differences, and the trap of "the perfect regex".
1. About regular expressions
The roots reach back to mathematician Stephen Kleene's 1956 notation for regular sets. In 1968, Ken Thompson at Bell Labs added regex search to the QED text editor, and his paper that year set the standard for the regex → NFA conversion algorithm. After that came grep in 1973, sed and awk in the 1980s, and Perl in 1987, which made regex ubiquitous.
Today's regex syntax splits into two main lineages:
| Lineage | Origin | Where |
|---|---|---|
| POSIX BRE / ERE | 1986 POSIX standard | Traditional grep, sed |
| PCRE (Perl Compatible) | 1987 Perl, 1997 PCRE library | Almost every modern tool |
JS, Python, Java, .NET, Go, Rust, and Ruby all resemble PCRE but differ in subtle ways. The same pattern often works in one language and silently fails in another.
2. Metacharacters
| Character | Meaning |
|---|---|
. |
Any single character except a newline |
^ · $ |
Start and end of a line (or string) |
* |
Previous token, 0 or more times |
+ |
Previous token, 1 or more times |
? |
Previous token, 0 or 1 times |
{n} · {n,m} |
Exactly n times, or n–m times |
[abc] |
Character class (one of these) |
[^abc] |
Negated class (none of these) |
[a-z] |
Range |
\d · \D |
Digit, non-digit |
\w · \W |
Word char ([A-Za-z0-9_]), or not |
\s · \S |
Whitespace, non-whitespace |
\b · \B |
Word boundary, non-boundary |
() |
Group (captures) |
(?:) |
Non-capturing group |
(?<name>) |
Named group |
| |
Or (alternation) |
\ |
Escape |
3. Greedy vs lazy
*, +, and ? are greedy by default — they match as much as they can. Add *?, +?, or ?? to flip them lazy, matching as little as possible:
"<b>hello</b>"
<.*> → <b>hello</b> (everything)
<.*?> → <b> (the shortest single chunk)
*? and +? show up often when grabbing the closest pair, like HTML tags.
4. Groups and backreferences
Pattern: (\w+)\s+\1
Target: "ho ho"
Match: "ho ho" (\1 references the same text as group 1)
Groups can also be referenced in replacements:
"2025-04-25".replace(/(\d{4})-(\d{2})-(\d{2})/, "$3/$2/$1");
// "25/04/2025"
5. Lookahead · lookbehind
Tools that test conditions without consuming characters.
| Notation | Meaning |
|---|---|
(?=X) |
Followed by X (positive lookahead) |
(?!X) |
Not followed by X (negative lookahead) |
(?<=X) |
Preceded by X (positive lookbehind) |
(?<!X) |
Not preceded by X (negative lookbehind) |
Pattern: \d+(?=원)
Target: "1500원"
Match: "1500" (only digits followed by "원", "원" itself is not part of the match)
lookbehind does not allow variable length in some languages. JS has supported it since 2018, Python's re only allows fixed length, and the regex library allows variable.
6. Flags
| Flag | Meaning |
|---|---|
i |
Case-insensitive |
g |
Global match (every occurrence) |
m |
Multiline. ^ and $ apply per line |
s |
dotall. . also matches newlines |
u |
Unicode. Code-point granularity |
x |
Extended (whitespace and comments allowed; Python, PCRE) |
JS's g is required for whole-match operations like String.prototype.replaceAll and matchAll.
7. Other paths
The same job done with non-regex tools:
- CSV / JSON parsing — Structured formats are always safer with a dedicated parser than with regex.
- HTML / XML parsing — Slicing with regex breaks down. Use parsers like cheerio, BeautifulSoup, or DOMParser.
- Natural language search — Long regexes are worse than a search engine (Elasticsearch, Postgres full-text).
- PEG · parser combinators — Complex grammars are outside regex territory.
8. Common shapes
Basic email, URL, and phone-number patterns only catch things "roughly". A perfect regex is nearly impossible.
Simple email (rough):
^[\w.+-]+@[\w-]+\.[\w.-]+$
URL (rough):
\bhttps?://[^\s)]+
Korean mobile phone (rough):
01[016789]-?\d{3,4}-?\d{4}
YYYY-MM-DD:
\b\d{4}-\d{2}-\d{2}\b
Empty line:
^\s*$
Strip ANSI color codes:
\x1b\[[0-9;]*m
Per language:
// JavaScript
const re = /^\d{4}-\d{2}-\d{2}$/;
re.test("2025-04-25"); // true
"a-1, b-2".match(/[a-z]-\d/g); // ["a-1", "b-2"]
"hello".replace(/l/g, "L"); // "heLLo"
# Python
import re
re.match(r"^\d{4}-\d{2}-\d{2}$", "2025-04-25")
re.findall(r"[a-z]-\d", "a-1, b-2")
re.sub(r"l", "L", "hello")
pat = re.compile(r"\d+", re.IGNORECASE)
// Java
Pattern p = Pattern.compile("\\d{4}-\\d{2}-\\d{2}");
Matcher m = p.matcher("2025-04-25");
m.find();
"hello".replaceAll("l", "L");
# grep · sed (mac · Linux)
grep -E "^[a-z]+@[a-z]+\.[a-z]+$" emails.txt
sed -E 's/[0-9]+/N/g' file.txt
# Windows PowerShell
Get-Content file.txt | Select-String -Pattern "ERROR"
"hello" -replace "l", "L"
9. Common pitfalls
A perfect email or URL regex is effectively impossible — the proper RFC 5321 / 5322 regex is hundreds of characters. "Catch it roughly and verify by sending or via a library" is the practical move.
HTML regex parsing — there's a famous Stack Overflow answer satirizing this trap. Use a parser.
Catastrophic backtracking — nested quantifiers like (a+)+$ blow up to exponential time on certain inputs. ReDoS security issues often originate here. Lookbehind / atomic groups / possessive quantifiers, or a linear-time engine like RE2, are the defenses.
. not matching newlines — the default. To match across newlines with .*, use the s flag or [\s\S]*.
Unicode — \w is ASCII-only and excludes Korean. JS uses the u flag, Python is Unicode by default, and Java needs (?U) or Pattern.UNICODE_CHARACTER_CLASS.
Escaping — once for the language string literal, once for the regex. Two layers of escaping. Python's raw string r"\d" and JS' regex literal /\d/ skip the issue.
POSIX vs PCRE differences — \d doesn't work under POSIX BRE. Traditional grep uses [0-9] or ERE's [[:digit:]] via grep -E.
Closing thoughts
Regex packs heavy expressive power into short strings. Chasing a "perfect" pattern leads into the swamp of catastrophic backtracking, RFC conflicts, Unicode, and per-language differences. The operationally safe stance is catch roughly, verify with a library. For HTML, natural language, and deep structures, reach for a parser or a search engine instead of regex.
Next
- git-submodule-lfs
- (end of tools)
References include regex101.com, regexr.com, RegexLearn, MDN Regular Expressions, Python re docs, Java Pattern, Russ Cox — Regular Expression Matching Can Be Simple And Fast, RE2, and OWASP ReDoS.