Regular expressions — finding strings by pattern

Whenever someone asks "pull just the emails out of this text", the first tool that comes to mind is the regular expression (regex). Short strings, big expressive power — and the same reason makes it look alien when first learning. This article covers regex's origins, the core metacharacters, common patterns, language differences, and the trap of "the perfect regex".

1. About regular expressions

The roots reach back to mathematician Stephen Kleene's 1956 notation for regular sets. In 1968, Ken Thompson at Bell Labs added regex search to the QED text editor, and his paper that year set the standard for the regex → NFA conversion algorithm. After that came grep in 1973, sed and awk in the 1980s, and Perl in 1987, which made regex ubiquitous.

Today's regex syntax splits into two main lineages:

Lineage	Origin	Where
POSIX BRE / ERE	1986 POSIX standard	Traditional grep, sed
PCRE (Perl Compatible)	1987 Perl, 1997 PCRE library	Almost every modern tool

JS, Python, Java, .NET, Go, Rust, and Ruby all resemble PCRE but differ in subtle ways. The same pattern often works in one language and silently fails in another.

2. Metacharacters

Character	Meaning
`.`	Any single character except a newline
`^` · `$`	Start and end of a line (or string)
`*`	Previous token, 0 or more times
`+`	Previous token, 1 or more times
`?`	Previous token, 0 or 1 times
`{n}` · `{n,m}`	Exactly n times, or n–m times
`[abc]`	Character class (one of these)
`[^abc]`	Negated class (none of these)
`[a-z]`	Range
`\d` · `\D`	Digit, non-digit
`\w` · `\W`	Word char (`[A-Za-z0-9_]`), or not
`\s` · `\S`	Whitespace, non-whitespace
`\b` · `\B`	Word boundary, non-boundary
`()`	Group (captures)
`(?:)`	Non-capturing group
`(?<name>)`	Named group
`\|`	Or (alternation)
`\`	Escape

3. Greedy vs lazy

*, +, and ? are greedy by default — they match as much as they can. Add *?, +?, or ?? to flip them lazy, matching as little as possible:

"<b>hello</b>"

<.*>     → <b>hello</b>   (everything)
<.*?>    → <b>            (the shortest single chunk)

*? and +? show up often when grabbing the closest pair, like HTML tags.

4. Groups and backreferences

Pattern: (\w+)\s+\1
Target:  "ho ho"
Match:   "ho ho"      (\1 references the same text as group 1)

Groups can also be referenced in replacements:

"2025-04-25".replace(/(\d{4})-(\d{2})-(\d{2})/, "$3/$2/$1");
// "25/04/2025"

5. Lookahead · lookbehind

Tools that test conditions without consuming characters.

Notation	Meaning
`(?=X)`	Followed by X (positive lookahead)
`(?!X)`	Not followed by X (negative lookahead)
`(?<=X)`	Preceded by X (positive lookbehind)
`(?<!X)`	Not preceded by X (negative lookbehind)

Pattern: \d+(?=원)
Target:  "1500원"
Match:   "1500"   (only digits followed by "원", "원" itself is not part of the match)

lookbehind does not allow variable length in some languages. JS has supported it since 2018, Python's re only allows fixed length, and the regex library allows variable.

6. Flags

Flag	Meaning
`i`	Case-insensitive
`g`	Global match (every occurrence)
`m`	Multiline. `^` and `$` apply per line
`s`	dotall. `.` also matches newlines
`u`	Unicode. Code-point granularity
`x`	Extended (whitespace and comments allowed; Python, PCRE)

JS's g is required for whole-match operations like String.prototype.replaceAll and matchAll.

7. Other paths

The same job done with non-regex tools:

CSV / JSON parsing — Structured formats are always safer with a dedicated parser than with regex.
HTML / XML parsing — Slicing with regex breaks down. Use parsers like cheerio, BeautifulSoup, or DOMParser.
Natural language search — Long regexes are worse than a search engine (Elasticsearch, Postgres full-text).
PEG · parser combinators — Complex grammars are outside regex territory.

8. Common shapes

Basic email, URL, and phone-number patterns only catch things "roughly". A perfect regex is nearly impossible.

Simple email (rough):
^[\w.+-]+@[\w-]+\.[\w.-]+$

URL (rough):
\bhttps?://[^\s)]+

Korean mobile phone (rough):
01[016789]-?\d{3,4}-?\d{4}

YYYY-MM-DD:
\b\d{4}-\d{2}-\d{2}\b

Empty line:
^\s*$

Strip ANSI color codes:
\x1b\[[0-9;]*m

Per language:

// JavaScript
const re = /^\d{4}-\d{2}-\d{2}$/;
re.test("2025-04-25");                  // true
"a-1, b-2".match(/[a-z]-\d/g);          // ["a-1", "b-2"]
"hello".replace(/l/g, "L");             // "heLLo"

# Python
import re
re.match(r"^\d{4}-\d{2}-\d{2}$", "2025-04-25")
re.findall(r"[a-z]-\d", "a-1, b-2")
re.sub(r"l", "L", "hello")
pat = re.compile(r"\d+", re.IGNORECASE)

// Java
Pattern p = Pattern.compile("\\d{4}-\\d{2}-\\d{2}");
Matcher m = p.matcher("2025-04-25");
m.find();
"hello".replaceAll("l", "L");

# grep · sed (mac · Linux)
grep -E "^[a-z]+@[a-z]+\.[a-z]+$" emails.txt
sed -E 's/[0-9]+/N/g' file.txt

# Windows PowerShell
Get-Content file.txt | Select-String -Pattern "ERROR"
"hello" -replace "l", "L"

9. Common pitfalls

A perfect email or URL regex is effectively impossible — the proper RFC 5321 / 5322 regex is hundreds of characters. "Catch it roughly and verify by sending or via a library" is the practical move.

HTML regex parsing — there's a famous Stack Overflow answer satirizing this trap. Use a parser.

Catastrophic backtracking — nested quantifiers like (a+)+$ blow up to exponential time on certain inputs. ReDoS security issues often originate here. Lookbehind / atomic groups / possessive quantifiers, or a linear-time engine like RE2, are the defenses.

. not matching newlines — the default. To match across newlines with .*, use the s flag or [\s\S]*.

Unicode — \w is ASCII-only and excludes Korean. JS uses the u flag, Python is Unicode by default, and Java needs (?U) or Pattern.UNICODE_CHARACTER_CLASS.

Escaping — once for the language string literal, once for the regex. Two layers of escaping. Python's raw string r"\d" and JS' regex literal /\d/ skip the issue.

POSIX vs PCRE differences — \d doesn't work under POSIX BRE. Traditional grep uses [0-9] or ERE's [[:digit:]] via grep -E.

Closing thoughts

Regex packs heavy expressive power into short strings. Chasing a "perfect" pattern leads into the swamp of catastrophic backtracking, RFC conflicts, Unicode, and per-language differences. The operationally safe stance is catch roughly, verify with a library. For HTML, natural language, and deep structures, reach for a parser or a search engine instead of regex.

git-submodule-lfs
(end of tools)

References include regex101.com, regexr.com, RegexLearn, MDN Regular Expressions, Python re docs, Java Pattern, Russ Cox — Regular Expression Matching Can Be Simple And Fast, RE2, and OWASP ReDoS.

Regular expressions — finding strings by pattern

Regular expressions — finding strings by pattern

1. About regular expressions

2. Metacharacters

3. Greedy vs lazy

4. Groups and backreferences

5. Lookahead · lookbehind

6. Flags

7. Other paths

8. Common shapes

9. Common pitfalls

Closing thoughts

Next

More in tools