Crawler ethics and tooling

Collecting public web data touches not only technology but ethics and law. Hitting too often burdens the target server, and skipping the terms of service can spill into legal trouble.

1. robots.txt

A text-based protocol that tells web crawlers what is allowed and disallowed. Proposed by Martijn Koster in 1994, it served as a de facto standard for years and became an official IETF standard as RFC 9309 in September 2022.

User-agent: *
Disallow: /private/
Allow: /private/public.html
Crawl-delay: 5
Sitemap: https://example.com/sitemap.xml

robots.txt is a promise rather than a legal mandate, but ignoring it can become grounds for blocking and legal disputes.

2. Crawling ethics

Identify the User-Agent — a UA that identifies who collects what for what purpose, plus a contactable URL.
Rate limit — cap concurrent connections and per-second requests. If Crawl-delay is provided, follow it.
Use caches — send ETag and Last-Modified to receive 304 responses.
Check terms — verify the site's terms of service or API terms for automated collection policies.
Personal data — even when public, data may fall under PIPA (Korea), GDPR (EU), or CCPA (US).

3. Browser automation tools

Tool	First appeared	Provider	Features
Selenium	2004	OSS, SeleniumHQ	The longest-standing standard. WebDriver spec (W3C). Multi-language.
Puppeteer	2017	Chrome team	Chrome DevTools Protocol (CDP). Node first.
Playwright	2020	Microsoft (former Puppeteer members joined)	Multi-language (Node · Python · Java · .NET). Chromium · Firefox · WebKit. Auto-wait, tracing.

Playwright arrived relatively late but is praised for operational comforts like auto-wait, the selector engine, and the trace viewer. Puppeteer is optimized for the single Chrome target.

4. HTML parsers

Library	First appeared	Notes
BeautifulSoup	2004, Leonard Richardson	A Python staple. Strong with idiomatic HTML.
lxml	2005	C-based (libxml2). Fast. XPath.
parsel	around 2017	Scrapy's selector packaged separately. CSS + XPath.
html5lib	2008	Faithful to the HTML5 spec. Slow but compatible.
Cheerio	2012	Node, jQuery-like API.

BeautifulSoup lets us choose the underlying parser among html.parser (standard library), lxml, and html5lib. Speed goes to lxml; compatibility goes to html5lib.

5. Static scraping vs browser automation

Static scraping (httpx + BeautifulSoup) — works on server-rendered HTML only. Fast and light.
Browser automation (Playwright) — sees the result after JS executes. Slow but essential for SPAs.

Trying static scraping first and only stepping up to a browser when JS dependence is clear is reasonable for cost and speed.

6. Limits of anti-bot

Crawl-blocking techniques and bypasses keep evolving.

UA rotation — limited effectiveness. Other signals (header combinations, TLS fingerprint, behavior patterns) are often more decisive.
Headless detection evasion — navigator.webdriver, font/canvas fingerprint, WebGL renderer, and others are inspected. Helpers like playwright-stealth exist, but no permanent solution.
IP rotation — datacenter IPs are easily blocked, and residential proxies sit in a costly, legally gray zone.
Cloudflare · Akamai · PerimeterX — JS challenges, device fingerprints, ML-based. Bypass attempts come close to terms-of-service violations.
CAPTCHA — automated solving is risky on both terms-of-service and legal grounds.

What is technically possible and what is ethically and legally permitted are not the same. Blocks are generally read as the site's expression of intent.

7. API first

Korea's public data portal (data.go.kr, opened in 2013) provides many government datasets in OpenAPI form. When the same data is available, an API is the better answer than scraping HTML. It is superior on stability, terms, and structure. The US data.gov and EU data.europa.eu sit in similar positions.

8. Respecting robots.txt and using caches

import urllib.robotparser
rp = urllib.robotparser.RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()
if rp.can_fetch('MyBot/1.0', url):
    fetch(url)

headers = {}
if etag := cache.get(f'etag:{url}'):
    headers['If-None-Match'] = etag
r = httpx.get(url, headers=headers)
if r.status_code == 304:
    return cache.get(f'body:{url}')

9. Concurrency limits

import asyncio
sem = asyncio.Semaphore(5)
async def fetch(url):
    async with sem:
        ...

A policy of keeping per-domain concurrency at 5–10 or below is common. Tune to the size and policy of the target site.

10. Common pitfalls

Skipping the terms — there may be clauses banning automated collection. The bigger risk comes from law and terms, not technology.

Retry storms — unbounded retries on 5xx responses turn into a DDoS. Set backoff and a maximum retry count.

Session cookies and auth — scraping after authentication often falls under stricter terms.

License of stored HTML — redistributing crawl results requires copyright and database-right review.

Personal data — even when published, emails and contact details may have collection and retention restrictions.

Closing thoughts

Crawling is more about the boundaries of law, terms, and ethics than about the technology itself. Wherever a public-data OpenAPI is available, that is the safer place to start. Respecting the intent of a block tends to last longer than the urge to bypass anti-bot measures.

openapi-spec
rest-api-intro

See RFC 9309 — Robots Exclusion Protocol · Playwright · Puppeteer · Selenium · Beautiful Soup · Scrapy · data.go.kr · W3C WebDriver.

Crawler ethics and tooling

Crawler ethics and tooling

1. robots.txt

2. Crawling ethics

3. Browser automation tools

4. HTML parsers

5. Static scraping vs browser automation

6. Limits of anti-bot

7. API first

8. Respecting robots.txt and using caches

9. Concurrency limits

10. Common pitfalls

Closing thoughts

Next

Back to backend