Crawler ethics and tooling
Crawler ethics and tooling
Collecting public web data touches not only technology but ethics and law. Hitting too often burdens the target server, and skipping the terms of service can spill into legal trouble.
1. robots.txt
A text-based protocol that tells web crawlers what is allowed and disallowed. Proposed by Martijn Koster in 1994, it served as a de facto standard for years and became an official IETF standard as RFC 9309 in September 2022.
User-agent: *
Disallow: /private/
Allow: /private/public.html
Crawl-delay: 5
Sitemap: https://example.com/sitemap.xml
robots.txt is a promise rather than a legal mandate, but ignoring it can become grounds for blocking and legal disputes.
2. Crawling ethics
- Identify the User-Agent — a UA that identifies who collects what for what purpose, plus a contactable URL.
- Rate limit — cap concurrent connections and per-second requests. If
Crawl-delayis provided, follow it. - Use caches — send
ETagandLast-Modifiedto receive 304 responses. - Check terms — verify the site's terms of service or API terms for automated collection policies.
- Personal data — even when public, data may fall under PIPA (Korea), GDPR (EU), or CCPA (US).
3. Browser automation tools
| Tool | First appeared | Provider | Features |
|---|---|---|---|
| Selenium | 2004 | OSS, SeleniumHQ | The longest-standing standard. WebDriver spec (W3C). Multi-language. |
| Puppeteer | 2017 | Chrome team | Chrome DevTools Protocol (CDP). Node first. |
| Playwright | 2020 | Microsoft (former Puppeteer members joined) | Multi-language (Node · Python · Java · .NET). Chromium · Firefox · WebKit. Auto-wait, tracing. |
Playwright arrived relatively late but is praised for operational comforts like auto-wait, the selector engine, and the trace viewer. Puppeteer is optimized for the single Chrome target.
4. HTML parsers
| Library | First appeared | Notes |
|---|---|---|
| BeautifulSoup | 2004, Leonard Richardson | A Python staple. Strong with idiomatic HTML. |
| lxml | 2005 | C-based (libxml2). Fast. XPath. |
| parsel | around 2017 | Scrapy's selector packaged separately. CSS + XPath. |
| html5lib | 2008 | Faithful to the HTML5 spec. Slow but compatible. |
| Cheerio | 2012 | Node, jQuery-like API. |
BeautifulSoup lets us choose the underlying parser among html.parser (standard library), lxml, and html5lib. Speed goes to lxml; compatibility goes to html5lib.
5. Static scraping vs browser automation
- Static scraping (
httpx+ BeautifulSoup) — works on server-rendered HTML only. Fast and light. - Browser automation (Playwright) — sees the result after JS executes. Slow but essential for SPAs.
Trying static scraping first and only stepping up to a browser when JS dependence is clear is reasonable for cost and speed.
6. Limits of anti-bot
Crawl-blocking techniques and bypasses keep evolving.
- UA rotation — limited effectiveness. Other signals (header combinations, TLS fingerprint, behavior patterns) are often more decisive.
- Headless detection evasion —
navigator.webdriver, font/canvas fingerprint, WebGL renderer, and others are inspected. Helpers likeplaywright-stealthexist, but no permanent solution. - IP rotation — datacenter IPs are easily blocked, and residential proxies sit in a costly, legally gray zone.
- Cloudflare · Akamai · PerimeterX — JS challenges, device fingerprints, ML-based. Bypass attempts come close to terms-of-service violations.
- CAPTCHA — automated solving is risky on both terms-of-service and legal grounds.
What is technically possible and what is ethically and legally permitted are not the same. Blocks are generally read as the site's expression of intent.
7. API first
Korea's public data portal (data.go.kr, opened in 2013) provides many government datasets in OpenAPI form. When the same data is available, an API is the better answer than scraping HTML. It is superior on stability, terms, and structure. The US data.gov and EU data.europa.eu sit in similar positions.
8. Respecting robots.txt and using caches
import urllib.robotparser
rp = urllib.robotparser.RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()
if rp.can_fetch('MyBot/1.0', url):
fetch(url)
headers = {}
if etag := cache.get(f'etag:{url}'):
headers['If-None-Match'] = etag
r = httpx.get(url, headers=headers)
if r.status_code == 304:
return cache.get(f'body:{url}')
9. Concurrency limits
import asyncio
sem = asyncio.Semaphore(5)
async def fetch(url):
async with sem:
...
A policy of keeping per-domain concurrency at 5–10 or below is common. Tune to the size and policy of the target site.
10. Common pitfalls
Skipping the terms — there may be clauses banning automated collection. The bigger risk comes from law and terms, not technology.
Retry storms — unbounded retries on 5xx responses turn into a DDoS. Set backoff and a maximum retry count.
Session cookies and auth — scraping after authentication often falls under stricter terms.
License of stored HTML — redistributing crawl results requires copyright and database-right review.
Personal data — even when published, emails and contact details may have collection and retention restrictions.
Closing thoughts
Crawling is more about the boundaries of law, terms, and ethics than about the technology itself. Wherever a public-data OpenAPI is available, that is the safer place to start. Respecting the intent of a block tends to last longer than the urge to bypass anti-bot measures.
Next
- openapi-spec
- rest-api-intro
See RFC 9309 — Robots Exclusion Protocol · Playwright · Puppeteer · Selenium · Beautiful Soup · Scrapy · data.go.kr · W3C WebDriver.