Building public-data crawlers
Build an ethical crawler in six steps with Playwright, http_utils, and APScheduler.
- Difficulty
- intermediate
- Lessons
- 6
- Total time
- 145 min
Building public-data crawlers
Public data like NPS, DART, and HIRA is accessible to everyone, but automation comes with rules — robots.txt, rate limits, terms of service. Six steps to an ethical and sustainable crawler.
Who it's for
- Developers who need more control than portal APIs offer
- Anyone who has been blocked by a crawl target
- Teams who want incremental collection, schedules, and observability
What you can do afterwards
- Separate dynamic pages (Playwright) from static ones (BS4)
- Apply robots.txt + rate limit + backoff
- Schedule in KST with APScheduler
- Combine public APIs, ministry CSVs, and web scraping
- Incremental collection, dedup, checkpoints
- Healthchecks and failure alerts
Steps
- Crawler ethics and legal boundaries — robots.txt · terms · personal data
- Static vs dynamic — BS4 + Playwright — pick the right tool
- Rate limiting · retries · backoff — exponential + jitter
- APScheduler + KST — idempotency ·
replace_existing=True· double-trigger defence - Incremental collection · deduplication — checkpoints · unique keys · change detection
- Observability · alerts — success rate · latency · Slack · PagerDuty
Prerequisites — complete python-data-pipeline.