Step 1
Crawler ethics and legal boundaries
20 min
Crawler ethics and legal boundaries
The tech is easy, the boundaries less so. The biggest risk isn't bans or lawsuits but accidental outages on someone else's site.
1. robots.txt
User-agent: *
Disallow: /admin/
Crawl-delay: 10
- Weakly enforceable legally
- Ignoring it raises ban and legal risk
- Declare your crawler UA (
MyBot/1.0 (+https://mysite.com/bot))
2. Self rate-limit
await session.get(url)
await asyncio.sleep(1 + random.random())
1–5 req/s is polite. 100 req/s borders on DoS.
3. Terms of Service
- Many sites forbid automation
- Public data is different — often there's an open API
- Portal APIs are the safest path
4. No personal data
Emails, phones, names — PIPA / GDPR territory. Even publicly posted, bulk collection or repurposing is legally fraught.
5. Copyright
- Copying full text is out
- Summaries + links are fine
- Images carry both copyright and portrait rights
- Databases have their own protection (
sui generis)
6. No evasion
CAPTCHA bypass, IP rotation, cookie tricks = intentional circumvention. CFAA / unauthorized access laws apply.
7. Safer defaults
- Use public data portals first
- Obey robots.txt
- 1–5 req/s
- Announce your UA with contact
- Skip personal data
- Summarize and link for copyrighted text
8. If things go wrong
An admin complains about an outage:
- Stop immediately
- Apologise, explain
- Share prevention steps
- Offer help with recovery
Most admins forgive honest mistakes. Dishonesty compounds.
9. Public API portals first
- data.go.kr, data.gov
- opendart.fss.or.kr, data.nps.or.kr, opendata.hira.or.kr
10. Gotchas
- Ignoring robots.txt
- Concurrency too high
- Personal data inclusion
- Forged UAs
Closing
Respect the fact that "your requests live on someone else's server". Speed, politeness, and attribution cover 90%.
Next
- 02-static-vs-dynamic