Crawler ethics and legal boundaries | Coding Stairs | CodingStairs

Step 1

Crawler ethics and legal boundaries

20 min

Crawler ethics and legal boundaries

The tech is easy, the boundaries less so. The biggest risk isn't bans or lawsuits but accidental outages on someone else's site.

1. robots.txt

User-agent: *
Disallow: /admin/
Crawl-delay: 10

Weakly enforceable legally
Ignoring it raises ban and legal risk
Declare your crawler UA (MyBot/1.0 (+https://mysite.com/bot))

2. Self rate-limit

await session.get(url)
await asyncio.sleep(1 + random.random())

1–5 req/s is polite. 100 req/s borders on DoS.

3. Terms of Service

Many sites forbid automation
Public data is different — often there's an open API
Portal APIs are the safest path

4. No personal data

Emails, phones, names — PIPA / GDPR territory. Even publicly posted, bulk collection or repurposing is legally fraught.

5. Copyright

Copying full text is out
Summaries + links are fine
Images carry both copyright and portrait rights
Databases have their own protection (sui generis)

6. No evasion

CAPTCHA bypass, IP rotation, cookie tricks = intentional circumvention. CFAA / unauthorized access laws apply.

7. Safer defaults

Use public data portals first
Obey robots.txt
1–5 req/s
Announce your UA with contact
Skip personal data
Summarize and link for copyrighted text

8. If things go wrong

An admin complains about an outage:

Stop immediately
Apologise, explain
Share prevention steps
Offer help with recovery

Most admins forgive honest mistakes. Dishonesty compounds.

9. Public API portals first

data.go.kr, data.gov
opendart.fss.or.kr, data.nps.or.kr, opendata.hira.or.kr

10. Gotchas

Ignoring robots.txt
Concurrency too high
Personal data inclusion
Forged UAs

Closing

Respect the fact that "your requests live on someone else's server". Speed, politeness, and attribution cover 90%.

Next

02-static-vs-dynamic