Step 5
Step 5 — External APIs · crawler ethics
25 min
Step 5 — External APIs · crawler ethics
Pulling data from someone else's site has both etiquette and technique. Skip them and you get blocked or sued.
Five rules
- Read
robots.txtfirst —/robots.txtlists crawl-allowed paths - Throttle requests — ≤1 per second
- Set a real User-Agent — identify your bot
- Cache — don't fetch twice
- Prefer the official API — RSS / Atom / Open API first
Rate-limited HTTP client
import time, httpx
class RateLimitedClient:
def __init__(self, base_url: str, requests_per_second: float = 1.0):
self.client = httpx.Client(
base_url=base_url,
headers={"User-Agent": "codingstairs-crawler/1.0 (https://codingstairs.duckdns.org)"},
timeout=10.0,
)
self.min_interval = 1.0 / requests_per_second
self.last_request_at = 0.0
def get(self, path, **kw):
elapsed = time.time() - self.last_request_at
if elapsed < self.min_interval:
time.sleep(self.min_interval - elapsed)
self.last_request_at = time.time()
return self.client.get(path, **kw)
robots.txt parsing
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
if not rp.can_fetch("codingstairs-crawler/1.0", "https://example.com/some-page"):
raise PermissionError("blocked by robots.txt")
Cache — DB or Redis
def get_with_cache(url, ttl=3600):
with get_conn() as conn, conn.cursor() as cur:
cur.execute("SELECT body, fetched_at FROM http_cache WHERE url = %s", (url,))
row = cur.fetchone()
if row and (datetime.now(tz=UTC) - row[1]).total_seconds() < ttl:
return row[0]
body = client.get(url).text
# …UPSERT into http_cache
return body
Playwright — only when needed
For JavaScript-rendered sites, a real browser (Playwright) is the last resort — it costs ~100× more.
Try it
Pull https://jsonplaceholder.typicode.com/posts 5 times at 1-second intervals. Verify the spacing.
Next
Step 6 builds the full ETL pipeline.