Step 5 — External APIs · crawler ethics

Pulling data from someone else's site has both etiquette and technique. Skip them and you get blocked or sued.

Five rules

Read robots.txt first — /robots.txt lists crawl-allowed paths
Throttle requests — ≤1 per second
Set a real User-Agent — identify your bot
Cache — don't fetch twice
Prefer the official API — RSS / Atom / Open API first

Rate-limited HTTP client

import time, httpx

class RateLimitedClient:
    def __init__(self, base_url: str, requests_per_second: float = 1.0):
        self.client = httpx.Client(
            base_url=base_url,
            headers={"User-Agent": "codingstairs-crawler/1.0 (https://codingstairs.duckdns.org)"},
            timeout=10.0,
        )
        self.min_interval = 1.0 / requests_per_second
        self.last_request_at = 0.0

    def get(self, path, **kw):
        elapsed = time.time() - self.last_request_at
        if elapsed < self.min_interval:
            time.sleep(self.min_interval - elapsed)
        self.last_request_at = time.time()
        return self.client.get(path, **kw)

robots.txt parsing

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

if not rp.can_fetch("codingstairs-crawler/1.0", "https://example.com/some-page"):
    raise PermissionError("blocked by robots.txt")

Cache — DB or Redis

def get_with_cache(url, ttl=3600):
    with get_conn() as conn, conn.cursor() as cur:
        cur.execute("SELECT body, fetched_at FROM http_cache WHERE url = %s", (url,))
        row = cur.fetchone()
        if row and (datetime.now(tz=UTC) - row[1]).total_seconds() < ttl:
            return row[0]
    body = client.get(url).text
    # …UPSERT into http_cache
    return body

Playwright — only when needed

For JavaScript-rendered sites, a real browser (Playwright) is the last resort — it costs ~100× more.

Try it

Pull https://jsonplaceholder.typicode.com/posts 5 times at 1-second intervals. Verify the spacing.

Step 6 builds the full ETL pipeline.