Rate limit · retries · backoff

Two axes that make a crawler respect its target — your pace and how you retreat on failure.

1. Self rate-limit

async def polite_get(client, url):
    resp = await client.get(url)
    await asyncio.sleep(1 + random.random())
    return resp

Jitter prevents alignment peaks.

2. Concurrency cap

sem = asyncio.Semaphore(3)
async def bounded_get(client, url):
    async with sem:
        return await polite_get(client, url)

3. Exponential backoff

async def fetch_with_retry(client, url, max_retries=4):
    for i in range(max_retries):
        try:
            resp = await client.get(url, timeout=30)
            if resp.status_code in (429, 503):
                await asyncio.sleep(2 ** i + random.random()); continue
            resp.raise_for_status(); return resp
        except (httpx.TimeoutException, httpx.ConnectError):
            if i == max_retries - 1: raise
            await asyncio.sleep(2 ** i)
    raise RuntimeError("max retries")

4. Respect Retry-After

if resp.status_code == 429:
    retry_after = resp.headers.get("Retry-After")
    if retry_after:
        wait = float(retry_after) if retry_after.isdigit() else parse_http_date(retry_after)
        await asyncio.sleep(wait); continue

5. Circuit breaker

class CircuitBreaker:
    def __init__(self, threshold=5, cooldown=60):
        self.fails = 0; self.opened_at = None
        self.threshold = threshold; self.cooldown = cooldown
    async def call(self, fn):
        if self.opened_at and time.time() - self.opened_at < self.cooldown:
            raise RuntimeError("circuit open")
        try:
            r = await fn(); self.fails = 0; self.opened_at = None; return r
        except Exception:
            self.fails += 1
            if self.fails >= self.threshold: self.opened_at = time.time()
            raise

6. If blocked

Continuous 403/429 → likely banned.

Pause for hours
Halve the rate
Check UA
VPN/proxies are usually the wrong call

7. Distributed rate limit

async def acquire_token(key, capacity, refill_per_sec):
    bucket = int(time.time() / 60)
    k = f"rl:{key}:{bucket}"
    count = await redis.incr(k)
    if count == 1: await redis.expire(k, 120)
    return count <= capacity

8. Logging

logger.info("fetch", url=url, status=resp.status_code, attempt=i, wait=wait)

Can't tune what you don't measure.

9. Timeouts

async with httpx.AsyncClient(timeout=httpx.Timeout(connect=10, read=30, write=30)) as client:
    ...

Never run without a timeout.

10. Gotchas

No timeout — one request stalls the pipeline
Excess concurrency — 100 workers × 1/s = 100/s, easy ban
Infinite retries — always set max_retries
Ignoring Retry-After — looks hostile

Closing

A crawler that fails frequently is under-tuned. Aim for 95% success. Use backoff and breakers to avoid harming others.

04-apscheduler-kst