Step 3
Rate limit · retries · backoff
25 min
Rate limit · retries · backoff
Two axes that make a crawler respect its target — your pace and how you retreat on failure.
1. Self rate-limit
async def polite_get(client, url):
resp = await client.get(url)
await asyncio.sleep(1 + random.random())
return resp
Jitter prevents alignment peaks.
2. Concurrency cap
sem = asyncio.Semaphore(3)
async def bounded_get(client, url):
async with sem:
return await polite_get(client, url)
3. Exponential backoff
async def fetch_with_retry(client, url, max_retries=4):
for i in range(max_retries):
try:
resp = await client.get(url, timeout=30)
if resp.status_code in (429, 503):
await asyncio.sleep(2 ** i + random.random()); continue
resp.raise_for_status(); return resp
except (httpx.TimeoutException, httpx.ConnectError):
if i == max_retries - 1: raise
await asyncio.sleep(2 ** i)
raise RuntimeError("max retries")
4. Respect Retry-After
if resp.status_code == 429:
retry_after = resp.headers.get("Retry-After")
if retry_after:
wait = float(retry_after) if retry_after.isdigit() else parse_http_date(retry_after)
await asyncio.sleep(wait); continue
5. Circuit breaker
class CircuitBreaker:
def __init__(self, threshold=5, cooldown=60):
self.fails = 0; self.opened_at = None
self.threshold = threshold; self.cooldown = cooldown
async def call(self, fn):
if self.opened_at and time.time() - self.opened_at < self.cooldown:
raise RuntimeError("circuit open")
try:
r = await fn(); self.fails = 0; self.opened_at = None; return r
except Exception:
self.fails += 1
if self.fails >= self.threshold: self.opened_at = time.time()
raise
6. If blocked
Continuous 403/429 → likely banned.
- Pause for hours
- Halve the rate
- Check UA
- VPN/proxies are usually the wrong call
7. Distributed rate limit
async def acquire_token(key, capacity, refill_per_sec):
bucket = int(time.time() / 60)
k = f"rl:{key}:{bucket}"
count = await redis.incr(k)
if count == 1: await redis.expire(k, 120)
return count <= capacity
8. Logging
logger.info("fetch", url=url, status=resp.status_code, attempt=i, wait=wait)
Can't tune what you don't measure.
9. Timeouts
async with httpx.AsyncClient(timeout=httpx.Timeout(connect=10, read=30, write=30)) as client:
...
Never run without a timeout.
10. Gotchas
- No timeout — one request stalls the pipeline
- Excess concurrency — 100 workers × 1/s = 100/s, easy ban
- Infinite retries — always set
max_retries - Ignoring Retry-After — looks hostile
Closing
A crawler that fails frequently is under-tuned. Aim for 95% success. Use backoff and breakers to avoid harming others.
Next
- 04-apscheduler-kst