rate limit · 재시도 · backoff

크롤러가 목표 사이트를 존중하는 두 축 — 내 요청 속도 · 실패 시 물러서기.

1. 요청 간격 (self-rate-limit)

import asyncio, random

async def polite_get(client, url):
    resp = await client.get(url)
    await asyncio.sleep(1 + random.random())   # 1 ~ 2s 랜덤
    return resp

랜덤 지터로 정각 집중 회피. 초당 1 ~ 5 요청이 일반적 예의.

2. Token bucket

여러 워커가 동시에 같은 사이트 공격 → 지연만으로 부족.

from asyncio import Semaphore
sem = Semaphore(3)   # 동시 최대 3 요청

async def bounded_get(client, url):
    async with sem:
        return await polite_get(client, url)

asyncio.Semaphore 로 동시성 상한. 세마포어 크기 = 초당 요청 상한과 유사.

3. Exponential backoff

실패 후 즉시 재시도는 재난 유발. 점진 대기.

async def fetch_with_retry(client, url, max_retries=4):
    for i in range(max_retries):
        try:
            resp = await client.get(url, timeout=30)
            if resp.status_code in (429, 503):
                wait = 2 ** i + random.random()
                await asyncio.sleep(wait)
                continue
            resp.raise_for_status()
            return resp
        except (httpx.TimeoutException, httpx.ConnectError):
            if i == max_retries - 1: raise
            await asyncio.sleep(2 ** i)
    raise RuntimeError("max retries")

429 Too Many Requests · 503 Service Unavailable → 재시도 후보
404 · 400 등 클라이언트 오류 → 재시도 무의미

4. Retry-After 헤더 존중

if resp.status_code == 429:
    retry_after = resp.headers.get("Retry-After")
    if retry_after:
        wait = float(retry_after) if retry_after.isdigit() else parse_http_date(retry_after)
        await asyncio.sleep(wait)
        continue

서버가 "얼마 뒤에 다시" 를 알려주면 그대로 따름.

5. Circuit Breaker

연속 실패 N 회 → 일정 시간 차단.

class CircuitBreaker:
    def __init__(self, threshold=5, cooldown=60):
        self.fails = 0
        self.opened_at = None
        self.threshold = threshold
        self.cooldown = cooldown

    async def call(self, fn):
        if self.opened_at and time.time() - self.opened_at < self.cooldown:
            raise RuntimeError("circuit open")
        try:
            r = await fn()
            self.fails = 0; self.opened_at = None
            return r
        except Exception:
            self.fails += 1
            if self.fails >= self.threshold:
                self.opened_at = time.time()
            raise

사이트 장애 시 무의미한 요청 반복 중단. 자원 낭비 방지.

6. IP 차단 대응

403 · 429 가 지속 → 이미 차단됐을 가능성.

잠시 중단 — 몇 시간 ~ 하루 대기
크롤 속도 재조정 — 50% 감속
user-agent 확인 — 너무 명시적이면 차단 쉬움
VPN · 프록시 — 법적 · 윤리적 고민 (대부분 비추천)

7. 분산 rate limit (여러 워커)

Redis token bucket:

async def acquire_token(key: str, capacity: int, refill_per_sec: float):
    # 단순 sliding window
    now = time.time()
    bucket = int(now / 60)
    k = f"rl:{key}:{bucket}"
    count = await redis.incr(k)
    if count == 1:
        await redis.expire(k, 120)
    return count <= capacity

여러 앱 인스턴스가 있어도 글로벌 rate limit.

8. 로그 · 측정

logger.info("fetch", url=url, status=resp.status_code, attempt=i, wait=wait)

재시도 횟수 · 성공률 집계. 200 만 찍고 끝나면 튜닝 근거가 없음.

9. 타임아웃 설정

async with httpx.AsyncClient(timeout=httpx.Timeout(connect=10, read=30, write=30)) as client:
    ...

connect — TCP 연결 시간
read — 응답 읽기
write — 요청 body 전송

기본 timeout=30 한 줄도 OK. 무기한 대기 절대 금지.

10. 자주 걸리는 자리

timeout 없음 — 한 요청이 무한 대기 · 전체 파이프라인 막힘
동시성 과잉 — 100 worker × 초당 1 = 초당 100 요청. 차단 확률 ↑
재시도 무한 — max_retries 반드시 설정
Retry-After 무시 — 서버 신호 거부 = 악의적으로 보임

하고픈 말

"자주 실패하는 크롤러" 는 튜닝 부족. 성공률 95% 이상이 정상. backoff + circuit breaker 로 타 서비스에 피해 주지 말 것.

04-apscheduler-kst