Observability · alerts

Crawlers break quietly — site redesigns, bans, network blips. Without dashboards and alerts you miss weeks of missing data.

1. What to collect

Success rate
Latency (p50/p95/p99)
Rows ingested per day / source
Block signals (403/429/CAPTCHA)
Queue lag

2. Structured logging

import json, time, logging
logger = logging.getLogger("crawler")

def log(level, event, **fields):
    logger.log(getattr(logging, level.upper()), json.dumps({
        "event": event, "ts": time.time(), **fields
    }))

log("info", "fetch_ok", url=url, status=200, latency_ms=320)
log("warn", "fetch_blocked", url=url, status=429)

3. PostgreSQL events table

CREATE TABLE crawl_events (
  id BIGSERIAL PRIMARY KEY,
  source VARCHAR NOT NULL,
  status INT NOT NULL,
  latency_ms INT,
  rows_inserted INT DEFAULT 0,
  error_type VARCHAR,
  created_at TIMESTAMPTZ DEFAULT now()
);

CREATE INDEX ON crawl_events (source, created_at DESC);

4. Aggregation

SELECT source,
  count(*) FILTER (WHERE status = 200) * 100.0 / NULLIF(count(*), 0) AS success_rate,
  count(*) AS total,
  avg(latency_ms) AS avg_latency
FROM crawl_events
WHERE created_at > now() - interval '24 hours'
GROUP BY source;

5. Alerts

if success_rate < 0.8:
    await send_slack(f"⚠️ {source} success {success_rate:.1%} (24h)")

async def send_slack(text):
    webhook = os.environ["SLACK_WEBHOOK_URL"]
    async with aiohttp.ClientSession() as s:
        await s.post(webhook, json={"text": text})

6. Alert hygiene

Don't alert on everything
Suppress repeats
Separate INFO/WARN/CRITICAL
After-hours policy for CRITICAL only

7. Daily summary

@scheduler.scheduled_job("cron", hour=9, minute=0, timezone="Asia/Seoul")
async def daily_summary():
    stats = await fetch_yesterday_stats()
    await send_slack(format_report(stats))

A one-liner every morning catches regressions early.

8. Prometheus + Grafana (optional)

from prometheus_client import Counter, Histogram
fetch_total = Counter("crawler_fetch_total", "requests", ["source", "status"])
fetch_latency = Histogram("crawler_fetch_latency_seconds", "latency", ["source"])

with fetch_latency.labels(source="nps").time():
    resp = await session.get(url)
fetch_total.labels(source="nps", status=resp.status).inc()

Worth it only with many crawlers.

9. Sentry

import sentry_sdk
sentry_sdk.init(dsn=os.environ["SENTRY_DSN"])

try:
    await crawl_job()
except Exception as e:
    sentry_sdk.capture_exception(e); raise

10. Healthcheck

@app.get("/health/crawler")
async def health():
    last = await db.fetchval("SELECT MAX(created_at) FROM crawl_events WHERE status=200")
    if (now() - last).total_seconds() / 3600 > 25:
        raise HTTPException(503, "crawler stale")
    return {"status": "ok", "last_success": last}

External uptime monitors poll this endpoint.

11. Gotchas

Alert fatigue
No alerts → silent failures
Paging on INFO → disturbed sleep
Only watching errors → miss gradual degradation

Closing

A one-line Slack summary is often the most valuable dashboard you'll ever build.

security/06-headers-and-cors
quality/03-observability-minimal

Observability · alerts

Observability · alerts

1. What to collect

2. Structured logging

3. PostgreSQL events table

4. Aggregation

5. Alerts

6. Alert hygiene

7. Daily summary

8. Prometheus + Grafana (optional)

9. Sentry

10. Healthcheck

11. Gotchas

Closing

Next

🎉 You finished Building public-data crawlers