Step 6
Observability · alerts
25 min
Observability · alerts
Crawlers break quietly — site redesigns, bans, network blips. Without dashboards and alerts you miss weeks of missing data.
1. What to collect
- Success rate
- Latency (p50/p95/p99)
- Rows ingested per day / source
- Block signals (403/429/CAPTCHA)
- Queue lag
2. Structured logging
import json, time, logging
logger = logging.getLogger("crawler")
def log(level, event, **fields):
logger.log(getattr(logging, level.upper()), json.dumps({
"event": event, "ts": time.time(), **fields
}))
log("info", "fetch_ok", url=url, status=200, latency_ms=320)
log("warn", "fetch_blocked", url=url, status=429)
3. PostgreSQL events table
CREATE TABLE crawl_events (
id BIGSERIAL PRIMARY KEY,
source VARCHAR NOT NULL,
status INT NOT NULL,
latency_ms INT,
rows_inserted INT DEFAULT 0,
error_type VARCHAR,
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX ON crawl_events (source, created_at DESC);
4. Aggregation
SELECT source,
count(*) FILTER (WHERE status = 200) * 100.0 / NULLIF(count(*), 0) AS success_rate,
count(*) AS total,
avg(latency_ms) AS avg_latency
FROM crawl_events
WHERE created_at > now() - interval '24 hours'
GROUP BY source;
5. Alerts
if success_rate < 0.8:
await send_slack(f"⚠️ {source} success {success_rate:.1%} (24h)")
async def send_slack(text):
webhook = os.environ["SLACK_WEBHOOK_URL"]
async with aiohttp.ClientSession() as s:
await s.post(webhook, json={"text": text})
6. Alert hygiene
- Don't alert on everything
- Suppress repeats
- Separate INFO/WARN/CRITICAL
- After-hours policy for CRITICAL only
7. Daily summary
@scheduler.scheduled_job("cron", hour=9, minute=0, timezone="Asia/Seoul")
async def daily_summary():
stats = await fetch_yesterday_stats()
await send_slack(format_report(stats))
A one-liner every morning catches regressions early.
8. Prometheus + Grafana (optional)
from prometheus_client import Counter, Histogram
fetch_total = Counter("crawler_fetch_total", "requests", ["source", "status"])
fetch_latency = Histogram("crawler_fetch_latency_seconds", "latency", ["source"])
with fetch_latency.labels(source="nps").time():
resp = await session.get(url)
fetch_total.labels(source="nps", status=resp.status).inc()
Worth it only with many crawlers.
9. Sentry
import sentry_sdk
sentry_sdk.init(dsn=os.environ["SENTRY_DSN"])
try:
await crawl_job()
except Exception as e:
sentry_sdk.capture_exception(e); raise
10. Healthcheck
@app.get("/health/crawler")
async def health():
last = await db.fetchval("SELECT MAX(created_at) FROM crawl_events WHERE status=200")
if (now() - last).total_seconds() / 3600 > 25:
raise HTTPException(503, "crawler stale")
return {"status": "ok", "last_success": last}
External uptime monitors poll this endpoint.
11. Gotchas
- Alert fatigue
- No alerts → silent failures
- Paging on INFO → disturbed sleep
- Only watching errors → miss gradual degradation
Closing
A one-line Slack summary is often the most valuable dashboard you'll ever build.
Next
- security/06-headers-and-cors
- quality/03-observability-minimal
🎉 You finished Building public-data crawlers
What's next? Pick another course below.