Minimal observability — logs, metrics, traces
Minimal observability — logs, metrics, traces
The word "observability" tends to suggest rolling out a full stack, but introducing a full stack from a small system makes operational cost outpace value pretty quickly.
1. The 3 pillars
Observability is usually framed around three pillars.
| Pillar | What it shows | Common tools |
|---|---|---|
| Logs | Time-ordered records of events and messages | stdout · files · Loki · ELK · CloudWatch |
| Metrics | Numbers over time windows (request count, latency, error rate) | Prometheus · Datadog · CloudWatch Metrics |
| Traces | The path one request took through the system | Jaeger · Tempo · Honeycomb · Datadog APM |
The three complement each other. Metrics raise the alarm, logs show the events at that moment, and traces reveal where in the request the slowness lives.
2. OpenTelemetry
OpenTelemetry is a CNCF project formed in 2019 by merging OpenTracing (2016) and OpenCensus (2017). The site is opentelemetry.io. It defines a standard SDK, API, and wire format (OTLP) that reduce vendor lock-in.
Core shape:
- The application emits signals through the OpenTelemetry SDK (traces · metrics · logs).
- The OTel Collector (or the app itself) receives the signals and forwards them to a backend.
- The backend (Tempo · Jaeger · Datadog, etc.) provides visualization and search.
It comes up often when avoiding vendor lock-in matters.
3. The grain of logs
- Structured logs (JSON) — written as key-value so they are searchable and aggregatable.
- Correlation ID (correlation id / trace id) — the same key on every log line that belongs to one request.
- Levels — DEBUG · INFO · WARN · ERROR · FATAL.
- Sampling — too many logs become cost and noise. Either sample only important spots or sample by ratio.
{
"ts": "2026-04-25T01:23:45Z",
"level": "ERROR",
"service": "api",
"trace_id": "0a1b2c...",
"msg": "DB query timeout",
"query": "SELECT ... LIMIT 10",
"elapsed_ms": 5021
}
4. The grain of metrics
- Counter — monotonically increasing (request count, error count).
- Gauge — instantaneous value (current memory, current connections).
- Histogram — distribution (latency) — quantiles like p50 · p95 · p99.
- Summary — quantiles pre-computed by the client.
p99 latency is more meaningful than the average in many places. Watching only the average hides the user experience that the tail covers up.
5. The grain of traces
- Span — a unit of work with a start and end.
- Trace — the tree of spans belonging to one request.
- Context propagation — pass span IDs via HTTP headers (
traceparent) and message headers. - Sampling — tracing every request is expensive. Use ratio, head, or tail sampling.
Header standards include W3C Trace Context and B3, with OTel defaulting to W3C.
6. Grafana stack
- Grafana — visualization.
- Prometheus — metrics collection and storage.
- Loki — log storage.
- Tempo — trace storage.
- Mimir — large-scale metrics (distributed Prometheus-compatible).
It is open source and self-hostable, but it means running, upgrading, and backing up 5–6 services.
7. SaaS tools
| Tool | Grain |
|---|---|
| Datadog | All-in-one APM, infra, logs, security. Cost grows fast with usage. |
| New Relic | Similar all-in-one APM. Per-user licensing. |
| Honeycomb | Trace and high-cardinality event focused, with strong support for high-cardinality queries. |
| Sentry | Error tracking first, with tracing and sessions added on. |
| Logtail · Better Stack · Axiom | Newer SaaS players in logging. |
The choice balances:
- Operations headcount.
- Cost predictability.
- Data retention and regulatory needs.
- Integration coverage (languages, frameworks, infra).
8. Minimal observability — 4 stages
A grain that fits small systems and a single server.
Stage 1 — structured logs + health checks
- All service logs as JSON.
/healthand/readyendpoints.- An external uptime monitor (UptimeRobot · BetterStack · Cronitor) checks those endpoints and pages out.
This single stage already covers "we know when the service goes down."
Stage 2 — error tracking
- Wire Sentry (or similar) to auto-send exceptions.
- Stack traces, releases, and user context arrive together.
- Cost stays predictable.
Stage 3 — core metrics
- Expose request count, error rate, and p95 latency.
- Prometheus + a small Grafana, or a managed cloud metrics service.
- Start alerting on 1–2 core indicators.
Stage 4 — tracing
- As multi-service calls grow, traces become more valuable.
- Combine the OpenTelemetry SDK with a backend (Tempo · Jaeger · SaaS).
- Control cost via sampling.
9. What information needs to surface eventually
Whatever the names and tools at each stage, this information needs to be visible.
- Error count and p95 latency over the last hour.
- Active alerts right now.
- Which users and requests were affected.
10. Common stumbles
Premature full stack — bringing up 5 services for a small system makes the observability stack heavier to operate than the system it observes. Phased rollout is safer.
Secrets in logs — passwords, tokens, and PII end up in logs. Block them with masking libraries and filters.
Cardinality explosion — putting user IDs or request IDs into metric labels blows up time series count.
Trace gaps — context propagation breaks across async paths and message queues. Propagate explicitly.
Alert fatigue — too many alerts get ignored entirely. Start with 1–3 core ones and add gradually.
The meaning of sampling — 1% trace sampling is often plenty for debugging but misses rare errors. Decide between head and tail sampling.
Time sync — when clocks drift between servers, trace and log correlation drifts too. Run NTP.
Vendor lock-in — relying on a SaaS-specific SDK makes migration hard. Build on a standard interface like OpenTelemetry.
Closing thoughts
Observability is the kind of thing whose value only becomes visible after an operational incident. Even so, bringing in the full stack at once just inflates operational cost. Starting with stage 1 (structured logs + health checks) and moving to the next stage only where current coverage falls short is the safest path.
Next
- github-actions
- vitest-pytest-real-world
See OpenTelemetry, W3C Trace Context, Prometheus, Grafana, Loki, Sentry, and Charity Majors' writing.