Minimal observability — logs, metrics, traces

The word "observability" tends to suggest rolling out a full stack, but introducing a full stack from a small system makes operational cost outpace value pretty quickly.

1. The 3 pillars

Observability is usually framed around three pillars.

Pillar	What it shows	Common tools
Logs	Time-ordered records of events and messages	stdout · files · Loki · ELK · CloudWatch
Metrics	Numbers over time windows (request count, latency, error rate)	Prometheus · Datadog · CloudWatch Metrics
Traces	The path one request took through the system	Jaeger · Tempo · Honeycomb · Datadog APM

The three complement each other. Metrics raise the alarm, logs show the events at that moment, and traces reveal where in the request the slowness lives.

2. OpenTelemetry

OpenTelemetry is a CNCF project formed in 2019 by merging OpenTracing (2016) and OpenCensus (2017). The site is opentelemetry.io. It defines a standard SDK, API, and wire format (OTLP) that reduce vendor lock-in.

Core shape:

The application emits signals through the OpenTelemetry SDK (traces · metrics · logs).
The OTel Collector (or the app itself) receives the signals and forwards them to a backend.
The backend (Tempo · Jaeger · Datadog, etc.) provides visualization and search.

It comes up often when avoiding vendor lock-in matters.

3. The grain of logs

Structured logs (JSON) — written as key-value so they are searchable and aggregatable.
Correlation ID (correlation id / trace id) — the same key on every log line that belongs to one request.
Levels — DEBUG · INFO · WARN · ERROR · FATAL.
Sampling — too many logs become cost and noise. Either sample only important spots or sample by ratio.

{
  "ts": "2026-04-25T01:23:45Z",
  "level": "ERROR",
  "service": "api",
  "trace_id": "0a1b2c...",
  "msg": "DB query timeout",
  "query": "SELECT ... LIMIT 10",
  "elapsed_ms": 5021
}

4. The grain of metrics

Counter — monotonically increasing (request count, error count).
Gauge — instantaneous value (current memory, current connections).
Histogram — distribution (latency) — quantiles like p50 · p95 · p99.
Summary — quantiles pre-computed by the client.

p99 latency is more meaningful than the average in many places. Watching only the average hides the user experience that the tail covers up.

5. The grain of traces

Span — a unit of work with a start and end.
Trace — the tree of spans belonging to one request.
Context propagation — pass span IDs via HTTP headers (traceparent) and message headers.
Sampling — tracing every request is expensive. Use ratio, head, or tail sampling.

Header standards include W3C Trace Context and B3, with OTel defaulting to W3C.

6. Grafana stack

Grafana — visualization.
Prometheus — metrics collection and storage.
Loki — log storage.
Tempo — trace storage.
Mimir — large-scale metrics (distributed Prometheus-compatible).

It is open source and self-hostable, but it means running, upgrading, and backing up 5–6 services.

7. SaaS tools

Tool	Grain
Datadog	All-in-one APM, infra, logs, security. Cost grows fast with usage.
New Relic	Similar all-in-one APM. Per-user licensing.
Honeycomb	Trace and high-cardinality event focused, with strong support for high-cardinality queries.
Sentry	Error tracking first, with tracing and sessions added on.
Logtail · Better Stack · Axiom	Newer SaaS players in logging.

The choice balances:

Operations headcount.
Cost predictability.
Data retention and regulatory needs.
Integration coverage (languages, frameworks, infra).

8. Minimal observability — 4 stages

A grain that fits small systems and a single server.

Stage 1 — structured logs + health checks

All service logs as JSON.
/health and /ready endpoints.
An external uptime monitor (UptimeRobot · BetterStack · Cronitor) checks those endpoints and pages out.

This single stage already covers "we know when the service goes down."

Stage 2 — error tracking

Wire Sentry (or similar) to auto-send exceptions.
Stack traces, releases, and user context arrive together.
Cost stays predictable.

Stage 3 — core metrics

Expose request count, error rate, and p95 latency.
Prometheus + a small Grafana, or a managed cloud metrics service.
Start alerting on 1–2 core indicators.

Stage 4 — tracing

As multi-service calls grow, traces become more valuable.
Combine the OpenTelemetry SDK with a backend (Tempo · Jaeger · SaaS).
Control cost via sampling.

9. What information needs to surface eventually

Whatever the names and tools at each stage, this information needs to be visible.

Error count and p95 latency over the last hour.
Active alerts right now.
Which users and requests were affected.

10. Common stumbles

Premature full stack — bringing up 5 services for a small system makes the observability stack heavier to operate than the system it observes. Phased rollout is safer.

Secrets in logs — passwords, tokens, and PII end up in logs. Block them with masking libraries and filters.

Cardinality explosion — putting user IDs or request IDs into metric labels blows up time series count.

Trace gaps — context propagation breaks across async paths and message queues. Propagate explicitly.

Alert fatigue — too many alerts get ignored entirely. Start with 1–3 core ones and add gradually.

The meaning of sampling — 1% trace sampling is often plenty for debugging but misses rare errors. Decide between head and tail sampling.

Time sync — when clocks drift between servers, trace and log correlation drifts too. Run NTP.

Vendor lock-in — relying on a SaaS-specific SDK makes migration hard. Build on a standard interface like OpenTelemetry.

Closing thoughts

Observability is the kind of thing whose value only becomes visible after an operational incident. Even so, bringing in the full stack at once just inflates operational cost. Starting with stage 1 (structured logs + health checks) and moving to the next stage only where current coverage falls short is the safest path.

github-actions
vitest-pytest-real-world

See OpenTelemetry, W3C Trace Context, Prometheus, Grafana, Loki, Sentry, and Charity Majors' writing.

Minimal observability — logs, metrics, traces

Minimal observability — logs, metrics, traces

1. The 3 pillars

2. OpenTelemetry

3. The grain of logs

4. The grain of metrics

5. The grain of traces

6. Grafana stack

7. SaaS tools

8. Minimal observability — 4 stages

9. What information needs to surface eventually

10. Common stumbles

Closing thoughts

Next

Back to quality