2026-05-10
Distributed tracing: the landscape
A trace tells you what happened to a single request across every service it touched, in time order, with the latency and status of each step. Metrics tell you how often something happened. Logs tell you what individual events looked like. Traces tell you the shape of a request.
That third axis is what makes tracing indispensable the moment your system grows past one process.
Three signals, one job
The “three pillars of observability” framing is overused but useful as a cheat sheet:
| Signal | Question it answers | Cardinality | Cost shape |
|---|---|---|---|
| Metrics | Is anything broken? How fast? | Low | Cheap, aggregated |
| Logs | What did this event look like? | High | Linear in event volume |
| Traces | Where did this request spend its time? | Highest | Linear in requests × spans |
The pitch for traces specifically: when a checkout takes 4.2 seconds and the customer gives up, metrics say “P99 spiked,” logs say “a hundred services fired warnings,” and the trace says “the auth call hit a cold cache and waited 3.8s on Redis.” Without a trace, you reconstruct that from logs and timestamps, badly.
The model: trace, span, context
A trace is a tree (sometimes a DAG) of spans. A span is one unit of work — a service handling a request, a DB query, a queue publish — with a start time, end time, status, and arbitrary attributes (user_id, http.route, db.statement, …).
What makes it work end-to-end is context propagation: every outgoing call carries trace headers (W3C traceparent, tracestate) so the next service stitches its spans onto the same trace. If propagation breaks anywhere — usually at a queue, lambda, or third-party SDK that doesn’t forward headers — the trace splits, and the picture goes dark from that point.
A request, traced
The diagram below is one checkout request through a typical service graph. Solid edges are the request path. Dashed edges are spans being exported to the OpenTelemetry collector, which forwards them to a backend.
A few things worth pointing out:
- The request path is the trace’s logical shape. Each hop is a span; each span’s
parent_span_idpoints to the caller’s span. Reassembled, that gives you the tree. - The OTel collector sits between your services and the backend. It batches, filters, redacts PII, samples — and decouples apps from the storage choice. Swap Tempo for Jaeger, or add a commercial APM as a second exporter, without touching service code.
- All services emit to the same collector. That’s the magic of OpenTelemetry as a standard — propagation, instrumentation, and wire format are uniform across runtimes.
Click ⛶ on the diagram to view it full-screen.
OpenTelemetry, briefly
Before OpenTelemetry (OTel) merged OpenCensus and OpenTracing in 2019, instrumenting a polyglot system was painful: every vendor had its own SDK, propagation format, and semantic conventions. You’d commit to a vendor at the SDK layer and pay to leave.
OTel changed three things:
- Vendor-neutral SDKs in 12+ languages with stable APIs.
- OTLP — a single wire protocol for exporting traces, metrics, and logs.
- Semantic conventions —
http.method,db.system,messaging.destinationmean the same thing everywhere.
The practical effect: instrument once, route the data anywhere. That single decision is what made the rest of the landscape coherent.
The tools landscape
Roughly four layers, easy to mix and match because of OTel:
Instrumentation (inside your services)
- OpenTelemetry SDKs — auto-instrumentation for Java, .NET, Node, Python, Go covers most frameworks out of the box
- eBPF-based auto-instrumentation (Beyla, Pixie) — span capture without code changes; useful for legacy services
Collection / pipeline
- OpenTelemetry Collector — the default. Receivers, processors (filter, batch, sample), exporters
- Vector — log-first but extends to traces
- Vendor agents (Datadog Agent, New Relic Infra) when you’re committed to a vendor
Storage / query (open source)
- Grafana Tempo — object-storage-backed, cheap to retain everything
- Jaeger — the original; ClickHouse and OpenSearch backends are common now
- Zipkin — older, simpler, still in use
- ClickHouse + SigNoz — increasingly popular as a single store for traces, logs, and metrics
Storage / query (commercial)
- Datadog APM, New Relic, Honeycomb, Lightstep, Dynatrace, Splunk Observability
- Honeycomb’s high-cardinality query model is genuinely different from the others; worth a look if you do a lot of debugging-as-querying
The thing to internalize: with OTel as the SDK and OTLP as the wire format, the storage layer is no longer a one-way decision. You can run Tempo for retention and ship a sampled stream to a commercial backend for debugging, or migrate between vendors with a config change.
What’s actually hard
A short list of recurring pain points:
- Sampling. You can’t store every span — at scale, you’d pay millions for storage you’ll never query. Head sampling (decide at the start) is cheap but may miss the interesting traces. Tail sampling (decide after the trace is complete, based on errors or latency) needs the collector to buffer until trace assembly. Most production systems run a mix.
- Context propagation gaps. A queue, a lambda, a third-party SDK that doesn’t carry headers — and your trace tree splits. Audit your async paths; they’re where traces quietly die.
- Instrumentation drift. Auto-instrumentation covers HTTP, DB, queue clients. Custom in-process work — feature flags, ML inference, batch jobs — needs manual spans, and they decay if no one owns them.
- Cardinality explosions. Putting
user_idorrequest_idon every span is great for debugging, expensive for storage, and often violates retention or PII policy. Decide what goes on the span vs. what stays in linked logs. - Trace ≠ profile. A trace tells you which span was slow. To know why inside the process, you need continuous profiling (Pyroscope, Parca) — which OTel is now standardizing too.
Where to start
Starting from zero on a real system:
- Pick OTel SDKs for your top two languages and turn on auto-instrumentation. (One sprint per service, less if framework-supported.)
- Run an OTel collector — locally first — and point everything at it.
- Send to one backend. Tempo if you want cheap retention; a commercial one if you want a UI that pays for itself fast. Pick one — don’t shop.
- Verify propagation works across your message bus. This is where it usually breaks.
- Add manual spans only where you actually debug — async jobs, fan-outs, anything auto-instrumentation can’t see.
- Then think about sampling. Don’t pre-optimize before you know your trace volume.
The fastest way to waste a year on tracing is to bikeshed the backend before you’ve solved instrumentation and propagation. The hard work is upstream of the storage choice.