2026-05-10

Distributed tracing: the landscape

A trace tells you what happened to a single request across every service it touched, in time order, with the latency and status of each step. Metrics tell you how often something happened. Logs tell you what individual events looked like. Traces tell you the shape of a request.

That third axis is what makes tracing indispensable the moment your system grows past one process.

Three signals, one job

The “three pillars of observability” framing is overused but useful as a cheat sheet:

Signal	Question it answers	Cardinality	Cost shape
Metrics	Is anything broken? How fast?	Low	Cheap, aggregated
Logs	What did this event look like?	High	Linear in event volume
Traces	Where did this request spend its time?	Highest	Linear in requests × spans

The pitch for traces specifically: when a checkout takes 4.2 seconds and the customer gives up, metrics say “P99 spiked,” logs say “a hundred services fired warnings,” and the trace says “the auth call hit a cold cache and waited 3.8s on Redis.” Without a trace, you reconstruct that from logs and timestamps, badly.

The model: trace, span, context

A trace is a tree (sometimes a DAG) of spans. A span is one unit of work — a service handling a request, a DB query, a queue publish — with a start time, end time, status, and arbitrary attributes (user_id, http.route, db.statement, …).

What makes it work end-to-end is context propagation: every outgoing call carries trace headers (W3C traceparent, tracestate) so the next service stitches its spans onto the same trace. If propagation breaks anywhere — usually at a queue, lambda, or third-party SDK that doesn’t forward headers — the trace splits, and the picture goes dark from that point.

A request, traced

The diagram below is one checkout request through a typical service graph. Solid edges are the request path. Dashed edges are spans being exported to the OpenTelemetry collector, which forwards them to a backend.

browser

API gateway

orders service

auth

inventory

payments

postgres

redis

stripe

OTel collector

Tempo / Jaeger

A few things worth pointing out:

The request path is the trace’s logical shape. Each hop is a span; each span’s parent_span_id points to the caller’s span. Reassembled, that gives you the tree.
The OTel collector sits between your services and the backend. It batches, filters, redacts PII, samples — and decouples apps from the storage choice. Swap Tempo for Jaeger, or add a commercial APM as a second exporter, without touching service code.
All services emit to the same collector. That’s the magic of OpenTelemetry as a standard — propagation, instrumentation, and wire format are uniform across runtimes.

Click ⛶ on the diagram to view it full-screen.

OpenTelemetry, briefly

Before OpenTelemetry (OTel) merged OpenCensus and OpenTracing in 2019, instrumenting a polyglot system was painful: every vendor had its own SDK, propagation format, and semantic conventions. You’d commit to a vendor at the SDK layer and pay to leave.

OTel changed three things:

Vendor-neutral SDKs in 12+ languages with stable APIs.
OTLP — a single wire protocol for exporting traces, metrics, and logs.
Semantic conventions — http.method, db.system, messaging.destination mean the same thing everywhere.

The practical effect: instrument once, route the data anywhere. That single decision is what made the rest of the landscape coherent.

The tools landscape

Roughly four layers, easy to mix and match because of OTel:

Instrumentation (inside your services)

OpenTelemetry SDKs — auto-instrumentation for Java, .NET, Node, Python, Go covers most frameworks out of the box
eBPF-based auto-instrumentation (Beyla, Pixie) — span capture without code changes; useful for legacy services

Collection / pipeline

OpenTelemetry Collector — the default. Receivers, processors (filter, batch, sample), exporters
Vector — log-first but extends to traces
Vendor agents (Datadog Agent, New Relic Infra) when you’re committed to a vendor

Storage / query (open source)

Grafana Tempo — object-storage-backed, cheap to retain everything
Jaeger — the original; ClickHouse and OpenSearch backends are common now
Zipkin — older, simpler, still in use
ClickHouse + SigNoz — increasingly popular as a single store for traces, logs, and metrics

Storage / query (commercial)

Datadog APM, New Relic, Honeycomb, Lightstep, Dynatrace, Splunk Observability
Honeycomb’s high-cardinality query model is genuinely different from the others; worth a look if you do a lot of debugging-as-querying

The thing to internalize: with OTel as the SDK and OTLP as the wire format, the storage layer is no longer a one-way decision. You can run Tempo for retention and ship a sampled stream to a commercial backend for debugging, or migrate between vendors with a config change.

What’s actually hard

A short list of recurring pain points:

Sampling. You can’t store every span — at scale, you’d pay millions for storage you’ll never query. Head sampling (decide at the start) is cheap but may miss the interesting traces. Tail sampling (decide after the trace is complete, based on errors or latency) needs the collector to buffer until trace assembly. Most production systems run a mix.
Context propagation gaps. A queue, a lambda, a third-party SDK that doesn’t carry headers — and your trace tree splits. Audit your async paths; they’re where traces quietly die.
Instrumentation drift. Auto-instrumentation covers HTTP, DB, queue clients. Custom in-process work — feature flags, ML inference, batch jobs — needs manual spans, and they decay if no one owns them.
Cardinality explosions. Putting user_id or request_id on every span is great for debugging, expensive for storage, and often violates retention or PII policy. Decide what goes on the span vs. what stays in linked logs.
Trace ≠ profile. A trace tells you which span was slow. To know why inside the process, you need continuous profiling (Pyroscope, Parca) — which OTel is now standardizing too.

Where to start

Starting from zero on a real system:

Pick OTel SDKs for your top two languages and turn on auto-instrumentation. (One sprint per service, less if framework-supported.)
Run an OTel collector — locally first — and point everything at it.
Send to one backend. Tempo if you want cheap retention; a commercial one if you want a UI that pays for itself fast. Pick one — don’t shop.
Verify propagation works across your message bus. This is where it usually breaks.
Add manual spans only where you actually debug — async jobs, fan-outs, anything auto-instrumentation can’t see.
Then think about sampling. Don’t pre-optimize before you know your trace volume.

The fastest way to waste a year on tracing is to bikeshed the backend before you’ve solved instrumentation and propagation. The hard work is upstream of the storage choice.