2026-05-10

Datadog: the observability platform tour

Datadog is the SaaS observability platform that started in 2010 as infrastructure monitoring for cloud workloads and is now 20-something products competing across every category of observability and security. It’s the default answer for “give me a unified UI for everything happening across our cloud environment, and don’t make me think about scaling the backend.” It’s also the platform whose monthly bill is the recurring conversation in every engineering all-hands once a company scales past a few hundred hosts.

This post is what Datadog actually contains in 2026, how the architecture works, where the value compounds, and the cost dynamics worth understanding before you commit.

The position

Datadog’s value proposition has two parts:

  1. Unification. Metrics, logs, traces, RUM, security findings, infrastructure inventory — all in one UI, all correlated automatically (the same host tag links a metric, a log line, a trace, and a security alert). The alternative is stitching together Prometheus + Loki + Tempo + Falco + a SIEM, and being responsible for that integration yourself.
  2. Managed. No clusters to scale, no retention quotas to operate, no per-team SREs maintaining the observability stack. The Datadog org has done that work; you pay them to keep doing it.

The trade-off is paid in dollars and lock-in. Datadog’s pricing scales with usage in a way that surprises every customer at least once. And while data going in is easy (open standards, OpenTelemetry support), the value of the correlation between products is what’s hard to migrate away from.

The architecture

Mini Map

Reading the diagram:

  • Datadog Agent runs on every host you want to monitor. It’s an open-source Go process that ships metrics, logs, traces, and process info to the Datadog SaaS intake. Configured via YAML; supports 800+ integrations (one per common piece of software it knows how to scrape).
  • Cluster Agent on Kubernetes — a separate process that talks to the K8s API and provides cluster-level metadata, leader election for cluster checks, and admission webhooks for auto-instrumentation.
  • APM tracer is library code inside your application, in whatever language you’re using (Python, Go, Java, Node, Ruby, .NET, etc.). It generates distributed traces, profiles, and live debugger snapshots, and ships them to the local Agent over UDP / unix socket.
  • Browser RUM and Synthetics are different — the RUM SDK runs in users’ browsers; Synthetics tests are scheduled HTTP checks from Datadog’s edge network.
  • Datadog intake — Datadog’s SaaS regions (US1, US3, US5, EU1, AP1, GovCloud). One of these is your “site” and all your data flows there.
  • The products — Metrics, Logs, APM, Security — sit on top of Datadog’s internal storage. Same backend, different query and visualization layers per product.

The green dashed edges show all data flowing outbound from your infrastructure to Datadog’s intake. You don’t expose anything inbound; the Agent initiates all connections. (You can also run Agents in a “private link” mode for VPC-internal egress.)

The product catalog

Datadog has a lot of products. The major lines:

ProductWhat it does
Infrastructure MonitoringHost metrics, integration metrics, dashboards, monitors. The original product.
APMDistributed tracing in your application code. Service maps, trace search, exception tracking.
LogsLog ingestion, indexing, archival to S3. Live tail. Sensitive data scanner.
Continuous ProfilerAlways-on production profiling (CPU, memory, lock contention).
RUM (Real User Monitoring)Frontend performance — page loads, route changes, errors, user sessions.
SyntheticsScheduled API and browser-based checks from Datadog’s edge.
Network Performance MonitoringFlow data, DNS, TCP retransmits, host-to-host throughput.
Database MonitoringQuery-level analysis for Postgres, MySQL, SQL Server, MongoDB.
ServerlessSpecific instrumentation for Lambda, GCF, Azure Functions.
Cloud SIEMSecurity signal correlation over logs.
Cloud Security Posture Management (CSPM)Cloud account / Kubernetes misconfiguration scanning.
Cloud Workload Security (CWPP)Runtime workload security; competes with Falco/Sysdig in the K8s slot.
Application Security Management (ASM)In-app threat detection via the APM tracer (RASP-style).
Cloud Cost ManagementCost analytics tied to your observability data.
CI VisibilityCI pipeline performance and flaky test detection.
Test VisibilityPer-test execution time, flakiness, ownership.
LLM ObservabilityNew in 2024 — token usage, latency, prompt/completion tracking per LLM call.
Workflow Automation (Datadog Workflows)No-code response automation triggered by signals.

Roughly the first six are the core observability stack; the rest are adjacent products that share the Agent and the data model. The pricing comes in modules — you pay per product you turn on.

What makes the platform feel different in practice

Three things explain why teams pick it despite the cost:

  • Tag-based unification. Every metric, log, trace, profile, and security event gets the same tags (env, service, version, host, kube_namespace, etc.). One click on a tag in any product filters every other product to the same scope. This is what “unification” actually means — and it’s the part you’d spend a quarter building yourself with OpenTelemetry + Grafana.
  • Watchdog (anomaly detection). Datadog runs ML over your metrics to surface “this is anomalous” without you defining a threshold. Sometimes it’s noise. Often enough, it surfaces a real degradation 20 minutes before any monitor would have fired.
  • Live debugger and Continuous Profiler. Production profiling, always on, attributable to a specific request via trace correlation, with near-zero overhead. You can answer “why is this request slow?” in seconds.

Cost dynamics worth understanding upfront

Datadog’s pricing model is the source of most customer-Datadog friction. Three patterns to internalize:

  • Hosts are billed continuously. A spot fleet that auto-scales to 200 hosts every weekday hour will bill for roughly the average host count. Containers are billed differently — by container density or by host depending on the SKU.
  • Custom metrics are the runaway cost. Datadog bills per unique tag combination per metric. A metric with user_id as a tag and 10 million users is 10 million custom metrics. Avoid high-cardinality tags on metrics. Use logs or traces for high-cardinality data.
  • Indexed log volume bills by GB, not by event count. Verbose JSON logs with stack traces are expensive. Use the Logs Pipelines to drop noisy fields, sample high-volume ones, and archive everything to S3 instead of indexing.

The structural advice: set up Datadog with cost discipline from day one. A team that adds Datadog without limits will see a 5-10× bigger bill than one that’s deliberate about cardinality, log volume, and APM trace sampling.

Where it sits in the landscape

CompetitorWhat it does wellWhat Datadog does better
New RelicAPM heritage, pricing model now per-userDatadog has wider product breadth and tag unification
DynatraceOneAgent auto-instrumentation, AI-driven RCADatadog has cleaner UX and faster iteration
Splunk ObservabilityStrong APM (Signal-FX heritage), fits Splunk shopsDatadog is more integrated cloud-native
Grafana Cloud / LGTMOpen-source-compatible (Prometheus, Loki, Tempo, Mimir)Datadog handles correlation automatically; LGTM gives you the building blocks to assemble
HoneycombHigh-cardinality observability, very different query modelHoneycomb is the right answer for debugging-as-querying; Datadog for everything else
Elastic ObservabilityOpen data formats, self-hostableElastic is more flexible but more operational work
AWS CloudWatch / GCP Operations / Azure MonitorBuilt in, cheap baselineNative is fine for baseline; Datadog is what you turn to when you outgrow it

The natural lane: cloud-native organizations of 50–5000 engineers that want unified observability without owning the observability platform. Below 50 engineers, you can usually get by with the cloud provider’s native tools plus Sentry. Above 5000, the bill becomes large enough that Grafana LGTM with dedicated platform engineering may be cheaper.

Limitations and pitfalls

  • Lock-in via correlation. Migrating away from Datadog is straightforward for any single product (metrics, logs, traces — all open standards). Migrating away from the correlation across products is the hard part. Plan for this if vendor independence is a long-term goal.
  • Tag explosion. Custom metrics billed by tag combination + a slack-message thread asking “can we add customer_id as a tag?” = bill shock at month-end. Establish tag governance early.
  • Logs indexed without retention discipline. Default index retention is 15 days; default behavior is “index everything.” A noisy service can index TB-per-month. Use exclusion filters and archive-only routing for logs you don’t actively query.
  • APM trace sampling defaults. The default samples 100% of traces, which sounds great until your bill arrives. Set service-level trace sampling rates intentionally.
  • Agent CPU / memory. Agents are generally lightweight, but Cluster Agent in large K8s clusters and Continuous Profiler can be non-trivial. Monitor the monitor.
  • Status page reliability. Datadog has had multi-region outages. Your monitoring should not be your only signal of an incident, especially when the incident is Datadog itself.

Where to start

  1. Sign up for a Datadog account (free tier covers a few hosts, 5 users, 1-day retention). Pick a region (US1 / EU1 / etc.) — you can’t migrate regions later.
  2. Install the Agent on one host. Watch the Infrastructure view fill in. This validates the rails.
  3. Add the Agent to your biggest server fleet or your Kubernetes cluster. Don’t add APM yet; metrics and logs alone give you most of the immediate value.
  4. Turn on APM for one service. Wire up the tracer in code, deploy, watch the service map populate. Validate before fanning out.
  5. Set cardinality discipline early. Establish your standard tags (env, service, version, team, region). Document what’s not allowed as a tag on metrics.
  6. Configure log exclusion filters and trace sampling rates before going production-wide. These are the cost knobs.
  7. Add additional products (RUM, Synthetics, Cloud Security) when you have a specific use case. Don’t turn them all on at once.

The mistake to avoid: enabling every product without cost discipline because “we’re already paying for Datadog.” Every product is a separate billing line; turning on Logs + APM + Continuous Profiler + RUM + Security on day one without sampling and retention discipline is how Datadog bills become a quarterly board topic. Adopt deliberately.