What Is Observability?
System Observability
Observability is the ability to understand a system's internal state from its external outputs — logs, metrics, and traces. It goes beyond monitoring (is it up?) to answering 'why is it slow?' and 'what went wrong?' for complex distributed systems.
How Observability Works
The three pillars of observability: Logs (timestamped events — what happened), Metrics (numerical measurements over time — request rate, error rate, latency), and Traces (request flow across services — where did it slow down). Tools: Datadog, Grafana, Honeycomb, OpenTelemetry.
Key Concepts
- Logs — Timestamped records of discrete events — errors, requests, state changes
- Metrics — Numerical values tracked over time — CPU usage, request latency, error rate percentages
- Traces — End-to-end request tracking across microservices — shows exactly which service is slow or failing
Frequently Asked Questions
What is the difference between monitoring and observability?
Monitoring tells you when something is wrong (alerts on thresholds). Observability helps you understand why it's wrong (explore logs, metrics, traces to find root causes). Observability enables monitoring.
What observability tools should I use?
OpenTelemetry for instrumentation (open standard). Grafana + Prometheus for metrics. Datadog or Honeycomb for full-stack observability. Start with structured logging and basic metrics.