Observability 101: metrics, logs, and traces explained

In monolithic applications, debugging a bug was straightforward: log on to the server, open the text log file, and search for the stack trace. In modern distributed architectures, a single user request can trigger a chain of calls across dozens of microservices, serverless functions, database clusters, and external payment APIs. Observability is the practice of understanding the internal state of a system based on its external outputs. To implement observability, you must correlate three pillars: metrics, logs, and traces.

1. Metrics: The High-Level Alerting (The "What")

Metrics are numeric values aggregated over time. They are computationally cheap to collect, query, and store, making them ideal for real-time alerting and high-level dashboards. When configuring metrics dashboards, prioritize the RED Method for web services:

Rate: The number of incoming requests per second.
Errors: The number of failing requests (e.g., HTTP 5xx codes or failed database queries).
Duration (Latency): The time it takes to process requests. Monitor p95 and p99 latencies to identify delays experienced by users.

2. Logs: Detailed Event Records (The "Why")

Logs are timestamps and message strings recorded when specific operations occur. To make logs useful at scale:

Use Structured Logging (JSON): Format every log entry as a structured JSON object. This allows log aggregation systems (e.g., Elasticsearch, Grafana Loki, Datadog) to instantly filter logs by fields like tenant_id, user_id, or http_status.
Log Levels: Appropriately categorize logs (e.g., DEBUG, INFO, WARN, ERROR, FATAL). Avoid logging sensitive user personal data (PII) or secrets.

3. Traces: End-to-End Request Lifecycles (The "Where")

A trace tracks a request as it flows through your distributed services. A trace consists of multiple "spans," representing units of work (e.g., an API call, a database query, or a message queue operation) with start and end times. Traces are critical for identifying latency bottlenecks in microservices.

4. The Power of Correlation: A Real-World Walkthrough

Isolated metrics, logs, and traces are not sufficient. You must connect them using a unique identifier called a **Trace ID** propagated across HTTP headers. Let's see how these three pillars work together during an incident:

The Metric Alert: A Prometheus alert triggers: the p99 latency of the checkout API has exceeded 2 seconds.
The Trace Investigation: The engineer opens the Grafana dashboard, selects the checkout latency graph, and views a slow trace sample. The trace shows a 2-second span for a database transaction.
The Log Correlation: The engineer clicks on the database span. The system retrieves all logs containing that specific Trace ID, revealing a database lock contention error.

Observability 101: metrics, logs, and traces explained

1. Metrics: The High-Level Alerting (The "What")

2. Logs: Detailed Event Records (The "Why")

3. Traces: End-to-End Request Lifecycles (The "Where")

4. The Power of Correlation: A Real-World Walkthrough

Related Articles

Bangladesh

United States