Skip to main content
Technical Systems

Why Your Metrics Don't Match Your Logs

One says 4 errors. The other says 15. Both are 'correct.'

Metrics and logs routinely disagree about error rates, request counts, and latency. The divergence isn't noise -- it's structural differences in what they measure, when, and how they sample.

Why Your Metrics Don't Match Your Logs

You are debugging a production incident. Metrics show 200 requests per second with 2% error rate. Logs show 180 successful requests and 15 errors in the same time window.

The numbers do not match.

One system reports 4 errors. The other reports 15. Both are measuring the same application at the same time. Both cannot be correct.

This is not a rare occurrence. Metrics and logs routinely disagree about basic facts: request counts, error rates, latency percentiles, resource utilization. The divergence is not noise. It is structural.

They Measure Different Things

Metrics and logs appear to measure the same events. They do not.

Metrics count state transitions. Logs record events. A state transition and an event are not the same thing.

When a request completes, the application increments a counter. This is a metric. It also writes a log line containing the request details. This is a log entry.

If the log write fails, the counter still increments. If the counter increment fails, the log line still writes. They are independent operations with independent failure modes.

A disk full condition may block log writes while metrics continue to increment in memory. A metrics agent crash may lose counter state while logs continue to accumulate. Network partition may prevent metrics export while local logs buffer successfully.

The two systems observe different slices of application behavior. They will report different numbers.

Time is Not Synchronized

Metrics are timestamped when collected. Logs are timestamped when written.

These timestamps are not the same.

A metric representing requests in the last minute is collected at 14:32:00. It aggregates counters incremented between 14:31:00 and 14:32:00. The timestamps on those increments are whenever the requests completed.

A log entry is written at 14:31:47 when the request handler completes. The log line includes a timestamp field set by the application. The log forwarder reads the file at 14:32:15 and sends it to the aggregator. The aggregator receives it at 14:32:18 and indexes it under the application-provided timestamp.

Which timestamp is used for querying? The write timestamp. The send timestamp. The receive timestamp. The application timestamp.

Different systems use different timestamps. A query for “errors between 14:31:00 and 14:32:00” will return different results depending on which timestamp field is indexed.

This is before considering clock skew. Different servers have different system times. NTP synchronization is approximate. A server that believes it is 14:31:55 may be at 14:32:03 according to the metrics collector’s clock.

Metrics and logs are not just measuring different things. They are measuring them at different times according to different clocks.

Aggregation Windows Do Not Align

Metrics aggregate over fixed time windows. Logs are indexed by event time. These windows do not align.

A Prometheus scrape happens every 15 seconds. The application counter reflects all requests that completed since the last scrape. This is not “requests in the last 15 seconds.” It is “requests since the counter was last read.”

If a scrape is delayed by 2 seconds due to network latency, the next scrape window is 17 seconds, not 15. The counter is reset after each scrape, so the next window will be 13 seconds to compensate.

Logs do not have this problem because they are not aggregated at collection time. But they have a different problem: they are aggregated at query time.

A query for “errors in the last minute” scans log entries with timestamps in that range. This is not the same as “errors that occurred in the last minute.” It is “errors that were logged with a timestamp in the last minute.”

If the application’s clock is 30 seconds fast, logs from 90 seconds ago appear to be from 60 seconds ago. If the log forwarder is buffering due to backpressure, logs from 5 minutes ago are being indexed now but carry old timestamps.

Metrics aggregate at collection time with non-deterministic windows. Logs aggregate at query time with non-deterministic timestamps. The results will diverge.

Sampling is Inconsistent

High-throughput systems cannot afford to record every event. They sample.

Metrics sample differently than logs. A metrics agent may sample 1% of requests and multiply by 100 to estimate total volume. A logging agent may sample 5% of requests to reduce storage costs.

If these sampling rates are independent, the estimates will diverge by 5x even if both systems are measuring accurately.

Sampling is also non-uniform. Applications sample based on local heuristics: log the first error of each type, then sample 1% of subsequent instances. This makes sense for reducing log spam. It destroys accurate counting.

Metrics may use reservoir sampling to maintain statistical properties. Logs may use hash-based sampling to ensure the same request ID is always logged or never logged. These sampling strategies produce different distributions.

When you query metrics, you get an estimate derived from one sampling strategy. When you query logs, you get an estimate derived from a different sampling strategy. The estimates do not agree.

Cardinality Limits Cause Dropped Data

Metrics systems impose cardinality limits. Too many unique label combinations, and metrics are dropped.

A metric labeled by user_id works fine in development with 10 users. It breaks in production with 10 million users. The metrics backend refuses to ingest that many unique time series. Silently or loudly, data is lost.

Logs do not have cardinality limits in the same way. Each log line can contain arbitrary fields. A user_id field in logs does not create a new index entry for every unique user.

This means logs contain data that metrics cannot represent. When you query both systems, logs show activity from millions of users. Metrics show activity from whatever subset fit within the cardinality limit.

The divergence is not random. It is systematic. Popular users are more likely to fit within cardinality limits because they generate more samples. Rare users are more likely to be dropped. Metrics show a biased view of system behavior.

Buffering Creates Temporal Distortion

Both metrics and logs buffer data before export. Buffering introduces temporal distortion.

An application writes logs to a local file. A log forwarder reads the file every 10 seconds and sends batches to the aggregator. The aggregator processes batches in arrival order, not timestamp order.

If the forwarder is CPU-bound, the queue grows. Logs written at 14:30:00 are not sent until 14:35:00. They are indexed under their original timestamp, but they were not observable until 5 minutes later.

During those 5 minutes, querying logs shows no errors. Querying metrics shows errors. After the buffer flushes, logs suddenly show errors that “happened” 5 minutes ago.

This is worse during incidents. High error rates generate more log volume. More log volume increases buffering delay. Increased delay makes logs less useful for real-time debugging. Metrics remain current because they are aggregated before export, reducing data volume.

But metrics lose detail. Logs retain individual error messages. The trade-off is latency versus fidelity.

Aggregation Functions Are Not Reversible

Metrics pre-aggregate data. Logs store raw events. Aggregation is not reversible.

A counter increments by 1 for each request. The counter is scraped every minute and reset. The metrics system stores the per-minute count. The raw increments are lost.

If you query for total requests over 10 minutes, you sum 10 counter values. If 3 of those scrapes failed due to network issues, those minutes show zero requests. The sum is wrong by 3 minutes of traffic.

Logs do not have this problem. If log forwarding fails for 3 minutes, those logs are buffered locally and forwarded later. The query eventually returns correct results, once the buffer flushes.

But logs have the opposite problem. If a log line is malformed and rejected by the indexer, it is gone. The metrics counter still incremented. Metrics show the request. Logs do not.

Neither system is wrong. They are measuring different aspects of system behavior with different durability guarantees.

Histograms Lie About Percentiles

Latency metrics are usually histograms. Histograms bucket observations into ranges and count how many fall into each bucket.

A histogram with buckets [0-10ms, 10-50ms, 50-100ms, 100-500ms, 500ms+] cannot tell you the true p99 latency. It can only tell you which bucket the p99 falls into.

If p99 is in the 100-500ms bucket, the actual value could be anywhere from 100ms to 499ms. The metrics system interpolates and reports an estimate. The estimate is wrong.

Logs contain exact latency values. Querying logs for p99 returns the actual 99th percentile value, not an interpolated estimate.

For low-volume systems, this divergence is small. For high-volume systems, the interpolation error compounds. Histogram-based p99 may report 150ms while log-based p99 reports 320ms. Both are measuring the same requests.

The histogram is trading accuracy for space efficiency. The logs are trading space for accuracy. They cannot both be right.

Logs Drop Events, Metrics Drop State

Logs are append-only. When the log buffer fills, new events are dropped. The application continues running. Metrics continue incrementing.

A burst of errors fills the log buffer. The application drops log writes to prevent blocking. No log entries are created. The error counter still increments.

Metrics show 1000 errors. Logs show 200 errors. The missing 800 were dropped during the burst.

The inverse also happens. A metrics agent crashes and loses in-memory counter state. The application continues writing logs. When the metrics agent restarts, counters reset to zero.

Metrics show 50 requests since restart. Logs show 850 requests. The missing 800 were counted before the crash but are not reflected in metrics because the state was lost.

Both systems are correct about what they observed. They observed different things.

Aggregation Happens at Different Layers

Metrics can be aggregated in the application, the collection agent, the metrics backend, or at query time.

An application may maintain per-endpoint counters. The collection agent may aggregate those into a single service-level counter. The metrics backend may further downsample to reduce storage. A query may sum across multiple services.

Each aggregation layer introduces opportunities for divergence. If the application sends metric requests{endpoint="/api"} and requests{endpoint="/health"}, the agent may sum them into requests{service="api"}. If the backend then downsamples from 1-second resolution to 1-minute resolution, the query sees only the downsampled values.

Logs are aggregated only at query time. There is no intermediate aggregation layer losing fidelity. But query-time aggregation is expensive. Counting 10 million log entries is slower than summing 10 counter values.

The trade-off is query latency versus aggregation accuracy. Metrics prioritize fast queries at the cost of aggregation fidelity. Logs prioritize accurate aggregation at the cost of query performance.

Labels and Fields Do Not Correspond

A metric is identified by its name and labels. A log entry is a semi-structured blob with fields.

Metrics enforce schema at write time. Labels must be declared when the metric is created. Adding a new label requires code changes and redeployment.

Logs enforce schema at read time. Any field can be added to any log entry without coordination. Querying a field that does not exist simply returns empty results.

This creates semantic drift. A metric labeled status="error" may not correspond to logs with field level="ERROR". They sound similar but represent different concepts. The metric may track HTTP 5xx responses. The log field may track application-level errors that return HTTP 200.

When you query both systems for “errors,” you get different results because the word “error” means different things in each system.

There is no mechanism to enforce semantic consistency between metrics labels and log fields. They evolve independently and diverge over time.

Multi-Process Systems Compound the Problem

A distributed system has multiple processes writing metrics and logs. Each process has its own counters, buffers, clocks, and failure modes.

Process A handles 100 requests and increments a counter. Process B handles 100 requests and increments a different counter. The metrics backend sums them: 200 requests.

Process A logs all 100 requests successfully. Process B’s log forwarder is down. Zero log entries from Process B reach the aggregator. Logs show 100 requests.

The divergence is 2x, and it is undetectable without process-level visibility. The metrics backend shows the sum but not the breakdown. The log aggregator shows entries without knowing which processes are not sending logs.

This is worse when processes crash. A crashed process stops sending metrics and logs. Metrics show the last known value until a timeout expires, then drop to zero. Logs show nothing because the process is not writing.

If the process crashes after handling 50 requests but before exporting metrics, those 50 requests are visible in logs but not in metrics. If the process crashes after exporting metrics but before flushing log buffers, those 50 requests are visible in metrics but not in logs.

There is no global transaction coordinator ensuring metrics and logs agree. Each system operates independently with independent failure modes.

They Are Designed for Different Use Cases

Metrics are designed for monitoring: detecting anomalies, tracking trends, alerting on thresholds.

Logs are designed for debugging: investigating specific requests, reconstructing event sequences, diagnosing root causes.

These use cases have different requirements. Monitoring requires low-latency aggregates with bounded cardinality. Debugging requires high-fidelity details with unbounded cardinality.

A system optimized for monitoring will aggressively aggregate and sample. A system optimized for debugging will preserve details at the cost of storage and query performance.

When you query both systems, you are asking them to serve use cases they were not designed for. Metrics provide poor debugging information because they discard details. Logs provide poor monitoring information because aggregating raw events is expensive and slow.

The divergence is not a bug. It is the result of optimizing for orthogonal objectives.

Why This Matters

When metrics and logs disagree, operators lose confidence in observability data. If the tools cannot agree on basic facts like request counts, how can they be trusted for complex diagnostics?

The answer is that metrics and logs are not measuring the same thing. They are two independent views of system behavior, each with its own sampling strategy, aggregation logic, failure modes, and latency characteristics.

Treating them as redundant sources of truth is a mistake. They are complementary sources of partial truth.

Metrics tell you what is happening right now, approximately, with low latency. Logs tell you what happened, exactly, with high latency.

When the numbers diverge, it is not because one system is broken. It is because they are measuring different aspects of a complex distributed system that does not have a single coherent ground truth.

The goal is not to make them agree. The goal is to understand why they disagree and what that divergence reveals about system behavior.