Why Observability’s ‘Universal Language’ Still Needs Translation


(igor kisselev/Shutterstock)

Metrics promise universal understanding across systems, but with evolving formats and complex math, they often cause more confusion than clarity. Here’s what we’re getting wrong and how we can fix it.

In 1887, an ophthalmologist named L.L. Zamenhof introduced Esperanto, a universal language designed to break down barriers and unite people around the world. It was ambitious, idealistic, and ultimately niche, with only about 100,000 speakers today.

Observability has its own version of Esperanto: metrics. They’re the standardized, numerical representations of system health. In theory, metrics should simplify how we monitor and troubleshoot digital infrastructure. In practice, they’re often misunderstood, misused, and maddeningly inconsistent.

Let’s explore why metrics, our supposed universal language, remain so difficult to get right.

Metrics, Decoded (and Re-Encoded)

A metric is a numeric measurement at a point in time. That seems straightforward—until you dive into the nuance of how metrics are defined and used. Take redis.keyspace.hits, for example: a counter that tracks how often a Redis instance successfully finds data in the keyspace. Depending on the telemetry format—OpenTelemetry, Prometheus, or StatsD—it will be formatted differently, even with the same dimensions , aggregations, and metric value.

We now have competing standards like StatsD, Prometheus, and OpenTelemetry (OTLP) Metrics, each introducing its own way to define and transmit datapoints and their associated metadata. These formats don’t just differ in syntax, they vary in fundamental behavior and metadata structure. The result? Three tools may show you the same metric value, but require entirely different logic to collect, store, and analyze it.

That fragmentation leads to operational confusion, inflated storage costs, and teams spending more time decoding telemetry than acting on it.

Format Conversion Does Not Equal Metric Understanding

Even when format translation is handled, aggregation still causes confusion. Imagine collecting redis.keyspace.hits every six seconds across 10 containers. If the container.id tag is dropped, the metric values must now be aggregated. In OTLP, Prometheus, or StatsD, dropping the container.id tag changes how the metric is interpreted as the values of the metrics must now be aggregated. Prometheus might sum the values, OTLP can treat it as a delta counter, and StatsD could average them, which results in behavior more like a gauge than a counter. These subtle differences in how metrics are interpreted can lead to inconsistent analysis. Without intentional handling of metrics, teams risk drawing incorrect conclusions from the data.

(BEST-BACKGROUNDS/Shutterstock)

But even after format translation, the hardest part often comes next: deciding how to aggregate those metrics. The answer depends on the metric type. Summing gauges can lead to incorrect results. Treating a delta as a cumulative counter can introduce risk. Aggregation math that is technically correct may still confuse downstream systems, especially if those systems expect monotonic behavior.

Metrics are math, and the math matters. This is why tools need metric-specific logic, similar to the event-centric logic that already exists for logs and traces.

Why It Matters

If we can’t rely on a shared understanding of metrics, observability suffers. Incidents take longer to resolve. Alerting becomes noisy. Teams lose faith in their data.

The path forward isn’t about creating another standard. It’s about developing better tooling that simplifies format handling, smarter ways to aggregate and interpret data, and education that helps teams use metrics effectively without needing a math degree.

By treating metrics as a unique form of telemetry with its own structure and challenges, we can remove the guesswork and empower teams to act with confidence. It’s time to build with clarity in mind—not just for machines, but for the humans interpreting the data.

About the author: Josh Biggley is a staff product manager at Cribl. A 25-year veteran of the tech industry, Biggley loves to talk about monitoring, observability, OpenTelemetry, network telemetry, and all things nerdy. He has experience with Fortune 25 companies and pre-seed startups alike, across manufacturing, healthcare, government, and consulting verticals.

Related Items:

2025 Observability Predictions and Observations

Data Observability in the Age of AI: A Guide for Data Engineers

Cribl CEO Clint Sharp Discusses the Data Observability Deluge

 

Leave a Reply

Your email address will not be published. Required fields are marked *