Published February 12, 2026

OpenRCA Isn’t Root Cause Analysis — and Why That Matters

Large language models are increasingly being applied to incident investigation and root cause analysis—and benchmarks like OpenRCA are an important early step, providing shared datasets, standardized evaluation, and a way to track progress over time.

As scores improve, it's tempting to treat them as signals of production readiness. But the hardest parts of RCA—causation, scale, and system structure—lie largely outside what benchmarks measure. And that gap becomes visible the moment you work through the dataset directly.

Identifiability and Evidence Gaps in Practice

Looking closely at OpenRCA's structure and evaluation criteria reveals a mismatch between synthetic benchmark performance and real-world production-grade incident root cause analysis, and it surfaces directly when working through the dataset itself. In several scenarios, the labeled root cause cannot be reliably recovered from the provided telemetry under the benchmark's own constraints:

Bank 0 — No observable fault onset

Injected fault: High memory usage for Mysql02 at 2021-03-04 14:57

The injected fault is high memory usage for Mysql02 at a specific timestamp. Across the entire evaluation window, memory utilization sits at approximately 98% — before, during, and after injection. No spike, no inflection, no onset. The fault is not recoverable from the telemetry. Memory was already saturated before injection occurred.

Bank 60 — Multiple plausible causes with no principled way to choose

Injected fault: High disk space usage for apache01 at 2021-03-09 17:39

Anomaly detection surfaces anomalies across 13 different components. apache01 — the labeled root cause — exhibits fewer anomalies than many others, doesn't appear in traces, and can't be placed reliably in the dependency graph. There is no principled basis for selecting it over components with stronger or more numerous anomalies.

Bank 47 — Anomalies that violate expected system semantics

Injected fault: High JVM CPU load for IG01 at 2021-03-04 21:06

IG01 and IG02 show nearly identical patterns, with IG01's latency actually decreasing during the incident. The components with the strongest anomaly signals don't appear in traces and sit in positions that are difficult to reconcile with standard propagation patterns. Causal attribution is ambiguous by design.

These cases illustrate a broader structural issue: when evidence is incomplete, ambiguous, or inconsistent with system structure, selecting a labeled answer is not the same as performing root cause analysis.

RCA Is a Causation Problem, Not a Pattern Matching Problem

In real production environments, root cause analysis is a causation problem in a complex, dynamic system. Incidents unfold over time. Signals propagate across dependencies. Symptoms appear far from their causes. Correct diagnosis requires enforcing constraints—temporal ordering, dependency relationships, causal consistency—not just pattern recognition over available data.

Text-first, model-centric agents struggle here, not because they lack intelligence, but because they lack structure. Even equipped with tools to read metrics, logs, and traces, they must infer causation and topology on the fly. As environments grow larger and noisier, this approach plateaus. You can game the problem on a narrowly scoped part of the stack–but that is not the same as autonomous RCA in ever-evolving production systems.

Effective RCA systems invert this. They make causation, topology, and temporal constraints explicit, letting models operate within that structure rather than reconstruct it from scratch.

Why the Benchmark Can't See What Matters

OpenRCA evaluates pattern matching over bounded, static telemetry artifacts. That's meaningfully different from how SREs debug incidents in production, where investigations span large systems, evolving state, and competing hypotheses under operational constraints.

The structural gaps make this concrete:

OpenRCA operates at gigabyte scale where telemetry can be scanned directly; production environments generate petabytes per day, forcing reliance on retrieval, indexing, and prioritization under strict constraints.
Telemetry is provided as files, bypassing indexing cost, query latency, and bandwidth limits entirely.
Deployments, configuration changes, and feature flags—often the primary disambiguators in real incidents—are absent.
Dependencies and propagation paths are weakly represented, allowing failures to appear independent of the system's actual structure.

The result is that hill-climbing on OpenRCA scores is possible without ever addressing causation, scale, or system structure. Recent gains in benchmark scores can invite linear extrapolation: if progress continues at this pace, production-grade RCA must be just a few years away. This inference is incorrect. The hardest parts of RCA—operating over terabytes of data per day, enforcing retrieval constraints, attributing change, and reasoning over system semantics—don't improve linearly with model capability.

When substantial gains are achievable without addressing causation, scale, or system structure, it suggests that those properties lie outside the benchmark's scope.

A Different Design Point

What OpenRCA highlights is not a failure of language models. Recent models show impressive gains in reasoning, tool use, and pattern recognition. The limitation is architectural.

Traversal was built around this premise. Instead of asking models to infer partial system structure from noisy raw telemetry on the fly, Traversal constructs a machine-readable model of the production environment and reprocesses observability data for investigation. Agents don't generate narratives over arbitrary data; they propose and test hypotheses against explicit constraints. Explanations that violate timing, dependencies, or system behavior are eliminated early. The result is a single diagnosis consistent with the system, not a list of plausible causes.

These capabilities are largely invisible to benchmarks like OpenRCA not because they're unimportant, but because they operate outside what the benchmark measures.

Reading Benchmarks Responsibly

OpenRCA is a valuable contribution. It provides a shared reference point and helps surface real progress in model capabilities. But its scores should be interpreted for what they are: signals of bounded telemetry reasoning, not proxies for production root cause analysis.

Treating OpenRCA scores as a proxy for enterprise RCA capability conflates benchmark reasoning with system-level diagnosis, a category error that obscures the real challenges of incident response.

As AI systems move from demonstrations to operational roles, this distinction becomes increasingly important. Progress in RCA will come less from ever-larger models and more from systems that encode causation, structure, and scale as first-class concepts.

Understanding that difference is a prerequisite for building AI that can keep production systems running.

OpenRCA Isn’t Root Cause Analysis — and Why That Matters

Table of Contents

Identifiability and Evidence Gaps in Practice

Bank 0 — No observable fault onset

Bank 60 — Multiple plausible causes with no principled way to choose

Bank 47 — Anomalies that violate expected system semantics

RCA Is a Causation Problem, Not a Pattern Matching Problem

Why the Benchmark Can't See What Matters

A Different Design Point

Reading Benchmarks Responsibly

More blog posts

AI SRE vs Observability: Why Your Dashboards Can't Diagnose

AI SRE vs Observability: Why Your Dashboards Can't Diagnose

AI SRE vs Observability: Why Your Dashboards Can't Diagnose

American Express Taps Traversal to Transform Site Reliability Engineering with AI

American Express Taps Traversal to Transform Site Reliability Engineering with AI

American Express Taps Traversal to Transform Site Reliability Engineering with AI

Questions?

Questions?

Questions?