Published February 21, 2026

Why You Can't Build an AI SRE with Claude Code (and What It Takes to Get There)

Every engineering leader has had the thought: Claude Code and Codex are powerful, the team is sharp. Why not build an AI SRE ourselves?

It's the right instinct applied to the wrong problem. See Traversal’s AI SRE by booking a demo.

The Build Trap

DIY AI SRE looks tractable from the outside. Connect an LLM to your observability stack, write some prompts, add a few tools. What could go wrong?

Eighteen months and $5M in senior engineering talent spend later, you find out.

What you've built works in demos, on a subset of your stack, under the specific conditions your team has had time to account for. The moment such an approach encounters real production scale—the noise, the volume, the ambiguity—the cracks appear. It hallucinates on incomplete telemetry. It times out against API rate limits mid-investigation. It loses the thread of what it was looking at three steps ago. When it does surface a root cause, it does so with high confidence regardless of accuracy level. Your on-call engineer acts on it anyway, because that's what the tool told them.

Every time your infrastructure evolves: a new service, a schema change, a dependency shift, someone has to go back in and fix it.

That someone is your best senior engineer: the one who carries the most institutional knowledge about how your systems actually behave. They are now spending that knowledge maintaining AI instead of applying it to the reliability problems it was supposed to solve.

And the setup cost alone should give you pause. Expect roughly 100 engineer hours per application just to get an agent operational: writing runbooks, mapping dependencies, encoding institutional knowledge into prompts and tooling. Multiply that across a portfolio of services and you're looking at a staggering upfront investment before the agent has handled a single real incident.

Worse, those runbooks start going stale the moment they're written. Code evolves. Services get refactored. In a traditional workflow, the engineer who makes the change updates the runbook. But in a world where AI is the one making changes and fixing things, who updates the runbooks? The feedback loop breaks. Your agent is operating against documentation that no longer reflects reality, and nobody notices until it matters most.

Why Simple Agents Fail in Production

The core problem isn't effort or talent. It's that production environments are adversarial to out-of-the-box LLMs in ways that aren't obvious until you're deep in it, including:

Data volume. Production systems generate terabytes of telemetry per day. The observability APIs agents rely on were built for humans navigating dashboards, not agentic systems running thousands of parallel investigations. Rate limits that feel invisible in normal use quickly become hard ceilings, pushing agents toward expensive, slow pattern matching: grouping alerts that fire repeatedly, or correlating alerts that frequently occur together, instead of true root cause analysis.
The scale math is brutal. Each app gets its own agent, no coordination across the fleet. At a million-plus alerts per day and ~$5 per LLM-driven investigation, that's $5M/day in API costs, if you could even get the rate limits. Our Fortune 100 customer has roughly $1440/day in standard API capacity; they'd need 3472x that to cover their alert volume.
The observability bottleneck. DIY agents query observability platforms directly via public MCPs and standard APIs, creating an impossible dilemma. Query aggressively and you crash your Elasticsearch clusters—blind during the incident you're trying to resolve. Throttle to protect infrastructure and the agent takes hours to respond—so slow the business ignores it entirely.
Ambiguity. Real incidents are messy: symptoms appear far from their causes, multiple components show anomalies simultaneously, and signals propagate across dependency chains in non-obvious ways. A DIY agent with no deep understanding of your system's topology has no principled way to distinguish a root cause from a downstream effect. It guesses—and at production scale, it guesses wrong often enough to be a risk.

The True Cost

The $5M in engineering talent is the visible cost. The invisible costs accumulate quietly and don't stop.

Every schema change requires someone to update the agent. Every new service requires re-mapping dependencies. Every model update requires re-validating behavior across an increasingly complex system. And every incident the AI handles badly erodes the team's trust in it—until engineers stop using it altogether and route around it, at which point you've paid full price for something that sits idle while incident debt compounds in the background.

You're not just building a tool. You're taking on a platform without the resources of a company whose entire mission is to make it work.

What You're Actually Betting

The cost is one thing; the risk is another.

When a major incident hits a part of your system the custom AI doesn't model well, a wrong diagnosis under pressure doesn't just extend the outage—it shapes how the incident is remembered. The post-mortem gets shared. Customers notice before you've fully understood what happened. A deal in late stages quietly goes cold. Engineering credibility, built carefully over years, is surprisingly fragile when the tool you championed gets it wrong at the wrong moment.

There’s also the compliance and security surface that tends to get underestimated until it becomes a blocker. A custom AI agent with broad read access to production telemetry—metrics, logs, traces, and potentially sensitive transaction data—introduces a meaningful attack vector. It requires access controls, audit logging, and data residency policies that meet your organization's standards. In regulated industries, it may require legal and security sign-off before it can touch production at all.

Most DIY builds treat these requirements as something to address later. By the time later arrives, the cost of retrofitting is significant and the team is already committed.

What Enterprise-Grade AI SRE Actually Requires

Two capabilities separate an enterprise-grade AI SRE platform from a project, and both require years of foundational AI research to get right.

A high-fidelity, dynamic knowledge graph: An AI SRE needs a continuously updated, machine-readable model of your entire architecture—built by mining telemetry and code across millions of entities, statistical baselines, and dependency relationships. Not a hand-maintained CMDB. Not a static graph. A living representation that updates as your systems change, captures behavioral baselines, and gives agents the structural context to form and test hypotheses rather than react blindly to whatever telemetry happens to be in the prompt window. Traversal calls this the Production World Model™.

Without it, an agent isn't investigating against a map of your system. It’s pulling logs and metrics on demand, one slice at a time. There’s no built-in understanding of how services depend on each other, what “normal” looks like, or how signals propagate through the stack. When multiple components show anomalies, the agent has no structural context to separate cause from effect. Everything that spikes together looks equally suspicious. Root cause analysis collapses into pattern matching.

But the limitation isn’t just structure. It’s also access.

Standard observability APIs are rate-limited, sequential, and optimized for human retrieval—not AI investigation. An agent forced to investigate through those interfaces can only evaluate a narrow slice of possible hypotheses before hitting cost and latency ceilings. Root cause analysis becomes shallow by design.

Traversal’s Production World Model™ solves both problems at once. We ingest your telemetry and code, compress and re-index it into a structured, searchable representation, and layer on millions of entities, baselines, and dependency relationships. The telemetry isn’t just stored—it’s reorganized for causal reasoning. Observability data is compressed and re-ranked for search across causal pathways rather than dashboard navigation.

The result is a continuously updated, machine-readable model of your entire production environment—one that agents can reason over exhaustively, not sequentially.

Building this required years of foundational work in automated topology discovery, large-scale telemetry indexing, statistical baselining, and causal machine learning—and it requires ongoing engineering to keep pace with systems that never stop changing. It isn’t a feature layered on top of dashboards. It’s the substrate that makes enterprise-grade AI SRE possible.

Causal search at scale: Most observability tools surface signals that correlate in time. When multiple services spike together, they’re grouped. When alerts fire simultaneously, they’re associated. That’s useful—but it isn’t root cause analysis. Root cause analysis requires evaluating competing explanations and ruling out everything inconsistent with how the system actually behaves.

The Causal Search Engine™ does exactly that.

Using agentic search and causal machine learning, Causal Search Engine™ evaluates thousands of hypotheses in parallel, running roughly 10,000 analytical tests in the time traditional systems evaluate a fraction of that. Each hypothesis is filtered against your system’s topology, dependency structure, timing constraints, and behavioral baselines. Explanations that violate those constraints are eliminated. The output isn’t a list of correlated anomalies, but causally consistent diagnoses that fits the structure and behavior of your system.

This is why most DIY approaches plateau. The limitation isn’t the LLM—it’s that they stop at correlation. Without large-scale causal hypothesis testing, pattern matching masquerades as root cause analysis.

Every month a DIY tool is in production is a month your uptime depends on infrastructure your team built under time pressure, hasn't fully stress-tested at scale, and is one engineer departure away from becoming unmaintainable. This is a critical business risk. When the next major incident hits a part of your system the custom AI doesn't model well, the cost isn't just the engineering hours. It's really the downtime: your reputation and customer trust.

What Traversal Is

Traversal is the frontier lab behind the only AI SRE platform built from the ground up around both capabilities, validated at petabyte scale at companies like American Express, DigitalOcean, PepsiCo, and Cloudways. It runs agentless and schema-less, with no instrumentation overhead.

The teams that chose to build themselves are still building. The teams that chose Traversal are mitigating incidents. Book a demo today.

Why You Can't Build an AI SRE with Claude Code (and What It Takes to Get There)

Table of Contents

The Build Trap

Why Simple Agents Fail in Production

The True Cost

What You're Actually Betting

What Enterprise-Grade AI SRE Actually Requires

What Traversal Is

More blog posts

Traversal Expands Executive Team with Six Senior Leaders Across Go-to-Market and Engineering

Traversal Expands Executive Team with Six Senior Leaders Across Go-to-Market and Engineering

Traversal Expands Executive Team with Six Senior Leaders Across Go-to-Market and Engineering

American Express Taps Traversal to Transform Site Reliability Engineering with AI

American Express Taps Traversal to Transform Site Reliability Engineering with AI

American Express Taps Traversal to Transform Site Reliability Engineering with AI

Questions?

Questions?

Questions?