We Spent 30% of Our Engineering Debugging AI Agents. So I Built One to Do It.

TL;DR: Kelet is an AI agent that automates root cause analysis for production AI agents. It reads your traces across thousands of sessions, identifies failure patterns no human would find manually, and generates validated prompt fixes with before/after proof. Free during beta.

I’ve built 50+ LLM apps and agents over the past few years, some scaled to millions of transactions a day. And after all of that, the thing I’m most frustrated by is still how we debug them. We have models writing production code, passing exams, replacing entire workflows, and when they break in production? Our best strategy remains: open the monitoring system, scroll, squint, guess, patch, deploy, pray. Monday morning, same thing again.

I’ve talked to over 112 AI engineers. Same story, every time. About a third of their week just… gone. Not building anything. Not shipping. Just scrolling traces. And it’s not just anecdotal: McKinsey and Gartner report that roughly 90% of AI projects look great in PoCs and collapse in production, because nobody has a reliable way to diagnose what’s going wrong at scale.

I tried everything. Evals that passed in staging, then fell apart the moment real users showed up. Autonomous monitoring that caught basically nothing useful. Every observability dashboard I could find, all of which gave me beautiful charts and zero actual answers.

The only thing that actually worked was the oldest, least glamorous trick in data science: error analysis. Going through production data, session by session, tagging failures by hand, slowly building a picture of what’s actually breaking. The best AI engineers I know, people shipping real agents at serious scale, were all doing the same thing. Spreadsheets. Every morning. One session at a time. It felt like 2015 all over again, manually labeling training data.

It works. It’s also an absurd waste of very expensive engineering time. And it doesn’t scale.

”Just throw Claude Code at it”

Everyone tried this. Claude Code, Cursor, deep research agents, autonomous debugging loops. Same wall every time. Nobody in this industry seems willing to say it out loud, so I will: a single LLM cannot solve this problem. It’s the wrong shape of tool for the job.

In traditional software, a coding agent can follow a traceback straight to the root cause. One session, one bug, one fix. AI failures don’t work that way. The same input succeeds ten times, fails on the eleventh, works again for no apparent reason. One hallucination is a data point, not a diagnosis. You need to observe a failure repeat across hundreds of sessions before you can separate real patterns from noise.

Think of it like dozens of needles scattered across thousands of haystacks — except the needles are connected in ways you can only see when you zoom out far enough. No single LLM call is going to surface that. You need an assembly of specialized models (some LLM-based, some classical ML) learning continuously over weeks and months. We train dozens of models per subagent in your pipeline.

The question I kept coming back to: we’re building agents for legal research, code generation, customer support. Why can’t an agent do the error analysis itself? Let the humans make the judgment calls. Let the machine do the heavy lifting. So that’s what we built.

The failure nobody saw coming

An insurance company we worked with had a two-agent pipeline: the first agent summarizes support call transcripts, the second classifies the claim type. Passed all their evals. Everyone was happy.

One session that stuck with me: a woman calls about a storm. Flooding, broken pipe, mud everywhere, the whole disaster. The summarization agent pulls out the facts. The classification agent reads that summary and outputs “water system problem.” The human underwriter glances at it, nods, and quietly changes it to “weather event.” Doesn’t open a ticket. Doesn’t flag anything. Just silently corrects it and moves on.

If you looked at that one session in your trace viewer, you’d blame the classification agent. Add a better few-shot example to the prompt. Ship the fix. Done, right?

We analyzed hundreds of their sessions: fire claims, power outages, electrical failures. The classification agent wasn’t the problem. The summarization agent was. It kept flattening the timeline into a bag of facts with no chronological order. In insurance, the sequence of events is what determines the claim type, not the events themselves. That’s institutional knowledge that experienced underwriters carry in their heads. It’s not written down anywhere. A single session would have sent you debugging the wrong agent entirely.

The core insight

One session shows you that something went wrong. Hundreds show you why it keeps going wrong. And once you find the actual root cause, the fix is usually much simpler.

Your observability tool is a very expensive screenshot

I’ll say what the Langfuse and LangSmith teams won’t: trace collection is a solved problem. OpenTelemetry commoditized it. Langfuse, LangSmith, Datadog, Braintrust, Arize. They’re all showing you the same underlying data with different interfaces on top. For example, Langfuse will show you latency, token count, cost — everything except why the agent failed. At this point the real competition is about who renders traces more beautifully. And not one of them can tell you why your agent broke.

The way I see the current stack:

Layer	What it does	Status
Traces + metrics	Collecting and viewing what happened	Solved. Commoditized.
Root cause analysis	Understanding why it failed, with evidence	Every tool stops here.
Automated fix	Generating a validated fix with before/after proof	Nobody does this.

The whole observability industry built thermometers — some of them very good thermometers, I’ll give them that. But a thermometer doesn’t diagnose strep throat. It confirms what you already knew: something is wrong. Your agent is failing. Your users are complaining. You didn’t need a fancier dashboard to tell you that.

What’s missing from the stack is what comes after the thermometer: an actual diagnosis, and a prescription. That’s the gap Kelet fills.

If you’re already on Langfuse, keep it. Kelet pulls your traces directly from there. No re-instrumentation needed. We sit on top of what you already have.

What Kelet actually does

Your traces are the raw material. User signals — thumbs-down clicks, retries, silent edits where a user quietly fixes what your agent got wrong (e.g., rewrites a hallucinated sentence) — point Kelet toward where the real problems are concentrated.

Collect

Traces and signals flow in via OpenTelemetry, Langfuse, or the Kelet SDK. Five minutes to connect.

→

Investigate

Kelet reads every session, clusters failures, and pinpoints which agent in your pipeline caused the problem.

→

Fix

For every root cause, Kelet generates a prompt patch, with before/after quality metrics to prove it works.

When a failure pattern keeps repeating across hundreds of sessions, Kelet surfaces it as a named finding, a specific root cause backed by evidence, not a hunch. Not “I spent an hour scrolling and I think it might be the retrieval step.”

For extra confidence before shipping a fix, there’s GEPA optimization, essentially an evolutionary search tested against your real production sessions. It shows you the improvement before you deploy anything, so you’re not crossing your fingers after every prompt change.

Numbers so far:

33,000+

production sessions analyzed

73%

of teams had failure patterns nobody knew about

14.3 min

average time to a tested fix

I still find that 14.3-minute figure surprising, honestly. It used to take us weeks. (The teams where it’s slower are the ones where the failure hasn’t had enough repetitions yet. 14 minutes assumes sufficient volume for a pattern to emerge. Connect earlier, and the time improves.)

Try it

Your agent is failing somewhere right now. Scrolling through traces one by one isn’t going to find it.

If you’re building agents and want to talk through any of this: [email protected].

— Almog

”Just throw Claude Code at it”

The failure nobody saw coming

Your observability tool is a very expensive screenshot

What Kelet actually does

Try it

Start finding failures in 5 minutes