We Spent 30% of Our Engineering Debugging AI Agents. So I Built One to Do It.
AI engineers waste 30% of their week scrolling traces. Kelet automates root cause analysis across thousands of sessions and generates validated fixes: the AI agent that debugs your AI agents.
TL;DR: Kelet is an AI agent that automates root cause analysis for production AI agents. It reads your traces across thousands of sessions, identifies failure patterns no human would find manually, and generates validated prompt fixes with before/after proof. Free during beta 🤩.
I’ve built 50+ LLM apps/agents over the past few years. Some scaled to millions of transactions a day. And after all of that, the thing I’m most frustrated by is how we debug them. It’s embarrassing. We have models writing production code, passing exams, replacing entire workflows — and when they break in production? Our best strategy is still: open our (LLM) monitoring system, scroll, squint, guess, patch, deploy, pray. Monday morning, same thing again.
I’ve talked to over 112 AI engineers. Same story, every time. About a third of their week just… gone. Not building anything. Not shipping. Just scrolling traces. And it’s not just anecdotal: McKinsey and Gartner report that roughly 90% of AI projects look great in PoCs and fail in production, because nobody has a reliable way to diagnose what’s going wrong at scale.
I tried everything. Evals? They pass in staging, then real users show up, and the whole thing falls apart. Autonomous monitoring, which turned out to catch basically nothing useful. Every observability dashboard I could find, all of which gave me beautiful charts and zero actual answers.
The only thing that actually worked? The oldest, least glamorous trick in data science: error analysis. Just going through production data, session by session, tagging failures by hand, slowly building up a picture of what’s actually breaking. The best AI engineers I know, people shipping real agents at serious scale, were all doing the same thing. Spreadsheets. Every morning. One session at a time. It felt like 2015 all over again, manually labeling training data.
It works. It’s also an absurd waste of very expensive engineering time. And it definitely doesn’t scale.
”Just throw Claude Code”
Everyone tried this. Claude Code, Cursor, deep research agents, autonomous debugging loops. Same wall every time. Nobody in this industry wants to say this out loud, so I will: a single LLM cannot solve this problem. It’s the wrong tool for the job.
In traditional software, a coding agent follows a traceback to the root cause. One session, one bug, one fix. AI failures are stochastic. The same input succeeds ten times, fails on the eleventh, works again for no reason. One hallucination is an anecdote. You need to see a failure across hundreds of sessions before you can separate patterns from noise.
Picture dozens of needles scattered across thousands of haystacks, connected in ways you can only see when you zoom out. No LLM call or agent loop is going to crack that. You need an assembly of specialized models — some LLM, some classical ML — learning continuously over weeks and months. We train dozens of models per subagent in your pipeline.
The question I couldn’t stop thinking about: we’re building agents for legal research, code generation, customer support. Why can’t an agent do error analysis? Let the humans make the judgment calls. Let a machine grind through the data.
So we built that machine.
The failure nobody saw coming
An insurance company we worked with had a two-agent pipeline: first agent summarizes support call transcripts, second classifies the claim type. Passed all their evals. Everyone was happy.
One session that stuck with me: a woman calls about a storm. Flooding, broken pipe, mud everywhere, the whole disaster. The summarization agent pulls out the facts. The classification agent reads that summary and says “water system problem.” The underwriter looks at it, nods, and just… changes it to “weather issue.” Doesn’t flag it or open a ticket, just fixes it silently.
If you looked at that one session in your trace viewer, you’d blame the classification agent. Add a better example to the prompt. Ship the fix. Done, right?
We analyzed hundreds of their sessions. Fire claims, power outages, electrical failures. The classification agent wasn’t the problem. The summarization agent was. It kept squashing the timeline into a flat bag of facts, stripping out the order things happened in. And in insurance, the sequence is what determines the claim type, not the events themselves. That’s institutional knowledge that experienced underwriters carry in their heads. It’s not written down anywhere. A single session would have sent you chasing the wrong agent entirely.
One session gives you the symptom. Hundreds give you the root cause. And once you find the real root cause, the fix is short.
Your observability tool is a very expensive screenshot
I’ll say what the Langfuse and LangSmith teams won’t say: trace collection is a solved problem. OpenTelemetry commoditized the whole thing. Langfuse, LangSmith, Datadog, Braintrust, Arize. They’re all showing you the same underlying data, just with different UIs on top. The competition at this point is literally about who renders traces more beautifully. And none of them — not one — can tell you why your agent broke.
The way I see the stack:
| Layer | What it does | Status |
|---|---|---|
| Traces + metrics | Collecting and viewing what happened | Solved. Commoditized. |
| Root cause analysis | Understanding why it failed, with evidence | Every tool stops here. |
| Automated fix | Generating a validated fix with before/after proof | Nobody does this. |
The whole observability industry built thermometers. Some of them are really good thermometers, I’ll give them that. But a thermometer doesn’t diagnose strep throat — it just confirms what you already knew, which is that something’s wrong. Your agent is failing. Your users are complaining. You didn’t need a fancier dashboard to tell you that.
Kelet is the doctor.
And look, if you’re already on Langfuse, great, keep it. Kelet pulls your traces directly, no re-instrumentation needed. We’re not replacing anything in your stack. We’re the part that was missing.
Kelet is a detective that never sleeps
Signals are tips. Traces are the crime scene. Kelet is the detective. Unlike your best engineer, it doesn’t get tired at 3pm or push the weird edge case to the next sprint.
When a user gives a thumbs-down, retries a request, or quietly edits what your agent wrote, those are all signals. They point Kelet to where the real problems are hiding. And when a pattern keeps showing up across hundreds of sessions, it surfaces as a named finding with a root cause backed by actual evidence. Not gut feeling. Not “I spent an hour scrolling, and I think maybe it’s the retrieval step.”
If you want extra confidence before shipping a fix, there’s GEPA optimization, basically evolutionary search tested against your real production sessions. It shows you the improvement before you deploy anything. No more crossing fingers after a prompt change.
You don’t look at Kelet. Kelet looks at your agent.
So far we’ve analyzed over 33,000 production sessions. 73% of teams we worked with had failure patterns that nobody on the team knew about. Average time from raw traces to a root cause plus a tested fix: 14.3 minutes. I still find that number kind of wild, honestly. It used to take us weeks. (The teams where it’s slowest, I should say, are the ones where failure patterns haven’t had time to repeat yet — 14 minutes assumes there’s enough data for a pattern to surface. Which means the earlier you connect, the faster it gets.)
Try it
Your agent is failing somewhere right now, and scrolling traces won’t fix it.
Start finding failures in 5 minutes
Free during beta. No credit card.
If you’re building agents and want to talk about this: [email protected].
— Almog