7 LLM App Failure Patterns in Production or Why Your Traces Lie

The trace was right. The diagnosis was wrong.

A security team’s phishing detection agent had been misclassifying the same type of email for three weeks. Every trace was explainable: the model saw more safe content than malicious content, and called it safe. Correct behavior. Wrong result. Nobody had noticed that every misclassified case shared the same structure — a phishing body with a legitimate reply thread below it.

Feed both as one blob and the legitimate tokens dilute the attack signal. The model was being misled by the input.

You can’t see this in one trace. You need hundreds before the shape becomes visible. The fix was a code change — split the body and thread, summarize separately, feed structured context. The agent didn’t change. What it received did.

# Before: entire email passed as a single string
result = classify_email(raw_email_text)

# After: body and thread processed separately
thread_summary = summarize(email.reply_thread)
result = classify_email({
    "body": email.body,
    "thread_summary": thread_summary,
})

The classifier prompt didn’t change. A summarization step was added upstream. Two agents instead of one.

73% of the teams we’ve worked with had active failure patterns their engineers couldn’t name. Most had observability. Most reviewed traces. They were just looking at the wrong level. (More in our launch post.)

The story in one trace isn’t the failure mode. It’s a sample.

The shape only appears at scale

A B2B sales team built a pipeline to score inbound discovery calls. Two agents: a summarizer that extracted key moments from call transcripts, and a scorer that rated deal quality from those summaries.

The scorer kept over-qualifying stalled deals — forecasts were inflated and reps were chasing conversations that had already gone cold. The team spent weeks on it: threshold tuning, retraining on closed-lost examples, different scoring rubrics. Nothing moved.

It wasn’t the scorer.

The summarizer was extracting questions from call transcripts — “what does your current process look like?”, “when are you looking to decide?”, “who else is involved?” — without tracking who asked them. In a sales call, speaker attribution is everything. A prospect asking “when can we get started?” is buying. A rep asking the same question is pushing on silence.

The scorer had no way to distinguish them. Every over-scored stalled deal had the same structure in the raw transcripts: the rep asked all the forward-looking questions. The prospect answered but never initiated one. The summarizer reported the questions. Stripped the speaker. The scorer read buying signals that belonged to the rep.

Every single-session trace looked like a scoring error. Hundreds of sessions revealed the summarizer had been dropping speaker attribution on every call.

Both of these failures were “explained” in individual traces. Neither failure mode was visible in individual traces.

The shapes we’ve named

We’ve looked at 33,000 production sessions across real deployments. Patterns repeat — not the same failures, the same shapes. The vocabulary we use:

Hallucination — the model generates something not grounded in its context. Factual (“your plan includes free returns”) or action-based (“I’ve cancelled your order”). They look identical in a trace, but the fixes are completely different — which is the part that bites you. Applied-llms.org puts the baseline at 5–10% in well-prompted production systems.

Format failure — the model reasoned correctly and output the wrong shape. JSON that doesn’t parse. A required field missing. Your pipeline caught the exception. The user got a fallback. Nobody filed a bug. AgentBench found this accounts for 53% of failures on database tasks — more than reasoning errors.

Context starvation — the model lacks what it needs, or what it has is the wrong shape. The phishing case is the example: the information was all there, but packaging it as one blob destroyed the signal entirely. Most fixes for this end up being code changes — restructuring what goes into the context window — not prompt changes.

Instruction fog — the model can’t fully parse what’s required, picks an interpretation, and runs with it consistently. You won’t see any uncertainty in the output — it committed and moved forward. It just had a different mental model of what you were asking.

Style drift — the model gives the right answer in the wrong register. Formal enterprise assistant going casual. Support bot using internal jargon. This one surfaces as implicit feedback — thumbs down on correct responses — long before anyone complains explicitly.

Edge case blindness — handles the common case well; breaks on inputs outside the design space. The gap isn’t obvious because your eval set tends to share the same assumptions as your prompt — so tests pass cleanly while a whole category of real-world inputs is quietly failing.

Pipeline blindspot — failure shows up downstream, but the cause is upstream. In multi-agent systems this is extremely common: the agent that gets blamed had nothing wrong with it. The sales case above is the cleanest example we have.

These are starting points. Your agent’s actual failure pattern might fit one cleanly, span two, or not match any of them. (We’ve watched teams spend weeks misclassifying a pipeline blindspot as hallucination because the final output looked hallucinated. The name is a hypothesis, not a verdict.)

Why you haven’t found yours yet

A senior AI engineering manager at a public company was direct about the numbers: “Each run takes about 15–20 minutes to review. We need to do 100–1,000s of those to uncover a real failure pattern.”

That’s the constraint — not tooling, not model quality, not prompt craft. Just time. Among the teams we’ve worked with, roughly 30% of AI engineering time goes to manual trace review, and even then, patterns only emerge when someone has enough context to connect sessions from weeks apart.

By the time you can see it, it’s been running for weeks.

The question that changes things

Not “what went wrong in this session?” — that finds a bug. “What pattern keeps appearing across sessions?” — that finds the failure mode. You need enough sessions to tell the difference, and you need to be looking at all of them together.

Why evals don’t catch this

Evals test what you expected. Real failure patterns come from users doing things you didn’t design for — they won’t show up in evals you built from your own assumptions. The phishing case: no synthetic eval set would include “phishing body + legitimate reply thread” as a test case. The sales pipeline case: nobody writes “score a call where only the rep asked forward-looking questions” into their test suite. The only source is what real users actually send you.

The shape only becomes visible at volume, and production data takes time to accumulate.

Why not LangSmith / Datadog / Langfuse?

Observability tools show you individual traces. They’re the right tool for understanding a single session. Failure patterns need cross-session analysis — clustering, drift detection, correlation across time. That layer doesn’t exist in trace viewers.

Why not Claude Code / Codex / AI SRE?

Coding agents are genuinely useful here — give one a stack trace and it’ll find your bug. But AI apps are stochastic. A “failure” isn’t a line you can grep for. Finding patterns means clustering thousands of sessions and reasoning about what connects them, which isn’t that naive LLM agent can do.

The shape only appears at scale

The shapes we’ve named

Why you haven’t found yours yet

Why evals don’t catch this

Find your agent's failure fingerprint