← Blog/Engineering

My fix for hallucinating case notes was weirdly boring: stop stuffing context and split the job in two

Priya SharmaJune 7, 2026 · 10 min read

I keep seeing the same failure pattern in AI workflows that touch sensitive notes. Someone wires up a model to read therapy notes, incident reports, HR case logs, or legal intake summaries, and the first demo looks impressive for about five minutes. Then the model starts making recommendations that sound polished but don’t cleanly map back to the source.

That’s exactly what I was thinking when I read a thread on r/openclaw from someone trying to turn therapy notes into action plans. The problem was painfully familiar: analyze reports, classify each contact, recommend next steps. Instead, the model drifted.

One reply in that thread said something so unsexy that I immediately trusted it. The advice was basically: stop asking one prompt to do everything, extract raw evidence first, and only then generate the action plan from that evidence.

That is such a boring answer. Which is exactly why it tends to work.

A lot of people still look for the glamorous fix first. They want a smarter system prompt, a bigger model, a 1M-token context window, or some elaborate multi-agent loop with confidence scores and retries and self-critique.

But in note-to-action workflows, the real problem usually isn’t that GPT-5, Claude, Grok, Qwen, or Llama suddenly forgot how to reason. The real problem is context window bloat. You gave the model too many jobs, too much text, and too little discipline about what counts as evidence.

Then it improvised, because that’s what language models do when the task is underspecified.

Once I started looking at these failures as architecture problems instead of prompt problems, a lot of things clicked into place. A single-pass prompt for case notes usually asks the model to read messy notes, decide what matters, classify the case, infer missing context, prioritize risk, generate recommendations, and explain why.

That’s not one task. That’s a committee meeting crammed into one API call.

Anthropic has been pretty blunt about this in its guidance on building agents. The systems that survive production are often the simple ones: predefined workflows, clear handoffs, narrow tasks, fewer moving parts. Not because giant autonomous prompts are impossible, but because simple workflows are easier to predict, audit, and debug.

That matters a lot more when the source material is sensitive. If OpenClaw is reading therapy notes stored locally and then routing follow-ups over Slack, Discord, Telegram, or WhatsApp, the last thing you want is “creative synthesis” doing the heavy lifting.

You want a chain of custody for the reasoning.

The two-pass pattern gives you that. It feels slightly annoying at first because it adds another step, and everyone is conditioned to think fewer model calls must be better. But the first time you need to explain why the system recommended something serious, that extra pass suddenly feels very cheap.

Here’s the practical difference.

Single-pass note-to-action prompt

One model call does extraction, classification, and recommendations together
Fast to prototype
Harder to audit because the recommendation comes from the full noisy context
Drift is easy to miss because the output still sounds fluent

Two-pass grounded workflow

Pass 1 extracts evidence or verbatim quotes with source references
Pass 2 generates recommendations from only that evidence
Easier to debug because you can inspect the evidence set separately from the recommendation step
Much safer for sensitive workflows where every claim should tie back to a source

The thing that finally pushed me over the edge on this was long-context research. People still talk about long context as if more text automatically means better grounding, but that’s not really how it works.

The “Lost in the Middle” paper made this painfully clear. Long-context models often perform better when the important information is near the beginning or the end of the prompt, and worse when the key evidence is buried in the middle.

That should make anyone building note-processing workflows a little nervous. A giant stuffed prompt is basically a very efficient way to hide the most important sentence on page 8 between six paragraphs of background and three pages of irrelevant history.

Pinecone’s RAG guidance lands in the same place from a more practical angle. More retrieved text can improve recall at the retrieval stage, but once you dump all of it into the prompt, the model still has to locate and use the right evidence correctly. At some point, adding context stops helping and starts acting like clutter.

While researching OpenClaw memory setups, I ran into another r/openclaw thread where someone said they stuck with a human-editable markdown memory setup plus hybrid retrieval instead of just stuffing everything. That comment stuck with me because it gets at the real issue.

Stuffing feels safe because nothing was left out. But the model still has to find the needle after you dumped the whole garage on top of it.

To be clear, I’m not arguing that long context is useless. That would be the wrong takeaway.

Anthropic has said that if your knowledge base is under roughly 200,000 tokens, sometimes the simplest thing really is to include the whole thing in the prompt instead of building retrieval. And for some jobs, long context plus prompt caching is genuinely great.

If you want to chat with a handbook, analyze a stable corpus repeatedly, or ask questions over a cached 100,000-token prompt, that can be elegant. Anthropic’s examples around prompt caching are especially practical: lower latency, lower repeated cost, and less orchestration overhead.

But that is not the same job as generating high-stakes recommendations from sensitive notes. If I’m asking Claude to answer questions about a policy manual, long context is fine. If I’m asking GPT-5 to turn therapy notes into an action plan, I want explicit evidence extraction first.

Different job, different failure mode.

The workflow I trust now is almost annoyingly plain.

Pass 1: extract only evidence. Ask the model for verbatim quotes, structured facts, and source citations. No recommendations, no broad summary, no “best next action,” no room to freestyle.

Pass 2: generate recommendations from the evidence only. Feed the approved evidence into a second model call and require every recommendation to cite the evidence IDs from pass 1.

And then there’s the step people always want to skip: abstention. If the evidence is weak, conflicting, or incomplete, the workflow should be allowed to say “insufficient evidence” and escalate to a human.

That’s not a bug. That’s the design doing its job.

OpenAI’s more recent hallucination research has made this point pretty clearly: models don’t only hallucinate because retrieval fails. They also hallucinate because training and evaluation often reward guessing over admitting uncertainty.

So if your workflow has no graceful way to abstain, you are quietly training the system architecture to prefer confident nonsense over honest ambiguity. That’s a terrible trade in any workflow touching case notes.

The implementation doesn’t need to be fancy. With an OpenAI-compatible client, the first pass can be as simple as one call that says: extract only verbatim evidence quotes from these notes and return structured JSON with citations.

Then the second pass takes just that evidence and produces the classification or action plan. No giant prompt, no giant memory dump, no hidden leap from “I saw this” to “therefore do that.”

This is also where Standard Compute becomes very practical if you’re running these workflows at scale. A two-pass or three-pass architecture is usually the right design, but per-token pricing punishes you for doing the safer thing. Teams end up collapsing steps together because every extra call feels like a billing event.

That’s backwards.

If you’re building agents in OpenClaw, n8n, Make, Zapier, or custom automations, you should be able to split extraction, reasoning, verification, and escalation into separate calls without doing token math in your head all day. That’s the appeal of Standard Compute’s flat monthly model: you can keep the architecture disciplined instead of constantly asking whether one more grounding pass is “worth the cost.”

And because it’s a drop-in OpenAI API replacement, you don’t have to rebuild your stack to get there. If your current workflow already uses an OpenAI-compatible SDK or HTTP client, the migration path is refreshingly boring, which I mean as a compliment.

OpenClaw is a particularly natural fit for this pattern. Once it’s acting as a self-hosted, model-agnostic gateway with stateful sessions, memory, tools, and routing across channels, it becomes very easy to separate responsibilities. One agent extracts evidence. Another writes recommendations. Session history stays isolated. Sensitive records stay local-first.

That is a much better design than one overloaded agent trying to do everything in one breath.

People often frame this whole problem as a model selection question. Should I switch from GPT-5 to Claude? Should I use Grok? Should I test a local Qwen or Llama variant? Should I pay for a bigger context window?

Sometimes the answer is yes. Better models do help, and routing intelligently across models can help even more.

But if the architecture is wrong, model shopping is mostly procrastination with a benchmark spreadsheet.

Retrieval is similar. For very small corpora, stuffing can be fine. For growing corpora, retrieval plus reranking usually wins because it surfaces the right evidence instead of hoping the model will notice it in a giant blob.

Anthropic’s Contextual Retrieval writeup is one of the more useful pieces here because it focuses on actual retrieval failures, not just vibes. Their reported gains are large enough to matter in production, especially when recommendation quality depends on whether the right chunk was surfaced in the first place.

Here’s the tradeoff in plain English.

Context stuffing

Simpler for small corpora
Easier to prototype because there are fewer moving parts
Gets worse as prompts grow and relevant facts get buried
Encourages “just include everything” thinking, which feels safe but often weakens grounding

Retrieval plus reranking

More setup and more components to maintain
Scales much better as the corpus grows
Stronger when the right evidence would otherwise disappear inside long context
Pairs naturally with a two-pass workflow because retrieval feeds a smaller evidence set into extraction

The funny part is how often the boring fix keeps winning. I saw another r/openclaw thread where someone troubleshooting flaky behavior boiled the lesson down to this: ask the model to do a very specific task.

That sounds almost embarrassingly basic. It is also one of the most reliable rules in applied LLM work.

Narrow tasks are easier to evaluate, easier to compare across models, easier to guardrail, and easier to debug when they fail. Broad tasks hide sloppiness because the output can still sound smart.

And with case notes, “sounds smart” is exactly the danger. A recommendation that cites the wrong sentence is worse than no recommendation at all.

So this is the default I’d use until proven otherwise. Retrieve or select the smallest useful evidence set. Extract verbatim quotes and structured facts first. Attach source references to every extracted item. Generate recommendations only from that extracted evidence. Allow abstention when the evidence is weak or missing.

That’s it.

It’s not glamorous. It won’t get framed as frontier magic. It’s not the kind of trick people brag about after a weekend hackathon.

But it’s the pattern I trust when the workflow actually matters.

Long context is real. Prompt caching is real. OpenClaw is a solid home for local-first agent workflows. Standard Compute makes multi-step architectures much easier to run without per-token anxiety, especially when you want separate extraction, recommendation, and verification passes instead of one overloaded prompt.

Still, the core lesson is simpler than any of that.

When a model hallucinates on case notes, the first question usually isn’t “which model should I switch to?” It’s “why did I ask one model call to do three jobs while hiding the evidence in the middle of a giant prompt?”

My fix for hallucinating case notes was weirdly boring: stop stuffing context and split the job in two

Keep reading

I thought a family calendar bot should run everything until I realized AI is way better at intake than decisions

I stopped letting my AI agent do the final click, and my automations got way more useful