← Blog/Guide

My fix for hallucinating case notes was weirdly boring: stop stuffing context and split the job in two

Sarah MitchellJune 7, 2026 · 9 min read

Two-pass note → action workflow

Stuffed context

Case note

History

RAG dump

Policies

shaky output

Split in two

grounded by cited evidence

Show receipts

evidence → action

The best fix for context window bloat in sensitive note-to-action workflows is usually architectural, not prompt magic: use two passes. First extract verbatim evidence with citations, then generate recommendations from only that evidence. Research on long context, retrieval, and agent design all points the same way: smaller, grounded inputs beat one giant prompt.

The best fix for context window bloat in sensitive note-to-action workflows is usually architectural, not prompt magic: use two passes. First extract verbatim evidence with citations, then generate recommendations from only that evidence. Research on long context, retrieval, and agent design all points the same way: smaller, grounded inputs beat one giant prompt.

I found this while reading a thread on r/openclaw from someone trying to turn therapy notes into action plans.

The problem was painfully familiar. They wanted OpenClaw to analyze reports, classify each contact, and recommend next steps. Instead, the model drifted. It hallucinated. It made recommendations that sounded polished but weren’t cleanly tied to the source notes.

And one reply cut through the fog better than most prompt engineering threads I’ve seen all year:

“Hallucination in data extraction usually happens when the prompt is too open-ended or the context window is crowded. Try implementing a two-step verification process: first, have the agent extract raw quotes from the notes that support the action item, and then have a second pass generate the action plan based only on those quotes.”

That is such a boring answer.

Which is exactly why I trust it.

Everybody wants the sexy fix. A smarter system prompt. A bigger model. A 1M-token context window. Some baroque agent loop with five tools and a confidence score.

But for sensitive workflows like therapy notes, incident reports, clinical summaries, HR case logs, or legal intake notes, the villain usually isn’t that GPT-5 or Claude or Qwen is “bad at reasoning.” The villain is context window bloat.

You gave the model too many jobs at once, too much text at once, and too little discipline about what counts as evidence. Then you acted surprised when it improvised.

The real bug wasn’t hallucination. It was architecture.

Once you see it, you can’t unsee it.

A single-pass note-to-action prompt usually asks one model call to do all of this at once:

read a pile of messy notes
decide what matters
classify the case
infer missing context
prioritize risk
generate recommendations
explain why

That’s not one task. That’s a committee meeting.

Anthropic has been pretty direct about this in its guidance on building agents: the systems that work in production are often simple, predefined workflows, not one giant autonomous prompt. Not because giant prompts are impossible. Because workflows are easier to predict, audit, and debug.

That matters a lot more when the input is sensitive.

If an OpenClaw agent is reading therapy notes stored locally, maybe routing follow-ups over WhatsApp, Slack, or Discord, you do not want “creative synthesis” to be the thing holding your process together. You want a chain of custody.

And this is where the two-pass pattern wins.

Approach	What actually happens
Single-pass note-to-action prompt	One model call does extraction, classification, and recommendations together. Fast to prototype, but recommendations are generated from the full noisy context, so drift is harder to catch.
Two-pass grounded workflow	Pass 1 extracts evidence or quotes with source references; Pass 2 generates recommendations from only approved evidence. More auditable, easier to debug, and much safer for sensitive workflows.

That extra pass feels inefficient right up until the first time you have to explain why the model recommended something serious.

Then it feels cheap.

Why does stuffing more text make the model worse?

This is the part people still treat like superstition.

They’ll say, “Come on, the whole point of long-context models is that they can handle long context.” Sure. Up to a point. But more context is not the same thing as better grounding.

The clearest research on this is the “Lost in the Middle” paper. The short version: long-context models often do better when relevant information appears near the beginning or end of the prompt, and worse when the key evidence is buried in the middle.

That should make every note-processing workflow designer slightly uncomfortable.

Because what is a giant stuffed prompt, really? It’s a way of burying the important sentence on page 8 between six paragraphs of background and three pages of irrelevant history.

Pinecone’s RAG guide says basically the same thing in practical terms: adding more retrieved text can improve retrieval recall, but it can also reduce the model’s ability to actually recall and use the right evidence once it’s inside the prompt.

That matches what I keep seeing in the wild. While researching OpenClaw memory setups, I ran into another r/openclaw discussion where one user said, “the human-editable markdown part is why I stuck with memory-lancedb-pro for daily use. I still keep MEMORY.md and project notes readable, then let hybrid retrieval pull the right chunks instead of stuffing”.

Exactly.

Stuffing feels safe because nothing was left out. But the model still has to find the needle after you dumped the whole garage on top of it.

So should you avoid long context entirely?

No. That would be the wrong lesson.

Anthropic explicitly says that if your knowledge base is smaller than about 200,000 tokens, sometimes the simplest move is to include the whole thing in the prompt instead of building retrieval. That can absolutely be the right call.

And long context has some genuinely great use cases.

Anthropic’s old 100K context announcement showed Claude-Instant finding a modified line in The Great Gatsby—about 72K tokens—in 22 seconds. Its prompt caching examples are also real-world useful: cache a 100,000-token prompt once, then ask repeated questions against it. Anthropic says prompt caching can cut latency by more than 2x and costs by up to 90%; in one example, time to first token dropped from 11.5s to 2.4s, a 79% reduction.

That is fantastic for “chat with a book,” policy Q&A, or repeated analysis over a stable corpus.

But that is not the same thing as generating high-stakes recommendations from sensitive notes.

If I’m asking Claude to answer questions about a handbook, cached long context is elegant. If I’m asking GPT-5 to turn therapy notes into an action plan, I want explicit evidence extraction first. Different job. Different failure mode.

What should the workflow actually look like?

The simplest version is almost annoyingly plain.

Pass 1: extract only evidence

Tell the model to return verbatim quotes, structured fields, and source citations. No recommendations. No summarizing beyond what can be directly supported.

Pass 2: generate recommendations from the evidence only

Now feed only the approved evidence into a second model call. Ask for classification, prioritization, or next steps. Require every recommendation to cite the evidence IDs from pass 1.

Pass 3 if needed: abstain or escalate

This is the part people skip.

OpenAI’s 2025 research on hallucinations makes an uncomfortable point: models don’t only hallucinate because retrieval is bad. They also hallucinate because training and evaluation often reward guessing instead of admitting uncertainty.

So if the evidence is weak, conflicting, or incomplete, the workflow should allow the model to say “insufficient evidence” and kick the case to a human.

That is not a failure. That is the whole point.

Here’s the shape of it with an OpenAI-compatible LLM client:

import OpenAI from "openai";
const client = new OpenAI();
const response = await client.responses.create({
  model: "gpt-5.5",
  input: "Extract only verbatim evidence quotes from these notes and return JSON with citations."
});
console.log(response.output_text);

If you’re running this through OpenClaw, the setup is straightforward:

npm install -g openclaw@latest
openclaw onboard --install-daemon
openclaw dashboard

And if you want to sanity-check the local services before wiring in note workflows:

openclaw status
openclaw status --all
openclaw health --json

OpenClaw’s docs recommend Node 24, or Node 22 LTS 22.19+ for compatibility. That matters more than it sounds like it should.

Because once OpenClaw is acting as a self-hosted, model-agnostic gateway with stateful sessions, memory, tools, and multi-agent routing across Slack, Telegram, Discord, and WhatsApp, it becomes a very natural place to split responsibilities. One agent extracts evidence. Another agent writes recommendations. Session history stays isolated. Sensitive records stay local-first.

That is a much better design than one overloaded agent trying to do everything in one breath.

What about retrieval? Is RAG actually better than stuffing?

For small corpora, not always.

For growing corpora, yes, usually by a lot.

Anthropic’s Contextual Retrieval writeup is one of the more useful practical pieces here. It says the method reduced failed retrievals by 49%, and by 67% when combined with reranking.

Those are not cosmetic gains.

If your recommendation step is built on top of the wrong chunks, it doesn’t matter whether you use Claude Opus, GPT-5, Grok, Qwen, or Llama. The answer will still drift because the evidence was never surfaced cleanly.

Strategy	Tradeoff
Context stuffing	Simpler for small corpora. But as prompts grow, relevant facts can get buried in the middle and become harder for the model to use correctly.
Retrieval + reranking	More moving parts, but it scales better and is stronger when the right evidence would otherwise be lost inside long context.

The funny part is that people often frame this as a model question. Should I switch from GPT-5 to Claude? Should I use a local Qwen or Llama variant? Should I upgrade to a bigger context window?

Sometimes that helps.

But if your architecture is wrong, model shopping is just procrastination with a benchmark spreadsheet.

Why do narrower tasks keep winning?

Because models are still easier to steer when the target is small.

I was reminded of that by one more r/openclaw thread where someone troubleshooting unreliable behavior boiled the fix down to this: ask questions with a very specific task.

That sounds almost insultingly basic.

It is also true.

Specific tasks make it easier to evaluate outputs, add guardrails, compare models, and know when the workflow broke. Broad tasks hide sloppiness because the output can still sound fluent.

And fluent is exactly what gets teams into trouble with case notes.

A recommendation that cites the wrong sentence is worse than no recommendation at all.

The boring fix is the one that survives contact with reality

If you’re building any workflow that turns sensitive notes into decisions, I would use this default until proven otherwise:

Retrieve or select the smallest useful evidence set
Extract verbatim quotes and structured facts first
Attach source references to every extracted item
Generate recommendations only from extracted evidence
Allow abstention when evidence is weak or missing

That’s it.

Not glamorous. Not viral. Not the kind of thing that gets framed as frontier magic.

But it’s the pattern I trust.

Long context is real. Prompt caching is real. A good openai compatible llm stack can make repeated analysis fast and cheap. OpenClaw is a very plausible home for this architecture if you want local control and separate agents for extraction and recommendation.

Still, the core lesson is simpler than any of that.

When a model hallucinates on case notes, the first question usually isn’t “which model should I switch to?”

It’s “why did I ask one model call to do three jobs while hiding the evidence in the middle of a giant prompt?”

Frequently Asked Questions

How do I stop an LLM from hallucinating when turning notes into action plans?

Split the workflow into two passes. First extract verbatim evidence with citations, then generate recommendations using only that extracted evidence. This makes the process easier to audit and reduces drift from noisy source text.

Is a bigger context window enough to fix hallucinations?

No. Larger context windows help in some cases, especially for smaller corpora or repeated Q&A with prompt caching, but they do not replace grounding. If relevant evidence gets buried in the middle of a huge prompt, model performance can still degrade.

When should I use retrieval instead of stuffing everything into the prompt?

If your corpus is small, sometimes including the whole thing is simplest; Anthropic says this can work under about 200,000 tokens. As the corpus grows, retrieval plus reranking usually becomes better because it surfaces the right evidence instead of burying it in long context.

Why is a two-pass workflow better for therapy notes or other sensitive records?

Sensitive workflows need traceability. A two-pass design separates evidence extraction from recommendation generation, so every action can be tied back to specific quotes or fields. That makes errors easier to catch and supports human review when stakes are high.

Can I build this with an OpenAI-compatible LLM and OpenClaw?

Yes. OpenClaw is a self-hosted, model-agnostic gateway with sessions, memory, tools, and multi-agent routing, so it fits a design where one agent extracts evidence and another creates recommendations. Any openai compatible llm client can handle the model calls as long as you keep the passes separate.