The best fix for context window bloat in sensitive note-to-action workflows is usually architectural, not prompt magic: use two passes. First extract verbatim evidence with citations, then generate recommendations from only that evidence. Research on long context, retrieval, and agent design all points the same way: smaller, grounded inputs beat one giant prompt.
I found this while reading a thread on r/openclaw from someone trying to turn therapy notes into action plans.
The problem was painfully familiar. They wanted OpenClaw to analyze reports, classify each contact, and recommend next steps. Instead, the model drifted. It hallucinated. It made recommendations that sounded polished but weren’t cleanly tied to the source notes.
And one reply cut through the fog better than most prompt engineering threads I’ve seen all year:
“Hallucination in data extraction usually happens when the prompt is too open-ended or the context window is crowded. Try implementing a two-step verification process: first, have the agent extract raw quotes from the notes that support the action item, and then have a second pass generate the action plan based only on those quotes.”
That is such a boring answer.
Which is exactly why I trust it.
Everybody wants the sexy fix. A smarter system prompt. A bigger model. A 1M-token context window. Some baroque agent loop with five tools and a confidence score.
But for sensitive workflows like therapy notes, incident reports, clinical summaries, HR case logs, or legal intake notes, the villain usually isn’t that GPT-5 or Claude or Qwen is “bad at reasoning.” The villain is context window bloat.
You gave the model too many jobs at once, too much text at once, and too little discipline about what counts as evidence. Then you acted surprised when it improvised.
The real bug wasn’t hallucination. It was architecture.
Once you see it, you can’t unsee it.
A single-pass note-to-action prompt usually asks one model call to do all of this at once:
- read a pile of messy notes
- decide what matters
- classify the case
- infer missing context
- prioritize risk
- generate recommendations
- explain why
That’s not one task. That’s a committee meeting.
Anthropic has been pretty direct about this in its guidance on building agents: the systems that work in production are often simple, predefined workflows, not one giant autonomous prompt. Not because giant prompts are impossible. Because workflows are easier to predict, audit, and debug.
That matters a lot more when the input is sensitive.
If an OpenClaw agent is reading therapy notes stored locally, maybe routing follow-ups over WhatsApp, Slack, or Discord, you do not want “creative synthesis” to be the thing holding your process together. You want a chain of custody.
And this is where the two-pass pattern wins.
| Approach | What actually happens |
|---|---|
| Single-pass note-to-action prompt | One model call does extraction, classification, and recommendations together. Fast to prototype, but recommendations are generated from the full noisy context, so drift is harder to catch. |
| Two-pass grounded workflow | Pass 1 extracts evidence or quotes with source references; Pass 2 generates recommendations from only approved evidence. More auditable, easier to debug, and much safer for sensitive workflows. |
That extra pass feels inefficient right up until the first time you have to explain why the model recommended something serious.
Then it feels cheap.
Why does stuffing more text make the model worse?
This is the part people still treat like superstition.
They’ll say, “Come on, the whole point of long-context models is that they can handle long context.” Sure. Up to a point. But more context is not the same thing as better grounding.
The clearest research on this is the “Lost in the Middle” paper. The short version: long-context models often do better when relevant information appears near the beginning or end of the prompt, and worse when the key evidence is buried in the middle.
That should make every note-processing workflow designer slightly uncomfortable.
Because what is a giant stuffed prompt, really? It’s a way of burying the important sentence on page 8 between six paragraphs of background and three pages of irrelevant history.
Pinecone’s RAG guide says basically the same thing in practical terms: adding more retrieved text can improve retrieval recall, but it can also reduce the model’s ability to actually recall and use the right evidence once it’s inside the prompt.
That matches what I keep seeing in the wild. While researching OpenClaw memory setups, I ran into another r/openclaw discussion where one user said, “the human-editable markdown part is why I stuck with memory-lancedb-pro for daily use. I still keep MEMORY.md and project notes readable, then let hybrid retrieval pull the right chunks instead of stuffing”.
Exactly.
Stuffing feels safe because nothing was left out. But the model still has to find the needle after you dumped the whole garage on top of it.
So should you avoid long context entirely?
No. That would be the wrong lesson.
Anthropic explicitly says that if your knowledge base is smaller than about 200,000 tokens, sometimes the simplest move is to include the whole thing in the prompt instead of building retrieval. That can absolutely be the right call.
And long context has some genuinely great use cases.
Anthropic’s old 100K context announcement showed Claude-Instant finding a modified line in The Great Gatsby—about 72K tokens—in 22 seconds. Its prompt caching examples are also real-world useful: cache a 100,000-token prompt once, then ask repeated questions against it. Anthropic says prompt caching can cut latency by more than 2x and costs by up to 90%; in one example, time to first token dropped from 11.5s to 2.4s, a 79% reduction.
That is fantastic for “chat with a book,” policy Q&A, or repeated analysis over a stable corpus.
But that is not the same thing as generating high-stakes recommendations from sensitive notes.
If I’m asking Claude to answer questions about a handbook, cached long context is elegant. If I’m asking GPT-5 to turn therapy notes into an action plan, I want explicit evidence extraction first. Different job. Different failure mode.
What should the workflow actually look like?
The simplest version is almost annoyingly plain.
Pass 1: extract only evidence
Tell the model to return verbatim quotes, structured fields, and source citations. No recommendations. No summarizing beyond what can be directly supported.
Pass 2: generate recommendations from the evidence only
Now feed only the approved evidence into a second model call. Ask for classification, prioritization, or next steps. Require every recommendation to cite the evidence IDs from pass 1.
Pass 3 if needed: abstain or escalate
This is the part people skip.
OpenAI’s 2025 research on hallucinations makes an uncomfortable point: models don’t only hallucinate because retrieval is bad. They also hallucinate because training and evaluation often reward guessing instead of admitting uncertainty.
So if the evidence is weak, conflicting, or incomplete, the workflow should allow the model to say “insufficient evidence” and kick the case to a human.
That is not a failure. That is the whole point.
Here’s the shape of it with an OpenAI-compatible LLM client:
import OpenAI from "openai";
const client = new OpenAI();
const response = await client.responses.create({
model: "gpt-5.5",
input: "Extract only verbatim evidence quotes from these notes and return JSON with citations."
});
console.log(response.output_text);
If you’re running this through OpenClaw, the setup is straightforward:
npm install -g openclaw@latest
openclaw onboard --install-daemon
openclaw dashboard
And if you want to sanity-check the local services before wiring in note workflows:
openclaw status
openclaw status --all
openclaw health --json
OpenClaw’s docs recommend Node 24, or Node 22 LTS 22.19+ for compatibility. That matters more than it sounds like it should.
Because once OpenClaw is acting as a self-hosted, model-agnostic gateway with stateful sessions, memory, tools, and multi-agent routing across Slack, Telegram, Discord, and WhatsApp, it becomes a very natural place to split responsibilities. One agent extracts evidence. Another agent writes recommendations. Session history stays isolated. Sensitive records stay local-first.
That is a much better design than one overloaded agent trying to do everything in one breath.
What about retrieval? Is RAG actually better than stuffing?
For small corpora, not always.
For growing corpora, yes, usually by a lot.
Anthropic’s Contextual Retrieval writeup is one of the more useful practical pieces here. It says the method reduced failed retrievals by 49%, and by 67% when combined with reranking.
Those are not cosmetic gains.
If your recommendation step is built on top of the wrong chunks, it doesn’t matter whether you use Claude Opus, GPT-5, Grok, Qwen, or Llama. The answer will still drift because the evidence was never surfaced cleanly.
| Strategy | Tradeoff |
|---|---|
| Context stuffing | Simpler for small corpora. But as prompts grow, relevant facts can get buried in the middle and become harder for the model to use correctly. |
| Retrieval + reranking | More moving parts, but it scales better and is stronger when the right evidence would otherwise be lost inside long context. |
The funny part is that people often frame this as a model question. Should I switch from GPT-5 to Claude? Should I use a local Qwen or Llama variant? Should I upgrade to a bigger context window?
Sometimes that helps.
But if your architecture is wrong, model shopping is just procrastination with a benchmark spreadsheet.
Why do narrower tasks keep winning?
Because models are still easier to steer when the target is small.
I was reminded of that by one more r/openclaw thread where someone troubleshooting unreliable behavior boiled the fix down to this: ask questions with a very specific task.
That sounds almost insultingly basic.
It is also true.
Specific tasks make it easier to evaluate outputs, add guardrails, compare models, and know when the workflow broke. Broad tasks hide sloppiness because the output can still sound fluent.
And fluent is exactly what gets teams into trouble with case notes.
A recommendation that cites the wrong sentence is worse than no recommendation at all.
The boring fix is the one that survives contact with reality
If you’re building any workflow that turns sensitive notes into decisions, I would use this default until proven otherwise:
- Retrieve or select the smallest useful evidence set
- Extract verbatim quotes and structured facts first
- Attach source references to every extracted item
- Generate recommendations only from extracted evidence
- Allow abstention when evidence is weak or missing
That’s it.
Not glamorous. Not viral. Not the kind of thing that gets framed as frontier magic.
But it’s the pattern I trust.
Long context is real. Prompt caching is real. A good openai compatible llm stack can make repeated analysis fast and cheap. OpenClaw is a very plausible home for this architecture if you want local control and separate agents for extraction and recommendation.
Still, the core lesson is simpler than any of that.
When a model hallucinates on case notes, the first question usually isn’t “which model should I switch to?”
It’s “why did I ask one model call to do three jobs while hiding the evidence in the middle of a giant prompt?”
