← Blog/Engineering

I stopped trusting “same answers, fewer tokens” after watching an agent lose the one detail that mattered

James OlsenJune 12, 2026 · 6 min read

Three hours into a Claude Code run, I watched an agent make the wrong API call with total confidence. The weird part was that nothing looked obviously broken. The plan was coherent, the reasoning read clean, and the running summary of prior steps sounded exactly like the kind of thing people point to when they say compression is “basically the same result for fewer tokens.”

It was also wrong.

Earlier in the session, the agent had hit an error log that included one field name it absolutely needed later. Somewhere along the way, that detail got compressed out of memory. Not the whole issue. Not the general lesson. Just the one sharp little fact that mattered three hours later, after the agent had already committed to the wrong action.

That was the moment I stopped trusting the easy pitch around context compression.

While researching this, I ran into a thread on r/openclaw about using Headroom with OpenClaw. What I liked about that discussion is that it got closer to the real engineering tradeoff than most product pages do. Compression is useful. Sometimes extremely useful. But only if you treat it like a reversible optimization, not a memory wipe with nicer branding.

The real problem with context compression is not that it loses words. It is that it can lose the one fact your agent needs much later, when the rest of the workflow has already built on top of that missing fact. That is a very different failure mode from a chatbot giving a slightly worse answer.

And that distinction matters a lot if you are running long-lived agents in n8n, Make, Zapier, OpenClaw, Claude Code, or a custom OpenAI-compatible workflow. In those systems, one dropped detail can quietly poison everything downstream. You do not notice it when the summary is generated. You notice it when the agent confidently does the wrong thing.

The sales pitch for compression is not fake. Shrink the prompt, lower latency, reduce cost, keep the model focused. Those are real benefits. Headroom says it can cut context by 60 to 95 percent by compressing logs, files, tool output, and history. RTK can shrink a terminal-heavy Claude Code session from roughly 118,000 tokens to around 23,900.

If your workload is mostly noisy command output, that is genuinely impressive. I do not think those numbers are the problem. The problem is pretending that compressed context is interchangeable with original context.

It is not. Compressed context is an interpretation of original context.

That sounds obvious when you say it plainly, but a lot of agent tooling still behaves like summaries are a drop-in replacement for memory. They are not. The moment your agent needs the exact wording of an error, the exact shape of a JSON payload, the exact schema constraint, or the exact user instruction from two hours ago, an interpretation is often not enough.

That is why I think aggressive compression is fine for disposable chat and much riskier for multi-hour agents. If the raw context is not retrievable on demand, you are not really saving tokens. You are taking on hidden reliability debt and hoping it does not come due at 2 a.m.

My rule now is simple: compress what is noisy, reload what is consequential.

Tool chatter, repetitive logs, bulky terminal output, and intermediate traces are good candidates for compression. Final decisions, unresolved branches, retrieved evidence, schema details, error signatures, and debugging clues are not. Those should either stay raw or remain one fetch away from raw.

That is the difference between a safe system and a fragile one. Safe systems assume the summary might be wrong. Fragile systems assume the summary is the truth.

This is also where the differences between tools actually matter.

Headroom

Best at: aggressive context reduction across a wider agent stack
Strength: can dramatically shrink logs, files, tool output, and history
Risk: if the agent cannot recover original source material, the compression becomes a one-way rewrite of memory
My take: useful if paired with retrieval, dangerous if used as a permanent substitute for raw history

RTK

Best at: narrowing the token footprint of terminal-heavy Claude Code sessions
Strength: easier to reason about because the transformed data category is more specific, especially Bash output
Risk: still risky if important debugging details are hidden inside “just logs”
My take: simpler tradeoff than full-stack compression, but still not something I would trust blindly

OpenAI Prompt Caching

Best at: reusing exact prompt prefixes for latency and cost improvements
Strength: preserves the original prompt instead of rewriting it
Risk: does not solve selective recall or long-context memory by itself
My take: much safer than summarization, but it is a different tool for a different problem

If I had to pick winners and losers, the winner is reversible compression plus retrieval. The loser is one-way summarization pretending to be memory.

And honestly, I think a lot of teams are over-compressing for a boring reason that has nothing to do with model quality. Per-token pricing trains people to fear their own context windows.

You can see it everywhere. Teams running automations in n8n, Make, Zapier, OpenClaw, or custom agent frameworks start trimming history, collapsing tool traces, and summarizing everything in sight not because that is the best architecture, but because every extra call feels like a meter running in the background.

That pricing pressure leaks directly into technical design. People end up making reliability decisions that are really billing decisions in disguise.

This is the part where Standard Compute becomes relevant in a very practical way. If you are using a flat-rate, OpenAI-compatible API instead of worrying about per-token charges on every long run, you get to make different choices. You can keep more raw context, reload source material when the agent needs it, and stop pretending every problem should be solved by another layer of summarization.

That does not mean you should be sloppy. Unlimited compute is not permission to build garbage prompts. It just means you can choose the safer architecture instead of the cheapest-looking one.

For long-running agents, that is a huge difference.

The boring rule I trust now is this: never let your agent depend on compressed context unless it can recover the raw source later. If the original log, chunk, tool output, or conversation turn is gone, then the compression step is not optimization.

It is amputation.

And once you have watched an agent fail because one field name vanished from memory, the whole “same answers, fewer tokens” slogan starts sounding less like engineering and more like marketing copy written by someone who has never had to debug a broken workflow in the middle of the night.

If you are building agents that run for hours, touch real systems, and make real decisions, I think that is the line worth drawing. Compress the noise. Keep the evidence. And if pricing is forcing you to choose the brittle version of that tradeoff, the pricing model is part of the bug.

I stopped trusting “same answers, fewer tokens” after watching an agent lose the one detail that mattered

Keep reading

I think the best openai api alternative for customer email is way smaller than the “replace your staff” people admit

I looked into oauth openai for OpenClaw and the scary part isn’t what most people think