LLM cost optimization in OpenClaw usually fails when you pick one cheap model for everything. The real savings come from routing by risk: let low-cost models handle routine steps, and reserve stronger models for planning, recovery, and high-stakes tool calls. That matters because one r/openclaw user spent $100 in two days before changing models.
I keep seeing the same mistake in agent stacks.
Someone gets excited about OpenClaw, wires up a few tools, watches Claude Opus or GPT-class models do something magical, then opens the bill and has a small spiritual crisis. So they swing hard in the other direction: "Fine. I'll just use the cheapest model everywhere."
That sounds disciplined. It sounds like llm cost optimization. It is usually the exact move that makes the whole stack more expensive.
While researching budget setups, I came across a thread on r/openclaw where one user said, "I blew 100 usd in two days in openclaw using opus, sonnet, haiku. Moved to deepseek and its consuming pennies."
At first glance, that sounds like the whole story. Stop using expensive models. Switch to DeepSeek Flash. Problem solved.
But that wasn't the part that interested me.
The interesting part was what the thread accidentally revealed: the biggest cost driver in agent systems is often not the posted price of the model. It's what your agent does when the model is slightly wrong.
The expensive part isn't the token price
A chatbot can survive a mediocre answer. An agent can't.
When OpenClaw is driving tools, a weak answer doesn't just look bad on screen. It creates retries. It triggers extra tool calls. It stalls mid-task. It asks for clarification when it should have acted. Or worse, it confidently takes the wrong action and forces a recovery loop.
That is where penny-per-call thinking breaks.
A cheap model that needs three attempts, one supervisor check, and a cleanup pass from Claude Sonnet is not cheap. It's the most expensive line item in your stack because it multiplies everything around it.
And OpenClaw makes this painfully visible, because OpenClaw is built for agents, not just chat. Its docs describe it as a gateway for sessions, routing, and channel connections, with per-agent routing and failover plus multi-agent routing across providers like Anthropic, OpenAI, MiniMax, OpenRouter, and local models. In other words, "one cheap model for everything" is not a requirement of OpenClaw. It's a configuration choice.
That distinction matters a lot.
Because once you realize OpenClaw already supports agent routing, the question changes from "which model is cheapest?" to "which steps are safe enough to be cheap?"
So what actually goes wrong with small models in OpenClaw?
Reddit was unusually honest about this.
In another r/openclaw discussion, one user wrote: "I use Gemma 4 E4B for simple tool tasks, but I would have serious doubts about trying to use any of the Gemma 4 models for the main agent. It will almost certainly fail in horrible and unpredictable ways."
That sentence is brutal. It is also more useful than most benchmark charts.
Because that is exactly how these failures feel in practice. Not "slightly lower reasoning quality." More like: weird tool sequencing, forgotten constraints, brittle recovery, and random collapses after a long session.
Another commenter in the same thread said: "I've got ollama gemma4:latest as my fallback (12gb when loaded). It's failed over to it and it's the best I've seen for openclaw on a 16GB gpu, but still super basic. As in, barely keep the lights on basic."
That phrase stuck with me: barely keep the lights on.
That's the right mental model for a lot of cheap and local models in an agent harness. Not useless. Not fake. Just miscast.
Where smaller models actually shine
This is where people get confused.
Google's Gemma 3 launch made small models look a lot more capable on paper. Google says Gemma 3 comes in 1B, 4B, 12B, and 27B sizes, supports function calling, structured output, and a 128k-token context window, and is designed to run on a single GPU or TPU. Google also says Gemma adoption has passed 100 million downloads with more than 60,000 community variants.
Those are real capabilities. They matter.
But a function-calling checkbox is not the same thing as being reliable as the main autonomous controller inside OpenClaw.
A model can be perfectly fine at extracting fields from a support email, classifying a request, formatting JSON, or deciding whether a Discord message looks urgent. That same model can still fall apart when it has to plan a six-step tool sequence, recover from an API timeout, and decide whether to retry or escalate.
That's the trap.
The winner is not DeepSeek or Gemma. It's routing.
If I had to boil this down to one opinionated take, it's this: single-model OpenClaw setups are usually lazy architecture wearing a budget hat.
The answer is not "always use Claude Opus." The answer is not "just use DeepSeek Flash." The answer is llm routing.
Use the cheap model where mistakes are cheap.
Use the strong model where mistakes cascade.
That sounds obvious, but almost nobody starts there. They start with list price instead of failure cost.
A simple way to think about model roles
| Model option | Best role in OpenClaw |
|---|---|
| DeepSeek Flash | Very low-cost worker for classification, extraction, formatting, and other bounded steps where retries are acceptable |
| Gemma 3 or Gemma 4 12B-class models | Local or single-GPU helper for simple tool tasks, fallback, and low-risk subtasks; risky choice for the main harness agent |
| Stronger frontier models like Claude Opus, Claude Sonnet, or GPT-5-class models | Planner, supervisor, recovery model, and decision-maker for ambiguous or high-consequence turns |
That table is the whole game.
Not model tribalism. Role design.
What should route to a cheap model and what should not?
Here's the split I wish more people used.
Good jobs for cheap models
- Intent classification
- Document extraction
- Structured output formatting
- Spam filtering
- Low-risk summarization
- Fallback responses when failure is obvious and recoverable
- Simple tool tasks with tight schemas
Bad jobs for cheap models
- Main agent planning across multiple tools
- Recovery after failed tool calls
- Long-horizon tasks with lots of state
- Any step that can trigger side effects like sending emails, updating records, or executing transactions
- Ambiguous decisions where the agent has to infer intent from messy context
If a mistake only means "we rerun a parser," use DeepSeek Flash, Qwen, Gemma, GLM, MiniMax, or a local Ollama-hosted model.
If a mistake means "the agent spirals for ten minutes and then Claude Sonnet has to rescue it," stop pretending you're saving money.
What does this look like in a real OpenClaw stack?
OpenClaw already gives you the control plane for this. Its docs explicitly frame the gateway around sessions, routing, and failover. That means your architecture can be deliberate instead of all-or-nothing.
A practical pattern looks like this:
- Start with a cheap worker model for intake and routine tool work.
- Track where the agent hesitates, retries, or requests supervision.
- Escalate only those turns to a stronger model like Claude Sonnet, Claude Opus, or GPT-5.
- Keep a local fallback for continuity, not quality.
- Review failures by task type, not by average token price.
Even the setup flow hints at this being an operational system, not a toy chat app:
npm install -g openclaw@latest
openclaw onboard --install-daemon
openclaw status --deep
OpenClaw recommends Node 24, or Node 22 LTS 22.19+ for compatibility. That's not the documentation style of a single-model hobby wrapper. It's infrastructure. Treat it like infrastructure.
And infrastructure gets cheaper when you route intelligently.
But what if a cheap or local model really is enough?
Sometimes it is.
If your workload is tightly scoped and low-risk, then yes, a cheap model may genuinely be the most economical option. The Reddit threads mention people getting acceptable results from DeepSeek Flash, Gemma fallback setups, Qwen, GLM, and MiniMax for simpler tasks.
For fully local stacks, API cost can drop to zero. Then the tradeoff shifts to hardware, latency, setup pain, and reliability. One r/openclaw post made the wonderfully blunt argument that you should fit the biggest local model your hardware can handle because, basically, bigger local models make OpenClaw better.
I don't disagree.
But that's still a routing argument, not a universal cheap-model argument.
If your Raspberry Pi side project is classifying inbound webhooks, go cheap. If your OpenClaw agent is orchestrating tools across a CRM, a database, Slack, and an internal API, pretending Gemma 12B is a drop-in main agent because it technically supports function calling is how you end up paying for mistakes in a more annoying currency than tokens.
The weird truth about agent costs
The weird truth is that stronger models are often cheaper per successful task, even when they're more expensive per call.
That feels wrong until you've watched a weak model burn minutes wandering around a tool graph.
The user who spent $100 in two days with Opus, Sonnet, and Haiku found one kind of pain. The users describing Gemma as fallback-grade found the other kind. Put those together and the answer becomes obvious: the cheapest model is often the most expensive part of your OpenClaw stack when you ask it to do the wrong job.
The practical takeaway is simple.
Don't optimize for the lowest sticker price.
Optimize for the lowest cost of getting the task done once, correctly, without supervision. In OpenClaw, that usually means cheap models for low-risk steps, strong models for hard turns, and failover that reflects reality instead of wishful thinking.
