Standard Compute
Unlimited compute, fixed monthly price
← Blog/Engineering

I read the 32-comment OpenClaw fight about GPT 5.5 and I think people are blaming the wrong thing

James Olsen
James OlsenMay 16, 2026 · 9 min read
Why it feels off
OpenClaw can make a strong model feel lifeless
Thread
32comments
Blame map
ModelPromptingDefaultsBugs
Model feel
GPT-5.5
smart, muted
Codex
tool-heavy
Opus
more alive
Not “bad model” — often prompt wrappers, defaults, and bugs.

GPT 5.5 in OpenClaw probably isn’t a bad model in the simple sense. In a 10-upvote, 32-comment r/openclaw thread, most complaints were really about agent behavior defaults: lower initiative, less emotional intuition, and more need for explicit instructions. The twist is that commenters also pointed at OpenClaw prompting, upgrades, and reliability issues as major reasons the experience feels worse.

GPT 5.5 in OpenClaw probably isn’t a bad model in the simple sense. In a 10-upvote, 32-comment r/openclaw thread, most complaints were really about agent behavior defaults: lower initiative, less emotional intuition, and more need for explicit instructions. The twist is that commenters also pointed at OpenClaw prompting, upgrades, and reliability issues as major reasons the experience feels worse.

A post on r/openclaw caught my eye because the title sounded simple and the comments absolutely were not.

“Is GPT 5.5 in OpenClaw a bad model?”

That sounds like a yes-or-no question. It turned into a 32-comment argument about something much messier: what happens when you confuse model quality with agent personality, and then layer OpenClaw bugs on top.

That’s why this thread matters. Not because one side “won,” but because it exposes a problem everyone building agents runs into sooner or later: you think you’re comparing GPT-5.5 to Claude Opus 4.7, but you’re actually comparing prompt wrappers, tool behavior, session handling, and product stability.

And once I realized that, the whole thread snapped into focus.

The complaint wasn’t “GPT 5.5 is dumb” — it was “why does it feel dead?”

The line that made the thread travel was brutal.

One user wrote that GPT 5.5 through Codex feels like “an incredibly smart person who has no desire to live, doesn’t want to do anything unless you tell it exactly what to do.”

That is not a capability complaint. That is a vibe complaint. But in agent workflows, vibe is not cosmetic. Vibe becomes throughput.

If your agent is supposed to help you think, suggest next steps, notice missing context, or keep momentum in a long-running task, then “waits to be told” feels terrible. Especially if you came from Claude Opus.

Another user said switching back to “opus 4.7” made the agent feel “alive” again immediately, even after trying both the default soul and another soul file in OpenClaw. That’s a huge clue. They weren’t testing a narrow coding benchmark. They were testing collaborative, initiative-heavy personal assistance.

And for that job, they felt GPT 5.5 lost badly.

Why this hits so hard in OpenClaw

OpenClaw is not just a textbox. People use it like an operator, a planner, a semi-autonomous companion, a coding helper, and sometimes a weird little chief of staff.

So when a model stops volunteering useful next moves, the experience doesn’t just get slightly worse. It collapses.

That’s why comments like “soulless,” “reluctant to execute,” and “less proactive” kept showing up. Not because GPT 5.5 can’t reason, but because the default behavior felt wrong for the job people had hired it to do.

But then the thread took a turn.

Is this actually a GPT 5.5 problem, or an OpenClaw problem?

One of the smartest comments in the thread pushed back on the whole premise.

A commenter argued that the difference between “waiting to be told” and “taking initiative” is partly a training objective issue, sure, but it also shifts a lot with the system prompt. They put it plainly: “The behavior you're describing — waiting to be told vs. taking initiative — is partly a training objective difference, but it shifts a lot with the system prompt.”

I think that commenter is mostly right.

People love to talk about models as if they arrive with fixed personalities. In practice, once you run them through a product like OpenClaw, you’re dealing with at least four layers:

  1. The base model — GPT 5.5, Claude Opus 4.7, whatever sits underneath
  2. The provider wrapper — in this case, Codex sitting between OpenClaw and GPT 5.5
  3. OpenClaw’s own prompting and soul files
  4. OpenClaw’s runtime behavior — tools, sessions, cron, auth, upgrades, failures

If layer 3 tells the agent to be conservative, ask before acting, avoid assumptions, and minimize initiative, users will absolutely experience that as “this model has no spark.”

And the thread had more than one hint that OpenClaw itself is shaping that behavior across models.

One commenter said Codex “still breaks too often, but that is on the Openclaw side.” Another said the “don’t do things on its own” behavior seems encoded in OpenClaw now across different models, not just Codex.

That’s not a small footnote. That’s the whole case.

The weird part: some people are perfectly happy with Codex

This is where the thread gets interesting instead of predictable.

Not everyone hated GPT 5.5 via Codex. One commenter said they were “rather happy with Codex and Openclaw” and framed most of the breakage as an OpenClaw issue, not a Codex issue.

That matters because it suggests GPT 5.5 isn’t uniformly failing. It’s failing for a certain style of use.

Here’s the cleanest way I’d put it:

OptionWhat users in the thread seemed to feel
GPT 5.5 via Codex in OpenClawSmart, capable, but often less proactive and more dependent on explicit instructions; better fit when you want controlled execution or coding help
Claude Opus 4.7 in OpenClawMore intuitive, more emotionally legible, more likely to feel collaborative in long-running assistant-style work
OpenClaw itselfA huge confounder; prompting, souls, upgrades, cron behavior, and auth issues can make any model feel worse than it is

That table is the thread in miniature.

If you want an agent that behaves like a thoughtful coworker, people in this discussion clearly preferred Claude Opus 4.7. If you want a model that can be steered precisely for coding or operator tasks, GPT 5.5 via Codex still had defenders.

And that leads to the best practical workaround anyone shared.

What do power users do when OpenClaw chat feels wrong?

They stop expecting OpenClaw’s default chat behavior to carry the whole experience.

One commenter described using Codex CLI directly to maintain OpenClaw itself. They keep a persistent session inside ~/.openclaw, point Codex at https://docs.openclaw.ai/ for context, and then resume the thread whenever they need to fix or update something.

That’s such a revealing workaround because it changes the role of Codex completely. Instead of asking OpenClaw chat to feel like Claude, they use Codex as an external coding/operator layer.

The commands mentioned were simple:

cd ~/.openclaw
codex resume

And the docs they referenced:

https://docs.openclaw.ai/

Why this workaround makes sense

When people say a model feels passive, they often mean one of two things:

  • It doesn’t proactively propose next steps in chat
  • It’s actually very good once dropped into a concrete repo, shell, or task context

Those are not the same failure mode.

A lot of coding-oriented models look mediocre in open-ended assistant chat and then suddenly become excellent once you give them files, commands, and a persistent working thread. That seems to be what some OpenClaw users discovered the hard way.

But even that wasn’t the biggest twist in the story.

What if the model comparison is being poisoned by reliability bugs?

While reading the main thread, I went looking for surrounding context on r/openclaw, and honestly, this is the part that changed my mind.

There are enough reliability complaints around OpenClaw right now that I don’t think you can cleanly judge GPT 5.5 inside it without an asterisk the size of a house.

In one related thread, a user said OpenClaw auto-upgraded from 5.7 to 5.12 and then cron jobs silently stopped firing, API key loading broke, and every request returned “Something went wrong.” That post had a score of 9, and the top comment claimed “75% of your time is spent fixing OC.”

That’s not just annoying. That completely contaminates model perception.

If your scheduled tasks fail, your auth breaks, or your agent stops executing after an upgrade, you will absolutely come away saying “this model sucks,” even if the real issue lives three layers above the model.

And there’s an even more absurd example floating around the subreddit: one OpenClaw mishap post reported 13,616 cron jobs created in one day. The same post claimed almost 14,000 requests to Ollama Deepseek V4 consumed only 7% of the user’s usage.

That is the kind of story that sounds fake until you’ve spent enough time around agent tooling.

Why this matters for anyone comparing providers

The broader r/openclaw context also shows that people aren’t choosing between GPT 5.5 and Claude in a vacuum. They’re reacting to pricing and access changes.

One related discussion said Anthropic no longer lets users draw from subscription usage in OpenClaw and instead offers a credit. Another thread referenced ChatGPT pricing tiers of $20, $27, and $100 in the context of Codex and OpenClaw decisions.

So sometimes a user “switches to GPT 5.5” not because they think it’s better, but because the economics or access path changed under them. Then they discover the behavior is different, and the disappointment lands harder because it wasn’t really a voluntary experiment.

So is GPT 5.5 in OpenClaw a bad model?

My answer: no, but it may be a bad default experience for the kind of OpenClaw user who wants initiative-heavy collaboration.

That distinction matters.

If your ideal agent behaves like a smart teammate who notices things, suggests better options, and keeps momentum alive, the comments in this r/openclaw thread make a strong case that Claude Opus 4.7 currently feels better inside OpenClaw.

If your workflow is more like “here is the repo, here is the task, execute carefully,” then GPT 5.5 via Codex may be completely fine, maybe even preferable.

The real mistake is pretending this is a clean model bake-off.

It isn’t.

It’s a three-way interaction between GPT 5.5, Codex, and OpenClaw. Add soul files, system prompts, auth issues, upgrade bugs, and cron weirdness, and suddenly “bad model” is the least precise diagnosis available.

What should you actually test before picking a side?

If I were trying to settle this for my own workflow, I’d run three tests before forming any opinion:

  1. Same task, same prompt, different model inside OpenClaw
  2. Same model, different system prompt or soul file inside OpenClaw
  3. Same model outside OpenClaw using Codex CLI or a direct provider workflow

If GPT 5.5 feels inert only in test 1, that’s probably model behavior.

If it changes dramatically in test 2, that’s prompt architecture.

If it suddenly shines in test 3, then OpenClaw chat was the problem all along.

That’s my takeaway from the thread. Not that Reddit found the one true answer, but that the commenters accidentally mapped the real problem better than the title did.

GPT 5.5 in OpenClaw might feel soulless. That part seems real.

But “soulless” is not the same thing as “bad.” Sometimes it just means you’re asking a coding operator to play therapist, project manager, and cofounder inside a wrapper that keeps changing the rules.

And yeah, that usually ends badly.

Frequently Asked Questions

Is GPT 5.5 in OpenClaw actually a bad model?

Not necessarily. The Reddit discussion suggests the bigger issue is how GPT 5.5 behaves inside OpenClaw, especially around initiative, emotional intuition, and reliance on explicit instructions, rather than a simple lack of intelligence.

Why do some people prefer Claude Opus 4.7 over GPT 5.5 in OpenClaw?

Users in the thread described Claude Opus 4.7 as feeling more alive, intuitive, and collaborative for long-running assistant-style work. GPT 5.5 via Codex was seen as more capable when tightly directed, but less naturally proactive.

Could OpenClaw itself be causing the bad GPT 5.5 experience?

Yes. Multiple commenters said breakage and odd behavior often come from OpenClaw, including system prompts, soul files, upgrades, cron failures, and authentication problems, which can all shape how a model feels in practice.

What workaround did OpenClaw users suggest for Codex?

One practical workaround was to use Codex CLI directly to manage OpenClaw, keep a persistent session in ~/.openclaw, and resume it with `codex resume`. That approach treats Codex as an external coding layer instead of relying on OpenClaw chat defaults.

How should I compare GPT 5.5 and Claude in OpenClaw fairly?

Run the same task with the same prompt across both models, then change only the system prompt or soul file, and finally test the model outside OpenClaw if possible. That helps separate base model behavior from OpenClaw-specific prompting and reliability issues.

Ready to stop paying per token?Every plan includes a free trial. No credit card required.
Get started free

Keep reading