← Blog/Guide

My OpenClaw agent didn’t get dumber — I just gave it 50 skills and hoped for the best

Standard Compute TeamMay 1, 2026 · 9 min read

Tool Routing Drift

More skills, worse routing

50 skills

Accuracy by skill count

Route mix

tool match

18%

overlap

61%

unused

21%

route() → ambiguous match
skills: 50 • overlap: high • selected: shell

I knew something was off when my OpenClaw agent picked the TikTok posting skill to answer a question that clearly belonged to Reddit.

Not because the model was tiny, either. This was the same pattern people always describe right before they say "GPT-5 got worse" or "Claude is losing it" or "OpenClaw used to be smarter last month." The weird part is that the outputs still looked fluent. The agent sounded confident. It was just confidently reaching for the wrong capability.

That’s when the real problem clicked for me: once your OpenClaw setup gets past a certain size, the bottleneck stops being the model and starts being tool routing.

And honestly, I think a huge chunk of "my agent got dumber" complaints are really this.

The moment your agent turns into a junk drawer

There’s a specific phase every OpenClaw setup seems to hit.

At first you add a few skills and everything feels magical. A Reddit skill. A Twitter posting skill. Maybe a Discord notifier, a scraper, a Notion sync, a browser action, and something that writes summaries. Then you keep going because adding one more capability is easy and removing one feels dumb.

A month later, your agent has 30, 40, 50+ visible skills and starts acting like it has a head injury.

That pattern showed up over and over in the Reddit thread about 50+ skills. Multiple users said performance started degrading somewhere around 20-30 visible tools or skills. Not catastrophic failure. Worse than that. Misses. Conflicts. Instruction drift. The kind of slow rot that makes you blame the model because nothing is obviously broken.

And OpenClaw makes this easier to do than people realize, because it has two separate ways to create sprawl:

Tools: structured function definitions sent to the model API
Skills: SKILL.md instructions injected into the system prompt

So you can bloat the agent from both sides at once. Too many callable functions, plus too many behavioral instructions, plus too many overlapping descriptions. That’s not an intelligence problem. That’s an information architecture problem wearing a model-quality costume.

But the part that surprised me most wasn’t the number.

It was what actually caused the confusion.

It’s not just too many skills — it’s too many similar skills

One commenter in that thread said the quiet part out loud: schema overlap matters more than raw count.

That tracks. If you expose OpenClaw to three skills with vague descriptions like "post content," "publish update," and "share social message," and they all take fields like title, text, content, or message, you’ve basically created a multiple-choice test with three almost identical answers.

Then people act shocked when GPT-5 or Claude picks the wrong one.

I don’t think that’s fair.

If your Twitter skill, Reddit skill, and TikTok skill all look semantically mushy, the model isn’t failing at reasoning. You failed at naming things. And naming things, unfortunately, is half of agent engineering.

This is why some OpenClaw setups feel cursed even on strong models like GPT-5 or Claude Opus 4.6, while a smaller, more constrained setup on Qwen or Llama can feel weirdly reliable. The smaller setup isn’t smarter. It just isn’t being asked to choose between twelve nearly identical actions.

And once you see that, the fixes get a lot more practical.

The best fix is boring, and that’s why it works

The most consistently praised fix in those discussions was not "buy a smarter model" or "switch providers."

It was this: reduce the active tool surface area.

That sounds almost too simple, but it kept showing up in the same forms:

Split one big agent into domain specialists
Use a cheap categorization router first
Keep the active tool list tiny per session
Use MCP-style mediation instead of exposing everything directly

One OpenClaw user said they moved from a single agent with 40+ tools to an orchestrator plus three specialists for Twitter, Reddit, and TikTok. Their result was exactly what you’d expect if routing was the real issue: they went from frequent misses on the all-in-one agent to near-zero misses per specialist.

That is not a subtle improvement.

Another user had a setup with a little over 100 loaded skills and a library of 1000+ skills available only when explicitly requested. Their verdict was "mixed results," which is the most honest possible review of giant agent setups. But even they still said splitting work among specialized agents beat relying on one main agent with everything loaded.

So no, the answer is usually not "give the main agent more power."

It’s "stop making the main agent stare at the entire hardware store every time it needs a screwdriver."

One giant agent vs a team that knows its job

Here’s the tradeoff in plain English:

Approach	What actually happens
Single super-agent	Easy to imagine, hard to keep reliable once tool overlap grows
Specialized sub-agents	Slightly more setup, much better tool selection and cleaner context

And if you want the more detailed version:

Pattern	Tool selection accuracy	Maintenance complexity	Context focus
Single super-agent	Usually drops as visible tools rise, especially past 20-30	Lower at first, then ugly	Weak once every domain is loaded together
Specialized sub-agents	Usually much higher because choices are constrained	Higher upfront, lower long-term	Strong because each agent stays in its lane

I’m opinionated on this one: specialized sub-agents win for almost every serious OpenClaw workflow.

Not because orchestration is elegant. It usually isn’t. It’s because reliability beats elegance. I would rather maintain three boring specialists than one "universal" agent that keeps trying to use a YouTube workflow to answer a Discord moderation task.

But there’s an even better version of this pattern.

The router should discover, not do everything

One of the better counterarguments in the discussions was that a broad orchestrator can still make sense if you separate discovery from execution.

That distinction matters.

A fast, cheap model can classify the request. Then a stronger model can execute it with a narrow set of tools. That "segregate, discover, then execute" pattern is a lot saner than letting one agent both survey the whole map and drive every vehicle.

Here’s how I think about the three common designs:

Design	Active tool count per session	Security/control	Latency or implementation overhead
Direct tool exposure	Highest	Weakest unless heavily restricted	Lowest to start
Categorization router	Low to medium	Better because execution can be narrowed	Moderate
MCP-style mediation	Lowest visible surface for the model	Strongest control plane	Highest setup cost

If your OpenClaw instance is doing real work across Discord, Reddit, Notion, Google Sheets, TikTok, and internal APIs, direct exposure gets messy fast. A categorization router is usually the sweet spot. MCP-style mediation is even better if you care about policy, auditing, or approval steps.

And that brings up the part people weirdly separate from routing even though it’s the same design problem.

Tool sprawl is also a security problem wearing a productivity hat

One audit-focused Reddit post was brutal.

The user logged outbound calls for 30 days and found that 70% of installed skills made zero calls. Zero. Dead weight.

Worse, 4 skills were sending fields from prompts off-machine without the user realizing it. And 3 installed skills overlapped almost completely.

That should end the debate right there.

Bloated skill lists are not just confusing OpenClaw. They’re increasing your attack surface, your maintenance burden, and your odds of accidental data leakage for no upside.

This is why I don’t buy the idea that sandboxing alone solves it. The NemoClaw discussions and OpenClaw docs both point in the same direction: visibility control is the real control plane.

Not just "can this skill execute safely?"

Also: should the agent even know this skill exists in this session?

That’s where allow lists, deny lists, tool profiles, and per-session restrictions stop looking like boring admin features and start looking like the whole game.

The fix I wish more people started with

If your OpenClaw setup feels flaky, I would do this before touching the model:

1. Audit what’s actually loaded

Run:

openclaw skills list

Then be ruthless. If a skill hasn’t been used in weeks, remove it from the default visible set.

2. Split by domain, not by vibe

Don’t make one "content agent."

Make a Reddit agent, a Twitter agent, a TikTok agent, a research agent, and a Discord ops agent. If a human would need to stop and ask "which app am I in right now?" then your agent probably should too.

3. Use profiles and allow lists

A tiny openclaw.json fragment can do more for reliability than swapping from Claude to GPT-5 and hoping.

{
  "tools": {
    "profiles": {
      "reddit_agent": {
        "allow": ["reddit_post", "reddit_search", "browser_fetch"],
        "deny": ["tiktok_publish", "twitter_post", "discord_send"]
      },
      "tiktok_agent": {
        "allow": ["tiktok_publish", "media_transcode"],
        "deny": ["reddit_post", "notion_write"]
      }
    },
    "groups": {
      "social": ["reddit_post", "twitter_post", "tiktok_publish"],
      "research": ["browser_fetch", "web_search", "notion_write"]
    }
  }
}

4. Teach the orchestrator to delegate

Your orchestrator should not be a hero. It should be a receptionist.

# SKILL.md
You are an orchestrator.

Rules:
- Do not execute domain-specific actions directly if a specialist agent exists.
- First classify the request into one domain: reddit, twitter, tiktok, research, or ops.
- Delegate to the matching specialist.
- Only use shared tools for lightweight discovery or clarification.
- If multiple domains are possible, ask one short clarifying question.

That one instruction is often more useful than another 15 clever prompt paragraphs.

Yes, sometimes it really is the model

To be fair, not every failure is routing.

The InsiderLLM guide is right to call out model-side issues too: weak tool-calling support, malformed JSON, provider-specific formatting mismatches, and smaller local models that simply aren’t good at function selection can all create similar symptoms. If you’re running a shaky local stack on Llama or Qwen with imperfect tool-calling wrappers, you can absolutely get false positives and blame architecture for what is really model support.

But I still think people over-attribute this stuff to model decline.

Because if your agent works dramatically better the moment you cut visible skills from 50 to 8, that wasn’t some mysterious intelligence collapse. That was clutter.

And clutter is fixable.

The weirdly hopeful part

This is the good news: tool sprawl is easier to fix than model quality.

You can’t personally make GPT-5 better at function calling. You can’t patch Claude Opus 4.6 from your desk. You probably don’t want to rebuild Qwen’s tool schema parser on a Raspberry Pi at 1 a.m.

But you can absolutely stop showing one OpenClaw agent every capability you’ve ever installed.

That’s the practical takeaway I wish more people heard earlier:

When an OpenClaw agent starts feeling dumb, don’t ask only whether the model got worse. Ask how many skills it can see, how many of them overlap, and whether the wrong agent is trying to do everyone’s job.

Most of the time, the fix is not more intelligence.

It’s fewer choices.