I stopped blaming prompts for agent loops when I realized my agents had a management problem

Daniel NguyenMay 24, 2026 · 11 min read

I used to think most multi-agent loops were prompt failures. If an agent system got stuck bouncing between roles, my first instinct was to blame the model, tweak the instructions, and try again.

You probably know the pattern. The Research Agent asks the Analyst Agent for clarification, the Analyst Agent asks the Writer Agent for a summary, the Writer Agent asks the Reviewer Agent to check the summary, and the Reviewer Agent sends everyone back to Research for one more source. Forty turns later, the logs look busy, but nothing is actually done.

For a while, I treated that like a model-quality issue. Maybe GPT-5 was too verbose, maybe Claude Opus 4.6 was too cautious, maybe a different open-weight model would be less likely to spiral.

But the more docs I read, the more framework examples I looked at, and the more Reddit threads I dug through, the less I believed that explanation. The real problem is usually simpler and more embarrassing: most agent loops are not prompt failures. They are management failures.

That shift changed how I look at almost every multi-agent design. Once you stop treating the agents like a magical collective intelligence and start treating them like an org chart, a lot of weird behavior suddenly makes sense.

One comment from a thread on r/openclaw really crystallized it for me: when agents are designed as peers, nobody clearly owns the outcome. That sounds obvious after the fact, but I think a lot of people miss it because peer-to-peer agent collaboration looks impressive in demos.

In production, though, it often turns into a tiny bureaucracy. The agents are not under-prompted. They are over-empowered.

And the weird part is that the major frameworks mostly agree on this, even if they describe it in different language. Once I noticed that, a lot of vendor guidance that had felt scattered started lining up into one pretty clear pattern.

Anthropic has been fairly blunt about this. In its guidance on building agents, it says the best implementations tend to use simple, composable patterns instead of giant complex frameworks, and it recommends starting with the simplest possible solution before adding more agentic behavior.

I think that is a bigger statement than it first appears. Some of what gets marketed as autonomy is really just architecture debt wearing a cool jacket.

CrewAI makes the same point in a more practical way. Its default process is sequential, and hierarchical mode only happens if you explicitly enable Process.hierarchical. In that setup, a manager agent coordinates the workflow, delegates tasks, and validates outcomes.

That default matters. Delegation is disabled by default for a reason, and framework authors do not make a choice like that accidentally.

Here is the basic shape in CrewAI:

from crewai import Crew, Process, Agent

crew = Crew(
    agents=[researcher],
    process=Process.hierarchical,
    manager_llm="gpt-4"
)

You have to choose hierarchy on purpose. You have to assign a manager on purpose. You have to make authority visible in the architecture instead of pretending it will somehow emerge from a room full of clever workers.

That is not just an implementation detail. It is org design.

Once you look at agent systems that way, three failure modes show up again and again. Authority gets blurry, memory gets polluted, and stopping conditions become an afterthought.

The first one is the real killer. If the Research Agent, Analyst Agent, and Writer Agent can all reopen work, reinterpret the goal, or delegate sideways whenever they feel like it, then nobody owns the final answer.

At that point you do not have a team. You have a Slack channel with no manager.

Microsoft AutoGen makes this tension really visible. Its Group Chat pattern uses a Group Chat Manager to select the next speaker in a shared thread, and the conversation continues until a termination condition is met.

That can absolutely work. It is flexible, and it feels smart when you watch it run.

It is also the kind of thing that drifts unless you lock it down hard. Shared-thread collaboration is seductive because it looks like intelligence, but a lot of the time it is just unbounded coordination overhead.

AutoGen’s Mixture of Agents pattern feels much healthier to me. It separates worker agents from a single orchestrator agent, organizes workers into layers, and has the orchestrator aggregate results at the end.

That is less magical. It is also much easier to reason about at 2 a.m. when something has gone wrong in production.

And honestly, that is the standard I care about now. Once the novelty wears off, I do not want my multi-agent system to feel conversational. I want it to be debuggable.

That brings me to memory, which is where a lot of otherwise decent builds quietly go off the rails. I found another strong discussion on r/openclaw where someone described a setup I think more teams should copy: worker agents do not write memory directly. They emit structured memory events with evidence, and a separate memory curator validates, redacts, deduplicates, routes, or discards them.

That sounds strict, but strict is good here. If every worker can write directly into shared memory, then bad guesses become durable facts.

A speculative search result turns into stored preference. A one-off exception becomes policy. Yesterday’s wrong assumption becomes tomorrow’s context window.

OpenClaw’s design gives a nice clue about how to avoid that. It does not treat agents like one giant shared mind. It isolates them.

Per-agent state lives under ~/.openclaw/agents/<agentId>/..., sessions live under ~/.openclaw/agents/<agentId>/sessions, and auth profiles live under ~/.openclaw/agents/<agentId>/agent/auth-profiles.json. That structure is telling you something important: each agent is supposed to be a scoped unit with its own workspace, state, auth, and session history.

That is a much healthier model than pretending every agent should have equal access to one giant shared memory blob. It is more org chart, less hive mind.

You can even see it in the CLI:

openclaw agents add work
openclaw agents list --bindings

Bindings route work inward, and state stays local unless something with authority decides to move it. That instinct is correct.

The same pattern shows up in OpenAI’s Agents SDK. What I like there is the emphasis on handoffs.

Delegation is explicit, not ambient. A triage agent can hand work to a Billing agent or a Refund agent through transfer tools like transfer_to_refund_agent, which sounds like a small implementation choice until you realize it fundamentally changes how the system behaves.

A handoff is not just a convenience. It is a boundary.

That matters because boundaries are where you can inspect, stop, and validate work. OpenAI’s guardrails model reinforces that by splitting guardrails into three categories: input guardrails, output guardrails, and tool guardrails.

A lot of people say they added guardrails and assume the problem is solved. But the detail that matters is where those guardrails actually run.

Input guardrails apply to the first agent, output guardrails apply to the final agent, and tool guardrails run on every custom tool invocation. So if you want review checkpoints at every delegated step, the tool layer is where the control really lives.

That is also why OpenAI’s execution modes for input guardrails are interesting. With run_in_parallel=True, they run in parallel by default. With run_in_parallel=False, they can block execution before the system burns tokens or calls tools.

That is basically what a competent manager does. Stop bad work before the team spends time on it.

So what does the org chart actually look like when it works? In my experience, it is not fancy.

It is one supervisor that owns the outcome, a set of narrow workers, and clear rules about who is allowed to decide what happens next. The supervisor is the only one that can reopen work, terminate the run, or approve the final result.

Workers should get scoped jobs. A Web Search Agent searches, a Data Analyst Agent analyzes, a Refund Agent handles refunds, and a Writer Agent drafts.

What they should not do is renegotiate the business objective or start freelancing across the task graph because they technically can. That is where systems start to feel intelligent right before they become expensive and unreliable.

I also think workers should return artifacts, not opinions about orchestration. Give me a ranked source list, extracted entities, a draft with confidence notes, or a diff.

Do not give me a paragraph about how the Planning Agent and Reviewer Agent should probably discuss next steps. That is how you end up paying for synthetic middle management.

Memory needs an owner too. Either the supervisor writes memory, or a dedicated curator does.

Workers can propose memory events, but they should not casually commit them. That one design choice alone prevents a surprising amount of long-term weirdness.

And every boundary needs a checkpoint. If a worker uses a tool, writes a file, calls an API, or triggers another agent, that boundary should be inspectable and stoppable right there.

Not eventually. Not after the run is over. At the boundary itself.

If I had to summarize the framework patterns I trust most, it would look like this.

CrewAI Hierarchical Process

A manager agent coordinates and validates work
Delegation is disabled by default
You must explicitly choose Process.hierarchical and set manager_llm
The architecture forces you to make authority visible

OpenAI Agents SDK Handoffs

Delegation happens through explicit transfer tools rather than free-form peer chatter
Guardrails are attached to clear workflow boundaries
Sessions provide persistent working context without turning everything into shared-memory soup
The model encourages reviewable transitions between roles

Microsoft AutoGen Group Chat vs Mixture of Agents

Group Chat uses a shared thread and manager-selected speaker flow
It is flexible, but can drift if termination conditions are weak
Mixture of Agents uses one orchestrator plus worker layers
Aggregation and stopping logic are more explicit, which makes failures easier to debug

If I am building something that needs to run reliably inside n8n, Make, Zapier, OpenClaw, or a custom Python worker, I am choosing hierarchical orchestration over peer collaboration almost every time. Not because it is prettier, but because it fails in ways I can understand.

That is an underrated feature when you are paying for every mistake. It is also why pricing model matters more than people admit.

When you are running agent workflows on usage-based APIs, bad orchestration compounds the pain. A loose peer-to-peer design does not just create messy logs. It creates runaway token burn, surprise bills, and a constant temptation to throttle useful work because you are scared of the next loop.

That is exactly why predictable infrastructure is so appealing for agent builders. If you are experimenting with supervisor-worker patterns, trying different routing strategies, and letting automations run continuously in tools like n8n or Make, flat-rate compute changes the psychology of the whole thing.

You stop treating every delegated step like a billing event. You can focus on making the system reliable instead of obsessing over whether one bad loop just doubled your monthly spend.

That is also why Standard Compute feels aligned with how agent systems actually get built. If your stack already speaks the OpenAI API, being able to swap in a flat-price endpoint and run agents without token anxiety is a practical advantage, not a marketing line.

Especially once you realize the path to better agents usually involves more testing, more orchestration experiments, and more long-running automation, not less. Predictable cost makes it easier to fix the architecture instead of prematurely optimizing around usage bills.

Now, to be fair, not every loop is an org-design problem. Some are still caused by weak termination conditions, bad tool schemas, or plain model unreliability.

AutoGen’s Group Chat docs explicitly depend on termination logic, and if that logic is sloppy, even a good hierarchy can wander. There are also cases where broader autonomy is worth it.

Anthropic’s distinction between workflows and agents is useful here. If you truly need flexibility and model-driven decision-making at scale, then yes, more autonomous behavior can be justified.

But that does not mean every worker should get broad autonomy by default. That is the leap people keep making, and it is where things start to break.

If you want to build agent systems that survive contact with production, the boring answer is usually the right one. Smaller jobs, clearer ownership, tighter routing, stricter memory rules.

That was the part that surprised me most. The fix for endless loops is usually not a smarter prompt.

It is a smaller org chart.

Less peer-to-peer chatter, more explicit handoffs. Less shared memory, more scoped state. Fewer agents allowed to decide what happens next.

At first that can feel like you are making the system less intelligent. In practice, you are making it less political.

And once you see multi-agent failures as org-design failures, the questions get much better. You stop asking how to make the Research Agent more proactive and start asking why the Research Agent is allowed to reopen the plan in the first place.

That second question has given me better systems every single time. Systems that stop when they should, remember only what matters, and do not turn a three-step task into a committee meeting.

So if your agents keep looping, I would stop tuning personality first. I would redraw the org chart.

I stopped blaming prompts for agent loops when I realized my agents had a management problem

Keep reading

I stopped blaming prompts for agent loops when I realized my agents had a management problem

I think “remember this” is dead — agent memory needs branches, diffs, and rollback now