I thought creative AI needed better prompts, but it actually needed a 4-step LLM routing pipeline

Marcus ChenJune 10, 2026 · 10 min read

I used to think the bottleneck in creative AI work was prompting. If the output felt generic, I assumed the fix was a smarter prompt, a longer prompt, or a more controlling prompt. The more time I spent watching real people try to build creative agents, the more obvious it became that prompting was not the main problem.

The real problem was workflow design.

That clicked for me while reading a small thread on r/openclaw from a jewelry designer trying to automate part of their process. It was not some massive viral post. The score was low, the thread was niche, and that is probably why it was so useful.

The designer already had ChatGPT doing trend summaries, concept lists, prompt refinement, and seasonal organization. But then they wrote the line that mattered: what they actually needed was an agent that could help run more of the workflow, not just suggest ideas.

That sentence is the whole thing.

A lot of people say they want AI to be a creative partner. What they really want is something much less romantic and much more valuable: a system that can take a vague signal from TikTok, Pinterest, Instagram, or competitor launches and turn it into a reviewable package with notes, mockups, references, and a clear next decision.

Once you see that, the usual “ask ChatGPT for ideas” workflow starts to feel weirdly underpowered. It is not useless. It is just solving the easiest five percent of the job.

The part after ideation is where the work actually lives.

If you ask ChatGPT for “summer jewelry trends inspired by coastal textures,” it will give you a decent answer. Ask for ten pendant concepts and it will absolutely produce ten pendant concepts. Ask it to rewrite prompts for Midjourney or a GPT-5-class image model and it can do that too.

But then the real process begins. You have to pressure-test those ideas against manufacturing constraints, sort references by season and material, create multiple visual directions, generate prompt variants, and package everything so a human can review it without digging through a giant transcript.

That is not brainstorming. That is operations.

The easiest way to explain the difference is this.

ChatGPT-style brainstorming

Output: mostly ideas and text
State: trapped inside one long chat
Human role: keeps prompting until something looks usable

Agent pipeline

Output: structured deliverables like briefs, mockups, folders, and notes
State: saved across tasks, files, and handoffs
Human role: explicit approval at the right checkpoints

One comment in that Reddit thread said something that sounded dismissive at first: this feels like a couple of skills stacked into a cron job. I actually think that was the smartest comment there.

Because once a process repeats, the answer is usually not “write a better mega-prompt.” The answer is “break the work into stages and make each stage reliable.”

That is where LLM routing stops being a nice optimization and becomes the entire game.

The reason one model keeps disappointing people is simple: they are asking it to be a trend researcher, a creative director, a manufacturing consultant, an image prompt engineer, and a file clerk. That is not a prompting failure. That is bad staffing.

The most useful advice in the thread was brutally practical: use one model for trend search, another for high-level creative thinking, and another for image mockups. Then make sure the agent knows when to use which one.

Yes. Exactly.

This is what good creative automation looks like in practice. Not benchmark screenshots. Not “which model wins on reasoning.” Actual role assignment.

My favorite split for this kind of workflow is pretty straightforward.

Grok for trend search and intake

Best use: pulling signals from the web fast
Why it works: good at broad trend collection from TikTok chatter, Pinterest patterns, competitor launches, and aesthetic shifts
What I would not use it for: final brief writing or nuanced creative tradeoffs

Claude Opus for creative reasoning and brief writing

Best use: turning messy research into a coherent design brief
Why it works: better at synthesis, contradictions, and taste-level reasoning
What I would not use it for: being the first-pass search engine for everything

GPT-5-class image models for mockups and visual exploration

Best use: generating prompt sets and visual directions once the brief is approved
Why it works: better at turning structured creative direction into something reviewable
What I would not use it for: trend intake or workflow coordination

n8n or Make for storage, naming, and handoff

Best use: folder creation, Airtable updates, Google Drive uploads, Slack notifications, and all the boring adult work
Why it works: because nobody should be manually organizing assets after every run
What I would not use it for: high-level creative reasoning

A single general-purpose model can fake all of this. That is the trap. It can appear competent across every stage while quietly being mediocre at the exact moments where quality matters.

The tradeoff is pretty clear.

Single general-purpose model

Quality: uneven across tasks
Cost: expensive if every stage uses the premium model
Failure mode: vague and hard to debug

Model-specific routing

Quality: each task gets a model that fits the job
Cost: cheaper because easy steps stay cheap
Failure mode: easier to isolate by stage

This is also why the “best model for tool calling” debate usually misses the point. The best model is not just the one with the highest score on some benchmark. It is the one that reliably knows when to search, when to write, when to generate, and when to stop and hand the work back to a person.

That last part matters more than most people admit.

One of the smartest comments in the thread was a reminder to write the design on paper and put the human in the loop. I love that advice because it sounds obvious, but most agent builders skip it.

Creative work is full of moments where the human judgment call is the actual product. Is this trend relevant to our customer? Does this concept feel like our brand or just like whatever is hot this week? Is this manufacturable in brass, sterling silver, or gold vermeil? Which direction deserves another round?

If you remove the human from that diagram, you do not get an autonomous design studio. You get a folder full of polished nonsense.

The goal is not to automate taste. The goal is to remove the repetitive work between inspiration and review.

So the output should be things a human can actually approve.

Trend summary
Design brief
Constraint check
Image prompt set
Mockup batch
Organized folder with references and notes
Human decision

That final step is not a weakness in the system. It is the reason the system is useful.

If I were building this for real, the workflow would look less like one giant chat window and more like a small routed pipeline.

A main agent kicks off the process. A trend-search sub-agent uses Grok or parallel search. A reasoning sub-agent uses Claude Opus to write the brief and resolve contradictions. A mockup sub-agent uses a GPT-5-class image model to generate visual directions. Then an aggregator collects outputs, checks completeness, names assets, and sends everything through n8n or Make into Google Drive, Airtable, Notion, or Slack for review.

That split matters a lot.

OpenClaw-style agent setup

Best for: autonomous loops, delegated tasks, experimentation
Strength: flexible thinking and orchestration
Weakness: not the best place for explicit business process plumbing

n8n or Make workflow

Best for: app connectors, folder logic, records, notifications, approvals
Strength: production handoff and organization
Weakness: not where I want the creative reasoning to live

I like OpenClaw for the thinking and n8n or Make for the plumbing. That feels like the adult version of agent design.

And there is a cost reason to do it this way too.

Creative-agent workflows are iterative by nature. You search, summarize, rewrite, generate, compare, revise, and re-run. If every one of those steps hits the most expensive model, the budget gets ugly fast.

That is not theoretical. I kept seeing the same complaint in adjacent Reddit discussions while researching this topic. One person said a single prompt ate 61 percent of their session limit on a $20 plan. Another said one task with a high-end model cost them about $22. Another just said the quiet part out loud: you will burn tokens and money.

That is not whining. That is a workflow constraint.

A creative pipeline has a lot of cheap tasks and a few expensive ones. If you do not separate them, you end up paying premium reasoning prices for glorified sorting, classification, and cleanup.

The sane pattern is to route cheap work to cheap models and reserve the expensive models for moments where taste, synthesis, or risk actually matter. I have seen people use stacks like Ollama for local utility work, DeepSeek Chat for normal agent tasks, and Claude Sonnet or Claude Opus for hard reasoning and final checks. The exact stack is flexible. The principle is not.

This is also why Standard Compute is interesting for teams building agentic workflows instead of one-off demos. When you are running repeated trend search, brief generation, mockups, and revisions across n8n, Make, OpenClaw, or custom automations, usage-based billing becomes its own form of friction. People start optimizing around fear.

Flat-rate access changes the behavior. You can route aggressively, test more stages, run more iterations, and stop treating every agent step like it might trigger a tiny finance meeting. If your setup already uses OpenAI-compatible SDKs or HTTP clients, Standard Compute is basically a drop-in replacement for the OpenAI API, which makes the switch much less dramatic than people assume.

That matters because the whole point of a creative ops pipeline is repeatability. If the economics punish repetition, the system never becomes part of real work.

If I were automating one part of this workflow first, I would not start with image generation. That is the flashy part, which is exactly why it is the wrong first move.

I would start with trend intake and brief structure. That is where consistency actually comes from. If your research inputs are messy, your mockups will just be expensive versions of the same mess.

The order I would build is:

Scheduled trend search using Grok or parallel search in OpenClaw
Brief generation in Claude Opus with constraints built in
Concept pressure test against manufacturing realities
Prompt set generation for multiple visual directions
Mockup generation in a GPT-5-class image tool
Asset organization in Google Drive, Airtable, or Notion through n8n or Make
Human review gate before a second round

That order sounds less magical than “AI designs my collection.” It is also the order that survives contact with real work.

That was the surprise buried inside a tiny Reddit thread most people would scroll past. The jewelry designer thought they were asking for a creative agent. What they were really asking for was a production pipeline with clear roles, clear folders, clear checkpoints, and one human decision at the end.

Once you see that, the category looks different.

The useful creative assistant is not the one that gives you more ideas. It is the one that shows up tomorrow morning with the research done, the brief written, the mockups organized, and a clean place for you to say yes or no.

I thought creative AI needed better prompts, but it actually needed a 4-step LLM routing pipeline

Keep reading

I thought creative AI needed better prompts but it actually needed llm routing

The first browser-agent workflow teams will actually run at scale is way smaller than the demos