I saw the screenshot the same way everyone else did: huge number, instant outrage, instant jokes. $1,305,088.81 in OpenAI API spend over 30 days is the kind of image that makes Reddit stop whatever it’s doing and turn into a tribunal.
And to be fair, the jokes were good. In one r/openclaw thread, someone described it as “100 monkeys writing code on golden typewriters made of data center waste.” That’s brutal, and honestly, if all you saw was the top-line number, it felt deserved.
But the more I read, the less the story looked like reckless spending and the more it looked like a preview of where agentic development is heading. Tom’s Hardware reported that Peter Steinberger’s screenshot was tied to about 603 billion tokens, 7.6 million requests, and roughly 100 Codex instances. On the day of the screenshot alone, the spend was around $19,985.84 across 206,000 requests.
That’s not one person rage-prompting GPT-5.4. That’s a small software factory running all day.
The detail that changed the whole story for me was Fast Mode. Steinberger said that with Codex Fast Mode turned off, the raw API cost drops from roughly $1.3 million to about $300,000. Still massive, obviously, but now we’re talking about a latency choice layered on top of an always-on agent fleet, not just some cartoonish waste pile.
That distinction matters. People want to make this a morality tale about one person burning money on tokens, but the more useful takeaway is that per-token pricing starts behaving very differently once your workload looks like a fleet instead of a chat window.
OpenClaw wasn’t just chatting with a model. According to reporting, it was reviewing pull requests, scanning commits for security issues, deduplicating GitHub issues, proposing fixes, monitoring benchmarks, and even turning meeting discussions into PRs. That is a completely different shape of work from “developer asks one question, model gives one answer.”
OpenAI’s own Codex docs point in the same direction. Their cloud coding agents can work on many tasks in parallel, each in its own sandbox, often for 1 to 30 minutes at a time. They even support AGENTS.md files so repositories can define how the agent should behave.
Once you have long-running subagents operating in parallel, token spend stops feeling like a clean usage meter. It starts feeling like weather. You don’t really control it directly; you just spend your day trying to route around it.
That’s the part I think most people miss. The first thing that breaks when you run 100 agents is not your budget. It’s your sanity.
If you read OpenAI’s docs like an operator instead of a hobbyist, they’re surprisingly blunt about the constraints. Limits can show up as RPM, TPM, RPD, TPD, and monthly org or project caps, and some model families share those limits. So when someone says they hit an openai api quota exceeded error, that can mean several overlapping bottlenecks, not one simple problem.
At small scale, per-token billing feels elegant. You pay for what you use, and the math is easy enough to explain in a meeting. At fleet scale, the billing model drags a bunch of operational work in behind it.
Now you’re managing bursts across many agents, deciding which jobs deserve low latency, watching shared limits across model families, trying to preserve prompt prefixes for caching, and building internal dashboards so nobody accidentally smashes into a monthly cap. The money hurts, sure, but the constant vigilance is what really wears teams down.
I think that’s why this story stuck with me. I’ve seen the same pattern on a much smaller scale in automation work. The moment an LLM stops being a tool you occasionally call and starts becoming infrastructure inside workflows, you stop thinking like a prompt engineer and start thinking like an SRE with a token meter hanging over your head.
Prompt design changes too. It stops being this clever craft thing and turns into plumbing.
OpenAI says Prompt Caching can cut latency by up to 80% and input token costs by up to 90%, which sounds incredible until you read the conditions. Cache hits require exact prefix matches, usually on prompts longer than 1024 tokens, and OpenAI notes that effectiveness can degrade once the same prefix goes above roughly 15 requests per minute and spills across more machines.
That means agent prompt sloppiness is no longer just an aesthetic problem. If ten agents prepend slightly different instructions before reading the same codebase, you don’t just get inconsistency. You get a bigger bill.
Static instructions need to be stable. Repo guidance needs to be stable. Variable content needs to be pushed later in the prompt. Once you’re running a lot of agents, prompt discipline is not about elegance anymore. It’s about whether your system behaves like infrastructure or like chaos.
And then there’s the quiet part OpenAI itself seems to be admitting through pricing. If per-token billing were a perfect fit for every workload, they wouldn’t keep adding exceptions around it.
The Batch API cuts input and output costs by 50% for jobs that can finish within 24 hours. Flex processing is priced like Batch for slower, lower-priority work. That’s OpenAI acknowledging that asynchronous agent workloads are economically different from interactive chat, even if they don’t say it in those words.
The comparison is pretty straightforward once you strip away the marketing language.
OpenAI Standard API
- Billing: per-token pricing by model and token type
- Best for: interactive requests, tightly controlled workloads, lower-volume usage
- Tradeoff: you also inherit RPM, TPM, RPD, TPD, and monthly cap management
OpenAI Batch and Flex
- Billing: roughly 50% lower pricing or batch-rate pricing for slower jobs
- Best for: evaluations, enrichment, and non-urgent agent work
- Tradeoff: you give up immediacy in exchange for better economics
OpenRouter
- Billing: pass-through provider pricing plus a 5.5% credit-purchase fee
- Best for: teams that want OpenAI-compatible routing, analytics, and key-level visibility across providers
- Tradeoff: it adds control, but it doesn’t eliminate the underlying per-token mindset
I don’t think per-token billing is bad. I think it’s honest for one specific shape of work.
If you’re making occasional calls from a side project, or your workload is bursty and weird, per-token pricing is fine. Maybe it’s even ideal. OpenAI’s own Codex numbers suggest many developers are more in the $100 to $200 per month range, with lots of variance, which is a completely different universe from 603 billion tokens.
But if you’re running n8n automations, Make scenarios, Zapier tasks, OpenClaw jobs, or custom coding agents all day, the mismatch starts to show. Your orchestration layer is already priced in executions or runs, and then your model layer is priced in tokens. You end up paying one tax for workflow automation and another tax for cognition.
That second tax is where teams start changing behavior. Not because the work isn’t valuable, but because the meter is always visible.
While researching this, I ran into another r/openclaw discussion where someone built apps to monitor Claude and ChatGPT usage just to answer a very normal question: am I actually better off on a subscription plan or API pricing? Their summary was perfect: “How much am I saving on a subscription plan vs. API token costs alone? Spoiler alert: about 15x what you're paying for the plan.”
That line stuck with me because it captures the psychology of the whole thing. Once people can see the meter, they start building around the meter.
I found the same energy in another r/openclaw thread where someone said, “Felt the same when openclaw first came out at Jan. I was on a token budget and claw cost me an arm and a leg.” That’s nowhere near the Peter Steinberger scale, and that’s exactly why it matters.
You do not need 100 agents to feel token anxiety. You just need enough automation that every experiment begins with a tiny internal flinch.
And that flinch changes what teams build. It changes what they test, what they leave running, how often they retry, how much context they include, and whether they let agents operate continuously or keep them on a short leash.
So what pricing model actually fits agent fleets? My honest answer is: not one model, two.
Per-token pricing still makes sense when the work is truly interactive. Human-in-the-loop chat, short-lived coding help, low-volume internal tools, and messy experiments all fit that model. If a developer asks GPT-5.4 to debug a flaky test, charging by tokens is legible and fair enough.
OpenAI’s published rates make that pretty clear too. GPT-5.5 is listed at $5.00 per 1M input tokens, $0.50 per 1M cached input tokens, and $30.00 per 1M output tokens. GPT-5.4 comes in at $2.50, $0.25, and $15.00. Expensive, yes, but at least understandable when usage is bounded.
Persistent, parallel, always-on agents are different. If you have dozens of code review agents, benchmark watchers, issue triagers, and repo-specific fixers running continuously, what you are really buying is not just tokens. You are buying the ability to stop thinking about every token.
That’s why infrastructure-style pricing is becoming more attractive. Predictable monthly cost is not as flashy as frontier model benchmarks, but it matches how operators actually want to buy compute for automations. Boring is good when the alternative is building internal dashboards just to figure out whether your agents are being useful or merely expensive.
That’s also why Standard Compute exists. It gives teams an OpenAI-compatible API with flat monthly pricing, so the economics fit the way agent systems actually run. Instead of obsessing over token counts across GPT-5.4, Claude Opus 4.6, and Grok 4.20, teams can let routing, batching, prompt optimization, and throttling happen behind the scenes and focus on whether the automation is delivering value.
That doesn’t mean every workload should move to flat-rate pricing. If you barely touch the API, usage-based billing is still perfectly rational. But if your agents run all day inside n8n, Make, Zapier, OpenClaw, or custom workflows, there’s a point where the real product you want is not a cheaper token. It’s fewer pricing decisions.
That’s where I landed after looking past the screenshot. The viral image was not proof that agentic coding is fake or doomed. It was proof that openai api cost stops behaving like a normal software bill once agents become parallel, persistent, and semi-autonomous.
If you’re building serious automations, the first question shouldn’t be “what’s the cheapest model per million tokens?” It should be whether your pricing model matches the shape of the work.
Is the workload interactive or asynchronous? Can it be batched? Are your prompts structured for caching? What happens when multiple agents hit the same limits at once? Are you optimizing for model quality, or are you spending half your time firefighting an ops problem created by pricing?
That last question is the one I keep coming back to. Because once your team starts designing around quota errors, shared TPM caps, and cache misses, you are no longer just building agents.
You are running a token economy. And that, more than the $1.3 million screenshot, is the real story.
