← Blog/Engineering

I kept tracking AI agent pricing by model and missed the Slack channel that was burning the budget

James OlsenMay 26, 2026 · 9 min read

I used to think I was being responsible about AI costs. I checked provider dashboards, watched token usage, compared GPT-5 against Claude Opus 4.6, and told myself I had a handle on what our agents were spending.

Then I realized I was measuring the wrong thing.

The moment an agent starts doing real work across Slack, Telegram, n8n, and OpenClaw, model-level pricing stops being the main story. The real story is cost per workflow: per Slack channel, per customer, per automation run, per conversation. If you can’t see that, your dashboard can be technically accurate and still completely fail you.

That clicked for me while reading a thread on r/openclaw about tracking cost per Slack channel or Telegram topic in SigNoz. Someone suggested using separate API keys per channel, which is exactly the kind of advice that sounds smart until you imagine maintaining it in production.

The original poster shut it down immediately: “Problem is that I need to track cost per slack channel without adding new api key when I add bot to new channel.” That one sentence explains the whole problem better than a lot of vendor documentation.

The question wasn’t really about tokens. It wasn’t even about whether OpenAI or Anthropic was more expensive that week. The real question was: which unit of work is causing spend?

That’s a much more useful question, and once I started looking at agent pricing that way, a lot of standard advice started to feel weirdly shallow.

People love saying things like “watch your OpenAI dashboard,” “split traffic by API key,” or “just use a cheaper model.” None of that is wrong, exactly. It’s just incomplete in the way that matters most once agents stop being toy demos.

If one OpenClaw bot serves 40 Slack channels, 12 Telegram topics, and a handful of customer-facing n8n automations, “Claude cost this much this week” is interesting but not actionable. I don’t really care which provider was expensive in the abstract. I care which workflow got out of control.

I want to know whether #support-enterprise is generating giant context windows every afternoon. I want to know whether one customer account is triggering repeated tool loops. I want to know whether lead_enrichment_v3 now makes five model calls where it used to make one.

That’s the difference between accounting and operations. Accounting tells you what happened. Operations tells you what to fix.

This gets more obvious the minute you look at how agent systems actually run. n8n’s docs make a useful distinction here: an LLM call is not the same thing as an AI agent run. Agents branch, call tools, retry, retrieve context, and sometimes route different subtasks to different models.

That sounds obvious when you say it out loud, but a lot of cost reporting still acts like a workflow is one request and one response. It isn’t. A single “reply to this customer” action might classify the message, retrieve documentation, summarize a thread, generate a draft, reformat it for Slack, and retry because a tool call failed halfway through.

So when a dashboard tells me GPT-5 handled one step and Claude Opus 4.6 handled another, that’s not useless information. It’s just not the level I need to make a decision. The business event wasn’t “Claude handled step 4.” The business event was “the onboarding workflow failed twice and cost $1.18.”

That’s the number people actually act on.

My opinionated take is simple: if you run agents in production, your primary cost metric should be cost per completed workflow. Not cost per model. Not cost per million tokens. Not cost per provider account.

Then you break that number down by the dimensions that map to reality. Workflow ID. Customer ID. Channel. Feature. Stage. User. Those are the dimensions that tell you whether a workflow is worth what it costs.

This is where a lot of LLM cost optimization conversations go off the rails. Teams spend days debating whether to switch from Claude to Qwen or from GPT-5 to Llama 4, but they still can’t answer the most basic question: which workflow is delivering value, and which one is just eating budget?

I saw a version of this in another r/openclaw thread where one user said they were spending too much on Claude tokens and needed to switch. Totally relatable. Everyone who has run agents long enough has had that moment.

But another commenter said something much smarter: they try to keep one main paid plan and only add others if there’s a clear reason, otherwise it turns into subscription soup fast. That’s exactly right.

If your reporting unit is “Claude is expensive,” you end up making frantic provider decisions. If your reporting unit is “support escalation costs $0.09 and closes tickets 22% faster,” now you can make adult decisions.

The good news is that the tooling already points in the right direction. The pattern is not mysterious. You attach metadata at request time and aggregate downstream.

LiteLLM, Langfuse, and Helicone all push you toward the same conclusion. OpenAI’s own usage APIs are improving too, but they still depend on what you send them.

Here’s the practical breakdown:

OpenAI Usage and Costs API

Groups and filters by things like project_ids, user_ids, api_key_ids, and models
Good for provider-native reporting and anomaly detection
Weak for business-level attribution unless you partition traffic upstream

LiteLLM

Tracks spend by user and metadata tags
Exposes headers like x-litellm-response-cost and request IDs
Works as an OpenAI-compatible proxy across a huge range of models
Useful when you want cost data attached to real workflow metadata instead of just model names

Langfuse and Helicone

Better fit for attribution by user, tags, traces, conversations, features, and workflow stages
Much closer to the way operators actually think about agent systems
Stronger choice if your goal is cost per workflow, customer, or channel

LiteLLM is a good example because it gets the core idea exactly right. Its docs show OpenAI-compatible requests where you pass a user field plus metadata tags. That’s not just nice-to-have logging. That’s the difference between “Llama 3 cost money” and “job 214590dsff09fds for page classification cost money.”

It also exposes response headers with request cost. One docs example from LiteLLM version 1.40.21 includes an x-litellm-response-cost value of 2.85e-05. That’s neat, but by itself it’s almost too small to mean anything.

A number like 2.85e-05 only becomes useful when you know what generated it. Was it a classification step in a profitable workflow? Was it one of 400 tiny calls inside a runaway Slack bot loop? Without metadata, it’s just a decimal pretending to be insight.

Helicone is even more direct about this. Its custom properties docs explicitly recommend tagging requests by project, feature, conversation, environment, or workflow stage. That’s basically Helicone telling you, in plain language, that the right unit of analysis is not “model.” It’s conversation, feature, stage, and environment.

That’s how you answer questions that actually matter. Why did mobile support conversations get expensive yesterday? Why is production spending four times more than staging? Why does one conversation type trigger long agent loops while another one stays cheap?

Langfuse lands in the same place with user IDs, tags, traces, and metrics rollups. Different product, same lesson.

OpenAI’s own usage APIs deserve more credit than they usually get. Being able to group and filter by project_ids, user_ids, api_key_ids, and models is a lot better than the old era of “here’s your total bill, good luck.”

But there’s a hard limit nobody can code around: OpenAI cannot infer your business unit of work if you never send it one. If five workflows share one project and one API key, OpenAI has no way to know which Slack channel caused the spike. That information never existed at the provider layer.

So yes, use provider-native reporting. I would. It’s good for anomaly detection, procurement, and catching obvious spikes.

Just don’t pretend it solves workflow economics.

Another thing that sneaks up on teams is that token counts are getting less useful as a standalone metric. Langfuse documents additional usage types like cached tokens, audio tokens, and image tokens. LiteLLM also points out that cost discrepancies often come from token categories like cache usage or provider-specific pricing tiers.

That means two workflows can show similar token totals and still behave completely differently on cost. One might hit cache heavily. Another might process audio. Another might involve image inputs. Another might retry three times because a downstream tool is flaky.

This is why “we used 8 million tokens” increasingly feels like a vanity metric. I’d rather know that the TikTok moderation workflow costs $0.004 per item while the Discord escalation workflow costs $0.19 per incident. At least then I know where to look.

The annoying part is that workflow-level attribution only works if engineers are disciplined about metadata on every single request. Workflow ID, user ID, channel, customer, stage — whatever matters for your business has to travel with the call.

Miss it in one service and your charts start rotting from the inside. Everything still looks measurable, but the numbers stop lining up with reality.

If I were setting this up from scratch tomorrow, I’d keep it boring and strict. First, define one primary unit: workflow run, conversation, or customer interaction. Then require metadata fields like workflow_id, customer_id, channel, stage, and user_id on every LLM call.

After that, pass those fields through LiteLLM, Langfuse, Helicone, or your own middleware. Roll up cost by workflow first and model second. Keep provider dashboards as a secondary lens, not the main one.

That ordering matters more than people think. If you reverse it, you end up optimizing model spend while the expensive workflow keeps quietly exploding in #support-enterprise every afternoon.

This shift also changes how people think about pricing models in general. If you run one prompt, model-level cost is usually enough. If you run agents, cost per workflow is the real bill.

That’s part of why predictable pricing is getting more attractive for teams running automations all day. The appeal of flat monthly plans isn’t just that they can be cheaper. It’s that they remove the mental overhead of treating every agent step like a separate financial event.

That’s the part Standard Compute gets right. If you’re building AI agents and automations with an OpenAI-compatible stack, the real pain usually isn’t just raw model cost. It’s the constant background stress of per-token billing while workflows fan out across retries, tools, and multiple models.

Standard Compute’s flat monthly pricing changes that equation. Instead of watching every token and wondering whether a Slack bot, n8n workflow, or OpenClaw agent is quietly detonating your budget, you get predictable cost and can focus on whether the workflow is actually useful.

For teams running agents 24/7, that matters more than another dashboard chart.

The simplest version of my argument is this: once agents are involved, the unit that matters is the workflow, not the model. Everything else is just a prettier way to miss the thing that’s actually burning your budget.

I kept tracking AI agent pricing by model and missed the Slack channel that was burning the budget

Keep reading

I thought a family calendar bot should run everything until I realized AI is way better at intake than decisions

I stopped letting my AI agent do the final click, and my automations got way more useful