← Blog/Guide

I kept tracking AI agent pricing by model and missed the Slack channel that was burning the budget

Priya SharmaMay 26, 2026 · 9 min read

AI Agent Spend Visibility

Model pricing looked flat. Slack was the budget leak.

Burn source

Slack

72%

Spend by channel

By model

GPT-4.134%

Claude33%

Gemini33%

Costing workflow

Support triage$4.8k

Internal Q&A$1.2k

Lead routing$0.7k

AI agent pricing gets misleading fast when you track only tokens or model spend. For real agent operations, the metric that matters is cost per workflow: per Slack channel, customer, automation run, or conversation. A single LiteLLM example even exposes a per-call cost of 2.85e-05, which is useful only if you also know what workflow that call belonged to.

AI agent pricing gets misleading fast when you track only tokens or model spend. For real agent operations, the metric that matters is cost per workflow: per Slack channel, customer, automation run, or conversation. A single LiteLLM example even exposes a per-call cost of 2.85e-05, which is useful only if you also know what workflow that call belonged to.

I kept seeing the same bad advice pop up in cost threads.

“Watch your OpenAI dashboard.” “Split traffic by API key.” “Use a cheaper model.”

That all sounds reasonable right up until your agent is doing real work across Slack, Telegram, n8n, and OpenClaw.

Then your nice clean dashboard turns into a crime scene.

While researching ai agent pricing, I came across a thread on r/openclaw that captured the problem better than most vendor docs ever do. The original poster wanted cost per Slack channel or Telegram topic in SigNoz. Someone suggested separate API keys per channel.

And the poster immediately killed that idea:

“Problem is that I need to track cost per slack channel without adding new api key when I add bot to new channel”

That sentence is the whole story.

Not “how many tokens did we use?” Not even “which provider is expensive?”

The real question was: which unit of work is causing spend?

That’s a very different problem. And once you see it, you can’t unsee it.

The dashboard was telling the truth and still lying to me

Provider dashboards are not useless. I still want OpenAI usage charts. I still want to know if Claude Opus 4.6 suddenly spikes or if GPT-5 starts eating twice the budget on a bad deploy.

But provider dashboards answer procurement questions. They do not answer operator questions.

If one OpenClaw bot serves 40 Slack channels, 12 Telegram topics, and three customer-facing automations in n8n, then “Claude cost this week” is trivia. Interesting trivia, sure. But still trivia.

What I actually need to know is:

Which Slack channel is generating the most expensive interactions?
Which customer account is causing repeated long-running tool loops?
Which n8n workflow fans out into five model calls instead of one?
Which automation path got slower and more expensive after I added retrieval?

That’s the difference between accounting and operations.

And once agents become multi-step, operations wins.

Why per-model reporting breaks the moment agents get real

n8n’s docs make a useful distinction here: an LLM call is not the same thing as an AI agent run. Agents are multi-step, tool-using systems. They branch. They call tools. They retry. They hit different models for different sub-tasks.

That matters more than most teams realize.

A workflow created in n8n before version 1.0 can even behave differently from workflows on 1.0 and above because execution order is branch-based. That sounds like a tiny implementation detail until you’re trying to explain why one support workflow suddenly costs 3x more than another.

Because your “one request” was never one request.

It was:

classify the message
retrieve context
summarize the thread
generate a response
reformat for Slack
maybe retry because the first tool call failed

Now ask yourself: what does a model dashboard tell you about that run?

Usually, almost nothing.

It can tell you GPT-5 handled step 2 and Claude handled step 4. Great. But the business event wasn’t “step 4 used Claude.” The business event was “customer onboarding workflow failed twice and cost $1.18.”

That’s the number people actually act on.

So what should you measure instead?

Here’s my opinionated answer: if you run agents in production, your primary cost metric should be cost per completed workflow.

Not cost per model. Not cost per million tokens. Not cost per provider account.

Then you break that down by the business dimensions that map to reality:

The dimensions that actually matter

Workflow ID: lead_enrichment_v3, support_triage, invoice_reconciliation
Channel: Slack #sales, Telegram topic returns, Discord #mod-help
Customer: account ID, workspace ID, tenant ID
Feature: summarization, routing, extraction, escalation
Stage: classify, retrieve, tool-call, draft, validate
User: internal operator, end customer, teammate, bot owner

This is where most llm cost optimization efforts go sideways.

Teams obsess over whether to switch from Claude to Qwen or from GPT-5 to Llama 4, but they still can’t answer the basic question: which workflow is worth what it costs?

That’s why the model-switching conversations often feel frantic instead of strategic.

While reading another r/openclaw discussion, I saw one user say, “I’m spending too much on Claude tokens so i need to switch.” Fair. We’ve all been there.

But another commenter said something smarter: “I try to keep one main paid plan and only add others if there’s a clear reason, otherwise it turns into subscription soup fast.”

Exactly.

If your reporting unit is “Claude is expensive,” you end up with subscription ai chaos. If your reporting unit is “support escalation workflow costs $0.09 and closes tickets 22% faster,” now you can make adult decisions.

The good news is the tooling already exists

This is the part that surprised me.

The industry mostly knows how to solve this. The pattern is not mysterious. You attach metadata at request time and aggregate downstream.

LiteLLM, Langfuse, and Helicone are all pointing in the same direction.

Option	What it actually gives you
OpenAI Usage/Costs API	Groups and filters by `project_ids`, `user_ids`, `api_key_ids`, and `models`; useful provider-native reporting, but weak for business-level attribution unless you partition traffic upstream
LiteLLM	Tracks spend by user and metadata tags, exposes `x-litellm-response-cost` and call IDs, and works as an OpenAI-compatible proxy across 100+ LLMs
Langfuse / Helicone	Supports attribution by user, tags, traces, conversations, features, and workflow stages; much better fit for cost per workflow, customer, or channel

LiteLLM gets the idea exactly right

LiteLLM’s proxy docs show OpenAI-compatible requests where you pass both a user field and metadata tags.

response = client.chat.completions.create(model="llama3", messages=[{"role": "user", "content": "this is a test request, write a short poem"}], user="palantir", extra_body={"metadata": {"tags": ["jobID:214590dsff09fds", "taskName:run_page_classification"]}})

That is not just logging fluff. That is the difference between “Llama 3 cost money” and “page classification job 214590dsff09fds cost money.”

LiteLLM also exposes response headers like this:

curl -i -sSL --location 'http://0.0.0.0:4000/chat/completions' --header 'Authorization: Bearer sk-1234' --header 'Content-Type: application/json' --data '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "what llm are you"}]}' | grep 'x-litellm'

The docs example from LiteLLM version 1.40.21 includes an x-litellm-response-cost header with a request cost of 2.85e-05. Tiny number. Very cool.

But again, it only becomes operationally useful when that call is tagged with the workflow, channel, or task that produced it.

Helicone is even more explicit

Helicone barely hides the punchline. Its Custom Properties docs explicitly recommend tagging requests by project, feature, conversation, environment, or workflow stage.

"Helicone-Property-Conversation": "support_issue_2", "Helicone-Property-App": "mobile", "Helicone-Property-Environment": "production"

That is exactly how you answer questions like:

Why did mobile support conversations get expensive yesterday?
Why is production spending 4x more than staging?
Why does one conversation type trigger long agent loops?

Helicone is telling you, pretty directly, that the right unit is not “model.” It’s conversation, feature, stage, environment.

And Langfuse lands in the same place with user IDs, tags, traces, and Metrics API rollups.

What about OpenAI’s own usage APIs?

They’re better than people give them credit for.

OpenAI’s Completions Usage API and Costs API can group and filter by project_ids, user_ids, api_key_ids, and models. That is real progress over the old “here’s your total bill, good luck” era.

But there’s a structural limit here that nobody can code around.

OpenAI cannot infer your business unit of work if you never send it one.

If five workflows share one project and one API key, OpenAI can’t magically know which Slack channel caused the spike. That information never existed at the provider layer.

So yes, use provider-native reporting for anomaly detection and procurement. Absolutely.

Just stop pretending it solves workflow economics.

The sneaky part nobody budgets for

Most teams still think in input tokens and output tokens.

That’s already outdated.

Langfuse documents additional usage types like cached_tokens, audio_tokens, and image_tokens. LiteLLM also warns that cost discrepancies often come from token categories like cache usage or provider-specific pricing tiers.

This is where raw token charts become actively misleading.

Two workflows can show similar token totals and wildly different cost behavior because:

one hits cache heavily
one uses audio
one includes image processing
one routes to a different pricing tier
one retries three times because a downstream tool is flaky

That’s why “we used 8 million tokens” is often a vanity metric.

I’d rather know that the TikTok comment moderation workflow costs $0.004 per item while the Discord escalation workflow costs $0.19 per incident. At least then I know where to look.

What happens if you don’t pass metadata on every request?

Then everything falls apart.

This is the annoying part, and there’s no way around it.

Workflow-level attribution only works if engineers are disciplined about propagating metadata on every request. Workflow ID. User ID. Channel. Customer. Stage. Whatever matters for your business.

Miss it in one service and your charts rot from the inside.

So if I were setting this up tomorrow, I’d do it like this:

Define one primary unit: workflow run, conversation, or customer interaction.
Require metadata fields for every LLM call: workflow_id, customer_id, channel, stage, user_id.
Pass those fields through LiteLLM, Langfuse, Helicone, or your own middleware.
Roll up cost by workflow first, model second.
Use provider dashboards only as a secondary lens for anomaly detection.

That ordering matters.

If you reverse it, you end up optimizing model spend while the expensive workflow keeps quietly detonating in #support-enterprise every afternoon.

The metric that finally made the whole thing click

Here’s the simplest way I can put it.

If you run one prompt, model-level cost is fine. If you run agents, cost per workflow is the real bill.

That’s the shift.

And once you start thinking that way, a lot of weird industry behavior suddenly makes sense. The endless provider switching. The panic over token bills. The growing interest in flat monthly plans from $9-$399/month. The appeal isn’t just lower cost. It’s escaping the mental trap of treating every agent step like a separate financial event.

Because for the team actually operating the thing, the only question that matters is brutally simple:

Did this workflow create enough value to justify what it cost?

Everything else is just a prettier dashboard.

Frequently Asked Questions

How should I track AI agent costs if one bot serves many Slack channels?

Track cost by workflow or channel using request metadata, not just by API key or model. If one agent serves many Slack channels, attach channel IDs or workflow IDs to every LLM request so tools like LiteLLM, Langfuse, or Helicone can attribute spend correctly.

Are OpenAI usage dashboards enough for agent cost tracking?

They are useful, but not enough for serious agent operations. OpenAI can group usage by project IDs, user IDs, API key IDs, and models, but it still cannot infer your business workflow unless you partition traffic or send metadata upstream.

What is the best metric for AI agent pricing?

For production agents, the best metric is usually cost per completed workflow, conversation, or customer interaction. Agents are multi-step systems, so per-model token totals often hide the true cost of retries, branching, tool calls, and multi-model execution.

Can LiteLLM track spend by user or task?

Yes. LiteLLM supports spend tracking by user and metadata tags, and its docs show OpenAI-compatible requests with fields like `user` and tagged metadata such as job IDs or task names. It also exposes response headers like `x-litellm-response-cost` for per-call cost visibility.

Why do token totals fail for llm cost optimization?

Token totals miss important cost drivers like cached tokens, audio tokens, image tokens, retries, and provider-specific pricing tiers. Two workflows can have similar token counts but very different real costs, which is why workflow-level attribution is more useful than raw model charts.