← Blog/Engineering

I thought llm tool calling would kill glue code and then my lights still wouldn’t turn on

Sarah MitchellMay 14, 2026 · 13 min read

I love the demo version of LLM tool calling. You ask for one thing, the model picks the right function, and suddenly Shopify updates a cart or Home Assistant dims the lights. For about five minutes, it really does feel like we finally escaped bespoke integrations.

Then you try to make it work in a real stack, and the fantasy falls apart in the most boring ways possible. Not because GPT-5 or Claude can’t understand the intent, but because auth breaks, proxies get weird, files don’t move cleanly, and permissions are still clumsy.

That’s the part I keep coming back to: MCP is real progress, but it does not erase glue code. It mostly standardizes the message format. The painful parts that actually wreck projects are still yours.

And once you see that, a lot of the current agent hype starts looking less like magic and more like operations with better branding.

The demo works because it cuts out the annoying parts

Every launch video is basically the same story. A model sees a button-shaped problem, chooses the right tool, calls it once, and everyone acts like we’ve entered the post-integration era.

I get why that’s seductive. If you’ve ever spent weeks wiring up APIs in n8n, Make, Zapier, OpenClaw, or some custom agent stack, one clean tool call feels like a miracle.

But the miracle only exists inside the carefully controlled demo. In production, your browser MCP server needs a login, your hosted environment needs a token, your agent has to pass a file to another service, and suddenly you’re not building an AI workflow anymore. You’re doing agent ops.

That shift matters. It means the problem was never just “how do I let the model call a tool?” The real problem is “how do I make a chain of semi-compatible systems behave reliably at 2 a.m. when one token expires and another service changes a schema?”

MCP fixed something important, just not the thing people hoped

I’m not anti-MCP at all. The Model Context Protocol is a good idea, and honestly a necessary one.

It gives tools a shared language. It uses JSON-RPC 2.0, and now it defines transports like stdio and Streamable HTTP. That alone removes a bunch of one-off integration nonsense that wasted everyone’s time in 2024.

But here’s the catch that gets lost in all the excitement: MCP standardizes the protocol, not the operational glue. That sounds abstract until you actually deploy something.

If you run MCP over stdio, you’re usually spawning a local subprocess and sending messages over stdin and stdout. The spec is pretty clear that stdio implementations should not use the HTTP authorization flow, which means credentials usually come from the environment.

That sounds simple right up until you have to answer the annoying questions. How do secrets get onto the machine? Who can read them? What restarts the subprocess when it dies? Why does it work on one developer laptop and fail in CI?

None of that is a protocol problem. It’s just infrastructure work wearing a new hat.

Remote MCP over Streamable HTTP feels cleaner and more adult. The spec supports HTTP transports, authorization is defined in an OAuth-style direction, and OpenAI made this feel especially slick when it added remote MCP server support to the Responses API on May 21, 2025.

The sample code is so clean it almost feels like trolling:

response = client.responses.create(
  model="gpt-4.1",
  tools=[{
    "type": "mcp",
    "server_label": "shopify",
    "server_url": "https://pitchskin.com/api/mcp",
  }],
  input="Add the Blemish Toner Pads to my cart"
)

That is a genuinely nice API. I’m not even being sarcastic.

But the second you move past the snippet, the old pain just reappears in a different shape. Is the endpoint reachable? Is auth configured correctly? Are you validating origin tightly enough? Did the tool schema change? What happens when the remote server is down? Who rotates the tokens?

The protocol got simpler. The glue just moved.

Home Assistant is where the clean story runs into reality

Home Assistant is probably the best example of this because the docs are ambitious and honest at the same time. It can now act as both an MCP server and an MCP client, which is exactly the sort of thing people have wanted.

As a server, Home Assistant can expose your home context and actions so Claude Desktop or another MCP client can interact with lights, switches, shopping lists, and automations. As a client, it can connect to external MCP servers and pull in things like memory or web search.

That is genuinely cool. It’s also exactly where the hidden tax shows up.

Home Assistant’s MCP Server integration exposes an endpoint at /api/mcp using Streamable HTTP. Great. Except the docs also say many MCP clients still only support stdio, so you may need a gateway like mcp-proxy.

And on the other side, Home Assistant’s MCP client docs say that if a server only supports stdio, you may need a proxy to expose it over SSE or HTTP. So even with a standard, you often still need a bridge depending on which side of the connection is lagging behind.

That is not a knock on Home Assistant. If anything, I appreciate that the docs say the quiet part out loud.

The ecosystem is moving fast, but support is fragmented. Fragmentation is where brittle glue code breeds.

And this isn’t some tiny hobbyist corner case. Home Assistant’s 2025.5 release notes reported 2,000,000 active installations worldwide. This is mainstream home automation colliding head-on with the current state of LLM infrastructure.

The sentence that explains the whole problem

While researching this, I found a thread on r/openclaw about using MCP server tools for Home Assistant. One line from that thread stuck with me because it captures the entire problem better than any whitepaper could.

The user wrote: “I got my open claw agent responding to me and running but when I gave the agent my mcp server long lived access token ... I’m just struggling to get the agent to be able to turn my lights on and off or to write basic automations for me”.

That’s it. That’s the whole story.

The model is not the hard part anymore. GPT-5, Claude, Qwen, Llama — they can all understand that “turn off the kitchen lights” is an action.

The hard part is whether that action is exposed, authorized, shaped correctly, reachable, and safe. That’s where projects live or die.

“Just give it a token” is not a serious security model

This is where the cheerful demo tone starts to feel a little irresponsible. A lot of current agent setups still quietly rely on the user handing over a long-lived token and hoping for the best.

Home Assistant’s own auth docs are a useful reality check here. Non-owner accounts currently have the same access as the owner account, with more restrictions planned later. The MCP Server docs add entity exposure controls, which definitely helps, but the underlying issue remains: permissions are still blunt.

That matters a lot more when the agent can do things in the real world. Turning on a lamp is one thing. Opening a garage door, placing a grocery order, disabling an alarm, or triggering an automation that hits Stripe, Twilio, or Shopify is a very different category of risk.

Home Assistant also notes that unused refresh tokens are automatically removed after 90 days if they haven’t been used for login, and recommends long-lived access tokens for permanent script access. That’s practical advice. It’s also a reminder that once you leave the demo, you’re not just writing prompts anymore.

You’re managing token lifecycles, blast radius, and failure modes.

Then the file handoff problem sneaks in

Even if you get actions working, artifacts show up and ruin your week. A generated CSV has to move from one agent to another. A screenshot from a browser MCP server has to get attached to a bug report. A Claude Code session in a hosted environment has to pass a file to n8n, Discord, or GitHub without someone saying, “just paste it somewhere.”

I found another r/openclaw discussion where someone summed this up perfectly: “The options I kept running into: S3 presigned URLs — works but 15 minutes of setup for every new project ... ‘Just commit it to git’ — please no”.

That line made me laugh because it’s painfully accurate. This is spreadsheet hell again, except now the spreadsheet is hidden inside your agent architecture.

Every team ends up rebuilding the same invisible columns. Where do files live? How long do they live there? Who can fetch them? What format are they in? How does the next agent know they exist?

None of this is glamorous. All of it determines whether the system feels usable.

If you’ve ever manually installed an n8n community node inside a Docker container, you already know the emotional texture of this problem:

docker exec -it n8n sh
mkdir ~/.n8n/nodes
cd ~/.n8n/nodes
npm i n8n-nodes-nodeName

Nothing there is impossible. It’s just one more tiny maintenance choice that somehow becomes permanent.

Which setup is actually easiest to live with?

My blunt opinion: the best architecture is usually the one with the fewest moving connectors, not the one with the prettiest protocol diagram. A lot of teams are still optimizing for theoretical flexibility when they should be optimizing for fewer surfaces that can fail.

Here’s how I think about the tradeoffs.

MCP stdio

Local subprocess over stdin and stdout
Credentials usually come from environment variables instead of HTTP auth
You own process management, local secrets, restarts, and machine-specific weirdness
Great when locality is the point, less great when it becomes the accidental default

MCP Streamable HTTP / remote MCP

HTTP GET and POST with optional SSE streaming
OAuth-style authorization is supported by the spec for HTTP transports
You own origin validation, token handling, auth discovery, remote uptime, and schema drift
Cleaner for distributed systems, but not magically simpler once real traffic shows up

Direct platform-native tools

Vendor-managed API surface
Auth is usually centralized in one provider account
Less connector glue and often fewer moving parts
More lock-in, fewer cross-platform guarantees, and less portability if you switch stacks later

This is why I think so many teams are making the same mistake. They keep adding more tools when they should be reducing the surface area.

Cloudflare’s MCP guidance is refreshingly grounded on this point. It recommends remote MCP over Streamable HTTP plus OAuth, but the more important advice is the boring advice: use scoped permissions, expose fewer well-designed tools, and run evals after updates.

That is infrastructure-vendor language for “stop shipping a junk drawer.”

My least favorite failure mode is when it works for months

The dramatic failures are easy to notice. The scary ones are the systems that seem fine until one day they aren’t.

In another r/openclaw post, someone described letting their grocery agent run for months before it ordered 2 kg of garlic instead of 2 heads. That story is funny because garlic is funny. It also makes the underlying point better than a security talk ever could.

Long stretches of apparent reliability can hide brittle assumptions. Units are underspecified. Schemas drift. Browser sessions expire. Proxies drop headers. File links expire sooner than expected. One model interprets a field differently than another.

This is also why “it worked in Claude Desktop” is such a misleading milestone. The moment you move that same flow into OpenClaw, n8n, Make, Zapier, or a hosted agent environment, all the differences in transport support, auth paths, and timeout behavior come rushing back.

The connector stack becomes the product whether you wanted it to or not.

So what should teams standardize first?

Not prompts. Not even models.

The glue.

The teams that will actually win with LLM tool calling are not the ones with the most MCP servers. They’re the ones that decide once how actions, files, auth, and observability work, and then force every new agent to use the same boring path.

If I were setting up a stack today for Home Assistant, OpenClaw, Claude Code, n8n, and a couple of remote MCP services, I’d lock down these decisions before adding one more capability.

First, pick one preferred transport. I’d choose remote MCP over Streamable HTTP where possible, and only use stdio when locality is the actual reason, not because it happened to be easier in one tutorial.

Second, pick one auth pattern. Prefer scoped OAuth-style flows for remote services and treat long-lived access tokens like hazardous materials, because that’s what they are.

Third, pick one file handoff method. Signed object storage URLs with fixed TTLs and explicit metadata beat ad hoc Git commits and mystery temp folders every time.

Fourth, enforce one tool design rule. Fewer tools, clearer schemas, and stronger descriptions are better than a giant pile of vaguely overlapping actions.

Fifth, keep one eval harness and actually use it. Re-run tests after every schema tweak, auth change, model swap, or proxy update. If the flow touches money, devices, or customer data, test the ugly cases on purpose.

The real cost is not just complexity. It’s usage-based billing on top of complexity

This is the part people weirdly don’t talk about enough. When your stack is already fragile, per-token pricing makes the whole thing worse.

Because now every retry, every eval run, every schema adjustment, every “why did the agent do that” debugging session also comes with a meter running in the background. You’re paying to discover where your glue is broken.

OpenAI said that since releasing the Responses API in March 2025, hundreds of thousands of developers had used it to process trillions of tokens across its models. That tells me the appetite for tool-enabled agents is absolutely real.

But it also means a lot of teams are learning the same expensive lesson at scale: API capability does not remove operational fragility. It just gives you more ways to hit it.

That’s a big reason I think predictable pricing matters so much for agent teams. If you’re running automations in n8n, Make, Zapier, OpenClaw, or your own custom workflows, the economics should not punish experimentation, evals, or 24/7 runtime.

This is exactly why Standard Compute is interesting to me. It gives you unlimited AI compute for a flat monthly price, works as a drop-in OpenAI-compatible API, and routes across models like GPT-5.4, Claude Opus 4.6, and Grok 4.20 without turning every agent workflow into a billing anxiety exercise.

That doesn’t solve auth, file transport, or permissions for you. Nothing does. But it does remove one of the dumbest constraints in the stack: the feeling that every test, retry, or long-running automation is quietly generating a surprise bill.

For teams building real agents, that matters more than people admit.

The Reddit post I couldn’t stop thinking about

I kept coming back to one more story while writing this. A user said they spent 3.5 months, 1,300 hours, nearly 5 billion tokens, and about $700 before pausing a fragile setup.

Those numbers are extreme, sure. But the emotional arc is not extreme at all.

Anyone who has tried to make six half-compatible agent components behave like one product knows that feeling. You start out believing the standard solved the hard part. Then weeks later you realize the standard mostly solved the visible part.

MCP is not the problem. It’s the beginning of the solution.

The mistake is thinking that a standard means you no longer have to own the boring stuff. You still do. Auth, file transport, permissions, proxies, evals, retries, and uptime are not side quests. They are the product.

Once a team accepts that, things get better fast. Tool calling stops feeling like a magic trick and starts feeling like engineering.

And honestly, that’s the point where these systems become useful.