← Blog/Engineering

If your agent touches health data, do the boring part first

Sarah MitchellJune 11, 2026 · 9 min read

The first sleep workflow I’d actually trust is not an AI doctor. It’s a narrow pipeline that takes six months of Apple Watch sleep data, cleans timestamps, maps everything into a fixed diary schema, flags missing fields, and then stops so a human can review it before anything reaches a clinician.

I had that reaction almost immediately while reading a post from someone who said their AI assistant turned six months of Apple Watch sleep data into the diary their sleep clinic asked for, and that the data gotchas were brutal. That sentence landed because it describes the real problem better than most health-agent demos do.

That is the use case. Not “AI physician.” Not “autonomous sleep coach.” Just a painfully practical workflow that takes messy wearable exports and turns them into something a clinic can actually read. The reason it sounds trustworthy is exactly because it is so unglamorous.

Most health-agent demos skip the only part I actually care about. They show the polished summary, the calm interface, the model saying something fluent, and they act like the hard part is reasoning.

I don’t think that’s true. The hard part is ETL: extraction, transformation, loading. It’s the ugly middle where timestamps shift, overnight sleep crosses calendar days, naps don’t fit neatly, fields go missing, intervals overlap, and every clinic seems to want a slightly different format.

That ugly middle is the product. If you get that part wrong, the nice summary at the end is worse than useless because it feels authoritative while being built on bad structure.

So when I think about what I would actually trust, it’s not a medical agent. It’s a structured sleep-diary automation with very clear limits.

The workflow should do one job: convert raw export data into a reviewable artifact. No diagnosis, no treatment suggestions, no fake confidence, and definitely no language like “based on your patterns, you may have…” coming from a model that is really just smoothing over ambiguity.

The shape I want is simple:

Apple Health export -> parse sleep records -> normalize timestamps and timezone -> map to fixed diary schema -> flag missing fields -> human review -> clinic submission

That’s it. And honestly, that’s where agent routing becomes useful instead of theatrical. You do not want one giant freeform model touching every step. You want deterministic steps for parsing and validation, and maybe one LLM pass at the end for human-readable notes.

If I were wiring this in OpenClaw, n8n, Make, Zapier, or a custom Python service using an OpenAI-compatible endpoint, I’d break it into very boring pieces. One step extracts the Apple Health export. Another validates schema and rejects broken rows. Another normalizes timestamps and day boundaries. Then, only after the data is clean, an LLM can write plain-language diary notes from validated fields.

And then comes the most important step of the whole system: a real approval gate. A human has to review it before anything gets labeled clinic-ready.

That approval step is the moral center of the workflow. Without it, you’re not automating carefully. You’re bluffing with a nicer UI.

I think this lesson matters far beyond health data. The safest agent is usually not the one with the fanciest prompt stack. It’s the one with the clearest boundary between deterministic work and model work.

That applies whether you’re dragging nodes around in n8n, building a Make scenario, running OpenClaw agents, or stitching together Python jobs with retries and queues. Once a workflow starts failing, retrying, and getting kicked back by humans, architecture stops being a whiteboard discussion and becomes a billing and reliability problem.

That’s also why narrower systems often feel smarter in practice. A lot of people talk about agents as if broader capability automatically means better outcomes, but I keep seeing the opposite.

If the input is messy, the output will be messy too. In a health context, that ambiguity feels more dangerous because people are tempted to read fluency as competence.

So the version I like is aggressively narrow. Not “an agent that helps with sleep,” but a pipeline that says: I will produce a diary-shaped artifact from source-derived fields, I will clearly mark what was inferred or summarized by a model, and I will stop before pretending to practice medicine.

That sounds less ambitious. I think it’s more useful.

This is also why multi-agent health demos make me nervous. I’m not anti-agent, but I am very anti-chaos.

Parallel agents look clever right up until the workflow matters. For a side project, a Discord summarizer, or some Raspberry Pi experiment, fine. For something touching sleep-clinic paperwork, I want fewer moving parts, not more.

The same thing shows up in mainstream automation stacks all the time. Once you have retries, branches, approvals, and external APIs, every extra model call becomes another failure point to inspect. A lot of “agentic” automations in n8n, Make, Zapier, and custom Python backends should be much more boring than their builders want to admit.

The architecture I trust here is almost insultingly simple. One parser. One validator. One formatter. One optional model for text cleanup. One human reviewer.

If you need three agents debating whether a 2:07 AM sleep segment belongs to Tuesday or Wednesday, you already lost.

There’s also a cost angle that people weirdly avoid talking about. I like the boring version because it’s cheaper in the ways that actually matter.

Not cheap as in “the demo run looked affordable.” Cheap as in repeated parsing, retries, corrections, and formatting passes don’t quietly blow up your bill.

That distinction matters a lot for real automations. In production, the same record might get parsed, rejected, normalized, summarized, reviewed, corrected, and rerun. Human-in-the-loop design is safer, but it also means the workflow may hit a model multiple times before it’s done.

If you’re paying per token for every retry, every validation miss, and every regenerated summary, the system starts punishing you for being careful. That’s a bad incentive, especially for teams building workflows they actually need to trust.

This is exactly where Standard Compute’s model makes sense to me. If you’re building agent workflows in n8n, Make, Zapier, OpenClaw, or your own Python stack, flat-rate access changes how you design the system. You can afford to add validation passes, approval gates, retries, and cleaner routing without feeling like every safeguard is another meter running.

And because Standard Compute is a drop-in OpenAI API replacement, you don’t have to rebuild your whole stack to get there. You keep the same OpenAI-compatible SDKs and HTTP clients, while routing requests across models like GPT-5.4, Claude Opus 4.6, and Grok 4.20 behind the scenes.

That matters more than model leaderboard arguments, because production automations are rarely one-shot prompts. They are loops. They rerun. They branch. They fail in the middle. They get corrected by humans and then replayed.

In that kind of system, the winning architecture is usually deterministic preprocessing first, narrow LLM usage second. Parse and validate before the model sees anything. Compress the messy raw export into clean structure, then let the model do the one thing it’s actually good at: turning structured data into readable language.

If you don’t do that, you end up paying for confusion repeatedly.

Here’s the honest comparison I keep coming back to:

Chatbot-style health agent

Feels impressive in a demo
Flexible and conversational
Prone to overclaiming
Bad fit for rigid clinician-required formats

Structured sleep-diary automation

Schema-first and explicit about missing fields
Easy for a human to review
Less likely to invent authority it doesn’t have
Much better fit for actual clinic paperwork

Manual spreadsheet transcription

Slow and annoying
Error-prone over long time spans
Still sometimes easier to inspect line by line
Often chosen because people trust boring visible mistakes more than polished hidden ones

The surprising answer is that the middle option is the one I’d trust first. Not because it’s the most advanced, but because it has the fewest opportunities to lie.

If I were setting rules for a system like this, a few would be non-negotiable. First, source-derived fields and model-generated text must be clearly separated. If a sleep start time came from the export, label it that way. If a sentence like “sleep appeared fragmented” came from GPT-5 or Claude, label it as model-generated summary.

Second, broken rows should fail loudly. Not “best effort,” not “probably fine.” If a start or end time is missing, reject the row. If intervals overlap, flag them. If timezone handling changes the diary date, show that explicitly.

Third, human review needs to be real. Not a checkbox buried in a settings panel, but an actual gate where someone can inspect the diary, compare it with the underlying records, and correct obvious nonsense before anything is exported or shared.

Fourth, the workflow should admit uncertainty. I’m not going to pretend every clinic wants the same format, or that Apple Watch sleep estimates map cleanly onto every clinician’s expectations. Different clinics want different things, wearable data can be incomplete, and a generated diary can still be wrong even if the pipeline is well designed.

That is not an argument against automation. It’s an argument for humble automation.

And honestly, this pattern is bigger than sleep clinics. The Apple Watch example is just a good stress test because the errors are obvious and the stakes are real.

The same design rule applies to invoices, support tickets, compliance forms, CRM cleanup, and any workflow where bad structure upstream creates fake confidence downstream. First parse. Then validate. Then normalize. Then let GPT-5, Claude, or another model turn the cleaned result into readable text.

That’s the reusable pattern for production automations. Deterministic preprocessing before LLM summarization. Not because it sounds less magical, but because it fails in ways you can actually inspect.

That’s the twist I keep coming back to: the first health workflow I’d trust barely feels like an agent at all. It’s more like a disciplined conveyor belt with one carefully fenced-off language model at the end.

No diagnosis engine. No fake bedside manner. No synthetic certainty. Just a boring pipeline that takes six months of Apple Watch sleep data, survives the brutal gotchas, produces a structured diary, and hands it to a human before anyone calls it medical.

That may sound small, but I think it’s exactly the right size. And if you’re building anything for health, sleep, or clinician-facing workflows, that’s probably the lesson worth stealing: the more sensitive the task, the less your automation should improvise.

If your agent touches health data, do the boring part first

Keep reading

I think the best openai api alternative for customer email is way smaller than the “replace your staff” people admit

I looked into oauth openai for OpenClaw and the scary part isn’t what most people think