AI in life-critical systems: what you can't afford to vibe-code

Engineering notes · Black Iris

The current AI moment is dominated by two genres of demo: the convincing chatbot, and the LLM "agent" that confidently does something it shouldn't. Both are fine in a sandbox. Neither survives contact with a system where a wrong answer dispatches an ambulance to the wrong address.

We've shipped AI features into production for an emergency medical dispatch platform under SOC 2-aligned controls and HIPAA-compliant data handling. The lessons aren't anti-AI — quite the opposite. They're about what changes when the cost of being wrong stops being theoretical.

The unspoken rule: AI proposes, humans dispose

In any system where errors cascade into real-world harm, the most useful framing for AI features is: this is a suggestion engine, not an action engine. LLMs are extraordinary at parsing unstructured text, summarizing context, and ranking candidates. They are unreliable at being the final word on anything consequential.

Our AI-assisted dispatch reads an inbound email, extracts a probable location and incident type, and pre-fills a dispatch form. A human confirms before any responder is notified. This isn't a paranoid design choice — it's the only design choice. Removing the human is a separate, multi-year evaluation problem most teams haven't even framed correctly yet.

Eval harnesses are the actual product

You can't ship an AI feature into a life-critical system without a regression test suite that exercises the LLM the way you would any other dependency. We maintain a golden set of representative inputs — real anonymized incident emails, edge cases, adversarial inputs, multilingual variants — and we run them on every change to the prompt, the model, the retrieval layer, or the surrounding code.

The metrics that matter are not "accuracy." They are:

  • Recall on must-not-miss cases. If the system fails to extract a location from a clear emergency, that's a P0 regression.
  • False-confidence rate. When the model is wrong, does it say so, or does it return a plausible-looking lie? The second is far more dangerous than the first.
  • Drift across model upgrades. A new model version that scores 1% lower overall but breaks an entire category of edge cases is a worse release, not a better one.

If you don't have this harness, you don't have an AI feature. You have a liability with good UX.

The audit trail problem nobody talks about

Compliance frameworks like SOC 2 expect deterministic, auditable behavior. LLMs are, by construction, not deterministic — temperature, sampling, model updates, prompt drift, and even silent provider-side changes all move the goalposts.

You can't make this go away, but you can contain it. Log the exact model version, prompt template, retrieval context, and raw response on every invocation. Treat the model output as input to a deterministic downstream system whose behavior is auditable. The chain of custody for any action a user takes should reconstruct to "human saw X, clicked Y" — not "the AI did it."

Where LLMs earn their keep

The strongest fits we've found are the unglamorous ones:

  • Parsing unstructured inputs into structured ones with a human review step.
  • Search and retrieval over internal documentation, with citations to the source.
  • Summarization of long incident logs into a briefing a human reads.
  • Suggestion ranking — proposing the three most likely options, never executing on them.

Where we don't use them

Anywhere a model output directly triggers an external action without human review. Anywhere the consequences of a confidently-wrong answer are irreversible. Anywhere we can't show our work in an audit. This isn't a stance against AI — it's a stance against the failure mode where someone shipped a chat-style interface over a process that warranted a form, a workflow, and a signoff.

The actual takeaway

Building AI into life-critical systems is mostly an exercise in restraint. The interesting engineering is in everything around the model: the eval harness, the retrieval layer, the human-in-the-loop UX, the audit logs, the rollback story when the provider silently changes a model. If your team isn't excited about that work, they aren't ready to ship AI in this domain yet.

← More writing from Black Iris