Blog Post
Beyond the MVP
The Test Harness Regulated AI Products Actually Need
Why This Post Exists
A specific event sparked this reflection, but I am not going to name names here. The point is not to point fingers. The issue is bigger than any one company or team.
What matters is the pattern.
Context
Every so often, a story appears that feels like a glimpse of the future.
A tiny team uses AI to move at extraordinary speed. A product takes shape faster than most companies would have thought possible. Revenue arrives. Scale arrives. Attention arrives.
And then reality arrives.
Not because the builders were foolish. Not because speed is bad. Not because AI-assisted development is fake.
Because in regulated systems, getting the product to work is only the first threshold. The harder threshold is proving that it keeps working safely when real humans, real edge cases, and real adversaries show up.
Too many teams still treat that as a phase-two problem.
It is not.
Why this matters to me
I am not raising this to pass judgment on the people building at speed. In many ways, what they achieved is remarkable.
I am raising it because I am trying to learn from it.
As I get ready to begin a new chapter of my life, one that I hope includes building AI solutions of my own, I am thinking hard about what success should look like and what it actually requires. There are no true experts here, at least not in the sense of people who have fully solved this transition. Most of us are learning in public.
That is why stories like this matter to me. They surface the gap between building something impressive and building something trustworthy. They remind me that speed, polish, and early traction are not the whole story. If I want to build responsibly, I need to be honest about what success really means and about the disciplines required to get there.
The real lesson
We should be honest about two things at once.
First, getting an AI-assisted healthcare or telehealth product off the ground with a tiny team is an incredible feat. It signals that the cost of building useful digital systems has dropped in ways that would have sounded absurd not long ago.
Second, in regulated fields, an MVP is often just the first moment when the system becomes dangerous enough to matter.
That is not fearmongering. It is the nature of systems that touch health, money, identity, education, legal outcomes, insurance, or safety.
A pleasant demo proves that a workflow exists.
It does not prove that:
- one patient cannot see another patient’s data
- one user cannot escalate privileges by changing a URL or identifier
- audit logs are sufficient to investigate an incident
- alerts will fire when access patterns go sideways
- the organization can determine scope and disclose appropriately if something goes wrong
That is the gap.
And it is a familiar one.
We build the thing. We test that the happy path works. We call it an MVP. Then we quietly act as if working software and trustworthy software are close cousins.
They are not.
Why GenAI changes the stakes
GenAI lowers the cost of creation.
That is wonderful.
It also lowers the cost of creating something that looks complete before it is actually mature.
That is the part we need to absorb.
A coding agent can help assemble pages, forms, APIs, dashboards, queues, admin panels, prompts, and workflow glue at startling speed. It can even generate test cases. Left to itself, though, it tends to optimize for visible completion.
Humans do this too.
The problem is that regulated systems require invisible completion.
The dangerous questions are often not:
- does the page load?
- does the form submit?
- does the user get a result?
The dangerous questions are:
- what happens when the wrong user asks for the same object?
- what happens when identifiers are altered, replayed, guessed, cached, or reused?
- what happens when a support workflow bypasses the main permission model?
- what happens when a synthetic edge case collides with a real operational shortcut?
Those are not demo questions. They are trust questions.
The old trap: building the workflow and stopping there
This is not just a software problem. It is an organizational habit.
We are very good at celebrating the visible milestone:
- the portal works
- the dashboard works
- the claim is submitted
- the appointment is booked
- the patient can message the provider
But the real system is not the screen.
The real system is the whole chain:
- identity
- permissions
- object ownership
- logging
- anomaly detection
- escalation
- incident handling
- disclosure readiness
If any of those are weak, the system is weak.
This is why we have to move past the idea that an MVP is finished just because it can be used.
In regulated work, usable is the beginning of the burden, not the end of it.
What kind of test harness is actually needed?
Not one harness.
A layered harness.
A living harness.
A harness designed to assume that the code, the humans, and the AI helpers will all occasionally miss something important.
1. Authorization harness
This is the first wall.
Every endpoint, route, resolver, file handler, export, and background job that touches a regulated object should be tested against a permission matrix.
Not just for valid access.
For invalid access.
That means systematically testing:
- patient A requesting patient B’s record
- one clinic user requesting another clinic’s data
- support users touching objects outside their scope
- stale links reused by other sessions
- attachments, downloads, images, and PDFs fetched directly
- record identifiers changed by one digit, one character, or one format
If the system takes an object ID, the harness should assume an attacker will try another one.
2. Synthetic data and tripwire records
Teams should not discover these failures on live human beings.
A proper environment needs synthetic records that behave like real data while exposing no real person. Better still, it should include tripwire objects that should never be touched under normal operations.
If something accesses them unexpectedly, alarms should fire.
Think of them as digital dye packs.
3. Property-based security testing
Traditional test cases ask whether the feature works.
Property-based tests ask whether core safety rules hold across many generated combinations.
Examples:
- a patient can only retrieve their own records
- a clinician can only access patients connected through approved workflow paths
- marketing or affiliate systems can never access clinical objects
- archive, restore, merge, and delete flows do not create back doors
This is where subtle edge cases get dragged into the light.
4. Browser and replay harness
Security failures do not live only in APIs.
A browser-level harness should capture normal traffic, then replay and mutate it:
- changing object IDs
- reusing tokens
- switching sessions
- replaying downloads
- testing browser storage
- observing cache behavior
- checking for data leakage through frontend helpers and side channels
The interface can be beautiful and still betray the user through the plumbing.
5. Adversarial GenAI harness
If GenAI helped build the system, it can also help attack the assumptions.
Used carefully, an adversarial model can propose likely weak points:
- endpoints that smell like missing authorization checks
- administrative workflows that may bypass policy
- identifiers that look enumerable
- exports and attachments that may not inherit the main controls
- awkward transitions after retry, resend, merge, restore, or import
The model should not be trusted as the judge. It should be used as a case generator. Deterministic tooling should execute the tests.
6. Contract-driven security tests
Every sensitive endpoint should declare its contract:
- what object it touches
- what roles may act
- what ownership rule applies
- whether regulated data is present
- whether audit logging is required
Once that contract exists, the pipeline can auto-generate mandatory negative tests.
This is where the feedback loop becomes structural rather than optional.
7. Audit and anomaly harness
Prevention is not enough.
Teams also need proof.
A good harness verifies that the system logs sensitive access in a way that supports investigation without casually leaking more sensitive data into the logs. It should also confirm that alerts trigger on suspicious patterns:
- repeated requests against sequential identifiers
- broad access across unrelated records
- spikes in export or download behavior
- repeated authorization failures from a live account
The system should not only fail safely. It should fail observably.
8. Breach-readiness harness
This is the one almost no one wants to build, which is exactly why it matters.
Run simulations.
Assume a likely exposure has occurred. Then test whether the organization can:
- preserve evidence
- determine scope
- identify affected records
- trace which users touched what
- notify the right stakeholders
- distinguish legal, compliance, engineering, and communications responsibilities
A mature regulated product is not just one that prevents incidents well. It is one that can respond coherently when prevention fails.
What the coding agent must be told
One of the quiet dangers of vibe coding is that the builder asks for features and forgets to ask for constraints.
If you prompt for:
- patient portal
- dashboard
- scheduling
- messaging
- payment flow
- records lookup
then the coding agent will try to deliver those things.
But if you do not explicitly include the nonfunctional rules, the agent may never treat them as first-class requirements.
So the prompt needs to include things like:
- every object access requires server-side authorization
- all object IDs are attacker-controlled input
- every sensitive read and write is audit logged
- every endpoint touching regulated data must have negative tests
- every new route must declare role and ownership rules
- implementation is not complete until enumeration and replay tests pass
In other words, the agent must be told that security, logging, and compliance are part of the feature, not decoration and not future hardening.
The maturity shift we need
The real shift is not from human coding to AI coding.
It is from building software to building feedback loops around software.
That is the deeper pattern.
If GenAI makes creation cheap, feedback becomes the scarce resource.
And in regulated fields, the scarce resource we should care about most is not code.
It is trustworthy verification.
The winning teams will not be the ones that can build a portal in a weekend. They will be the teams that can build the portal, surround it with adversarial checks, keep those checks current as the product evolves, and prove to auditors, leaders, and customers that trust is not resting on optimism.
That is the difference between a demo and a discipline.
A better definition of done
For regulated, AI-assisted products, done should mean more than:
- the feature works
- the user flow is complete
- the UI looks polished
- the first customer is live
Done should mean:
- authorization tests pass
- hostile-path tests pass
- audit logs are verified
- anomaly alerts are verified
- incident response and escalation paths are documented
- the system cannot ship without rerunning the relevant negative checks
That is not bureaucracy for its own sake.
That is what maturity looks like when real people can be harmed.
Closing thought
We should absolutely celebrate what AI is making possible.
Tiny teams can now build systems that once required large organizations and deep budgets. That matters. It opens doors. It creates opportunities. It allows experimentation that would have been out of reach even a few years ago.
But in regulated fields, speed is not the finish line.
Speed gets you to the starting line faster. Trust still has to be earned the slow way, through layered controls, repeated testing, transparent operations, and the humility to assume that a functioning MVP is only the first draft of a safe system.
That is the mindset shift.
Not anti-AI.
Not anti-speed.
Just grown-up enough to know that, in some domains, the hard part starts after the demo.
Potential skills, agents, and tools to add to our toolkit
This is here to spark ideas, not to serve as the permanent home for these concepts.
| # | Idea | Spark |
|---|---|---|
| 1 | Permission Matrix Generator | Converts roles, object types, ownership rules, and workflow states into a machine-readable authorization matrix for implementation and testing. |
| 2 | Negative Test Case Generator | Reads APIs, routes, or code and produces adversarial tests for unauthorized access, enumeration, replay, stale links, and role misuse. |
| 3 | Synthetic Regulated Data Factory | Generates realistic but safe test data sets with patients, providers, claims, merged accounts, archived records, and messy edge cases. |
| 4 | Tripwire Record Seeder | Plants honey records and decoy assets in test environments so unexpected access becomes visible fast. |
| 5 | Browser Mutation Harness | Captures normal browser and API traffic, then mutates identifiers, sessions, tokens, and requests to probe for leakage. |
| 6 | Contract-to-Test Compiler | Turns endpoint contracts, permission declarations, and sensitivity tags into required negative tests and audit-log checks. |
| 7 | Audit Log Quality Checker | Reviews logging to confirm events are recorded, correlated, searchable, and useful for incident response without oversharing. |
| 8 | Anomaly Rule Recommender | Suggests and simulates alert rules for probing, abnormal exports, repeated failures, and unusual cross-record access. |
| 9 | Adversarial Prompt Pack for Coding Agents | Uses prompts and review personas to push coding agents toward regulated edge cases, auditability, and hostile misuse thinking. |
| 10 | Security Regression Council | Combines multiple models and rule-based checks to review releases for access-control, logging, privacy, and disclosure regressions. |
| 11 | Incident Readiness Simulator | Rehearses scoping, evidence preservation, escalation, legal handoff, communications, and reporting after likely exposures. |
| 12 | Definition-of-Done Policy Enforcer | Blocks deployment until regulated-field checks such as authorization coverage, audit logging, anomaly detection, and playbook readiness pass. |
| 13 | Regulated-System Architecture Reviewer | Flags risky design patterns early, including weak object ownership, admin boundary problems, sensitive logging, and poor segregation. |
| 14 | Disclosure Readiness Checklist Builder | Maps likely incident types to the evidence, stakeholders, timelines, and communications needed if exposure occurs. |
| 15 | Trust Boundary Mapper | Visualizes where identities, tokens, objects, permissions, and regulated data cross services, vendors, workflows, and roles. |
These are not just coding aids. They are maturity aids.
And increasingly, that may be where the real value lives.