Beyond the MVP

The Test Harness Regulated AI Products Actually Need

Why This Post Exists

A specific event sparked this reflection, but I am not going to name names here. The point is not to point fingers. The issue is bigger than any one company or team.

What matters is the pattern.

Context

Every so often, a story appears that feels like a glimpse of the future.

A tiny team uses AI to move at extraordinary speed. A product takes shape faster than most companies would have thought possible. Revenue arrives. Scale arrives. Attention arrives.

And then reality arrives.

Not because the builders were foolish. Not because speed is bad. Not because AI-assisted development is fake.

Because in regulated systems, getting the product to work is only the first threshold. The harder threshold is proving that it keeps working safely when real humans, real edge cases, and real adversaries show up.

Too many teams still treat that as a phase-two problem.

It is not.

Why this matters to me

I am not raising this to pass judgment on the people building at speed. In many ways, what they achieved is remarkable.

I am raising it because I am trying to learn from it.

As I get ready to begin a new chapter of my life, one that I hope includes building AI solutions of my own, I am thinking hard about what success should look like and what it actually requires. There are no true experts here, at least not in the sense of people who have fully solved this transition. Most of us are learning in public.

That is why stories like this matter to me. They surface the gap between building something impressive and building something trustworthy. They remind me that speed, polish, and early traction are not the whole story. If I want to build responsibly, I need to be honest about what success really means and about the disciplines required to get there.

The real lesson

We should be honest about two things at once.

First, getting an AI-assisted healthcare or telehealth product off the ground with a tiny team is an incredible feat. It signals that the cost of building useful digital systems has dropped in ways that would have sounded absurd not long ago.

Second, in regulated fields, an MVP is often just the first moment when the system becomes dangerous enough to matter.

That is not fearmongering. It is the nature of systems that touch health, money, identity, education, legal outcomes, insurance, or safety.

A pleasant demo proves that a workflow exists.

It does not prove that:

one patient cannot see another patient’s data
one user cannot escalate privileges by changing a URL or identifier
audit logs are sufficient to investigate an incident
alerts will fire when access patterns go sideways
the organization can determine scope and disclose appropriately if something goes wrong

That is the gap.

And it is a familiar one.

We build the thing. We test that the happy path works. We call it an MVP. Then we quietly act as if working software and trustworthy software are close cousins.

They are not.

Why GenAI changes the stakes

GenAI lowers the cost of creation.

That is wonderful.

It also lowers the cost of creating something that looks complete before it is actually mature.

That is the part we need to absorb.

A coding agent can help assemble pages, forms, APIs, dashboards, queues, admin panels, prompts, and workflow glue at startling speed. It can even generate test cases. Left to itself, though, it tends to optimize for visible completion.

Humans do this too.

The problem is that regulated systems require invisible completion.

The dangerous questions are often not:

does the page load?
does the form submit?
does the user get a result?

The dangerous questions are:

what happens when the wrong user asks for the same object?
what happens when identifiers are altered, replayed, guessed, cached, or reused?
what happens when a support workflow bypasses the main permission model?
what happens when a synthetic edge case collides with a real operational shortcut?

Those are not demo questions. They are trust questions.

The old trap: building the workflow and stopping there

This is not just a software problem. It is an organizational habit.

We are very good at celebrating the visible milestone:

the portal works
the dashboard works
the claim is submitted
the appointment is booked
the patient can message the provider

But the real system is not the screen.

The real system is the whole chain:

identity
permissions
object ownership
logging
anomaly detection
escalation
incident handling
disclosure readiness

If any of those are weak, the system is weak.

This is why we have to move past the idea that an MVP is finished just because it can be used.

In regulated work, usable is the beginning of the burden, not the end of it.

What kind of test harness is actually needed?

Not one harness.

A layered harness.

A living harness.

A harness designed to assume that the code, the humans, and the AI helpers will all occasionally miss something important.

1. Authorization harness

This is the first wall.

Every endpoint, route, resolver, file handler, export, and background job that touches a regulated object should be tested against a permission matrix.

Not just for valid access.

For invalid access.

That means systematically testing:

patient A requesting patient B’s record
one clinic user requesting another clinic’s data
support users touching objects outside their scope
stale links reused by other sessions
attachments, downloads, images, and PDFs fetched directly
record identifiers changed by one digit, one character, or one format

If the system takes an object ID, the harness should assume an attacker will try another one.

2. Synthetic data and tripwire records

Teams should not discover these failures on live human beings.

A proper environment needs synthetic records that behave like real data while exposing no real person. Better still, it should include tripwire objects that should never be touched under normal operations.

If something accesses them unexpectedly, alarms should fire.

Think of them as digital dye packs.

3. Property-based security testing

Traditional test cases ask whether the feature works.

Property-based tests ask whether core safety rules hold across many generated combinations.

Examples:

a patient can only retrieve their own records
a clinician can only access patients connected through approved workflow paths
marketing or affiliate systems can never access clinical objects
archive, restore, merge, and delete flows do not create back doors

This is where subtle edge cases get dragged into the light.

4. Browser and replay harness

Security failures do not live only in APIs.

A browser-level harness should capture normal traffic, then replay and mutate it:

changing object IDs
reusing tokens
switching sessions
replaying downloads
testing browser storage
observing cache behavior
checking for data leakage through frontend helpers and side channels

The interface can be beautiful and still betray the user through the plumbing.

5. Adversarial GenAI harness

If GenAI helped build the system, it can also help attack the assumptions.

Used carefully, an adversarial model can propose likely weak points:

endpoints that smell like missing authorization checks
administrative workflows that may bypass policy
identifiers that look enumerable
exports and attachments that may not inherit the main controls
awkward transitions after retry, resend, merge, restore, or import

The model should not be trusted as the judge. It should be used as a case generator. Deterministic tooling should execute the tests.

6. Contract-driven security tests

Every sensitive endpoint should declare its contract:

what object it touches
what roles may act
what ownership rule applies
whether regulated data is present
whether audit logging is required

Once that contract exists, the pipeline can auto-generate mandatory negative tests.

This is where the feedback loop becomes structural rather than optional.

7. Audit and anomaly harness

Prevention is not enough.

Teams also need proof.

A good harness verifies that the system logs sensitive access in a way that supports investigation without casually leaking more sensitive data into the logs. It should also confirm that alerts trigger on suspicious patterns:

repeated requests against sequential identifiers
broad access across unrelated records
spikes in export or download behavior
repeated authorization failures from a live account

The system should not only fail safely. It should fail observably.

8. Breach-readiness harness

This is the one almost no one wants to build, which is exactly why it matters.

Run simulations.

Assume a likely exposure has occurred. Then test whether the organization can:

preserve evidence
determine scope
identify affected records
trace which users touched what
notify the right stakeholders
distinguish legal, compliance, engineering, and communications responsibilities

A mature regulated product is not just one that prevents incidents well. It is one that can respond coherently when prevention fails.

What the coding agent must be told

One of the quiet dangers of vibe coding is that the builder asks for features and forgets to ask for constraints.

If you prompt for:

patient portal
dashboard
scheduling
messaging
payment flow
records lookup

then the coding agent will try to deliver those things.

But if you do not explicitly include the nonfunctional rules, the agent may never treat them as first-class requirements.

So the prompt needs to include things like:

every object access requires server-side authorization
all object IDs are attacker-controlled input
every sensitive read and write is audit logged
every endpoint touching regulated data must have negative tests
every new route must declare role and ownership rules
implementation is not complete until enumeration and replay tests pass

In other words, the agent must be told that security, logging, and compliance are part of the feature, not decoration and not future hardening.

The maturity shift we need

The real shift is not from human coding to AI coding.

It is from building software to building feedback loops around software.

That is the deeper pattern.

If GenAI makes creation cheap, feedback becomes the scarce resource.

And in regulated fields, the scarce resource we should care about most is not code.

It is trustworthy verification.

The winning teams will not be the ones that can build a portal in a weekend. They will be the teams that can build the portal, surround it with adversarial checks, keep those checks current as the product evolves, and prove to auditors, leaders, and customers that trust is not resting on optimism.

That is the difference between a demo and a discipline.

A better definition of done

For regulated, AI-assisted products, done should mean more than:

the feature works
the user flow is complete
the UI looks polished
the first customer is live

Done should mean:

authorization tests pass
hostile-path tests pass
audit logs are verified
anomaly alerts are verified
incident response and escalation paths are documented
the system cannot ship without rerunning the relevant negative checks

That is not bureaucracy for its own sake.

That is what maturity looks like when real people can be harmed.

Closing thought

We should absolutely celebrate what AI is making possible.

Tiny teams can now build systems that once required large organizations and deep budgets. That matters. It opens doors. It creates opportunities. It allows experimentation that would have been out of reach even a few years ago.

But in regulated fields, speed is not the finish line.

Speed gets you to the starting line faster. Trust still has to be earned the slow way, through layered controls, repeated testing, transparent operations, and the humility to assume that a functioning MVP is only the first draft of a safe system.

That is the mindset shift.

Not anti-AI.

Not anti-speed.

Just grown-up enough to know that, in some domains, the hard part starts after the demo.

Potential skills, agents, and tools to add to our toolkit

This is here to spark ideas, not to serve as the permanent home for these concepts.

#	Idea	Spark
1	Permission Matrix Generator	Converts roles, object types, ownership rules, and workflow states into a machine-readable authorization matrix for implementation and testing.
2	Negative Test Case Generator	Reads APIs, routes, or code and produces adversarial tests for unauthorized access, enumeration, replay, stale links, and role misuse.
3	Synthetic Regulated Data Factory	Generates realistic but safe test data sets with patients, providers, claims, merged accounts, archived records, and messy edge cases.
4	Tripwire Record Seeder	Plants honey records and decoy assets in test environments so unexpected access becomes visible fast.
5	Browser Mutation Harness	Captures normal browser and API traffic, then mutates identifiers, sessions, tokens, and requests to probe for leakage.
6	Contract-to-Test Compiler	Turns endpoint contracts, permission declarations, and sensitivity tags into required negative tests and audit-log checks.
7	Audit Log Quality Checker	Reviews logging to confirm events are recorded, correlated, searchable, and useful for incident response without oversharing.
8	Anomaly Rule Recommender	Suggests and simulates alert rules for probing, abnormal exports, repeated failures, and unusual cross-record access.
9	Adversarial Prompt Pack for Coding Agents	Uses prompts and review personas to push coding agents toward regulated edge cases, auditability, and hostile misuse thinking.
10	Security Regression Council	Combines multiple models and rule-based checks to review releases for access-control, logging, privacy, and disclosure regressions.
11	Incident Readiness Simulator	Rehearses scoping, evidence preservation, escalation, legal handoff, communications, and reporting after likely exposures.
12	Definition-of-Done Policy Enforcer	Blocks deployment until regulated-field checks such as authorization coverage, audit logging, anomaly detection, and playbook readiness pass.
13	Regulated-System Architecture Reviewer	Flags risky design patterns early, including weak object ownership, admin boundary problems, sensitive logging, and poor segregation.
14	Disclosure Readiness Checklist Builder	Maps likely incident types to the evidence, stakeholders, timelines, and communications needed if exposure occurs.
15	Trust Boundary Mapper	Visualizes where identities, tokens, objects, permissions, and regulated data cross services, vendors, workflows, and roles.

These are not just coding aids. They are maturity aids.

And increasingly, that may be where the real value lives.