Day 7: When AI Deleted 20 Customers' Data (in One SQL Statement)

The session that became this incident started well. We'd just tracked down a real production blocker — a model file that had been silently gitignored in deployment for weeks, leaving rotation metrics dark for every upload during that window. The fix was a one-line ignore change. The deploy went green. Rotation metrics started flowing again.

Then the AI proposed the next step: invalidate the stale cached results for those affected uploads so they'd regenerate with the now-working metrics. Sensible. I said yes.

Visual thesis map

The fix looked sensible

The blast radius was hidden

Audit evidence made recovery possible

Dry-run gates became mandatory

Trust came from telling the truth fast

The SQL That Looked Reasonable

The AI proposed:

UPDATE <results_table>
SET <cache_column_a> = NULL,
    <cache_column_b> = NULL,
    <cache_column_c> = NULL
WHERE created_at BETWEEN '<date>' AND '<date>';

The intent in the AI's head: "reset the affected uploads so they recompute." The mental model: a single cache wiped clean on next read. The actual product state: multiple independent cached outputs. Only one was stale.

Diagram 01

mental model

reset the stale thing

A plausible cleanup step, written with too broad a scope.

product state

separate outputs, separate risk

Similar-looking data was not the same data.

One derived-output lane was correctly broken — that was the bug we just fixed. The other outputs were independent. They'd been running fine the whole time. Their cached output was current, personalized, and expensive to regenerate.

SET ... = NULL on all the stored outputs wiped them all.

20 Rows. 7 Customers. About 90 Seconds.

The query ran successfully. 20 rows updated. Seven paying customers' personalized generated content — gone. None of it recoverable from the user-facing database, because those stored outputs were the source of truth for that text.

I caught it within minutes because I happened to be watching the query output. Three months earlier, I would have noticed an hour later, after the LLM bill spiked from the regeneration wave.

The next 30 minutes were the worst 30 minutes of building JumpOnion.

Why The Story Has an Ending

Eight weeks before this incident, after a smaller but related scare, I'd added an audit trail that captured enough to reconstruct any LLM-generated content. Not just whether the call happened — enough recoverable evidence. It cost a few cents a day in storage. The justification at the time was: "if I ever have a cache-wipe incident, this is what saves me." It felt like overcaution. It became the only reason this story has an ending.

Diagram 02

bad write lands

The customer-facing state is damaged.

audit trail

The original generated content still has a reconstruction path.

same-day recovery

The incident becomes managed instead of unrecoverable.

Within fifteen minutes I had a script that walked the LLM call log for each affected task ID, pulled the latest recoverable generated content, and re-populated the wiped fields. The customers' data wasn't recovered — it was regenerated from the original LLM output that produced it. Identical text. Identical supporting content.

The audit log was a paranoia investment. It cost almost nothing to keep. It made an unrecoverable disaster into a 30-minute restoration. Most paranoia investments pay zero. The one that pays, pays everything.

Two Rules Calcified

Two new entries in the SOP came out of this:

Diagram 03

preview before destruction

Make the affected customers and rows visible before any write.

recover after destruction

Persist enough evidence that a mistake can be unwound.

Cache invalidation follows upstream-dependency rules. Every derived output has a known upstream input. When you fix an upstream bug, you invalidate only the stored outputs that depend on that input — never "all caches" as a reflex. Treat each output lane as having separate provenance.
Destructive SQL on multi-row production data requires a SELECT-preview gate. Any UPDATE or DELETE that could touch more than one customer row has to run as a SELECTfirst, returning the affected user emails. The founder reviews the list. Approves explicitly. Only then does the destructive query run. The cost of this protocol is 30 seconds. The cost of skipping it is 30 minutes of recovery if you're lucky and a customer churn if you're not.

The Mental Model the AI Had

What I keep coming back to is the AI's mental model at the moment it wrote that SQL. The model was: "cache = single thing that gets reset." Not: "cache = multiple independent columns, each with its own provenance, only one of which is actually stale."

The AI didn't lack data. The schema was loaded in the session context. The stored-output fields were even visible in its tab-completion. What it lacked was the disciplineof asking "which of these actually needs invalidation, and why?" before writing the SQL.

That discipline, written down, is a SOP rule. Without the rule, the AI defaults to whatever mental model is easiest — and "cache = one thing" is the easiest. With the rule, the AI runs the upstream-dependency check first, and the destructive SQL only touches what it actually needs to touch.

Customer Notification

Within hours of the incident, the seven affected customers got a same-day email. No marketing language. Just: "a cleanup query during a backend fix accidentally invalidated your latest diagnosis. We've regenerated it from the original analysis log and it's back on your account. No charge for the regeneration. Sorry."

Zero customers churned. Several replied with thanks for the transparency. One — a coach who'd been on the platform for a month — said the response was the reason he'd recommend it to other coaches: "you told me what broke before I noticed."

Single-founder SaaS isn't about never breaking. It's about absorbing the blast radius yourself instead of pushing it onto the customer. The audit log let me absorb this one.Without it, seven customers would have lost data and trust in the same afternoon.

Without SOP, With SOP

Without SOP

Trust the AI's SQL. Run it. Notice the column count hours later when an LLM bill spike triggers a billing alert. Discover seven customers' data is gone. No audit log means no recovery path. Push a public apology. Burn trust with paying customers in a niche where word spreads fast.

With SOP

SELECT-preview gate forces the dry run. The result shows three columns being NULLed. Catch the over-broad scope before execution. Even if it somehow runs: audit log makes regeneration possible. Same-day customer email turns the incident into a trust-building moment instead of a churn event.

The Real Lesson

AI partners write production SQL. That's table stakes now. The question isn't "will the AI ever write destructive SQL?" — it will, eventually, with the most plausible-looking justification — but "what catches it before it lands?"

Two things catch it. A SELECT-preview gate makes the scope visible before execution. An audit log makes recovery possible after execution. Together they convert unrecoverable disasters into managed incidents. Separately, either one alone is meaningfully better than neither.

The mental shift that took me longest to make: stop trusting the AI's "this is safe" reasoning. The AI is usually right about safety. But the cost of trusting it when it's wrong is high enough that the cost of double-checking every time is the cheaper trade.

Next: Day 8 — the LLM finished a long training plan generation. The platform gateway killed the client connection before the response made it back. The user got a 504. The plan was already saved. The audit log knew. Nothing else did.