Ciki Zeng
← Back to Blog
2026-04-25· 9 min readJumpOnionProduction IncidentData Safety

Day 7: When AI Deleted 20 Customers' Data (in One SQL Statement)

The session that became this incident started well. We'd just tracked down a real production blocker — a model file that had been silently gitignored in deployment for weeks, leaving rotation metrics dark for every upload during that window. The fix was a one-line ignore change. The deploy went green. Rotation metrics started flowing again.

Then the AI proposed the next step: invalidate the stale cached results for those affected uploads so they'd regenerate with the now-working metrics. Sensible. I said yes.

The SQL That Looked Reasonable

The AI proposed:

UPDATE <results_table>
SET <cache_column_a> = NULL,
    <cache_column_b> = NULL,
    <cache_column_c> = NULL
WHERE created_at BETWEEN '<date>' AND '<date>';

The intent in the AI's head: "reset the affected uploads so they recompute." The mental model: a single cache wiped clean on next read. The actual table schema: multiple independent caches in separate columns, each populated by a different upstream pipeline.

One pipeline (the analytics one) was correctly broken — that was the bug we just fixed. The other pipelines were independent. They'd been running fine the whole time. Their cached output was current, personalized, and the result of expensive LLM calls.

SET ... = NULL on all the cache columns wiped them all.

20 Rows. 7 Customers. About 90 Seconds.

The query ran successfully. 20 rows updated. Seven paying customers' personalized diagnosis text — gone. Their training plans — gone. None of it recoverable from the user-facing database, because the cache columns were the source of truth for that text.

I caught it within minutes because I happened to be watching the query output. Three months earlier, I would have noticed an hour later, after the LLM bill spiked from the regeneration wave.

The next 30 minutes were the worst 30 minutes of building JumpOnion.

Why The Story Has an Ending

Eight weeks before this incident, after a smaller but related scare, I'd added a column to the LLM call log persisting the full text of every LLM response. Not just the prompt and metadata — the full output. It cost a few cents a day in storage. The justification at the time was: "if I ever have a cache-wipe incident, this is what saves me." It felt like overcaution. It became the only reason this story has an ending.

Within fifteen minutes I had a script that walked the LLM call log for each affected task ID, pulled the latest diagnosis and training plan responses, and re-populated the wiped cache columns. The customers' data wasn't recovered — it was regenerated from the original LLM output that produced it. Identical text. Identical training plans.

The audit log was a paranoia investment. It cost almost nothing to keep. It made an unrecoverable disaster into a 30-minute restoration. Most paranoia investments pay zero. The one that pays, pays everything.

Two Rules Calcified

Two new entries in the SOP came out of this:

  1. Cache invalidation follows upstream-dependency rules. Every cache column has a known upstream input. When you fix an upstream bug, you invalidate only the cache columns that depend on that input — never "all caches" as a reflex. Each cache is a separate column because each has a separate upstream. Treat them that way.
  2. Destructive SQL on multi-row production data requires a SELECT-preview gate. Any UPDATE or DELETE that could touch more than one customer row has to run as a SELECTfirst, returning the affected user emails. The founder reviews the list. Approves explicitly. Only then does the destructive query run. The cost of this protocol is 30 seconds. The cost of skipping it is 30 minutes of recovery if you're lucky and a customer churn if you're not.

The Mental Model the AI Had

What I keep coming back to is the AI's mental model at the moment it wrote that SQL. The model was: "cache = single thing that gets reset." Not: "cache = multiple independent columns, each with its own provenance, only one of which is actually stale."

The AI didn't lack data. The schema was loaded in the session context. The cache column names were even in its tab-completion. What it lacked was the disciplineof asking "which of these actually needs invalidation, and why?" before writing the SQL.

That discipline, written down, is a SOP rule. Without the rule, the AI defaults to whatever mental model is easiest — and "cache = one thing" is the easiest. With the rule, the AI runs the upstream-dependency check first, and the destructive SQL only touches what it actually needs to touch.

Customer Notification

Within hours of the incident, the seven affected customers got a same-day email. No marketing language. Just: "a cleanup query during a backend fix accidentally invalidated your latest diagnosis. We've regenerated it from the original analysis log and it's back on your account. No charge for the regeneration. Sorry."

Zero customers churned. Several replied with thanks for the transparency. One — a coach who'd been on the platform for a month — said the response was the reason he'd recommend it to other coaches: "you told me what broke before I noticed."

Single-founder SaaS isn't about never breaking. It's about absorbing the blast radius yourself instead of pushing it onto the customer. The audit log let me absorb this one. Without it, seven customers would have lost data and trust in the same afternoon.

Without SOP, With SOP

Without SOP

Trust the AI's SQL. Run it. Notice the column count hours later when an LLM bill spike triggers a billing alert. Discover seven customers' data is gone. No audit log means no recovery path. Push a public apology. Burn trust with paying customers in a niche where word spreads fast.

With SOP

SELECT-preview gate forces the dry run. The result shows three columns being NULLed. Catch the over-broad scope before execution. Even if it somehow runs: audit log makes regeneration possible. Same-day customer email turns the incident into a trust-building moment instead of a churn event.

The Real Lesson

AI partners write production SQL. That's table stakes now. The question isn't "will the AI ever write destructive SQL?" — it will, eventually, with the most plausible-looking justification — but "what catches it before it lands?"

Two things catch it. A SELECT-preview gate makes the scope visible before execution. An audit log makes recovery possible after execution. Together they convert unrecoverable disasters into managed incidents. Separately, either one alone is meaningfully better than neither.

The mental shift that took me longest to make: stop trusting the AI's "this is safe" reasoning. The AI is usually right about safety. But the cost of trusting it when it's wrong is high enough that the cost of double-checking every time is the cheaper trade.

Next: Day 8 — the LLM finished a 59-second training plan generation. The platform's 60-second gateway timeout killed the client connection 0.7 seconds before. The user got a 504. The plan was already saved. The audit log knew. Nothing else did.

Want the destructive-SQL protocol?

Templates, SOPs, and enforcement hooks — from $39.

See Pricing