Day 2 ended with the project narrowly rescued from a wrong-algorithm spiral. By this point — weeks later — JumpOnion had grown up. 987 unit tests. A real production stack. Paying customers. The kind of project where things should be hard to break.
They weren't.
The Innocent Request
I asked my AI partner to do a security audit. Standard ask: scan the backend, flag anything risky, propose fixes. Four parallel agents, structured pass over 22 files, output a tidy commit.
The commit message was a model of professionalism:
security: full-stack audit v2 CRITICAL fixes (immediate revenue/cost impact): - Tighten CSP headers (nonce-based) - Verify webhook signatures with constant-time comparison - Remove S3 companion route mount (dead code, anonymous uploads)
I read it. 987 tests passed. The deploy went green on Railway. Vercel rebuilt the frontend. Everything looked fine.
It wasn't fine.The third bullet — the one labeled "dead code, anonymous uploads" — had just removed the route mount that every video upload flowed through.
How a Confident Label Lies
The AI's reasoning, traced back from the diff, looked plausible:
- The route's code had a comment mentioning "anonymous tasks" (a Phase 1 legacy concept).
- The current product had real authentication. So the "anonymous" path was "obviously" deprecated.
- Therefore the entire router was "dead code."
- Therefore deleting the mount was a "cleanup."
Every step was AI-confident. Every step was wrong. A single command would have proven it:
grep -r "s3/multipart" frontend/
The frontend uploader (Uppy) was actively pointing at that exact route as its companionUrl. Removing the mount turned every upload into an instant 404.
Why 987 Tests Didn't Save Me
This is the part I want every developer to internalize. We had:
- 987 unit tests. They all passed.
- A staging deploy step. Returned 200 on the main health check.
- An integration suite. Mocked the S3 client.
- A frontend build. Compiled cleanly.
None of them noticed that POST /s3/multipartnow returned 404. The reason isn't that we wrote bad tests — it's that we never wrote a test that asserted the route existed at all.
Mocked clients don't care if the server route exists. Health checks check /health, not every endpoint. Frontend builds compile JavaScript, not server routing.
The Slow-Motion Failure
For 30 hours, every paying customer who tried to upload a video got a silent failure. The frontend showed a generic error. The backend logs showed a 404 on /s3/multipart. No alert fired because nobody had built a monitor for "a route that should exist suddenly returns 404."
The sneakiest moment came hours after the original deletion. A follow-up audit pass added a security gate to that same router file — adding require_paid_feature_access as a decorator. The router was already unmounted. The AI had just put a lock on a dismantled door, and the diff still looked productive.
The Question That Cracked It
A paying customer reported uploads weren't working. My AI partner's first hypothesis: a CSP header issue with the uploader's domain. Plausible-sounding. We deployed a CSP fix.
Customer: still broken.
I asked one question — six words — that broke the loop:
"Did you actually deploy that?"
Within five minutes, my AI partner ran curl /s3/healthagainst production. The response: 404. The main health check returned 200. The diagnosis became obvious in retrospect: the upload route had been unmounted in the security audit two days earlier.
The Three-Layer Fix
Re-mounting the route took one line. The fix that mattered was ensuring this class of failure could never repeat:
- Code layer: Re-mount the route, with an inline comment naming the bug it caused.
- Test layer: A new test file with a
CRITICAL_ROUTESregistry — 16 endpoints that must exist. Parametrized assertion: every route in the registry must not return 404. Adding or removing a route requires updating the registry, which makes the change reviewable. - Runtime layer: A cron-driven smoke check that hits every route in the registry every 15 minutes against production. Alerts on the first 404. Worst-case detection time becomes 15 minutes, not 30 hours.
The Real Lesson
AI-authored code review has a property humans rarely have: it's consistently confident. A human deleting a route might write "not sure if this is still used, removing" — leaving a visible signal of doubt. The AI wrote "dead code, anonymous uploads" with absolute fluency.
That fluency is a trust signal in human review. Reviewers tend to give well-written labels the benefit of the doubt. The cost of challenging every line in a 22-file PR is high; the cost of trusting confident-sounding labels feels low. So the trust asymmetry quietly grows.
That asymmetry is a hidden attack surface. Not adversarial — just statistical. Every confident AI label that survives review without evidence-checking is a small bet that the AI got it right. Most of those bets pay off. Some don't. When they don't, the cost is asymmetric — a 30-hour outage, a paying customer's silent failure, a category of bug that 987 tests never see.
Trust the "dead code" label. Ship the audit. Spend hours debugging CSP, encryption, frontend caching — anything except the route that's 404. Discover the truth when a second customer reports the failure, days later.
After the first false hypothesis, escalate to evidence. One curl finds the real root cause in five minutes. Three-layer defense ships the same day. New SOP rule added: AI deletion commits with confident labels must include caller-audit evidence in the body.
What I Carry Forward
Three rules calcified out of this incident:
- AI confidence is not evidence.The more polished the deletion label, the more it deserves a caller-audit grep. Self-confident AI prose should be treated like a self-confident junior engineer's prose: trust, but demand evidence.
- Test what would silently break, not what would loudly break. 987 unit tests caught nothing because they all assumed the route existed. The dangerous failures are the ones where every existing test still passes.
- The founder's "did you actually…" question is the highest-leverage debugging tool.It costs six words to ask, and it's a forced reset against AI confidence. Normalize using it. Normalize answering it with a curl, not a confidence statement.
The 16-route registry is still in production. Every 15 minutes, a cron job confirms that every critical endpoint still exists. I sleep better.
Next: Day 4 — the schema migration that broke every paying user's drill access for two weeks. Tests passed. Mocks agreed. Production lied.