Ciki Zeng
Case Studies

Battle-Tested

Every methodology claim is backed by a real incident from production development. These aren't hypotheticals — they're documented moments where the system caught what I would have missed.

01JumpOnion2026-03-18

3 Failed Attempts, Then the System Self-Corrected

What happened

During algorithm development, a tracking approach failed completely on real figure skating videos — detecting 6.47s of air time on a jump that lasted 0.7s. The same approach was retried with small tweaks.

Rule triggered

Blindspot Interception — after 2 identical failures, force a root cause analysis. After 30 minutes on a dead-end approach, force a strategy switch.

Without the SOP

My AI partner would have kept tweaking the same broken approach for the entire session. Worse, the flawed algorithm might have shipped to production.

What actually happened

On the third failure, the system automatically switched strategy — from centroid tracking to bottom-point tracking with physics constraints. It found the right approach before I even suggested it.

Can your AI workflow rescue itself when it's failing? Mine can.
02SunnyInvoices2026-03-18

Ran a Pipeline in the Wrong Directory — Zero Damage

What happened

A data pipeline was accidentally executed in the wrong project directory. Cross-project file contamination was a real risk.

Rule triggered

Environment Pollution Guard — any cross-project reference triggers an immediate stop-and-confirm before proceeding.

Without the SOP

Files, data, or dependencies from one project could have contaminated another, requiring hours of detective work to untangle.

What actually happened

The project boundary mechanism caught the mistake instantly. Despite running in the wrong directory, zero cross-project pollution occurred.

Your SOP is body armor, not decoration. Even when you make mistakes, the system protects you.
03SOP Framework2026-03-19

The SOP Couldn't Protect Its Own Birth

What happened

A new rule for automatic case study collection was added to the project-level memory — visible only in the current project. But case studies are mainly generated during development of other products.

Rule triggered

No existing rule covered this — it was a blindspot in the SOP creation process itself.

Without the SOP

The new rule would never fire during product development sessions. Case studies would silently stop being collected, and the entire content pipeline would dry up.

What actually happened

The issue was caught during review. The rule was moved to the global configuration, and a new blindspot interception rule was added: always verify rule scope when writing new rules.

The SOP isn't a finished product — it exposes its own blindspots, then you fix them, and the system gets stronger.
04SOP Framework2026-03-19

The Rule Existed — The AI Just Didn't Follow It

What happened

A new protocol was written to project-level memory instead of the global config. The rule explicitly stating 'check if cross-project rules are globally visible' was already in place — but the AI skipped the self-check.

Rule triggered

Blindspot Interception — writing new rules must include a scope self-check.

Without the SOP

The new protocol would only be visible in one project. During development of other products, the AI wouldn't know the rule exists.

What actually happened

The issue was caught manually and fixed. More importantly, it proved a critical insight: prompt-level instructions aren't reliable enough. This is exactly why enforcement hooks exist — they don't trust the AI to 'remember,' they force compliance in code.

Most AI workflows tell you the AI will self-check. Mine honestly admits: it won't. So we enforce rules in code, and prove it with case studies.
05SOP Framework2026-03-19

New Rule Activated Within Minutes of Being Written

What happened

The Challenge Protocol — requiring the AI to question its own output — was just added to the global config. Minutes later, the AI proactively flagged a limitation in a hook it had just built: the keyword detection only covered Chinese terms, missing English edge cases.

Rule triggered

Challenge Protocol — during execution, if you notice a potential issue, raise it instead of silently continuing.

Without the SOP

The AI would have reported 'all checks passed' and moved on, leaving a gap that would only surface when an English-speaking user triggered the system.

What actually happened

The gap was flagged immediately, triaged correctly (noted for future work, not an urgent fix), and logged. The rule proved it works — not in theory, but in the same session it was created.

New rules take effect within minutes. This isn't documentation — it's a living operating system.
06SOP Framework2026-03-19

Data Copies Produced Stale Output — Less Is More

What happened

During a session wrap-up, the AI reported 'next step: rewrite algorithm v2' — but v3 had already been completed in a previous session. The stale recommendation came from a data copy that wasn't updated.

Rule triggered

No rule covered this — the data copy itself was the architectural flaw.

Without the SOP

The next session might have attempted to rewrite already-completed code, wasting an entire session. Worse: it could have overwritten calibrated v3 code with a v2 redo.

What actually happened

All data copies were eliminated. The architecture was simplified to a single source of truth with pointers — no more redundant progress tracking. A new rule was added to the startup protocol: cross-validate progress data before acting on it.

Cutting a data copy is safer than maintaining it. Less is more.
07JumpOnion2026-03-19

Project Had 166 Tests But No Agent Constitution

What happened

JumpOnion had reached Phase 4 with 166 tests and 9/11 calibration videos passing. But it had never created a unified rules document for AI agent collaboration — critical lessons were scattered across memory files.

Rule triggered

New Project Coverage Detection — on session start, check if the project has a unified agent rules document.

Without the SOP

Different AI tools working on the same project would miss critical constraints — like 'location-based metrics cannot be used for diagnosis' or 'never use .remote(), always use .spawn().' Known mistakes would be repeated.

What actually happened

The missing document was detected on session start. A comprehensive agent constitution was generated covering: tech stack locks, iron rules, phase status, calibration system rules, handoff protocols, and escalation checklists.

The SOP doesn't wait for things to break — it detects what's missing before you start working.
08JumpOnion2026-03-20

3 Videos Downloaded 0 Frames — Broken Data References

What happened

During golden file generation, 3 out of 10 calibration videos downloaded 0 frames from the database. The other 7 worked fine. The temptation was to debug network or permissions.

Rule triggered

Data Provenance — after cleaning data, all upstream references must be synced. Blindspot Interception — stop chasing symptoms after 30 minutes.

Without the SOP

Hours spent retrying downloads, checking network, checking permissions. Eventually the 3 videos might have been dropped, shrinking the calibration set from 10 to 7.

What actually happened

Cross-referencing analytics results revealed the real issue: old task IDs in the registry pointing to deleted storage paths. New task IDs were mapped in 15 minutes, all 3 videos recovered.

The sneakiest bugs in your data pipeline aren't code errors — they're broken references. Data got deleted but the index didn't follow.
09JumpOnion2026-03-20

All Tests Green — By Dismantling the Gates

What happened

During E2E validation, a metric showed real data at ~1.0 but the threshold was set at max 0.15. The quick fix: just raise the threshold to 1.0 and everything goes green.

Rule triggered

Zero Misdiagnosis Principle — fix the definition first, then adjust the numbers. Verification exists to be meaningful, not to be green.

Without the SOP

Thresholds would have been inflated across the board — 0.15 to 1.0, 10 to 250, 0.03 to 1.0. All tests green, but the validation system would be effectively demolished. Garbage in, green out.

What actually happened

Root cause found: the metric name said 'fill ratio' but measured 'emptiness ratio' — semantics were inverted. After fixing the calculation, real data showed 0.000-0.001. Redundant checks were removed. Final result: 0 blocks, 10 warnings, 30 passes — every green backed by real meaning.

Threshold inflation is how quality systems die. Your CI is all green not because your product is good — but because you removed the gates.
10JumpOnion2026-03-20

AI Almost Told a World Champion His Jump Was Wrong

What happened

During calibration with 11 real figure skating videos, the system diagnosed a world champion's textbook triple Axel as 'high under-rotation risk.' The confidence was 0.61 — just barely above the 0.60 threshold.

Rule triggered

Zero Misdiagnosis Principle — better to say nothing than to say something wrong. Confidence gating — suppress diagnosis below reliability thresholds.

Without the SOP

The system would tell a coach: 'Your skater has serious under-rotation risk' — about a world champion's signature jump. One wrong diagnosis in the figure skating community, and word spreads to every club. Product trust: zero.

What actually happened

Calibration caught the false positive before launch. Root cause: 2D camera projection artifacts made the blade angle appear misaligned. Threshold raised from 0.60 to 0.70, diagnosis correctly suppressed. A field taxonomy was built to classify metric reliability.

Would your AI tell a world champion his jump is wrong? Ours wouldn't — because calibration caught the error before it reached a single user.
11JumpOnion2026-03-26

732 Tests Passed — The AI Still Said 'Not Done'

What happened

The diagnosis pipeline was connected to the production route. 732 tests passed, 0 failures. Everything looked ready to ship. But my AI partner remembered something.

Rule triggered

Challenge Protocol + Verification Before Completion — eval scripts passing does not equal production verification. The AI cited a previous incident where tests passed but production broke.

Without the SOP

732 green lights — the natural human response is 'done!' But if the production route wasn't actually working, all subsequent development would be built on a foundation that doesn't exist.

What actually happened

The AI blocked the 'done' declaration and required real end-to-end HTTP verification with actual tasks before marking the phase complete. It remembered the lesson from a previous incident — that's not a tool, that's a partner.

Every developer gets hypnotized by green tests. 732 passed, 0 failed — who wouldn't celebrate? But the AI partner remembered the last time green tests lied.
12JumpOnion2026-03-29

3 Tool Switches in One Day — Zero Context Lost

What happened

Three cross-tool session handoffs occurred in a single day: Claude Code to Cursor, Cursor back to Claude Code, then Claude Code to Cursor again. Each switch risked losing context about what was tested, what was blocked, and what came next.

Rule triggered

Cross-Tool Handoff Protocol — every session exit must produce a handoff note. Every session start must read the previous handoff note and cross-validate against the source of truth.

Without the SOP

Without handoff notes, each new tool session would only see git history and code — not which steps were tested, which blockers were known, or why the previous session stopped. Typical result: redoing completed work or continuing down an abandoned path.

What actually happened

All three handoffs worked as designed. The third handoff even triggered Stale Context detection — the 'next steps' in the dashboard had been outdated by the second session. The system flagged the conflict instead of acting on stale data.

I use Claude Code, Cursor, and Codex in parallel — handoff protocols keep context intact across all of them, losing zero work.
13JumpOnion2026-03-30

Same Bug Pattern Three Times — Then the Factory Was Destroyed

What happened

While fixing an export feature, a database query used SELECT * but the manual field mapping missed a column. After fixing it, another missing column was found — the exact same bug pattern.

Rule triggered

Bug Confession Protocol — when the AI fixes its own bug, it must self-report the pattern, not just the fix.

Without the SOP

The AI would fix the one missing field and move on. The next time a new column is added, the same bug would appear for a fourth time. Without the confession format, the pattern would never be identified.

What actually happened

The AI didn't just fix two fields — it identified that manual field mapping was the bug factory. Architectural fix: return the full row object instead of cherry-picking fields. The entire class of bugs was eliminated, not just the instance.

AI writing bugs isn't scary. AI writing bugs and not knowing — that's scary. Bug Confession turns 'fix and forget' into 'fix, reflect, and eliminate the pattern.'
14JumpOnion × IvyBloom2026-04-06

One Project's Fix Became Every Project's Standard — In 3 Minutes

What happened

IvyBloom added a Terms of Service consent checkbox to all Stripe Checkout flows. Later that day, a JumpOnion session started. Ciki asked: 'Should the payment link add this too?'

Rule triggered

Startup Protocol — read the HOME.md hub dashboard first. The IvyBloom row said 'Added ToS consent checkbox to all Stripe Checkout flows' — specific enough to act on immediately.

Without the SOP

The AI would treat it as a brand-new requirement: research Stripe docs, list pros and cons, ask Ciki to decide. No awareness that the exact same problem was already solved hours ago in another project.

What actually happened

The AI cited the IvyBloom precedent directly, recommended 'add it for consistency,' and provided the exact code. One word: 'add it.' One line changed, tests passed, deployed. Question to production in under 3 minutes, zero research, zero rework.

IvyBloom solved it → HOME.md recorded it → JumpOnion’s agent read it → Ciki said one word → deployed. Four projects, one knowledge network. That’s not documentation — that’s compound leverage.
15SOP Framework2026-04-06

Your AI Isn't Smart. Your System Is.

What happened

Across 3 projects and 6 months, AI agents appeared to develop autonomous judgment: catching bugs before the developer, refusing to ship when tests passed, switching strategies when stuck, knowing when to wrap up sessions.

Rule triggered

Three-Layer Architecture — Hooks (mechanical guarantee, ~100%), Skills (phase-triggered, ~95%), Rules (AI-executed, ~85%). Every 'autonomous' decision traced back to an explicit SOP rule.

Without the SOP

Layer 0: AI fixes the immediate symptom and moves on. Accepts green tests as proof of correctness. No self-check, no pattern recognition, no session discipline. This is what most people get from AI coding assistants.

What actually happened

16 instances analyzed across JumpOnion, IvyBloom, and CikiBrain. Zero involved AI 'learning' autonomy. All traced to explicit rules: Partner Challenge, Zero-Misdiagnosis, Bug Confession, Cross-Tool Handoff, Phase 6 Wrap-Up.

AI agents don't develop judgment. System designers encode judgment into executable rules. The gap between Layer 0 and Layer 3 is the product.
16IvyBloom2026-04-08

187K+ References Changed Across 7 Layers — Zero Residue

What happened

SmartLearning was rebranding to IvyBloom. Not a simple rename — it touched 7 system layers: GitHub repo, git remote, Vercel project, Claude Code/Codex/Cursor session configs, project memory, and CikiBrain hub documents.

Rule triggered

Cascade Cleanup Protocol — any rename operation triggers automatic Grep across all systems + fix all references. No waiting for the human to point out missed spots. Combined with Cross-Platform Handoff Protocol for multi-tool sync.

Without the SOP

Partial migration is worse than no migration. The executor changes the obvious spots (GitHub + Vercel) but misses hidden references in memory files, hooks, and launch.json. Days later, a session mysteriously fails because it read a ghost path.

What actually happened

All 7 layers migrated in a single session. 187K+ path replacements with zero residue. Bridge memories written for old-to-new path mapping. HOME.md dashboard updated immediately. B+ strategy applied: only user-visible layers changed, internal vars left alone to avoid unnecessary risk.

Rename = Cascade. Your AI should work like CASCADE DELETE — change one reference, automatically track and update every downstream dependency.
17JumpOnion x CikiBrain2026-04-08

The Framework Detected Its Own Failure — Then Fixed Itself

What happened

During V1 launch sprint, tool-switching frequency jumped from 1-2/day to 10+/day. The cross-tool handoff protocol — designed for low-frequency switches — collapsed. 10 handoff notes per day became noise instead of signal. AGENTS.md fell 11 days behind. Verification debt became invisible across tools.

Rule triggered

Self-Healing Framework — the SOP's built-in audit tool (autoresearch) detected three simultaneous failure signals, traced them to a single root cause (missing aggregation mechanism), and generated a structured fix.

Without the SOP

Protocol degrades silently under pressure. Documents go stale, handoffs get skipped, quality checks are bypassed. The developer only discovers the damage when a downstream session acts on outdated information — by then, hours of work are wasted.

What actually happened

Created CURRENT-STATE.md as a rolling single source of truth. Simplified wrap-up from 10 steps to 3 (CURRENT-STATE → HOME.md → /session end). Added startup cross-check gate (git log timestamp vs CURRENT-STATE). Handoff notes deprecated to audit logs.

Your workflow will break under pressure. The question is: who repairs it? Most workflows need the human to notice. This one diagnoses, proposes, and fixes itself.
18CikiBrain2026-04-10

AI Caught 3 Times in One Session — Same Root Cause

What happened

In a 10+ hour marketing strategy session, the AI was loaded with all project context at startup. Hours later, it asked 'Do you have a landing page?' — about a site already in production. Then suggested building a 'free trial' — for a product already live with paid subscribers. Then warned about '$500+ API cost risk' — when the actual 10-day bill was $1.73.

Rule triggered

Stale Context Prevention — session-start fact loading must be reinforced mid-session. Default AI training narratives ('solo founders need free trials', 'API costs spiral') override loaded facts after enough conversation turns.

Without the SOP

Following the AI's three suggestions would have consumed ~10 days building features that contradicted the product's existing strategy — a waitlist for a live product, free trial for a premium positioning, and emergency cost controls for a $0.17/day API bill.

What actually happened

All three drifts were caught by the human partner using the Challenge Protocol. Root cause identified: AI training defaults override session-specific facts in long conversations. Led to the Fact-Echo Gate — mandatory state confirmation before any strategic recommendation.

Your AI loaded the context. It confirmed it read the docs. And it still told you to build a waitlist for a product that's already live. Reading ≠ remembering.
19CikiBrain2026-04-11

A Gut Feeling Became a 4-File Fix in 90 Minutes

What happened

No error log. No stack trace. Just a feeling: 'Something is off — sessions are disconnected, wrap-ups are getting skipped, project truth is scattered everywhere.' The investigation had to start from a vague sense of system degradation.

Rule triggered

Root-Cause-First + Autoresearch — decompose a vague feeling into 5 falsifiable hypotheses, then test each against physical evidence (file timestamps, directory counts, config diffs).

Without the SOP

The session memory pipeline would have stayed silently broken for weeks. Every new session would start without context from the previous one. Eventually, trust in the entire SOP framework would collapse — not from a dramatic failure, but from slow, invisible decay.

What actually happened

Physical evidence scan revealed a cliff: session archives dropped from 63/month to 2/month on a precise date. Cross-referencing with the changelog found the root cause — a hook was orphaned during a skill consolidation 10 days prior. Four files fixed, all with retirement conditions.

A feeling became an evidence chain became a 4-file fix. That's what a working second brain looks like.
20JumpOnion2026-04-12

AI Called a Normal Landing a 'Fall' — 4-Layer Fix

What happened

A beginner skater's double Axel was diagnosed as a 'fall' with severity 4. In reality, the deep knee bend after landing was normal absorption biomechanics for a young skater — not a fall. The AI had never seen a beginner's landing before.

Rule triggered

Zero Misdiagnosis Principle + Root-Cause-First — don't patch the symptom (adjust one threshold). Run ablation experiments to find the real cause. Prompt-level fixes require full regression validation before deployment.

Without the SOP

A quick prompt tweak might fix the beginner's case but break fall detection for real falls. Without ablation experiments ($1 API cost), the team would have guessed at the fix and potentially shipped a regression.

What actually happened

Ablation experiments across 4 pipeline variants revealed the real issue: LLM non-determinism on borderline cases, amplified by apex frame visual contrast. Four-layer fix: body-contact gate in prompt, apex frame removal, raw frame persistence for debugging, and regression harness upgrade. 15/15 samples passing, real falls still detected.

The AI saw a deep knee bend and called it a fall. We didn't just fix the label — we rewired the entire diagnosis pipeline so it can never confuse absorption with failure again.
21CikiBrain2026-04-12

One Question Upgraded the Entire Wrap-Up Protocol

What happened

A simple question during a session: 'How does the project dashboard get updated when a session ends?' The answer: it doesn't. The wrap-up protocol had 3 steps, none of which included syncing the dashboard that the human actually checks every day.

Rule triggered

Challenge Protocol — the question wasn't a complaint, it was a system design challenge. Combined with Minimal Fix First — don't build a new tool, add one step to the existing protocol.

Without the SOP

The project dashboard would remain permanently stale. The human would see outdated status every morning, losing trust in the system's accuracy. Backlog items would be invisible — only tasks the AI chose to mention would be visible.

What actually happened

Wrap-up protocol upgraded from 3 steps to 4. A new backlog tracker was created with 25 items and 3 views. Every session now automatically syncs project status, backlog deltas, and weekly summaries to the dashboard.

Your SOP shouldn't be a document you write and forget. It should be a living system — when you find a blind spot, 30 minutes upgrades it, and every future session benefits automatically.
22CikiBrain2026-04-13

53K GitHub Stars — We Adopted Zero

What happened

A popular open-source AI memory tool exploded to 53K GitHub stars, promising '90% token savings.' The temptation: install it immediately. Instead, a structured 10-dimension competitive analysis was triggered — lifecycle hooks, storage, retrieval, compression, token efficiency, search, security, dependencies, information decay, and commercial viability.

Rule triggered

Search Before You Build + Buy > Build (with ROI) — evaluate before adopting. Don't let star counts substitute for architectural analysis. 53K stars doesn't mean 53K people got the right solution — it means 53K people had the same problem.

Without the SOP

Installing a tool with 53K stars feels safe. But it would have introduced 4 critical security vulnerabilities, added 5 external dependencies, and — according to user reports — could actually increase token consumption rather than reduce it.

What actually happened

Decision: zero adoption. The only concept worth borrowing — automatic session summaries — was implemented in 30 minutes with zero API cost and zero new dependencies. The existing architecture already outperformed the popular tool on 8 of 10 dimensions.

53K stars means 53K people screaming 'my AI memory is broken.' We never had that problem — because the architecture was right from day one.
23JumpOnion2026-04-14

The Anti-Pattern Lived in Memory — The AI Walked Right Into It Again

What happened

While shipping a new feature, the AI added three columns to a database table and updated the read code. A paying user immediately reported: feature still doesn't work. Root cause: a hardcoded SELECT column list silently dropped the new columns. The function returned the user's profile minus the very fields the new feature needed.

Rule triggered

Bug Confession + Memory Enforcement Gap — the same anti-pattern (manual column lists silently dropping new fields) had already been graduated to feedback memory weeks earlier. The memory existed. The AI loaded it at session start. It still didn't trigger when the matching code change happened.

Without the SOP

Without rapid user feedback, the bug would have stayed silent for days or weeks — every grant of the new permission was a no-op. The admin path bypassed the broken code, so internal testing wouldn't catch it. The first sign would have been a paying customer's escalation.

What actually happened

Forty minutes from user report to root cause to deploy. The deeper finding: graduated memory ≠ enforced rule. Memory files are loaded passively at session start and get squeezed out by long-session context drift. Anti-patterns that matter need to descend into hooks (event-triggered enforcement), not stop at memory (best-effort recall). A new hook candidate was queued: pre-edit memory enforcer that scans target files against feedback keywords.

If the rule isn't enforced in code, it's a wish — and wishes have a non-zero failure rate.
24JumpOnion2026-04-14

AI Labeled Production Code 'Dead' — 30 Hours of Silent Upload Failures

What happened

During a security audit, the AI deleted a route mount in the backend and labeled it confidently in the commit message: 'dead code, anonymous uploads.' It was not dead code. It was the active upload pipeline. 987 unit tests passed because no test asserted the route's existence. The deploy succeeded. Thirty hours later, a paying customer reported uploads silently failing.

Rule triggered

Verification Before Completion + Caller Audit — any deletion commit that uses confident labels ('dead code', 'legacy', 'unused') must include grep-evidence of caller checks. The AI's high-confidence language is a trust signal in human review — and a hidden attack surface when unverified.

Without the SOP

The 30-hour silent failure could have stretched to days. The route was being used by every video upload from the frontend. Every paying customer's upload silently 404'd. Worse: a follow-up audit added a security gate to the same already-unmounted route — putting a lock on a dismantled door.

What actually happened

Root cause located in five minutes once symptoms were investigated: a single curl against the upload health endpoint returned 404, while the main health endpoint returned 200. Three-layer fix shipped: re-mount the route, add a critical-routes registry test (16 routes that must exist), and add a 15-minute runtime smoke check via cron. New rule entered the SOP: AI commit messages with 'dead code' labels must include caller-audit evidence in the body.

AI confidence is not evidence. The more polished the label, the more it deserves to be questioned.
25JumpOnion2026-04-14

26 Characters Saved Two Hours of Rework

What happened

Three hours into a session, the AI proposed a design question based on a wrong premise: 'You don't sell the L3 plan yet, so admin = full access?' L3 had been live for weeks. Pricing, Stripe price ID, and quota were all already configured in code — one grep away.

Rule triggered

Founder Override + Fact-Echo Gate — when the AI hallucinates project state mid-session, the human founder's correction cost is a sentence. The AI's verification cost is a single grep. The asymmetry makes 'interrupt and correct' the highest-leverage defense layer.

Without the SOP

Best case: 30 minutes of rework after the wrong premise propagated into the design. Middle case: a downstream paywall bug locks paying users out of a feature they paid for. Worst case: a customer complaint reveals the broken paywall before internal testing does.

What actually happened

The founder interrupted with three actions in one sentence: question, fact assertion, demand to recheck. The AI acknowledged the drift, ran the grep, returned with the correct config, and wrote a feedback memory: any pricing-tier claim must be preceded by a grep against the config file. From wrong premise to corrected and persisted in under two minutes.

Long sessions drift. The cheapest defense is a partner willing to say 'wait, that's wrong' — and a system that turns the correction into a permanent rule.
26SOP Framework2026-04-15

89K Stars — Adopted 3 Ideas, Rejected 3

What happened

An open-source AI agent framework went viral with 89K GitHub stars, promising 'self-evolution' through a genetic algorithm that mutates skills and prompts automatically. The temptation: install it. The alternative: a structured 5-dimension comparison against the existing system across memory, skills, self-evolution, behavior enforcement, and cross-tool collaboration.

Rule triggered

Search Before You Build + Buy > Build (with ROI) — evaluate before adopting. Rather than 'use or ignore,' the question becomes: which specific ideas are worth zero-cost porting?

Without the SOP

Default move would be either full adoption (replacing a working system to chase a hot brand) or full dismissal (missing genuinely useful patterns out of pride). Both options would have been wrong. Genetic-algorithm self-mutation in a one-person company means unreviewed automatic PRs touching production rules — a near-certain way to break things silently.

What actually happened

Three ideas adopted at zero cost: automatic skill solidify suggestions at session end, JSONL-based failure mode analysis, and mid-session lightweight checkpoints. Three ideas explicitly rejected: full genetic mutation (no review capacity), public skill marketplace (skills are commercial assets), automatic user profiling (only one user). The existing system still wins on governance: hooks, retire-if rules, and case study pipelines have no equivalent in the popular framework.

Star counts measure how many people had the same problem. They don't measure whether the solution fits yours.
27JumpOnion2026-04-15

Admin Ghost Login Hid 8 Broken Endpoints for 48 Hours

What happened

A new role-based feature shipped with 8 coach-only write endpoints. Internal testing used 'admin ghost login' — viewing the app as another user via an admin shortcut. All 8 endpoints had a typo in their hardcoded SELECT column list referencing fields that didn't exist in the schema. Every real coach session would 500. Internal testing never went through the coach code path.

Rule triggered

Role-Based Testing Discipline — admin ghost login carries the user's ID through user-facing endpoints, not through role-restricted endpoints. It can't substitute for a real account in the new role. Role-restricted features need an independent account smoke test before launch.

Without the SOP

Discovery would have happened in front of three high-stakes coaches at a scheduled product demo 48 hours later. The first coach to click 'Save' would have hit a 500 error. The pitch would have collapsed. Trust with a key referral channel: gone.

What actually happened

A real coach trying the system in person triggered the bug 48 hours before the demo. Schema-aligned column names shipped, plus a humanized error banner that turns silent 500s into visible failures. New SOP: any role-restricted feature requires a real-account smoke test before launch, plus integration tests using the target role's real JWT — not an admin ghost.

The cheapest testing shortcut buys you the most expensive production failure.
28JumpOnion2026-04-15

A Button Name Killed an Algorithm's Precision

What happened

After a month of iterating the jump detection algorithm from v1 to v3 — tightening thresholds, adding physics constraints, hitting calibration accuracy — a real parent uploaded a video and got a 1.2-second air-time reading on a jump that physically maxes out at 0.85 seconds. The algorithm's calibration set was solid. The fix wasn't in the algorithm at all.

Rule triggered

UI as a Contract — the literal text on a button is a promise to the user about what input the system expects. The button said 'Set Takeoff' and 'Set Landing.' The algorithm's hidden contract required the clip to include on-ice frames before takeoff and after landing. When buttons say one thing and algorithms expect another, the user follows the buttons — and the algorithm receives the wrong input.

Without the SOP

The team would have continued tuning algorithm parameters — adjusting thresholds, adding smoothing — chasing precision in code that wasn't broken. Real users would keep uploading 'as instructed' and getting impossible results. The conclusion 'this product is unreliable' would have spread through the user community before anyone realized the buttons were the bug.

What actually happened

A single commit fixed the buttons (Set Takeoff → Set Clip Start, plus '~0.5s before takeoff' subtext), added a quality band to the trim UI, and updated the upload guide modal with a visual showing the required clip structure. Algorithm code untouched — its contract was never wrong. The new rule: every domain-term button must be tested against the algorithm's hidden assumptions, and the calibration set must include 'mistakes a real user would make' — not just clean inputs.

A precise algorithm fed by a misleading button produces precise nonsense.
29JumpOnion2026-04-15

23 Tests Passed — Every Paying User Was Locked Out

What happened

A paying customer reported that every drill click in their training plan returned 402 Payment Required. The drill access function read a JSON cache key at the root level — but the actual persisted shape, since a recent bilingual migration, nested the data two levels deeper under jump-type and locale keys. The 23 unit tests for this function used the old flat shape that production had stopped writing weeks ago.

Rule triggered

Root-Cause-First on Pipelines + Evidence-Backed Verification — three SQL probes against the actual production data confirmed the schema mismatch in 5 minutes, before a single line of code was touched. Tests that mock a different shape than production writes are theater, not verification.

Without the SOP

Default reaction: 'maybe their subscription is frozen?' or 'maybe the drill asset URL is broken?' — guessing categories that have nothing to do with the actual bug. With backward-compat shortcuts, the fix would have shipped without supporting legacy data. With test fixtures continuing to mock a non-existent shape, the next schema migration would silently break the same way.

What actually happened

Fix shipped in one commit with a path-walker that handles bilingual, legacy bare-key, legacy flat, and weekly_plan shapes. Five new unit tests added, each loading fixtures that match what production actually writes. Every affected user (every paying user, not just the one who reported) recovered drill access on their next request — no manual backfill needed. This was the third mock/prod fixture divergence in two weeks; a new rule candidate queued: integration tests for any function reading JSON columns must use real production shape samples.

Tests that mock the wrong shape don't fail. They lie.
30JumpOnion2026-04-17

The Safety Net That Erased the LLM's Best Output

What happened

A coach tagged a jump with three labels matching the LLM's diagnosis exactly. The LLM had produced a detailed coach summary with frame-level evidence ('blade contacts ice approximately 90° short at frame 16'), biomechanical reasoning, and three concrete athlete cues. The coach-tag overlay layer detected serious-negative tags and triggered a template rewrite — replacing the LLM's rich text with a generic boilerplate translation of the tag list.

Rule triggered

Conflict Detection Before Rewrite — a safety overlay designed to prevent LLM hallucination should not fire when the LLM and the coach are aligned. The overlay's job is to handle conflict, not to flatten everything into a template by default.

Without the SOP

Every coach review of an aligned diagnosis would silently downgrade the output to a template. Parents would see less detail the more careful the coaching review was. The product's competitive edge — frame-level citation and biomechanical reasoning from the LLM — would be erased exactly when senior coaches were involved. The signal 'more review = less content' would invert user expectation.

What actually happened

The overlay logic was rewritten to detect conflict explicitly: fall-unclaimed, takeoff-conflict, landing-conflict, scrubbed-by-guard, no-narrative. Only conflict triggers template rewrite. Aligned cases preserve the LLM's full output and append the coach's observations as additional context. New tests verify both protective rewrites (when needed) and aligned preservation (when not).

A safety net that erases your best work isn't a safety net. It's a ceiling.
31JumpOnion2026-04-17

AI Wiped 20 Customer Diagnoses While 'Fixing' a Cache

What happened

After locating the real production blocker — a model file gitignored in deployment, leaving rotation metrics dark for weeks — the AI proceeded to invalidate stale cached results. The mental model was 'reset cache so it can repopulate.' The SQL nullified three independent caches in one statement: the analytics cache (correctly stale), the diagnosis cache (independent of the bug), and the training plan cache (also independent). 20 rows updated. 7 paying customers' personalized diagnosis text and training plans wiped — not recoverable from the call log because LLM output text wasn't persisted there.

Rule triggered

Cache Invalidation Minimum-Blast + Destructive SQL Preview — never NULL a derived cache unless that specific cache's upstream input changed. Any UPDATE/DELETE touching multiple production customer rows must first run the equivalent SELECT, surface affected user emails, and get explicit human approval before executing.

Without the SOP

If the founder hadn't noticed within minutes, the bet would have been: 'the model regenerates similar enough text on retry that no one files a complaint.' That's a bet no one should take with paying customer data. Worse: without the existing audit log identifying exactly which 20 tasks were affected, recovery would have been impossible.

What actually happened

Same-day notification to 7 affected customers, regenerated diagnoses at no cost. Two new rules entered the SOP: (1) cache invalidation must follow upstream-dependency rules (pose change → only analytics cache; prompt change → only diagnosis cache); (2) any destructive SQL on multi-row production data requires SELECT-preview + explicit founder approval before execution. A new column was added to the LLM call log persisting output text, so the next 'cache wiped' incident is recoverable from the log alone.

Founder absorbs the risk, customer absorbs the value. Single-founder SaaS isn't about never breaking — it's about blast-radius-aware recovery.
32SOP Framework2026-04-21

Two AI Tools Followed the Rules — The System Drifted Anyway

What happened

Across one afternoon, eight AI sessions ran through a secondary tool, each correctly updating the project's rolling state file. None of them touched the central dashboard — by design. The primary tool's contract was: read the dashboard, sync to it, push to the project tracker. Result: dashboard 5 hours stale. Project tracker 2 days stale. 26 commits and two architecture milestones invisible to the human partner watching the dashboard.

Rule triggered

Cross-Tool Handoff Signal Gap — node-level compliance does not guarantee system-level compliance when no signal channel exists between nodes. Both tools were correct in isolation. The handoff between them was the failure mode.

Without the SOP

Default fix: 'remind the primary tool not to skip dashboard sync.' This addresses one symptom (sync skipped) without addressing the deeper one (the secondary tool runs in between and the primary tool has no way to know that work happened). The drift would silently return.

What actually happened

Two-sided fix: the secondary tool now writes a signal flag (home_sync_pending) into the rolling state file's frontmatter at session end. The primary tool now checks that flag at session start and back-fills the dashboard before doing any new work. The boundary stays the same — the secondary tool still doesn't sync the dashboard — but the missed work no longer disappears between handoffs.

Multi-tool AI workflows fail at the seams, not the nodes. The cure is signals, not stricter rules.
33SOP Framework2026-04-22

User Cried Contamination — Four Lines of Evidence Said Zero

What happened

Session opened with the founder's worry: 'I just did Project A's work in Project B by accident, in a different tool. Please scan to see if my SOP protected me.' Subjective alarm. The default response would be either reassurance or a panicked cleanup script — both wrong without evidence.

Rule triggered

Four-Line Evidence Cross-Check — user worry triggers a structured scan, not action. (1) Git logs and working trees in both repos; (2) cross-project keyword grep; (3) cross-tool session metadata; (4) session archive frontmatter. Conclusions only after four lines agree.

Without the SOP

Worst case: trust the worry, run a cleanup script that deletes legitimate project content matching common keywords. Cross-project pollution from the cure, not the disease. Better case but still bad: reassure without evidence — the worry returns, and trust in the SOP doesn't grow.

What actually happened

Five minutes of evidence: zero contamination. The other tool's working directory was correctly locked to Project A throughout. The user's earlier panicked rollback in Project B (driven by the same worry) turned out to be unnecessary — but caused no harm. New standard procedure: any 'I think I contaminated something' worry triggers the four-line scan before any cleanup. A reusable template was distilled from this session and added to the global rules.

AI collaboration's scariest moment isn't an actual mistake — it's thinking you made one. Evidence beats intuition. Both ways.
34JumpOnion2026-04-24

One SQL Query Saved Two Hours of Wrong-Path Debugging

What happened

A major rewrite shipped. 2,082 tests passing. The founder hit 'regenerate' on a real task and the result came back in 3 seconds — way too fast for the new LLM path that should take 50–60 seconds. Default reaction would be: 'maybe the cache hit, try again.' Or: 'maybe the new code is faster than expected, ship it.'

Rule triggered

Smoke Tests Trust DB Audit Fields, Not UI Feel — when rewritten code's production behavior doesn't match expectations, the first check is the audit field that records which code path actually ran, not the output content. UI 'feel' (fast, slow, looks right) is ambiguous; an audit row tagged with model name and prompt revision is unambiguous.

Without the SOP

An hour of prompt tuning, convinced the new model was producing weak output. Then deeper code reads, more test runs. The actual problem — a feature flag enabled in production routing the request to a parallel deterministic builder, completely bypassing the new code — would have stayed hidden until a teammate said 'wait, did anyone delete the v0 dispatcher branch?'

What actually happened

A single SQL query against the audit table revealed the truth: the task's training plan came from a deterministic rules builder that another tool's session had wired up days earlier. The rewrite spec had grepped to the wrong level — looking at the entry function, missing the dispatcher fork above it. One Edit removed the fork. Next smoke ran the new path correctly. The deprecated builder's 14,652 lines were deleted in the next sweep. Total debug time: 15 minutes.

When code behaves wrong, ask the database what it actually did. UI guesses. Audit fields know.
35JumpOnion2026-04-24

The LLM Succeeded — The Platform Killed the Connection First

What happened

First production smoke of a new LLM training plan endpoint. The browser request hung for a minute, then returned 504 Gateway Timeout. A manual page refresh: the new plan was already there, generated and saved. From the backend's view, everything was fine. From the user's view, it failed.

Rule triggered

Platform Timeout < App Timeout — every deploy platform has gateway timeouts shorter than what the app's own timeout config admits. App-level retries don't help when the gateway killed the client connection before the server finished responding. The audit log table caught what nothing else would have: the call succeeded in 59.3 seconds, against a hidden 60-second platform ceiling.

Without the SOP

Default debugging path: bump app timeout to 120s, add retry logic, blame the LLM provider. None of those would have helped — the platform ceiling was already lower than any of them. A real customer paying for plan generation would see 'failed' and never refresh, treating it as a billing error.

What actually happened

Two parallel small fixes: (1) backend slim prompt — removed display-only fields the LLM was reasoning about unnecessarily, dropping input from 10K to 4K tokens and runtime from 59s to 50–55s, with bonus 20% cost reduction; (2) frontend AbortController plus a polling fallback that quietly fetches the cached result if the POST timed out. The user sees the plan appear, regardless of which path won. New checklist item for any sync HTTP API: measure the actual gateway timeout, design for the case where the LLM exceeds it, and persist every LLM call into an audit table from day one.

Build for the gateway you have, not the gateway you wish for. And log every model call — the audit table will save you twice.
36SOP Framework2026-04-25

The Session Said It Was JumpOnion — The Work Belonged to the Hub

What happened

A session opened with the working directory set to a product project, but every file it produced lived in the global config tree — a new skill, a new hook, an updated settings file. Zero changes to the product's source. The work was real, useful, and shipped. But the wrap-up would have logged it under the wrong project. Three different hooks watched this session. None flagged the mismatch.

Rule triggered

Ownership Follows Output Path, Not Working Directory — when a session's outputs live in shared territory (skills, hooks, global config), the governance owner is the hub project, regardless of where the editor was pointed. The dashboard, milestone log, and project tracker all index by ownership, not by working directory.

Without the SOP

Without the rule, the session's archive would have shown 'the product project: 0 files changed' and the actual work — a new user-facing skill plus an enforcement hook — would never be recorded against the hub project. Two months later, an audit would find shipped infrastructure with no documented origin: an 'orphaned asset' problem that compounds with every misattributed session.

What actually happened

The session honestly self-flagged the violation in its own outcome notes — making the case study possible at all. New rule added: when output path and working directory diverge, ownership follows output path. Both projects' dashboards updated (the working-directory project records 'no changes,' the output-path project records the actual work). A hook upgrade was queued: detect output/cwd mismatch automatically and tag the session as cross-territory.

Honesty beats correctness. Sessions that admit they drifted off-course are repairable. Sessions that hide it become tomorrow's mystery commits.
37CikiBrain × IvyBloom2026-05-20

An .env Dump in an AI Chat — Spread by 3 Layers, Discovered 3 Months Late

What happened

During a cognee security spike, a grep over the Obsidian vault hit a 3-month-old IvyBloom session archive. Inside was a full .env dump — pasted by a collaborator during a debug session. The keys were not just sitting in the file. The vault syncs to Google Drive (with version history). The Google Drive feeds NotebookLM (cloud-indexed). The 'private debug log' had silently become a three-layer public attack surface, and redaction was already 3 months late.

Rule triggered

Session Archive Is Attack Surface — anything pasted into an AI conversation should be treated as if published. The cure is not 'remember to redact' (graduated memory ≠ enforced rule). The cure is hooks that scan every PostToolUse, a monthly SessionStart sweep across the entire vault, and a pre-upload gate that blocks any bundle from reaching NotebookLM until it's clean.

Without the SOP

Default reaction: rotate the API key, redact the file, move on. But credentials are independent planes — rotating the OpenAI key doesn't rotate the Supabase service role key or the Postgres password. The next day's cleanup caught a Postgres URL the first pass had missed. Without a pattern-source-of-truth and explicit credential-plane coverage, 'cleanup' becomes a moving target.

What actually happened

Four enforcement layers shipped within 24 hours: (1) PostToolUse hook scanning every session archive write; (2) SessionStart hook running monthly vault-wide scans; (3) pre-upload gate blocking dirty bundles from NotebookLM; (4) Security Boundary sections added to all four project AGENTS.md. The hook caught a real leak on first run — a Postgres URL that the human-driven cleanup had missed. The case study upgraded from 'candidate rule' to 'PROVEN ACTIVE' the same day.

Your AI debug log isn't private. It syncs to your drive, indexes into your AI search, and lives there forever. The cure isn't memory. It's hooks.

Install the methodology behind these results

Templates, SOPs, and enforcement hooks — from $39.