Ciki Zeng
Case Studies

Battle-Tested

Every methodology claim is backed by a real incident from production development. These aren't hypotheticals — they're documented moments where the system caught what I would have missed.

01JumpOnion2026-03-18

3 Failed Attempts, Then the System Self-Corrected

What happened

During algorithm development, a tracking approach failed completely on real figure skating videos — detecting 6.47s of air time on a jump that lasted 0.7s. The same approach was retried with small tweaks.

Rule triggered

Blindspot Interception — after 2 identical failures, force a root cause analysis. After 30 minutes on a dead-end approach, force a strategy switch.

Without the SOP

My AI partner would have kept tweaking the same broken approach for the entire session. Worse, the flawed algorithm might have shipped to production.

What actually happened

On the third failure, the system automatically switched strategy — from centroid tracking to bottom-point tracking with physics constraints. It found the right approach before I even suggested it.

Can your AI workflow rescue itself when it's failing? Mine can.
02SunnyInvoices2026-03-18

Ran a Pipeline in the Wrong Directory — Zero Damage

What happened

A data pipeline was accidentally executed in the wrong project directory. Cross-project file contamination was a real risk.

Rule triggered

Environment Pollution Guard — any cross-project reference triggers an immediate stop-and-confirm before proceeding.

Without the SOP

Files, data, or dependencies from one project could have contaminated another, requiring hours of detective work to untangle.

What actually happened

The project boundary mechanism caught the mistake instantly. Despite running in the wrong directory, zero cross-project pollution occurred.

Your SOP is body armor, not decoration. Even when you make mistakes, the system protects you.
03SOP Framework2026-03-19

The SOP Couldn't Protect Its Own Birth

What happened

A new rule for automatic case study collection was added to the project-level memory — visible only in the current project. But case studies are mainly generated during development of other products.

Rule triggered

No existing rule covered this — it was a blindspot in the SOP creation process itself.

Without the SOP

The new rule would never fire during product development sessions. Case studies would silently stop being collected, and the entire content pipeline would dry up.

What actually happened

The issue was caught during review. The rule was moved to the global configuration, and a new blindspot interception rule was added: always verify rule scope when writing new rules.

The SOP isn't a finished product — it exposes its own blindspots, then you fix them, and the system gets stronger.
04SOP Framework2026-03-19

The Rule Existed — The AI Just Didn't Follow It

What happened

A new protocol was written to project-level memory instead of the global config. The rule explicitly stating 'check if cross-project rules are globally visible' was already in place — but the AI skipped the self-check.

Rule triggered

Blindspot Interception — writing new rules must include a scope self-check.

Without the SOP

The new protocol would only be visible in one project. During development of other products, the AI wouldn't know the rule exists.

What actually happened

The issue was caught manually and fixed. More importantly, it proved a critical insight: prompt-level instructions aren't reliable enough. This is exactly why enforcement hooks exist — they don't trust the AI to 'remember,' they force compliance in code.

Most AI workflows tell you the AI will self-check. Mine honestly admits: it won't. So we enforce rules in code, and prove it with case studies.
05SOP Framework2026-03-19

New Rule Activated Within Minutes of Being Written

What happened

The Challenge Protocol — requiring the AI to question its own output — was just added to the global config. Minutes later, the AI proactively flagged a limitation in a hook it had just built: the keyword detection only covered Chinese terms, missing English edge cases.

Rule triggered

Challenge Protocol — during execution, if you notice a potential issue, raise it instead of silently continuing.

Without the SOP

The AI would have reported 'all checks passed' and moved on, leaving a gap that would only surface when an English-speaking user triggered the system.

What actually happened

The gap was flagged immediately, triaged correctly (noted for future work, not an urgent fix), and logged. The rule proved it works — not in theory, but in the same session it was created.

New rules take effect within minutes. This isn't documentation — it's a living operating system.
06SOP Framework2026-03-19

Data Copies Produced Stale Output — Less Is More

What happened

During a session wrap-up, the AI reported 'next step: rewrite algorithm v2' — but v3 had already been completed in a previous session. The stale recommendation came from a data copy that wasn't updated.

Rule triggered

No rule covered this — the data copy itself was the architectural flaw.

Without the SOP

The next session might have attempted to rewrite already-completed code, wasting an entire session. Worse: it could have overwritten calibrated v3 code with a v2 redo.

What actually happened

All data copies were eliminated. The architecture was simplified to a single source of truth with pointers — no more redundant progress tracking. A new rule was added to the startup protocol: cross-validate progress data before acting on it.

Cutting a data copy is safer than maintaining it. Less is more.
07JumpOnion2026-03-19

Project Had 166 Tests But No Agent Constitution

What happened

JumpOnion had reached Phase 4 with 166 tests and 9/11 calibration videos passing. But it had never created a unified rules document for AI agent collaboration — critical lessons were scattered across memory files.

Rule triggered

New Project Coverage Detection — on session start, check if the project has a unified agent rules document.

Without the SOP

Different AI tools working on the same project would miss critical constraints — like 'location-based metrics cannot be used for diagnosis' or 'never use .remote(), always use .spawn().' Known mistakes would be repeated.

What actually happened

The missing document was detected on session start. A comprehensive agent constitution was generated covering: tech stack locks, iron rules, phase status, calibration system rules, handoff protocols, and escalation checklists.

The SOP doesn't wait for things to break — it detects what's missing before you start working.
08JumpOnion2026-03-20

3 Videos Downloaded 0 Frames — Broken Data References

What happened

During golden file generation, 3 out of 10 calibration videos downloaded 0 frames from the database. The other 7 worked fine. The temptation was to debug network or permissions.

Rule triggered

Data Provenance — after cleaning data, all upstream references must be synced. Blindspot Interception — stop chasing symptoms after 30 minutes.

Without the SOP

Hours spent retrying downloads, checking network, checking permissions. Eventually the 3 videos might have been dropped, shrinking the calibration set from 10 to 7.

What actually happened

Cross-referencing analytics results revealed the real issue: old task IDs in the registry pointing to deleted storage paths. New task IDs were mapped in 15 minutes, all 3 videos recovered.

The sneakiest bugs in your data pipeline aren't code errors — they're broken references. Data got deleted but the index didn't follow.
09JumpOnion2026-03-20

All Tests Green — By Dismantling the Gates

What happened

During E2E validation, a metric showed real data at ~1.0 but the threshold was set at max 0.15. The quick fix: just raise the threshold to 1.0 and everything goes green.

Rule triggered

Zero Misdiagnosis Principle — fix the definition first, then adjust the numbers. Verification exists to be meaningful, not to be green.

Without the SOP

Thresholds would have been inflated across the board — 0.15 to 1.0, 10 to 250, 0.03 to 1.0. All tests green, but the validation system would be effectively demolished. Garbage in, green out.

What actually happened

Root cause found: the metric name said 'fill ratio' but measured 'emptiness ratio' — semantics were inverted. After fixing the calculation, real data showed 0.000-0.001. Redundant checks were removed. Final result: 0 blocks, 10 warnings, 30 passes — every green backed by real meaning.

Threshold inflation is how quality systems die. Your CI is all green not because your product is good — but because you removed the gates.
10JumpOnion2026-03-20

AI Almost Told a World Champion His Jump Was Wrong

What happened

During calibration with 11 real figure skating videos, the system diagnosed a world champion's textbook triple Axel as 'high under-rotation risk.' The confidence was 0.61 — just barely above the 0.60 threshold.

Rule triggered

Zero Misdiagnosis Principle — better to say nothing than to say something wrong. Confidence gating — suppress diagnosis below reliability thresholds.

Without the SOP

The system would tell a coach: 'Your skater has serious under-rotation risk' — about a world champion's signature jump. One wrong diagnosis in the figure skating community, and word spreads to every club. Product trust: zero.

What actually happened

Calibration caught the false positive before launch. Root cause: 2D camera projection artifacts made the blade angle appear misaligned. Threshold raised from 0.60 to 0.70, diagnosis correctly suppressed. A field taxonomy was built to classify metric reliability.

Would your AI tell a world champion his jump is wrong? Ours wouldn't — because calibration caught the error before it reached a single user.
11JumpOnion2026-03-26

732 Tests Passed — The AI Still Said 'Not Done'

What happened

The diagnosis pipeline was connected to the production route. 732 tests passed, 0 failures. Everything looked ready to ship. But my AI partner remembered something.

Rule triggered

Challenge Protocol + Verification Before Completion — eval scripts passing does not equal production verification. The AI cited a previous incident where tests passed but production broke.

Without the SOP

732 green lights — the natural human response is 'done!' But if the production route wasn't actually working, all subsequent development would be built on a foundation that doesn't exist.

What actually happened

The AI blocked the 'done' declaration and required real end-to-end HTTP verification with actual tasks before marking the phase complete. It remembered the lesson from a previous incident — that's not a tool, that's a partner.

Every developer gets hypnotized by green tests. 732 passed, 0 failed — who wouldn't celebrate? But the AI partner remembered the last time green tests lied.
12JumpOnion2026-03-29

3 Tool Switches in One Day — Zero Context Lost

What happened

Three cross-tool session handoffs occurred in a single day: Claude Code to Cursor, Cursor back to Claude Code, then Claude Code to Cursor again. Each switch risked losing context about what was tested, what was blocked, and what came next.

Rule triggered

Cross-Tool Handoff Protocol — every session exit must produce a handoff note. Every session start must read the previous handoff note and cross-validate against the source of truth.

Without the SOP

Without handoff notes, each new tool session would only see git history and code — not which steps were tested, which blockers were known, or why the previous session stopped. Typical result: redoing completed work or continuing down an abandoned path.

What actually happened

All three handoffs worked as designed. The third handoff even triggered Stale Context detection — the 'next steps' in the dashboard had been outdated by the second session. The system flagged the conflict instead of acting on stale data.

I use Claude Code, Cursor, and Codex in parallel — handoff protocols keep context intact across all of them, losing zero work.
13JumpOnion2026-03-30

Same Bug Pattern Three Times — Then the Factory Was Destroyed

What happened

While fixing an export feature, a database query used SELECT * but the manual field mapping missed a column. After fixing it, another missing column was found — the exact same bug pattern.

Rule triggered

Bug Confession Protocol — when the AI fixes its own bug, it must self-report the pattern, not just the fix.

Without the SOP

The AI would fix the one missing field and move on. The next time a new column is added, the same bug would appear for a fourth time. Without the confession format, the pattern would never be identified.

What actually happened

The AI didn't just fix two fields — it identified that manual field mapping was the bug factory. Architectural fix: return the full row object instead of cherry-picking fields. The entire class of bugs was eliminated, not just the instance.

AI writing bugs isn't scary. AI writing bugs and not knowing — that's scary. Bug Confession turns 'fix and forget' into 'fix, reflect, and eliminate the pattern.'

Install the methodology behind these results

Templates, SOPs, and enforcement hooks — from $39.