Ciki Zeng
Case Study · JumpOnion

A computer-vision + rule-engine system that diagnoses figure-skating jumps — designed, directed, and verified end-to-end.

A product I designed and operate, live with paying subscribers. A skater uploads a phone video; the system measures the jump with computer vision, a deterministic rule engine decides the verdict, a human expert reviews the high-stakes calls, and an LLM only translates the result into language a parent and a nine-year-old can act on. Here's how it's built — and the calls I made.

DesignedDirected AI buildVerified end-to-end
A full upload → analysis → result walkthrough

Everything shown is my own skater, recorded with consent and with faces mosaicked — the same scrubbed footage already public on the live product. No other customer's footage, data, or private reliability thresholds appear anywhere on this page. Throughout, “the skater” and “a world champion” are anonymized, and drill names are generalized.

The problem

A skater and parent can't see what's actually wrong with a jump, so they train blind; a coach can see it, but their time is expensive and limited. The hard part isn't reading one number — it's deciding which calls a machine is allowed to make at all. Rotation defects look almost identical to clean landings to a general-purpose model, and a confident wrong diagnosis is worse than none. The system is built around that boundary: measure what can be measured exactly, decide the verdict with deterministic rules plus a human, and let the LLM translate — never diagnose.

Architecture

Measurement decides. The LLM only translates.

The whole system is organized around one boundary. Everything that decides the diagnosis is deterministic measurement, rules, and a human; the LLM lives downstream of a frozen verdict and is only ever allowed to put it into words.

Phone video
Input · a skater uploads from their phone
Computer-vision pose estimation
Biomechanical measurement
Air time, axis, rotation timing — exact, repeatable
Deterministic rule engine
The truth layer · a pure function, no LLM — same input, byte-identical verdict
Human expert review
High-stakes rotation calls are routed here — by construction, not by disclaimer
Verdict frozen — everything below only translates it
LLM translator
Narrative only · turns the verdict into parent / coach / athlete language — never invents a number
Post-validation safety gate
Forbidden-label + semantic-leak audit → falls back to a deterministic phrase bank on any violation
Personalized drill plan
Output · the LLM picks only from a rule-gated drill pool — it can't invent a drill
A result close-up — one real jump

The skater's own double loop, faces mosaicked. The same engine runs on this jump as on every customer upload.

Walkthrough · 8 steps

How the system works — and why it's built this way.

1
Step 1 · The system

End-to-end figure-skating jump diagnosis. I designed the whole flow: phone video → computer-vision measurement → a deterministic verdict → human review of the high-stakes calls → an LLM translation → a personalized drill plan.

2
Step 2 · Measurement first, LLM as translator

The core architectural call: a deterministic rule engine owns the diagnosis; the LLM only translates it into words. The engine is a pure function — same input, byte-identical output, no model in the loop. The rule that sits at the top of the system: “可以少,不要错” — it can be less, but it cannot be wrong.

3
Step 3 · Expert calls go to a human, by construction

High-stakes rotation calls (under-rotation, cheated takeoff) are forbidden from auto-display and routed to a human coach. It's built into the label pipeline — a pre-filter, a post-validation gate, and a semantic-leak audit — not left to a disclaimer.

4
Step 4 · Confidence gating — better silent than wrong

Early on, a world champion's textbook triple was flagged as an under-rotation risk. I raised the confidence threshold and suppressed the call; today that class of call is categorically routed to a human instead. The product's value is restraint — when the system isn't sure, it stays quiet.

5
Step 5 · Every claim traced to a measurement

Every number the skater sees must appear verbatim in the captured evidence— the LLM is banned from inventing a value, deriving one by arithmetic, or referencing a frame the system never captured. Evidence, not hallucination.

6
Step 6 · Diagnosis becomes an action plan

The verdict's tags intersect a 54-drill libraryto form a candidate pool; the LLM then picks 3–5 and orders them into a weekly plan — but it can only choose from the rule-selected pool. It can't invent a drill.

7
Step 7 · Calibrated, not guessed

When a beginner's deep-knee landing was mis-called a “fall,” I didn't nudge a threshold — I ran an ablation across pipeline variants to find the root cause, fixed it at the root (a body-contact gate + apex-frame removal), and verified 15/15 with real falls still detected.

8
Step 8 · Operator-grade reliability

A deploy once removed a route labeled “dead code”; uploads silently failed for ~30 hours while the whole test suite stayed green — because nothing asserted the route existed. The fix wasn't just re-mounting it: a registry of critical routes plus a 15-minute runtime smoke, so that class of silent failure is caught in minutes, not hours.

Architecture evolution

It wasn't written in a straight line.

The architecture was forced into shape by a string of setbacks. None of the turning points was “a cooler model” — every one was a judgment about trust boundaries. Each pain became a permanent guardrail.

01

Algorithm-first hit a physics ceiling

The painI first let the algorithm output the diagnosis directly — but rotation completeness is motion information, not single-frame appearance. No amount of tuning fixes that.

The callI demoted raw measurement to an evidence layerand moved the product's value to problem identification and a training prescription.

02

“All tests green — but production was analyzing nothing”

The painThe eval path and the production path quietly diverged; users were handed a “result” computed on empty input.

The callI built production-parity verification and made it a rule: code exists ≠ the path is verified.

03

A world champion’s textbook jump was flagged as a severe defect

The painOne false positive defined the entire product's safety philosophy.

The callThe first rule — “it can be less, but it cannot be wrong”: high-sensitivity rotation calls stay silent and route to a human coach rather than guess.

04

The biggest refactor: LLM-first → rules-first, human-reviewed, LLM-last

The painA general-purpose LLM hallucinates “plausible evidence” exactly where trust matters most.

The callA deterministic rule engine plus a human expert own the verdict; the LLM only translates, never invents a number, and every output is validated against the frozen verdict.

05

A confident “dead-code” deletion silently broke uploads for ~30 hours

The painEvery test passed — because none asserted the route existed.

The callThe fix wasn't just re-mounting it: a registry of critical routes plus a 15-minute runtime smoke, so that class of silent failure is caught in minutes.

06

A destructive “reset cache” operation damaged already-generated content

The painMulti-level caches that should have failed independently were wiped together by a single query.

The callTwo guardrails — destructive operations must SELECT-and-confirm first, and cache invalidation gets the smallest possible blast radius — plus choosing human review over auto-migration to match risk to the size of the business.

Architecture & judgment

The four calls that define the system.

Measurement-first boundary

A deterministic engine owns the verdict; the LLM only translates it into words. It can't introduce a number the measurement didn't produce — a post-validation gate enforces that, not a prompt.

Human-in-the-loop, by construction

Rotation-defect calls are forbidden from auto-display and routed to a coach — enforced by a pre-filter, a post-validation gate, and a semantic-leak audit, not by a disclaimer.

Calibrated, not patched

False positives and over-calls were caught against real video and fixed at the root cause — an ablation to find why, not a threshold nudge to hide it — with real falls still detected.

Operator-grade reliability

A ~30-hour silent outage became a permanent guard: a registry of critical routes plus a runtime smoke, so that whole class of failure surfaces in minutes instead of hours.

What I owned
  • 01Designed the diagnosis architecture — the measurement/LLM trust boundary, the rules-first verdict flow, and the human-review routing for high-stakes calls.
  • 02Directed the AI build — wrote the specs, set what the LLM may and may not do, reviewed the output, and decided what shipped.
  • 03Verifiedit end-to-end — caught the false positive on a world champion's jump, the beginner-fall over-call, and the dead-code outage before they defined the product.

This is how I work: design the system, draw the trust boundary, own the verification.

If you're evaluating someone to design or operate AI-augmented systems, this is a representative piece of how I think. There's more in the collection.