Case Study · JumpOnion

A computer-vision + rule-engine system that diagnoses figure-skating jumps — designed, directed, and verified end-to-end.

A product I designed and operate, live with paying subscribers. A skater uploads a phone video; the system measures the jump with computer vision, a bounded verdict layer decides what can be shown, a human expert reviews the high-stakes calls, and an LLM only translates the result into language a parent and a nine-year-old can act on. Here's how it's built — and the calls I made.

DesignedDirected AI buildVerified end-to-end

A full upload → analysis → result walkthrough

Everything shown is my own skater, recorded with consent and with faces mosaicked — the same scrubbed footage already public on the live product. No other customer's footage, data, or private reliability thresholds appear anywhere on this page. Throughout, “the skater” and “a world champion” are anonymized, and drill names are generalized.

The problem

A skater and parent can't see what's actually wrong with a jump, so they train blind; a coach can see it, but their time is expensive and limited. The hard part isn't reading one number — it's deciding which calls a machine is allowed to make at all. Rotation defects look almost identical to clean landings to a general-purpose model, and a confident wrong diagnosis is worse than none. The system is built around that boundary: measure what can be measured, decide the high-stakes calls with rules plus a human, and let the LLM translate — never diagnose.

Architecture

Measurement decides. The LLM only translates.

The whole system is organized around one boundary. Everything that decides the diagnosis is measurement, rules, and a human; the LLM lives downstream of a frozen verdict and is only ever allowed to put it into words.

Phone video

Input · a skater uploads from their phone

Computer-vision pose estimation

Biomechanical measurement

Sport-specific motion measurements — captured before narrative

Verdict boundary

The trust layer · measurement and review decide before the LLM sees the result

Human expert review

High-stakes rotation calls are routed here — by construction, not by disclaimer

Verdict frozen — everything below only translates it

LLM translator

Narrative only · turns the verdict into parent / coach / athlete language — never invents a number

Safety review gate

Product guardrail → safer fallback on any violation

Personalized drill plan

Output · the LLM picks only from a rule-gated drill pool — it can't invent a drill

A result close-up — one real jump

The skater's own double loop, faces mosaicked. The same engine runs on this jump as on every customer upload.

Walkthrough · 8 steps

How the system works — and why it's built this way.

Step 1 · The system

End-to-end figure-skating jump diagnosis. I designed the whole flow: phone video → computer-vision measurement → a bounded verdict → human review of the high-stakes calls → an LLM translation → a personalized drill plan.

Step 2 · Measurement first, LLM as translator

The core architectural call: measurement and rules own the diagnosis; the LLM only translates it into words. The verdict is constrained before narrative starts, so the model cannot talk its way into a new diagnosis. The rule that sits at the top of the system: “可以少，不要错” — it can be less, but it cannot be wrong.

Step 3 · Expert calls go to a human, by construction

High-stakes rotation calls (under-rotation, cheated takeoff) are forbidden from auto-display and routed to a human coach. It's built into the product boundary with a safer fallback — not left to a disclaimer.

Step 4 · Confidence gating — better silent than wrong

Early on, a world champion's textbook triple was flagged as an under-rotation risk. I raised the confidence threshold and suppressed the call; today that class of call is categorically routed to a human instead. The product's value is restraint — when the system isn't sure, it stays quiet.

Step 5 · Every claim traced to a measurement

Every number the skater sees must appear verbatim in the captured evidence— the LLM is banned from inventing a value, deriving one by arithmetic, or referencing a frame the system never captured. Evidence, not hallucination.

Step 6 · Diagnosis becomes an action plan

The diagnosis connects to a curated drill library; the LLM can sequence and explain, but it can only choose from approved training options. It can't invent a drill.

Step 7 · Calibrated, not guessed

When a beginner's deep-knee landing was mis-called a “fall,” I didn't nudge a threshold — I isolated the root cause, changed the product behavior where it mattered, and verified the fix against real falls.

Step 8 · Operator-grade reliability

A deploy once removed a route labeled “dead code”; uploads silently failed for ~30 hours while the whole test suite stayed green — because nothing asserted the route existed. The fix wasn't just re-mounting it: a registry of critical routes plus a 15-minute runtime smoke, so that class of silent failure is caught in minutes, not hours.

Architecture evolution

It wasn't written in a straight line.

The architecture was forced into shape by a string of setbacks. None of the turning points was “a cooler model” — every one was a judgment about trust boundaries. Each pain became a permanent guardrail.

Algorithm-first hit a physics ceiling

The painI first let the algorithm output the diagnosis directly — but rotation completeness is motion information, not single-frame appearance. No amount of tuning fixes that.

The callI demoted raw measurement to an evidence layerand moved the product's value to problem identification and a training prescription.

“All tests green — but production was analyzing nothing”

The painThe eval path and the production path quietly diverged; users were handed a “result” computed on empty input.

The callI built production-parity verification and made it a rule: code exists ≠ the path is verified.

A world champion’s textbook jump was flagged as a severe defect

The painOne false positive defined the entire product's safety philosophy.

The callThe first rule — “it can be less, but it cannot be wrong”: high-sensitivity rotation calls stay silent and route to a human coach rather than guess.

The biggest refactor: LLM-first → measurement-first, human-reviewed, LLM-last

The painA general-purpose LLM hallucinates “plausible evidence” exactly where trust matters most.

The callA measurement-and-review layer owns the verdict; the LLM only translates, never invents a number, and every output is validated against the frozen verdict.

A confident “dead-code” deletion silently broke uploads for ~30 hours

The painEvery test passed — because none asserted the route existed.

The callThe fix wasn't just re-mounting it: a registry of critical routes plus a 15-minute runtime smoke, so that class of silent failure is caught in minutes.

A destructive “reset cache” operation damaged already-generated content

The painMulti-level caches that should have failed independently were wiped together by a single query.

The callTwo guardrails — destructive operations must SELECT-and-confirm first, and cache invalidation gets the smallest possible blast radius — plus choosing human review over auto-migration to match risk to the size of the business.

Architecture & judgment

The four calls that define the system.

Measurement-first boundary

A measurement-first layer owns the verdict; the LLM only translates it into words. It can't introduce a number the measurement didn't produce because the product boundary enforces that, not a prompt.

Human-in-the-loop, by construction

Rotation-defect calls are forbidden from auto-display and routed to a coach — enforced by the product boundary, not by a disclaimer.

Calibrated, not patched

False positives and over-calls were caught against real video and fixed at the root cause — an ablation to find why, not a threshold nudge to hide it — with real falls still detected.

Operator-grade reliability

A ~30-hour silent outage became a permanent guard: a registry of critical routes plus a runtime smoke, so that whole class of failure surfaces in minutes instead of hours.

What I owned

01Designed the diagnosis architecture — the measurement/LLM trust boundary, the rules-first verdict flow, and the human-review routing for high-stakes calls.
02Directed the AI build — wrote the specs, set what the LLM may and may not do, reviewed the output, and decided what shipped.
03Verifiedit end-to-end — caught the false positive on a world champion's jump, the beginner-fall over-call, and the dead-code outage before they defined the product.

This is how I work: design the system, draw the trust boundary, own the verification.

If you're evaluating someone to design or operate AI-augmented systems, this is a representative piece of how I think. There's more in the collection.

Connect on LinkedIn →See more work →@cikibuilds