Ciki Zeng
← Back to Blog
2026-04-22· 7 min readJumpOnionUX DesignAI Engineering

Day 6: A Button Name Killed My Algorithm

By this point JumpOnion had been calibrated against a real-jump set sourced from coaches and senior skaters. The algorithm went through several iterations in a month — tightening thresholds, adding physics constraints, squeezing out false positives. By the end of that pass, every video in the calibration set produced a diagnosis that matched the coach's ground truth.

Then a real parent uploaded a video. Air time: 1.2 seconds.

A clean double jump tops out at about 0.85 seconds. Triples are in the high 0.6s. The number 1.2 doesn't exist in figure skating — it's either a quintuple jump that nobody has ever landed, or the algorithm just lied. The calibration consensus said: not the algorithm.

The Cardinal Engineering Sin

The first reaction was textbook wrong: assume the algorithm broke somehow on this specific video. Spend an hour adjusting thresholds. Add another smoothing filter. Tune the takeoff detector. None of it changed the output — the algorithm kept correctly extracting the wrong takeoff and landing framesthat the user had told it to use.

That last bit took an embarrassingly long time to notice.

The UI Said One Thing

The upload flow had two buttons: Set Takeoff and Set Landing. A user scrubbed the video, paused on the moment they thought was the takeoff, hit the button, scrubbed forward, paused on the landing, hit the second button. Submit. Diagnosis comes back in 60 seconds.

Every figure skating parent knows what "takeoff" and "landing" mean. They picked the moment the blade left the ice and the moment it touched down. Exactly what the buttons asked for. That's what they uploaded.

The Algorithm Wanted Something Else

The algorithm's internal contract was different. Its takeoff and landing detectors looked at signal patterns around the events, not at the events themselves — patterns that require context frames on either side of the event to exist. If you trim the clip flush against the takeoff and landing, the detectors lose their anchor.

When the user set the buttons to the literal takeoff and landing frames, they trimmed the clip flush against those events. The detectors had no context frames to anchor against, so they fell back to the first and last frames of the clip — which the user had chosen to be the takeoff and landing. Those got treated as clip boundaries. The algorithm calculated the duration from clip start to clip end: 1.2 seconds. Mathematically perfect. Physically nonsense.

The buttons were a promise: "tell us takeoff and landing." The algorithm was making a different promise: "tell us when the clip starts and ends, with takeoff and landing somewhere in the middle." Same words. Opposite contracts.

The Fix Wasn't in the Algorithm

Once the contract mismatch was visible, the fix was small. One commit changed three things:

  1. Button labels."Set Takeoff" became "Set Clip Start." "Set Landing" became "Set Clip End." Plus a subtext: "include about 0.5 seconds before the takeoff and after the landing."
  2. Visual guide.The trim UI got a quality band — a green "good clip" zone and red "too tight" edges, plus a tooltip showing the recommended clip structure.
  3. Onboarding modal. First upload now shows a one-screen diagram: a clip with the takeoff and landing in the middle, padding visible at both ends, labeled.

The algorithm code didn't change. Its contract was never wrong.

What The Calibration Set Couldn't Catch

The calibration set came from skating coaches and senior skaters. Those clips had been trimmed by people who already understood what the algorithm needed — padding on each end. The calibration set was perfectly clean. It was a laboratory dataset.

What it never contained: a clip trimmed by someone who took the button at its word. The very first real parent upload was outside the calibration distribution — not because the parent was wrong, but because the calibration set had filtered for users who already knew what the algorithm wanted.

New rule that calcified out of this: your calibration set must include mistakes a real user would make. Not just clean inputs from people who already know the system. "Mistakes" here doesn't mean "malicious" — it means "reasonable misinterpretation of what the UI promised."

UI as a Contract

I'd been treating the UI as cosmetic. Labels, layouts, spacing — things you tune for clarity. After this incident I promoted UI to a first-class contract layer. Every domain-term button on a critical input is a statement of what the system expects. If the user takes the statement literally, the system has to behave correctly.

That means UI text is part of the algorithm's test surface. If you change a button label, you should be able to re-run the algorithm's integration tests through the new label and confirm the contract still holds. If you don't have a way to test the path from UI label to algorithm output, you've got a silent failure waiting to happen.

Without SOP, With SOP

Without SOP

Keep tuning the algorithm. Add more smoothing. Lower a threshold. Re-run on calibration set, see "still clean", declare victory, ship. Real users keep uploading "as instructed" and getting impossible numbers. Conclusion spreads in the community: "this product is unreliable."

With SOP

When the algorithm's output is impossible, the first hypothesis is contract mismatch, not algorithm bug. Trace from output back to input back to UI. UI as a first-class contract layer. Calibration set expanded to include "reasonable misinterpretations."

The Real Lesson

Precision in code is wasted if the input that reaches the code doesn't mean what the code thinks it means. A calibration set built only from experts will not catch this class of failure. A laboratory dataset is not a production dataset, no matter how clean it looks.

Buttons are contracts. So are placeholder texts, help tooltips, confirmation dialogs. They tell the user what input the system expects. The user obeys the contract. If the contract's wording disagrees with the algorithm's assumption, the user's "mistake" is actually correct adherence to the wrong promise. Fix the promise.

Next: Day 7 — an AI tried to invalidate a stale cache and wiped 20 paying customers' diagnosis text and training plans in a single SQL statement. Recovery was possible only because of one audit decision made months earlier.

Want the "UI as Contract" checklist?

Templates, SOPs, and design rubrics — from $39.

See Pricing