Day 6: A Button Name Killed My Algorithm

By this point JumpOnion had been calibrated against a real-jump set sourced from coaches and senior skaters. The algorithm went through several iterations in a month — tightening reliability checks, closing off false positives. By the end of that pass, every video in the calibration set produced a diagnosis that matched the coach's ground truth.

Then a real parent uploaded a video. Air time: 1.2 seconds.

The number did not exist in figure skating. It was either a jump nobody has ever landed, or the algorithm just lied. The calibration consensus said: not the algorithm.

Visual thesis map

The algorithm looked wrong

The input contract was wrong

The user obeyed the UI

The fix lived in labels

UI became a contract layer

The Cardinal Engineering Sin

The first reaction was textbook wrong: assume the algorithm broke somehow on this specific video. Spend an hour adjusting thresholds. Add another smoothing filter. Tune the detector. None of it changed the output — the algorithm kept correctly extracting the wrong takeoff and landing framesthat the user had told it to use.

That last bit took an embarrassingly long time to notice.

The UI Said One Thing

The upload flow had two buttons: Set Takeoff and Set Landing. A user scrubbed the video, paused on the moment they thought was the takeoff, hit the button, scrubbed forward, paused on the landing, hit the second button. Submit. Diagnosis comes back in 60 seconds.

Every figure skating parent knows what "takeoff" and "landing" mean. They picked the moment the blade left the ice and the moment it touched down. Exactly what the buttons asked for. That's what they uploaded.

The Algorithm Wanted Something Else

The algorithm's internal contract was different. It expected a different kind of input than the button promised. The specifics don't matter. The user followed the label perfectly, and the label gave the algorithm the wrong shape of clip.

Diagram 01

button promise

pick the named event

The UI sounded obvious to a real parent.

system contract

needs a different clip shape

The model did exactly what the input implied.

result

precise nonsense

The calculation was clean; the promise was not.

When the user set the buttons to the literal takeoff and landing frames, they trimmed the clip flush against those events. The system treated the clip boundary as the event boundary, so a perfectly obedient user produced a perfectly wrong input. The algorithm calculated the duration from clip start to clip end: 1.2 seconds. Mathematically perfect. Physically nonsense.

The buttons were a promise: "tell us takeoff and landing." The algorithm was making a different promise: "tell us when the clip starts and ends, with takeoff and landing somewhere in the middle." Same words. Opposite contracts.

The Fix Wasn't in the Algorithm

Once the contract mismatch was visible, the fix was small. One commit changed three things:

Button labels."Set Takeoff" became "Set Clip Start." "Set Landing" became "Set Clip End." Plus a subtext: "leave a little padding before and after the jump."
Visual guide.The trim UI got a quality band — a green "good clip" zone and red "too tight" edges, plus a tooltip showing the recommended clip structure.
Onboarding modal. First upload now shows a one-screen diagram: a clip with the takeoff and landing in the middle, padding visible at both ends, labeled.

The algorithm code didn't change. Its contract was never wrong.

What The Calibration Set Couldn't Catch

The calibration set came from skating coaches and senior skaters. Those clips had been trimmed by people who already understood what the algorithm needed — padding on each end. The calibration set was perfectly clean. It was a laboratory dataset.

What it never contained: a clip trimmed by someone who took the button at its word. The very first real parent upload was outside the calibration distribution — not because the parent was wrong, but because the calibration set had filtered for users who already knew what the algorithm wanted.

Diagram 02

calibration set

clean expert inputs

Great for proving the core system, weak for real-user misunderstanding.

production upload

reasonable misread

The first real parent tested the contract the lab data never touched.

New rule that calcified out of this: your calibration set must include mistakes a real user would make. Not just clean inputs from people who already know the system. "Mistakes" here doesn't mean "malicious" — it means "reasonable misinterpretation of what the UI promised."

UI as a Contract

I'd been treating the UI as cosmetic. Labels, layouts, spacing — things you tune for clarity. After this incident I promoted UI to a first-class contract layer. Every domain-term button on a critical input is a statement of what the system expects. If the user takes the statement literally, the system has to behave correctly.

Diagram 03

label

what the user thinks

input

what the system receives

test

what the product proves

output

what the user trusts

That means UI text is part of the algorithm's test surface. If you change a button label, you should be able to re-run the algorithm's integration tests through the new label and confirm the contract still holds. If you don't have a way to test the path from UI label to algorithm output, you've got a silent failure waiting to happen.

Without SOP, With SOP

Without SOP

Keep tuning the algorithm. Add more smoothing. Lower a threshold. Re-run on calibration set, see "still clean", declare victory, ship. Real users keep uploading "as instructed" and getting impossible numbers. Conclusion spreads in the community: "this product is unreliable."

With SOP

When the algorithm's output is impossible, the first hypothesis is contract mismatch, not algorithm bug. Trace from output back to input back to UI. UI as a first-class contract layer. Calibration set expanded to include "reasonable misinterpretations."

The Real Lesson

Precision in code is wasted if the input that reaches the code doesn't mean what the code thinks it means. A calibration set built only from experts will not catch this class of failure. A laboratory dataset is not a production dataset, no matter how clean it looks.

Buttons are contracts.So are placeholder texts, help tooltips, confirmation dialogs. They tell the user what input the system expects. The user obeys the contract. If the contract's wording disagrees with the algorithm's assumption, the user's "mistake" is actually correct adherence to the wrong promise. Fix the promise.

Next: Day 7 — an AI tried to invalidate a stale cache and wiped 20 paying customers' diagnosis text and training plans in a single SQL statement. Recovery was possible only because of one audit decision made months earlier.