I'm one person. With AI, I built — and still maintain — a learning app for my two kids. Today it's 85 API routes and around 140,000 lines of code. In early April it was 48 routes. It keeps growing because they keep outgrowing it.
The app is called IvyBloom (ivybloom.app). My entire user base is two kids I see at breakfast.
What "AI-built" actually means here
The code is AI-generated. The judgment is mine. Those are two different jobs, and most of the work is the second one.
I decide what the app should do, how it should behave when it's unsure, and what "correct" means for a nine-year-old's math problem. The AI writes the implementation. Then I check whether it actually works — for these two users, on real problems, every day. When it doesn't, I'm the one who catches it. The line count is what the AI produced. The product is what survived my checking.
writes implementation
Code volume, scaffolding, routes, and feature wiring move fast.
sets correctness
Behavior, uncertainty, and age-appropriate answers are human calls.
survives checking
Daily use decides what is real, not the line count.
Three decisions I made, and the AI carried out
Difficulty follows their actual scores.Each problem's difficulty calibrates to where each kid really tests— their MAP scores — not a grade-level average. A nine-year-old who's two years ahead in reading and right on grade in math shouldn't get one blended "fourth-grade" setting for both. The app tracks them separately.
For math and science, a second pass checks the first. Before my kid sees a math or science question, the system independently re-solves it and compares. If the two solutions disagree, the question gets thrown out instead of shown.This only runs on math and science — the subjects where there's a single right answer to check against. It's not every question. The second pass isn't there to look clever; it's there to catch the cases where the system can't even agree with itself, because those are the ones that shouldn't reach a kid.
Wrong answers get tagged by type. When a kid misses a question, the app classifies the mistake — computational, conceptual, procedural, or careless. A raw score tells me how many they got wrong. The tag tells me whether the same kind of mistake keeps coming back, which is the part I can actually do something about.
difficulty follows scores
Reading and math do not collapse into one blended grade setting.
STEM gets a second pass
Math and science questions get independently re-solved before a kid sees them.
wrong answers get typed
Misses become computational, conceptual, procedural, or careless signals.
Why the line count isn't the point
85 routes and 140,000 lines sound like a team's output. They're not the point. The point is that the software has to work for two specific kids who will tell me, immediately and without diplomacy, when it doesn't.
That's a different bar than "ship it and watch the metrics." There are no metrics. There are two kids, and the question every day is whether the thing helped them learn something they couldn't yesterday. If it didn't, no dashboard is going to comfort me, and they'll let me know at dinner.
AI wrote the code. I decided what it was allowed to be wrong about. That second part is the whole job.
