Ciki Zeng
← Back to Blog
2026-04-18· 9 min readSOP FrameworkAI ToolingCompetitive Analysis

Day 5: 89K Stars. I Adopted 3 Ideas. Rejected 3.

A new AI agent framework hit GitHub trending. 89,000 stars in under a year. The pitch was perfect: agents that mutate their own skills via a genetic algorithm, marketplace of community skills, automatic user profiling. "Self-evolving" was the magic word. Of course it was.

My first reaction was the same as everyone's: install it. 89K people can't be wrong. My second reaction — trained into me by a similar moment a few weeks earlier with a 53K-star memory tool — was different. 89K stars measures how many people had the same problem. It doesn't measure whether the solution fits mine.

The 5-Dimension Comparison

Rather than ask "use or skip?" — a binary trap — I asked "which specific ideas are worth porting at zero cost?". Five axes, one pass:

  1. Memory architecture.Their model: a single evolving prompt blob. Mine: four explicit layers (asset, runtime, enforcement, training) with clear graduation paths. Verdict: mine is more legible. Don't adopt.
  2. Skill system. Their model: skills are first-class objects with metadata, callable by name, listed in a registry. Mine at the time: skills lived as ad-hoc commands. Verdict: borrow the registry pattern. Worth a half-day to port.
  3. Self-evolution. Their model: genetic algorithm auto-mutates skills based on session outcomes. Mine: a human-in-the-loop graduation pipeline. Verdict: see below.
  4. Behavior enforcement.Their model: prompt-level rules in the agent constitution. Mine: hooks + skills + memory (layered). Verdict: theirs is weaker. Don't adopt.
  5. Cross-tool collaboration. Their model: a shared session log. Mine: handoff notes + a rolling CURRENT-STATE.md + dashboard sync. Verdict: borrow their JSONL session-summary format. Cleaner than what I had.

Why I Rejected the Headline Feature

The genetic-algorithm self-mutation is the framework's most quoted feature. It's also the one I rejected first.

Here's what self-mutation looks like in a multi-person team with code review: the AI proposes a skill change, a human reads the diff, the diff lands or gets pushed back. That works. The human is the natural review checkpoint.

Here's what self-mutation looks like in a one-person company: the AI proposes a skill change, the AI executes it because there's no team, and now a production rule has silently shifted while you were doing something else. The next time the rule fires, it does something subtly different. You won't notice until a bug surfaces, and you won't know which mutation caused it.

Automation is a multiplier. In a team it multiplies oversight. In a one-person company it multiplies the absence of oversight. Same feature, opposite sign.

The Three Ideas I Stole

Three patterns ported, total cost about a day:

  1. Skill solidification prompts.Their framework prompts the agent at session end: "was anything in this session worth turning into a reusable skill?" I added the same nudge to my session-wrap protocol. Most sessions answer "no." The ones that say "yes" have given me three new skills in the last month.
  2. JSONL failure-mode log.Their framework appends every tool call outcome to a JSONL file. Mine now does the same — and a weekly review of the failures has already revealed two anti-patterns I'd been accommodating without naming.
  3. Mid-session checkpoints.Their framework triggers a self-reflection every N tool calls. Mine triggers every 15 turns: a one-line state echo ("what we set out to do / what changed / are we on track"). Cheap, and it catches drift before drift catches you.

The Three Ideas I Rejected

  1. Genetic skill mutation — for the reason above. No review capacity means no automation in this layer.
  2. Public skill marketplace. Their framework imagines users sharing skills the way packages share code. Skills in my system are commercial assets. I sell them. A public marketplace would commodify exactly the thing I monetize.
  3. Automatic user profiling. Their framework builds a profile of the user from session traces. Useful if you have many users. I have one user — me. The cost of building the system is higher than the benefit of profiling one person who can just tell the agent what they want.

The Generalized Rule

Out of this evaluation came a rule I now apply to every viral tool:

When a popular tool overlaps your domain:
1. Don't ask "should I adopt it?"
2. Ask "which 1-3 specific ideas are worth porting?"
3. For each idea, ask: "does the assumption behind this idea
   match my situation?" (team size, ownership, asset model, etc.)
4. Port the matches. Reject the mismatches. Explicitly.
5. Write the rejection down — it's the most valuable artifact.

The rejections are the valuable part. Reading a 89K-star framework and noting "these three features assume a team I don't have" is a sharper articulation of my own architecture than I could've produced in isolation. The popular tool became an evaluation mirror for the system I already had.

Without SOP, With SOP

Without SOP

Either install the 89K-star framework wholesale (chasing hype), or dismiss it because "mine is fine" (pride). Both options waste the opportunity to learn from a system that 89K people gave time to. Worse: full adoption replaces a working architecture with one whose assumptions don't match.

With SOP

Read the framework. Decompose into specific design choices. Match each against your situation. Port the matches at zero cost, reject the mismatches loudly. The process produces a better tool and a better-articulated own architecture as a byproduct.

The Real Lesson

Star counts measure popularity, not fit. A tool with 89K stars solves 89K people's problem in a way they're willing to vote for. It does not mean their problem is your problem, nor that their solution survives translation to your context.

What does survive translation is patterns. Specific design choices — "a registry of named skills," "a JSONL session log," "an end-of-session reflection prompt" — survive because they're context-independent. They're the kind of thing you can extract in an afternoon and slot in next to your own primitives. The headline features don't survive. The architectural specifics do.

This is why I now spend 30 minutes evaluating every viral AI tool that overlaps my stack. Not to adopt it. To mine it for the 1-3 specific design choices worth porting. The 30 minutes of evaluation is worth more than the 30 hours of half-using something I didn't need.

Next: Day 6 — a precise jump-detection algorithm reported a 1.2s air time on a jump that physically maxes out at 0.85s. The algorithm was right. The button label that fed it was wrong. A UX detail killed an engineering metric.

Want the full evaluation framework?

Templates, SOPs, and decision rubrics — from $39.

See Pricing