Ciki Zeng
Case Study · CikiBrain

The AI-agent operating system I architected to run a one-person, four-product company. Verification enforced in code, not prompts.

CikiBrain is the governance layer that lets one person plus AI run four shipped products without the wheels coming off. A four-layer memory architecture, ~19–20 enforcement hooks across five lifecycle events, and a self-evolving ledger that tracks its own operator's demonstrated capability from live signals. I designed it, directed AI to build it, and I'm the verification gate. Here's how it's built, and the calls I made.

Designed & architectedDirected the AI buildOperated daily
The hero mechanism — a self-evolving capability ledger

A system that learns its own operator's capability — from live signals, storing no prompt text.

Most of this system governs work. This part governs evidence: a silent, PII-safe loop that captures what the operator actually does, distills it into proof, and evolves a capability baseline from real signals instead of a once-a-year rewrite. It's the newest layer — and the reason this case study exists alongside the methodology.

Capture — silent, PII-safe
A hook logs capability signals as they happen — challenge, refuse-fake-done, demand-proof, architect, catch — as allow-listed keys plus a non-reversible hash. No prompt text is ever stored.
Enrich into evidence
A skill distills the high-signal events into structured capability evidence — raw signal becomes reviewable proof.
Propose a baseline delta
It proposes a concrete update to the operator's demonstrated-capability record.
The operator approves — the human is the decision gate
The baseline evolves
Only approved evidence moves the record — it grows from live signals, never a manual rewrite. The evolved baseline then informs what's worth capturing next, and the loop closes.
Raw → distilled → evolve — the same loop the rest of the system runs on. The system learns its operator's capability from live signals, by consent, storing no prompt text.

There's no app to screen-record here — CikiBrain is the operating system that runs the company, not a product with a UI. Everything shown is the architecture and the mechanisms; every on-screen signal is synthetic or generic. No vault contents, client, or private data appears anywhere on this page. It's a solo build, dogfooded daily — this shows capability and judgment, not traction.

The problem

In an AI-leveraged company the bottleneck isn't writing code — AI does that fast, and platforms will only make it faster. The bottleneck is judgment: knowing when output is wrong, when “done” isn't done, when not to trust the model. That judgment doesn't scale by working harder, and it quietly drifts over a long session. CikiBrain is built around one idea: systematize judgment into rules the system runs on its own — so AI scales the work without scaling the recklessness, and the discipline survives my own attention.

Architecture

Four layers, one rule: the things that matter run in code.

Every kind of information has exactly one home. Knowledge lives in the vault; active rules in the runtime; mandatory checks in the enforcement layer; and graduating feedback in the training buffer. The enforcement layer is what turns a set of good intentions into an operating system.

AssetObsidian vault
Knowledge, decisions, case studies, sellable assets — the long-term memory.
RuntimeCLAUDE.md
Active behavioral rules and routing the agent must follow this session.
Enforcementhooks that run in codePrompts are suggestions · hooks are law
~19–20 mandatory checks across five lifecycle events. Not reminders the model may skip — code that runs whether the AI remembers to or not. This is the layer that makes the difference.
Trainingmemory buffer
Feedback that graduates into a rule on recurrence — then retires when it's obsolete.
Walkthrough · 6 steps

How the system works — and why it's built this way.

1
Step 1 · The system

A personal AI-agent operating system that runs a one-person, four-product software company. I designed the governance, directed AI to implement it, and I'm the verification gate. It's a system of files and config — deliberately legible, not a black box — so the judgment lives somewhere I can audit and the AI has to obey.

2
Step 2 · Prompts are suggestions. Hooks are law.

A rule written into a prompt is best-effort — and the insight that forced the next layer was watching a rule that existed and the AI still not follow it. So judgment that actually matters descends into hooks: mandatory checks that run in code on every session, not reminders the model is free to skip. The enforcement layer is the difference between a document and an operating system.

3
Step 3 · Capture low, promote on recurrence, then retire

Observations start cheap. When one recurs — twice, or across projects, or it hardens into a rule — it's promoted into an enforced hook. And every enforced rule carries a retire-ifcondition, so it can die when it's obsolete. A rule that can't expire becomes a shackle; the lifecycle is part of the design, not an afterthought.

4
Step 4 · It catches its own operator

732 passing tests are not production-verified. A “done” can be false. Context drifts over a long session. The system encodes those failure modes as gates — verification, grounding, completeness — so it catches my own false-completions and drift, not just the AI's. Designing against your own blind spots is the part most workflows skip.

5
Step 5 · Threat-modeling the AI pipeline

A non-obvious leak chain — an AI debug transcript that syncs to cloud storage and then gets indexed by AI search — isn't something a secret-scanner alone can catch. So the enforcement layer includes security hooks and a source-write quarantine designed around that specific chain. Correctness here is a safety property, not a nicety.

6
Step 6 · Self-evolution, then productized

The capability-ledger closes the loop — the system now updates its operator's own capability record from live evidence, PII-safe by design. And the whole methodology was sanitized into a portable framework— the internal system built first, then packaged so it's installable by someone else.

Architecture & judgment

The four calls that define the system.

Judgment encoded as executable rules

Verification, scope, and grounding guards turn a recurring human-or-AI failure into a permanent, mechanical guarantee. The thesis a public case study proved 16 times: AI won't reliably self-enforce, so you force it in code — and verify it with real incidents, not theory.

Self-evolving capability tracking

A hook plus a skill plus a PII-safe ledger that learns its operator's demonstrated capability from live signals. Novel and meta — the system observes the person operating it, by consent, storing allow-listed keys and a hash, never a word of prompt text.

Anti-self-deception, by design

Grounding, completeness, and compliance gates catch the system's own false-completion, stale context, and drift. The point isn't catching the AI — it's designing against my failure modes as deliberately as the model's.

Lifecycle governance, not rule-piling

An auto-derived registry surfaces every mechanism's liveness from its own footprint — never hand-maintained, so it never rots. When capture began to outrun drain, the fix was drain mechanisms, not more rules. Systems thinking over discipline theater.

Outcomes

One system governs four shipped products — and catches its own operator's false-completion and context drift.

The verifiable surface is the system itself: a four-layer architecture, ~19–20 enforcement hooks across five lifecycle events, and a 45-entry methodology library — each entry a real incident → root cause → reusable rule, the case studies published on this site a curated subset of it. It's a system of files and config, not a versioned repo — so the honest evidence is what it does, not a commit count.

What I owned
  • 01Designed & architectedthe governance — the four-layer memory model, the enforcement layer, the capability-ledger loop, and the auto-derived registry that keeps it honest.
  • 02Directedthe AI build — wrote the rules and specs, set where the model may and may not be trusted, and decided what graduated into an enforced hook and what retired.
  • 03Was the verification gate— validated behavior and domain-correctness on the running system, not “it compiles,” then encoded that judgment as rules so it survives my own attention drift.

This is how I work: design the system, encode the judgment, own the verification.

If you're evaluating someone to design or operate AI-augmented systems, this is the clearest piece of how I think — the system I run everything else on. There's more in the collection.