9 cycles of self-critique against a self-improving agent system

Most agent-framework repos sell a vision. CSIS ships a paper trail of its own failures: parallel red-team agents attack each cycle's fixes, findings land in code with regression tests, and the next cycle usually finds the previous one fixed the bug at the wrong abstraction layer.

phase 0 · runnable 246 tests passing 9 critique cycles 99 findings · 0 open MIT

View on GitHub → Read the cycle trail → Architecture in 6 diagrams → Live dashboard demo → Tripwire playground →

The numbers

99 findings across 9 cycles. 96 closed in code, 3 honestly deferred.

Each cycle: parallel red-team agents attack the prior cycle's fixes, with file:line evidence and reproducible attack snippets. Fixes land in code with regression tests. Snapshots get written to brain/ so the trail survives any contributor — human or AI — picking up cold.

Cycle	Findings	Critical	High	Medium	Low	Open	Deferred	Tests after
▶1 · pre-impl	18	2	6	7	3	0	0	52

▶2 · post-impl	13	2	6	3	2	0	0	78

▶3 · deltas	11	2	4	4	1	0	0	114

▶4 · fixes	11	2	4	4	1	0	0	141

▶5 · fixes	11	3	3	4	1	0	0	165

▶6 · fixes	10	3	3	3	1	0	1	186

▶7 · fixes	7	1	2	3	1	0	0	195

▶8 · fixes	6	2	2	2	0	0	0	202

▶9 · fixes	12	4	3	2	3	0	2	213

Total	99	21	33	32	13	0	3	—

click any cycle for its summary, critical findings, and links to the full critique

How a single iteration works

The 8-step loop, end-to-end, on every iteration.

Researcher proposes; Builder produces; Verifier signs a cert on a structurally different checkpoint (Sonnet-class verifying Opus-class); Librarian writes to candidate stores; Auditor signs a why-doc with a hash precondition; Promote is a CAS-style atomic flip. If the live store moved between why-doc signing and promote attempt, the promotion is rolled back atomically.

click play to step through one iteration

What the cycles kept teaching

Two patterns showed up over and over.

Cycles 4-9 each found the previous cycle's pivot was the right concept at the wrong abstraction layer — and the next cycle moved it. Two patterns kept reappearing:

cycle 8 → cycle 9

Identity beats timing

Cycle 8 detected "which iteration wrote this candidate?" via a pre-consolidate snapshot diff. Cycle 9 found the snapshot has a race window — a sibling iteration writing a same-id candidate between snapshot and cleanup is indistinguishable from "introduced by this iteration."

The fix that ended the arms race wasn't a wider snapshot — it was a writer_iteration_id field stamped on every candidate at write_candidate time. Cleanup filters by stamp. Race-free under any concurrency model.

cycle 8 → cycle 9

Chokepoints beat perimeters

Cycle 8 added a type(...) is _BackendTracker check at Daemon.__init__ to defeat subclass-shaped bypasses of LLM metering. Cycle 9 found three production scripts (burst.py, loop.py, demo_pr_scenario.py) constructed the inner Coordinator directly with a raw backend.

The fix moved the check into Coordinator.__init__ (the actual single chokepoint every LLM call passes through) and added property setters that re-validate on every setattr. Single chokepoint beats perimeter fencing.

Beyond the nine cycles — distributional graders

Outcomes-based eval where the answer is a number with uncertainty.

Rubric eval (HealthBench-style, LLM-Rubric, CSIS V1) collapses every grader to passed: bool. That fits PR maintenance, lint pipelines, and CI gates. It does not fit medical image segmentation, orthopedic reconstruction, calibration, or any domain whose acceptance criterion is a continuous metric over a sample distribution. CSIS ships the missing layer.

the gap

Rubric eval can't carry CIs, slices, or sample size

A model with mean Dice 0.89 could be 0.89 ± 0.02 (excellent) or 0.89 ± 0.18 (one in three cases is dangerous). A rubric grader doesn't surface that. A global mean of 0.89 can hide a 0.71 on the pancreas — clinically lethal. A boolean grader can't see the slice.

the fix

DistributionalGraderResult with conservative CI semantics

Point estimate + 95% bootstrap CI + per-slice breakdown. Pass rule is conservative: lower CI bound must clear the threshold for higher-is-better metrics; upper CI bound must stay under for lower-is-better. A model whose true performance might be below the bar doesn't auto-promote.

# Per-case Dice with per-organ slice breakdown
from csis.verification.distributional_graders import DiceGrader, Sample

grader = DiceGrader(threshold=0.85, n_bootstrap=1000)
result = grader.evaluate([
    Sample(case_id="c-042", payload={"pred_mask": pred, "true_mask": gold},
           slices={"organ": "liver", "modality": "CT"}),
    # ... 522 more cases
])

# result.point_estimate = 0.892
# result.ci_lower / ci_upper = 0.871 / 0.913
# result.passed = True  (lower CI bound clears the 0.85 threshold)
# result.slices = [organ=liver: 0.94 [0.91, 0.96] PASS,
#                  organ=pancreas: 0.71 [0.66, 0.76] FAIL, ...]

Concrete graders shipped: DiceGrader, IoUGrader, LandmarkErrorGrader, AssdGrader. Pure stdlib (no numpy). The cert (VerifierCertificate.distributional_results) carries both rubric and distributional results side-by-side — hash-preconditioned promotion semantics carry through unchanged.

Full design rationale + literature anchor + what Anthropic's Managed Agents could ship to enable this natively: brain/research/02-distributional-graders.md →

Quick start

Runs end-to-end on a mock backend. No API key required.

# Get the code
git clone https://github.com/jim4226/CSIS
cd CSIS
pip install pydantic pytest

# Run the test suite (246 passing, 4 skipped)
python -m pytest tests/ -v

# One full 8-step iteration end-to-end, mock backend
python -m csis.loop

# Walk through the 5-scenario PR-maintenance benchmark
python scripts/demo_pr_scenario.py --clean

# Run the 24/7 daemon (foreground; Ctrl-C to stop)
python -m csis.daemon --backend mock --rate-per-hour 60

Load-bearing safety properties

Each property is enforced as code and gated by a regression test.

Property	Implementation
Capability cannot grow faster than oversight	Phase-0 hard ceiling = T1; T2+ rejected at the call site
Memory mutation is reversible	Candidate stores + `MemoryStore.promote()` is the only path to live
Cross-checkpoint verification	`assert_cross_checkpoint` requires ≥2 distinct identity components
Grader integrity	Pinned source-hash check at every cert build
Audit-only structured query	`structured_query()` allow-lists trusted producers only
Shutdown enforced at substrate	`ShutdownToken.halt()` raises `HaltSignal` on next iteration
Atomic promotion under contention	Single-writer lock + hash-preconditioned CAS
Wrapped-backend invariant (LLM metering can't be bypassed)	`Coordinator.__init__` demands `_BackendTracker`; property setter re-validates on every reassignment
TierMismatch cleanup is race-free	`writer_iteration_id` stamp on every candidate at write_candidate time
Lost-spend-under-lock-contention	`record()` appends to WAL on LockUnavailable; next successful `record()` drains it

Cycle 9 also produced two honest deferrals: H2 (closure-cell mutation defeats any pure-Python in-process guard) and H11 (POSIX unlink-during-lock race, unverified on Windows). Both require process-level isolation to close, planned for Phase 1. Full cycle trail →

What this is NOT a claim about

Where the system genuinely doesn't deliver yet.

Not a proof of safety. 246 tests and 99 findings (96 closed, 3 deferred) are evidence of careful engineering, not certification of safe behavior under adversarial conditions. Cycle 9 H2/H11 deferrals are explicit about this.
The LLM backend is mock by default. Real Anthropic calls are gated behind --backend anthropic + an ANTHROPIC_API_KEY + cost ceilings + WAL-backed budget metering. No agent is currently improving itself at production scale on this codebase.
"Self-improving" in Phase 0 means the infrastructure for self-improvement is working and tested. Real learning happens via scripts/burst.py on demand. The longer-arc framing is in CSIS-architecture.html Appendix A.