9 cycles of self-critique against a self-improving agent system

Most agent-framework repos sell a vision. CSIS ships a paper trail of its own failures: parallel red-team agents attack each cycle's fixes, findings land in code with regression tests, and the next cycle usually finds the previous one fixed the bug at the wrong abstraction layer.

phase 0 · runnable 246 tests passing 9 critique cycles 99 findings · 0 open MIT

The numbers

99 findings across 9 cycles. 96 closed in code, 3 honestly deferred.

Each cycle: parallel red-team agents attack the prior cycle's fixes, with file:line evidence and reproducible attack snippets. Fixes land in code with regression tests. Snapshots get written to brain/ so the trail survives any contributor — human or AI — picking up cold.

Cycle Findings Critical High Medium Low Open Deferred Tests after
1 · pre-impl1826730052
2 · post-impl1326320078
3 · deltas11244100114
4 · fixes11244100141
5 · fixes11334100165
6 · fixes10333101186
7 · fixes7123100195
8 · fixes6222000202
9 · fixes12432302213
Total992133321303

click any cycle for its summary, critical findings, and links to the full critique

How a single iteration works

The 8-step loop, end-to-end, on every iteration.

Researcher proposes; Builder produces; Verifier signs a cert on a structurally different checkpoint (Sonnet-class verifying Opus-class); Librarian writes to candidate stores; Auditor signs a why-doc with a hash precondition; Promote is a CAS-style atomic flip. If the live store moved between why-doc signing and promote attempt, the promotion is rolled back atomically.

cross-checkpoint · sonnet-class verifies opus-class Researcher T0 · plan Builder T1 · artifact Verifier V1 + V2 · cert Librarian T0 · candidate Auditor why-doc · hash CAS PROMOTE rollback if cert fails, tripwire fires, or hash precondition stale
click play to step through one iteration

What the cycles kept teaching

Two patterns showed up over and over.

Cycles 4-9 each found the previous cycle's pivot was the right concept at the wrong abstraction layer — and the next cycle moved it. Two patterns kept reappearing:

cycle 8 → cycle 9
Identity beats timing

Cycle 8 detected "which iteration wrote this candidate?" via a pre-consolidate snapshot diff. Cycle 9 found the snapshot has a race window — a sibling iteration writing a same-id candidate between snapshot and cleanup is indistinguishable from "introduced by this iteration."

The fix that ended the arms race wasn't a wider snapshot — it was a writer_iteration_id field stamped on every candidate at write_candidate time. Cleanup filters by stamp. Race-free under any concurrency model.

cycle 8 → cycle 9
Chokepoints beat perimeters

Cycle 8 added a type(...) is _BackendTracker check at Daemon.__init__ to defeat subclass-shaped bypasses of LLM metering. Cycle 9 found three production scripts (burst.py, loop.py, demo_pr_scenario.py) constructed the inner Coordinator directly with a raw backend.

The fix moved the check into Coordinator.__init__ (the actual single chokepoint every LLM call passes through) and added property setters that re-validate on every setattr. Single chokepoint beats perimeter fencing.

Beyond the nine cycles — distributional graders

Outcomes-based eval where the answer is a number with uncertainty.

Rubric eval (HealthBench-style, LLM-Rubric, CSIS V1) collapses every grader to passed: bool. That fits PR maintenance, lint pipelines, and CI gates. It does not fit medical image segmentation, orthopedic reconstruction, calibration, or any domain whose acceptance criterion is a continuous metric over a sample distribution. CSIS ships the missing layer.

the gap
Rubric eval can't carry CIs, slices, or sample size

A model with mean Dice 0.89 could be 0.89 ± 0.02 (excellent) or 0.89 ± 0.18 (one in three cases is dangerous). A rubric grader doesn't surface that. A global mean of 0.89 can hide a 0.71 on the pancreas — clinically lethal. A boolean grader can't see the slice.

the fix
DistributionalGraderResult with conservative CI semantics

Point estimate + 95% bootstrap CI + per-slice breakdown. Pass rule is conservative: lower CI bound must clear the threshold for higher-is-better metrics; upper CI bound must stay under for lower-is-better. A model whose true performance might be below the bar doesn't auto-promote.

# Per-case Dice with per-organ slice breakdown
from csis.verification.distributional_graders import DiceGrader, Sample

grader = DiceGrader(threshold=0.85, n_bootstrap=1000)
result = grader.evaluate([
    Sample(case_id="c-042", payload={"pred_mask": pred, "true_mask": gold},
           slices={"organ": "liver", "modality": "CT"}),
    # ... 522 more cases
])

# result.point_estimate = 0.892
# result.ci_lower / ci_upper = 0.871 / 0.913
# result.passed = True  (lower CI bound clears the 0.85 threshold)
# result.slices = [organ=liver: 0.94 [0.91, 0.96] PASS,
#                  organ=pancreas: 0.71 [0.66, 0.76] FAIL, ...]

Concrete graders shipped: DiceGrader, IoUGrader, LandmarkErrorGrader, AssdGrader. Pure stdlib (no numpy). The cert (VerifierCertificate.distributional_results) carries both rubric and distributional results side-by-side — hash-preconditioned promotion semantics carry through unchanged.

Full design rationale + literature anchor + what Anthropic's Managed Agents could ship to enable this natively: brain/research/02-distributional-graders.md →

Quick start

Runs end-to-end on a mock backend. No API key required.

# Get the code
git clone https://github.com/jim4226/CSIS
cd CSIS
pip install pydantic pytest

# Run the test suite (246 passing, 4 skipped)
python -m pytest tests/ -v

# One full 8-step iteration end-to-end, mock backend
python -m csis.loop

# Walk through the 5-scenario PR-maintenance benchmark
python scripts/demo_pr_scenario.py --clean

# Run the 24/7 daemon (foreground; Ctrl-C to stop)
python -m csis.daemon --backend mock --rate-per-hour 60

Load-bearing safety properties

Each property is enforced as code and gated by a regression test.

PropertyImplementation
Capability cannot grow faster than oversightPhase-0 hard ceiling = T1; T2+ rejected at the call site
Memory mutation is reversibleCandidate stores + MemoryStore.promote() is the only path to live
Cross-checkpoint verificationassert_cross_checkpoint requires ≥2 distinct identity components
Grader integrityPinned source-hash check at every cert build
Audit-only structured querystructured_query() allow-lists trusted producers only
Shutdown enforced at substrateShutdownToken.halt() raises HaltSignal on next iteration
Atomic promotion under contentionSingle-writer lock + hash-preconditioned CAS
Wrapped-backend invariant (LLM metering can't be bypassed)Coordinator.__init__ demands _BackendTracker; property setter re-validates on every reassignment
TierMismatch cleanup is race-freewriter_iteration_id stamp on every candidate at write_candidate time
Lost-spend-under-lock-contentionrecord() appends to WAL on LockUnavailable; next successful record() drains it

Cycle 9 also produced two honest deferrals: H2 (closure-cell mutation defeats any pure-Python in-process guard) and H11 (POSIX unlink-during-lock race, unverified on Windows). Both require process-level isolation to close, planned for Phase 1. Full cycle trail →

What this is NOT a claim about

Where the system genuinely doesn't deliver yet.