Each cycle: parallel red-team agents attack the prior cycle's fixes, with file:line evidence and reproducible attack snippets. Fixes land in code with regression tests. Snapshots get written to brain/ so the trail survives any contributor — human or AI — picking up cold.
| Cycle | Findings | Critical | High | Medium | Low | Open | Deferred | Tests after |
|---|---|---|---|---|---|---|---|---|
| ▶1 · pre-impl | 18 | 2 | 6 | 7 | 3 | 0 | 0 | 52 |
| ▶2 · post-impl | 13 | 2 | 6 | 3 | 2 | 0 | 0 | 78 |
| ▶3 · deltas | 11 | 2 | 4 | 4 | 1 | 0 | 0 | 114 |
| ▶4 · fixes | 11 | 2 | 4 | 4 | 1 | 0 | 0 | 141 |
| ▶5 · fixes | 11 | 3 | 3 | 4 | 1 | 0 | 0 | 165 |
| ▶6 · fixes | 10 | 3 | 3 | 3 | 1 | 0 | 1 | 186 |
| ▶7 · fixes | 7 | 1 | 2 | 3 | 1 | 0 | 0 | 195 |
| ▶8 · fixes | 6 | 2 | 2 | 2 | 0 | 0 | 0 | 202 |
| ▶9 · fixes | 12 | 4 | 3 | 2 | 3 | 0 | 2 | 213 |
| Total | 99 | 21 | 33 | 32 | 13 | 0 | 3 | — |
click any cycle for its summary, critical findings, and links to the full critique
Researcher proposes; Builder produces; Verifier signs a cert on a structurally different checkpoint (Sonnet-class verifying Opus-class); Librarian writes to candidate stores; Auditor signs a why-doc with a hash precondition; Promote is a CAS-style atomic flip. If the live store moved between why-doc signing and promote attempt, the promotion is rolled back atomically.
Cycles 4-9 each found the previous cycle's pivot was the right concept at the wrong abstraction layer — and the next cycle moved it. Two patterns kept reappearing:
Cycle 8 detected "which iteration wrote this candidate?" via a pre-consolidate snapshot diff. Cycle 9 found the snapshot has a race window — a sibling iteration writing a same-id candidate between snapshot and cleanup is indistinguishable from "introduced by this iteration."
The fix that ended the arms race wasn't a wider snapshot — it was a writer_iteration_id field stamped on every candidate at write_candidate time. Cleanup filters by stamp. Race-free under any concurrency model.
Cycle 8 added a type(...) is _BackendTracker check at Daemon.__init__ to defeat subclass-shaped bypasses of LLM metering. Cycle 9 found three production scripts (burst.py, loop.py, demo_pr_scenario.py) constructed the inner Coordinator directly with a raw backend.
The fix moved the check into Coordinator.__init__ (the actual single chokepoint every LLM call passes through) and added property setters that re-validate on every setattr. Single chokepoint beats perimeter fencing.
Rubric eval (HealthBench-style, LLM-Rubric, CSIS V1) collapses every grader to passed: bool. That fits PR maintenance, lint pipelines, and CI gates. It does not fit medical image segmentation, orthopedic reconstruction, calibration, or any domain whose acceptance criterion is a continuous metric over a sample distribution. CSIS ships the missing layer.
A model with mean Dice 0.89 could be 0.89 ± 0.02 (excellent) or 0.89 ± 0.18 (one in three cases is dangerous). A rubric grader doesn't surface that. A global mean of 0.89 can hide a 0.71 on the pancreas — clinically lethal. A boolean grader can't see the slice.
Point estimate + 95% bootstrap CI + per-slice breakdown. Pass rule is conservative: lower CI bound must clear the threshold for higher-is-better metrics; upper CI bound must stay under for lower-is-better. A model whose true performance might be below the bar doesn't auto-promote.
# Per-case Dice with per-organ slice breakdown from csis.verification.distributional_graders import DiceGrader, Sample grader = DiceGrader(threshold=0.85, n_bootstrap=1000) result = grader.evaluate([ Sample(case_id="c-042", payload={"pred_mask": pred, "true_mask": gold}, slices={"organ": "liver", "modality": "CT"}), # ... 522 more cases ]) # result.point_estimate = 0.892 # result.ci_lower / ci_upper = 0.871 / 0.913 # result.passed = True (lower CI bound clears the 0.85 threshold) # result.slices = [organ=liver: 0.94 [0.91, 0.96] PASS, # organ=pancreas: 0.71 [0.66, 0.76] FAIL, ...]
Concrete graders shipped: DiceGrader, IoUGrader, LandmarkErrorGrader, AssdGrader. Pure stdlib (no numpy). The cert (VerifierCertificate.distributional_results) carries both rubric and distributional results side-by-side — hash-preconditioned promotion semantics carry through unchanged.
Full design rationale + literature anchor + what Anthropic's Managed Agents could ship to enable this natively: brain/research/02-distributional-graders.md →
# Get the code git clone https://github.com/jim4226/CSIS cd CSIS pip install pydantic pytest # Run the test suite (246 passing, 4 skipped) python -m pytest tests/ -v # One full 8-step iteration end-to-end, mock backend python -m csis.loop # Walk through the 5-scenario PR-maintenance benchmark python scripts/demo_pr_scenario.py --clean # Run the 24/7 daemon (foreground; Ctrl-C to stop) python -m csis.daemon --backend mock --rate-per-hour 60
| Property | Implementation |
|---|---|
| Capability cannot grow faster than oversight | Phase-0 hard ceiling = T1; T2+ rejected at the call site |
| Memory mutation is reversible | Candidate stores + MemoryStore.promote() is the only path to live |
| Cross-checkpoint verification | assert_cross_checkpoint requires ≥2 distinct identity components |
| Grader integrity | Pinned source-hash check at every cert build |
| Audit-only structured query | structured_query() allow-lists trusted producers only |
| Shutdown enforced at substrate | ShutdownToken.halt() raises HaltSignal on next iteration |
| Atomic promotion under contention | Single-writer lock + hash-preconditioned CAS |
| Wrapped-backend invariant (LLM metering can't be bypassed) | Coordinator.__init__ demands _BackendTracker; property setter re-validates on every reassignment |
| TierMismatch cleanup is race-free | writer_iteration_id stamp on every candidate at write_candidate time |
| Lost-spend-under-lock-contention | record() appends to WAL on LockUnavailable; next successful record() drains it |
Cycle 9 also produced two honest deferrals: H2 (closure-cell mutation defeats any pure-Python in-process guard) and H11 (POSIX unlink-during-lock race, unverified on Windows). Both require process-level isolation to close, planned for Phase 1. Full cycle trail →
--backend anthropic + an ANTHROPIC_API_KEY + cost ceilings + WAL-backed budget metering. No agent is currently improving itself at production scale on this codebase.scripts/burst.py on demand. The longer-arc framing is in CSIS-architecture.html Appendix A.