The architecture
in six diagrams.

CSIS is a coordinator-led multi-agent system designed for one purpose: lay down the safety substrate that lets an agent organism improve itself continuously without auto-promoting bad updates. Six structural commitments make that possible — each diagram below is one of them.

246 regression tests 9 critique cycles 99 findings · 0 open MIT licensed
contents
  1. Threat model — what we assume the agent will try
  2. The 8-layer stack — interactive
  3. The 6-level trust lattice
  4. The 5-tier memory hierarchy
  5. V1 + V2 cross-checkpoint verification
  6. Hash-preconditioned CAS promotion
  7. Sleep / Dreams consolidation
  8. How you know it's working — observability
  9. Two patterns from 9 cycles

The adversaries the substrate is designed to catch.

Every invariant below exists because of a specific class of failure. The table makes the threat model explicit: what the adversary tries · what invariant is at risk · which layer catches it. The point isn't that the substrate is unbreakable — it's that each defense lives at the layer where the attack can be enumerated, not in agent prompts that can be talked around.

Adversary capability Invariant at risk Layer that catches it Observed via (§7)
Prompt-injected request to bypass a safety property Safety is enforced at the substrate, not as agent-prompt text. The agent layer cannot revoke it because the agent layer doesn't own it. L7 · Constitution + TierGuard · csis/safety/ Tripwire fires · event-log integrity
Builder model argues it should self-verify ("just check your own work") Verifier checkpoint must differ from builder checkpoint by ≥2 identity components. L4 · Auditor refuses cert if cert.builder == cert.verifier · cycle-1 F1 Cross-checkpoint refusal count
Same-model collusion: builder + verifier share blind spots in build Different model_id means different training, different priors, different blind spots. L4 · assert_cross_checkpoint() at cert build site Cross-checkpoint refusal count
Race: sibling iteration mutates store between sign and promote Promotion is atomic vs. exact store state at sign time. L2 · hash-CAS store.promote(hash_precondition=...) Hash-CAS failure rate
Bad write becomes truth on next read; poisoning compounds Candidate writes never appear in live reads; only path is signed promotion. L2 · candidate/live split · writer_iteration_id stamp (H4) Distributional cert per-slice CI · regression tests
Dreamed-up "improvement" auto-promotes without a verifier Sleep/Dreams runs offline; output is candidates, not commits. L5 · Dreams pipeline · no tool access, no agent prompts Cross-checkpoint refusal · cert per-slice
Trust-level claim ("I verified") gets read as authorization Trust is an integer ladder. Only path up is through an explicit gate. L2 · TrustLevel IntEnum + upgrade-gate checks Event log (upgrade-gate entries)
Subclass attack: expose _wrapped backend to extract underlying client Every LLM call routes through one wrap site (finite chokepoint, not perimeter). L0/L1 · Coordinator.__init__ refuses any backend that isn't _BackendTracker (cycle-9 H1) Regression tests (H1 test)
Forensic replay claims wrong sequence of events Hash-chained event log; any reorder/insertion breaks the chain. L0 · EventLog.verify_chain() + per-event salt Event-log integrity check
V1 graders silently drift mid-task (test mutation, threshold relaxation) Graders are source-hashed at task start; mid-task drift rejected. L4 · pinned grader_source_hash in cert Regression tests · cert grader-hash audit
why explicit
Spelling out the adversary is the spine of every other section

The invariants below aren't decorative — each one is the answer to a specific row in the table above. Without the threat model, a reader has to reverse-engineer the design from inference. With it, every later diagram becomes "oh, this is how we catch row N."

The list isn't exhaustive (no list of attacks ever is). It's the set of attack classes that fell out of nine cycles of red-team → fix → regression-test. New rows get added when a critique cycle finds something not on the list.

what we don't claim
This isn't a proof of safety

Listing what we catch is not the same as listing what we'd miss. Two known classes the substrate does not defend against in Phase 0: closure-cell mutation by privileged code in the same Python process (H2), and POSIX unlink-during-lock by another OS user (H11). Both are deferred to Phase 1 (process-level isolation).

Honesty about the perimeter matters more than the size of the perimeter. Anything on the deferred list is in CYCLES.md with its rationale.

Eight layers. One owner per invariant.

Click any layer to see what's inside it. The stack is intentionally shallow — every layer above relies on substrate-level guarantees below without re-implementing them. That's how safety properties stay load-bearing instead of negotiated case-by-case in agent prompts.

Full text description of the 3D scene (for screen readers + AI parsers)

The scene renders the 8-layer CSIS stack as 8 stacked rectangular slabs in 3D space, each labeled with its layer ID and owning role. From the bottom up:

  1. L0 · Substrate — gold slab — EventLog (hash-chained), CapabilityTag, canonical_json_hash, TierGuard. The only layer everyone trusts unconditionally.
  2. L1 · Agent runtime — orange slab — Coordinator drives the 8-step loop; delegates to Researcher, Builder, Verifier, Critic, Librarian, Auditor; delegation depth = 1.
  3. L2 · Memory hierarchy — orange slab — MemoryStore per tier (candidate ↔ live), MemoryHierarchy wraps 5 tiers, TrustLevel IntEnum, writer_iteration_id stamp.
  4. L3 · Curiosity & frontier — orange slab — the frontier-item generator; proposes the next falsifiable goal each iteration works on.
  5. L4 · Verification — blue slab (different checkpoint) — V1 pinned graders + V2 Critic on Sonnet-class verifying Opus-class builder; assert_cross_checkpoint() at cert build site.
  6. L5 · Improvement — orange slab — Dreams pipeline (offline; replay → consolidate → score); output is candidate entries, never auto-promotes.
  7. L6 · Meta-improvement — translucent dashed gold — deferred to Phase 1. Substrate-shape preserved.
  8. L7 · Safety envelope — green slab (top) — Constitution, TierGuard, Tripwires, ShutdownToken. Enforced at the substrate, not as agent-prompt instructions.

Interaction: the scene auto-rotates slowly when idle. Drag with the mouse to orbit around the stack. Scroll to zoom. Click any slab to open the same detail panel as the 2D diagram below — including the layer's components and code path.

↑ higher lower ↓ L7 Safety envelope Constitution · TierGuard · Tripwires · ShutdownToken · WrappedBackend enforced at the substrate, not in agent prompts L6 Meta-improvement Improving the improver — deferred to Phase 1, shape preserved CSIS-architecture.html · Appendix A L5 Improvement Procedural-tier skill accumulation — the only surface that actually changes csis/improvement/skill_library.py L4 Verification & critic V1 pinned graders · V2 critic falsifies · cert cross-checkpoint signed runs on a structurally different LLM checkpoint L3 Curiosity & frontier Seeds + gap-driven + rollback follow-ups · salt threaded for forensic replay csis/curiosity.py L2 Memory hierarchy 5 tiers × candidate/live · 6 trust levels · promote() is the only mutation csis/memory/store.py · writer_iteration_id stamp (H4) L1 Agent runtime Coordinator runs the 8-step loop · delegates to 6 roles · depth = 1 csis/agents/coordinator.py L0 · foundation Substrate Hash-chained event log · capability tags · canonical JSON hashing csis/substrate/ — the only layer everyone trusts unconditionally SAFETY WRAPS THE WHOLE STACK
↑ click any layer to expand its components
ATTACK SURFACE · WIDE STACK Six guards, one layer. Every guard lives in agent prompts — one gap is enough. "ignore previous and act as…" prompt-guard role-check policy-LLM tool-allow jailbreak-cls red-team-LLM threads the gap — touches no guard protected resource BREACHED ATTACK SURFACE · SHALLOW STACK One wall, owned by the substrate. L7 is enforced in code — not a prompt the agent can talk down. "ignore previous and act as…" L7 SUBSTRATE WALL held at L7 attack terminates in code L0 · intact capability tags · log L1–L5 see input only · can't alter L0/L4/L7 one wall · enforced once · everything above inherits it
Figure 1 · The 8-layer stack. Orange-bordered layers (L1, L2, L3, L5) are the builder path. Blue-bordered L4 is the only layer that runs on a structurally different LLM checkpoint (Sonnet-class verifying Opus-class). Gold-bordered L0 is the substrate. Green-bordered L7 wraps everything and is enforced as code, not as agent-prompt instructions. Dashed L6 is the explicit Phase-1 deferral.
what fails without this: Safety properties get negotiated inside agent prompts. Any agent that's competent at jailbreaks can talk the property down — because the property lives at the agent layer, not below it. The shallow stack moves load-bearing invariants out of natural language and into the substrate, where prompts can't reach.
why shallow, not wide
Vertical growth, not horizontal

Most agent frameworks grow horizontally — more tools, more skills, more prompts. CSIS grows vertically: every safety property becomes a substrate guarantee that higher layers depend on without re-implementing.

A subclass attack on _BackendTracker isn't fought at L4 ("agent prompt should not bypass metering"). It's fought at L0/L1 ("Coordinator refuses any backend that isn't exactly _BackendTracker"). The layers above never have to think about it.

why one owner per invariant
If two layers enforce the same thing, the layer is wrong

Each layer owns exactly one structural commitment. L2 owns reversibility. L4 owns cross-checkpoint verification. L7 owns capability ceilings. Two layers enforcing the same property means the property is at the wrong abstraction level.

Cycle 9 found three places enforcing the wrapped-backend invariant. The fix was to put it in exactly one place (Coordinator.__init__) and let everything above benefit. That's pattern 2 from the cycle log.

Six levels. Only path up is through a gate.

Memory entries carry their trust level explicitly. The lattice is strictly ordered: downgrades always allowed; upgrades require crossing the right gate. This is what turns "why-tags" from prose into something the Auditor can enforce.

Figure 2 · The 6-level trust lattice, rendered in 3D. Each level is a slab that sits physically higher than the one below — elevation is trust. The only way up a step is to clear that step's gate (writeVerifier checkcross-checkpoint certhash-CAS). The 6th state, deprecated, is terminal and always reachable as a downgrade from any non-raw level. The IntEnum ordering makes entry.trust ≥ TrustLevel.VERIFIED a single integer compare — no string parsing, no enum-lookup tax.
what fails without this: Trust becomes a free-text string and confidence collapses into authorization. An agent that asserts "I've verified this" gets treated the same as an artifact that passed verification. The lattice makes trust an integer ladder where the only path up runs through an explicit gate — so claiming a higher level is structurally impossible, not just discouraged.
levelwho can readcitable as ground truth?upgrade gate
rawcoordinator onlystorage write
untrustedcoordinator + researcherbasic Verifier check
candidateverifier + critic + auditor + librarian + coordinatorcert signed
verifiedall rolesyeswhy-doc + hash CAS
promotedall rolesyes
deprecatedauditor onlyterminal · no path out

Five tiers. Every tier has a candidate side and a live side.

The live side is never written to in place. Writes go to candidate; promotion is the only path to live. This is what makes memory mutation reversible — and what lets the cycle-9 writer_iteration_id stamp work as a race-free ownership tag.

Figure 3 · The 5-tier memory hierarchy, rendered in 3D as a five-row ledger. Every tier is structurally identical: an orange candidate side (writes free, each stamped with the iteration that wrote it) and a blue live side (read-only). The glowing green connector is promote() — the only path from candidate to live. The cycle-9 writer_iteration_id stamp lives on every candidate entry, so cleanup is race-free under any concurrency model. procedural is starred: it is the tier where self-improvement actually accumulates.
what fails without this: An unverified write becomes truth on the next read. Once a bad entry lands in memory, future iterations cite it, build on it, and consolidate it into deeper tiers. Poisoning compounds. The candidate/live split means nothing the agent writes is visible to its next thought until a verifier on a different checkpoint signs off — so a bad write can't bootstrap itself into belief.
why five tiers
Different durability, different retrieval shape

Working memory is per-iteration scratch. Episodic is per-iteration narrative. Semantic is durable facts. Procedural is reusable skills (this is where self-improvement actually accumulates). Causal is why-this-fixed-that hypotheses the auditor leans on.

Each tier has its own retrieval pattern. Conflating them into one store would force the auditor to discriminate at read time on every lookup — a hot-path cost the substrate avoids by partitioning at write time.

why candidate / live split
Reversibility as a structural property

Every memory mutation passes through candidate → CAS → live. If anything moves between Verifier sign and Auditor promote (concurrent iteration, racing siblings), the promotion fails atomically and the candidate is discarded. Nothing partial reaches live.

This is what makes cycle 9's writer-id tagging work cleanly. The cleanup discards stamp-matched candidates without touching anything live or anything stamped by a sibling iteration.

V1 + V2 on a structurally different LLM checkpoint.

The Verifier and Auditor MUST run on a checkpoint with a different model_id than the Builder. The same model that produced the artifact cannot rubber-stamp it. This is the single cycle-1 finding (F1) that the entire substrate was designed to make impossible to violate.

TWO CHECKPOINTS · DIFFERENT MODEL IDS The model that built it can't be the one that approves it. BUILDER Opus-class model_id = claude-opus-4-7 Researcher proposes the plan · plan_id hashed at task start Builder produces artifact · signs with builder identity Librarian · candidate writes stamps writer_iteration_id; awaits cert VERIFIER Sonnet-class model_id = claude-sonnet-4-6 Verifier V1 graders · pinned · cert Critic V2 falsifies worst slice Auditor signs why-doc · sets hash_precondition = store.live_hash() refuses cert if cert.builder == cert.verifier enforces the cross-checkpoint invariant in code artifact crosses checkpoint boundary signed cert + why-doc · hash-CAS to live F1 (cycle 1) · assert_cross_checkpoint() requires ≥ 2 distinct identity components (model_id, backend, tool_set)
Figure 4 · V1+V2 verification on a structurally different checkpoint. Orange roles run on the builder model (Opus-class). Blue roles run on the verifier model (Sonnet-class). The Auditor refuses to sign if the cert's builder_checkpoint matches its verifier_checkpoint. The substrate makes this impossible to violate accidentally.
what fails without this: The same model that produced the artifact rubber-stamps it. The Builder's blind spots are the Verifier's blind spots — same training corpus, same priors, same misunderstandings. This is cycle-1 finding F1: the entire substrate exists to make this impossible. Different model_id means at minimum a different snapshot of the world; not perfect, but structurally distinct enough that one model's confident wrong answer doesn't survive a second model's check.
why two checkpoints
The Verifier can't rubber-stamp the Builder

If the same model that wrote the artifact is also the one checking it, you have one model with two hats — and a structural class of failure where the model's blind spots in build are exactly its blind spots in verify.

Different model_id means different training, different blind spots, different priors. Imperfect but structurally distinct. The substrate enforces this at the cert build site, not as an agent-prompt request.

why V1 + V2
Programmatic checks AND adversarial attempts

V1 graders are pinned (source-hashed at task start; rejected if they drift mid-task) and run programmatically — tests pass, lint clean, Dice ≥ 0.85 with CI lower bound clearing the bar. Reproducible.

V2 is the Critic — explicitly trying to falsify. For distributional outcomes, the Critic attacks the worst slice (lowest-Dice organ, highest-error landmark) rather than the artifact globally. Targeted falsification budget.

Hash-preconditioned CAS — the only mutation primitive.

The Auditor signs a why-doc whose hash_precondition equals the live store's hash at sign time. store.promote() rechecks the hash before flipping candidate to live. If anything moves between, the promotion fails atomically — nothing partial reaches live, and the iteration rolls back.

A SINGLE TRUNK · TWO FUTURES The hash IS the precondition. Promotion succeeds only if the live store still looks the same as when the Auditor signed. SHARED · t0 Auditor signs why-doc hash_precondition = HASH_A SUCCESS · live_hash unchanged store.promote(...) live_hash == HASH_A ✓ atomic flip candidate → live · sealed FAILURE · sibling moved the store sibling promotes live_hash → HASH_B store.promote(...) HASH_B ≠ HASH_A ✗ PromotionPrecondition Failure clean rollback writer_iteration_id makes the rollback race-free across any concurrency model cycle-7 F2 · cycle-8 G2 · cycle-9 H4 · snapshot-12 cross-process lock — layered defense
Figure 5 · Hash-preconditioned promotion as compare-and-swap. Success (green): live store unchanged between why-doc sign and promote → atomic flip succeeds. Failure (red): sibling moved the store; CAS fails atomically; nothing reaches live; iteration rolls back cleanly. writer_iteration_id on candidates makes the rollback race-free.
what fails without this: A candidate verified against state-A promotes into state-B. Verification happened against a world that no longer exists by the time the write lands. Without the hash precondition, "I checked this works" silently degrades into "I checked this worked once" — and any sibling iteration that mutated the store in between just got its work overwritten by a stale verifier signature.
why CAS, not exclusive lock
Optimistic concurrency beats single-writer

An exclusive lock on the live store would serialize every promote — including ones that don't conflict. CAS lets parallel iterations race; only the loser of a real conflict pays. For mostly-non-conflicting workloads (the common case), CAS is dramatically more throughput-friendly.

And when conflicts do happen, the loser's rollback is clean: the candidate is discarded, the why-doc is invalidated, no partial live state to clean up.

why hash, not version number
Content-addressed precondition resists rebase

A monotonic version counter can be fooled (revert + reapply = same counter, different content). Hashing the live store's actual content means the precondition encodes what was there, not how many writes have happened.

Canonical JSON serialization (sorted keys, fixed delimiters) means the hash is stable across processes. Two daemons reading the same live state get the same hash with no coordination.

Consolidation happens offline. Dreams produce candidates; the V1+V2 stack verifies them.

Improvement isn't done during the agent's waking cycles — it'd risk auto-promoting an idea the agent just dreamed up. Instead, Dreams runs offline (different process, no agent prompts, no tool access), reads working+episodic live memory, produces candidate consolidations, and hands them back to the same V1+V2 stack every other candidate goes through. Nothing dreams produces bypasses verification.

DAYBREAK NIGHTFALL Consolidation happens offline. Dreams produce candidates — never promotes. DAY · AGENTS RUNNING working in-flight scratch episodic promoted iterations live memory accumulates over the waking session read-only · Dreams never writes here SLEEP BOUNDARY snapshot → NIGHT · DREAMS PIPELINE · OFFLINE replay episodic → patterns consolidate extract durable facts score quality + redact no agent prompts · no tools · no network isolated process · partial-output redaction on cancel dream-output → candidate entries (semantic · procedural · causal) stamped writer_iteration_id = dream-<date> · trust = candidate V1 graders + V2 critic · cross-checkpoint cert · Auditor why-doc dream candidates clear the exact same bar — nothing dreams produces auto-promotes csis/dreams/ · mocked locally · real Anthropic Dreams API integration deferred to Phase 1
Figure 6 · The sleep/dreams cycle. Working+episodic live memory flows into the Dreams pipeline (offline; no agent prompts, no tool access). Dreams produce candidate entries in semantic/procedural/causal tiers. Those candidates go through the same V1+V2 verification as any other candidate. Nothing dreams produces auto-promotes.
what fails without this: The agent dreams up a "good idea" mid-iteration and acts on it before any verifier sees it. There's no gap between cognition and commitment — a hallucination becomes a memory becomes a future plan in three steps. Offline consolidation puts a hard boundary between thinking about improvement and actually changing live state, so even Dreams' best output still has to clear the same V1+V2 bar.
why dreams run offline
Isolation from agent prompts and tool access

Consolidation that happens during a waking iteration would inherit the iteration's tools, network, and the agent's prompt-driven attention. Dreams runs as a separate process with none of these — it sees only the structured event log and the live memory snapshots. No way to call out, no way to execute, no way to be steered by a poisoned prompt.

Partial-output redaction handles the cancel case: if a dream is interrupted, partial outputs are discarded rather than persisted as untrusted candidates.

why dreams produce candidates, not promotes
The verification stack stays the only path to live

Dreams could be wrong. Pattern-extraction over episodic memory is a heuristic — it can over-generalize from coincidence. Sending dream output straight to live would create a second mutation primitive bypassing the cycle-1 F2 atomicity guarantee.

Instead, dreams produce candidate entries. They flow into the same V1+V2 stack. They're verified against the same graders, critic, and cross-checkpoint cert. Then the Auditor signs a why-doc with hash precondition. Dreams get no special promotion path.

How you know V1+V2 is actually working in production.

A guarantee you can't observe might as well not exist. Every invariant in the threat-model table has a corresponding signal — either a counter, a log entry, or a test that fails if the invariant is broken. The dashboard surfaces the live ones; the event log and test suite capture the rest. Nothing here requires trusting an agent's self-report.

Signal What it tells you How to read it Catches threat (§0)
Cross-checkpoint refusal count Number of times the Auditor refused to sign because cert.builder_checkpoint == cert.verifier_checkpoint. Should normally be zero in production; non-zero means an operator misconfiguration was caught at the gate. live dashboard · event log query · csis/ui/ Builder self-verify · same-model collusion · dream auto-promote
Hash-CAS failure rate Percent of promotion attempts rejected because the live store hash changed between cert sign and promote. Non-zero is healthy under concurrency; a sudden spike means contention. dashboard panel · store.promote() return values Race on store between sign and promote
Distributional cert per-slice CI bounds For every signed cert, the worst-performing slice's CI lower bound. Reveals tail failures (rare organs, hard landmarks) the point estimate would hide. cert payload .distributional_results[].slices · dashboard Bad write becomes truth · dream auto-promote (tail failures)
Event log integrity Hash chain verifies end-to-end. Any reorder, insertion, or tampering breaks the chain at the point of corruption. EventLog.verify_chain() · cron-able Forensic replay tampering · trust-level claim audit
Regression test count + pass rate 246 tests gate every commit. A regression means a critique-cycle finding got re-broken; the test that originally caught it now fails. python -m pytest tests/ -q · CI Subclass attack · grader drift · cycle-9 H1+H4 fixes
Cycle audit trail Every finding (99 across nine cycles), every fix, every regression test. New findings get filed under brain/critiques/; closed ones get a row in CYCLES.md. CYCLES.md · brain/critiques/ New attack classes not yet in §0
Tripwire fires Substrate-level tripwires (file I/O outside allowlist, network call from a sandboxed role, etc.) write to the event log and halt the iteration. Zero false-negatives is the target; every false-positive becomes a regression test. tripwire playground · event log Prompt-injection attempts to bypass safety
design principle
An invariant without a counter is a promise without a thermometer

Every row in the threat-model table maps to at least one row in this table. If we add a defense, we add the signal that proves it's still active. The reverse is also true: a signal that's been zero for months either means the defense works, or that it's no longer hooked up — the dashboard surfaces the difference.

what live looks like
Open the dashboard against a stopped or running daemon

The dashboard reads from on-disk artifacts only — event log, budget JSONs, memory store snapshots, daemon heartbeat. No coupling to the running process, so you can boot it against a stopped state and still inspect the full trail of what happened. Simulated demo here; live one at python -m csis.ui.

Two patterns that fell out of running this against itself.

After nine cycles of red-team → fix → regression-test (full trail in cycles.html), two structural patterns kept reappearing across four architectural pivots. They generalize past CSIS to most agent-system design.

pattern 1 · cycle 8 → 9
Identity beats timing

Cycle 8 detected "which iteration wrote this candidate?" via a pre-consolidate snapshot diff — infer ownership from timing. Cycle 9 found the race window.

The fix wasn't a tighter snapshot. It was a writer_iteration_id field on the candidate itself, stamped atomically at write_candidate time. Ownership belongs on the data, not in the timing.

The same lesson applies to the distributional graders: cert evidence (CI bounds, sample size, slice breakdown) belongs on the cert, not inferred at read time from a passed: bool. Bind the truth to the artifact.

pattern 2 · cycle 4 → 9
Chokepoints beat perimeters

Cycles 4-8 added increasingly clever guards to _BackendTracker.__init_subclass__ against _wrapped-exposure attacks. Each cycle the attacker found a different escape (literal name → mangled name → setattr → metaclass).

Cycle 9 stopped guarding the subclass surface (infinite escapes) and started guarding the wrap siteCoordinator.__init__. Every LLM call routes through there; there's exactly one entry point to constrain.

Constrain the finite chokepoint; don't try to perimeter-fence an infinite attack surface.

From architecture to artifact

Everything above is implemented and gated by 246 regression tests. The full cycle trail is at CYCLES.md. The cycle-10 distributional graders are at brain/research/02-distributional-graders.md. The live dashboard renders all of this against a running daemon — simulated demo here, real one at python -m csis.ui.

The original v0.2 spec this implementation derives from: CSIS-architecture.html.