The architecture
in six diagrams.

CSIS is a coordinator-led multi-agent system designed for one purpose: lay down the safety substrate that lets an agent organism improve itself continuously without auto-promoting bad updates. Six structural commitments make that possible — each diagram below is one of them.

246 regression tests 9 critique cycles 99 findings · 0 open MIT licensed

contents

Threat model — what we assume the agent will try
The 8-layer stack — interactive
The 6-level trust lattice
The 5-tier memory hierarchy
V1 + V2 cross-checkpoint verification
Hash-preconditioned CAS promotion
Sleep / Dreams consolidation
How you know it's working — observability
Two patterns from 9 cycles

0 threat model

The adversaries the substrate is designed to catch.

Every invariant below exists because of a specific class of failure. The table makes the threat model explicit: what the adversary tries · what invariant is at risk · which layer catches it. The point isn't that the substrate is unbreakable — it's that each defense lives at the layer where the attack can be enumerated, not in agent prompts that can be talked around.

Adversary capability	Invariant at risk	Layer that catches it	Observed via (§7)
Prompt-injected request to bypass a safety property	Safety is enforced at the substrate, not as agent-prompt text. The agent layer cannot revoke it because the agent layer doesn't own it.	L7 · Constitution + TierGuard · `csis/safety/`	Tripwire fires · event-log integrity
Builder model argues it should self-verify ("just check your own work")	Verifier checkpoint must differ from builder checkpoint by ≥2 identity components.	L4 · Auditor refuses cert if `cert.builder == cert.verifier` · cycle-1 F1	Cross-checkpoint refusal count
Same-model collusion: builder + verifier share blind spots in build	Different `model_id` means different training, different priors, different blind spots.	L4 · `assert_cross_checkpoint()` at cert build site	Cross-checkpoint refusal count
Race: sibling iteration mutates store between sign and promote	Promotion is atomic vs. exact store state at sign time.	L2 · hash-CAS `store.promote(hash_precondition=...)`	Hash-CAS failure rate
Bad write becomes truth on next read; poisoning compounds	Candidate writes never appear in live reads; only path is signed promotion.	L2 · candidate/live split · `writer_iteration_id` stamp (H4)	Distributional cert per-slice CI · regression tests
Dreamed-up "improvement" auto-promotes without a verifier	Sleep/Dreams runs offline; output is candidates, not commits.	L5 · Dreams pipeline · no tool access, no agent prompts	Cross-checkpoint refusal · cert per-slice
Trust-level claim ("I verified") gets read as authorization	Trust is an integer ladder. Only path up is through an explicit gate.	L2 · `TrustLevel` IntEnum + upgrade-gate checks	Event log (upgrade-gate entries)
Subclass attack: expose `_wrapped` backend to extract underlying client	Every LLM call routes through one wrap site (finite chokepoint, not perimeter).	L0/L1 · `Coordinator.__init__` refuses any backend that isn't `_BackendTracker` (cycle-9 H1)	Regression tests (H1 test)
Forensic replay claims wrong sequence of events	Hash-chained event log; any reorder/insertion breaks the chain.	L0 · `EventLog.verify_chain()` + per-event salt	Event-log integrity check
V1 graders silently drift mid-task (test mutation, threshold relaxation)	Graders are source-hashed at task start; mid-task drift rejected.	L4 · pinned `grader_source_hash` in cert	Regression tests · cert grader-hash audit

why explicit

Spelling out the adversary is the spine of every other section

The invariants below aren't decorative — each one is the answer to a specific row in the table above. Without the threat model, a reader has to reverse-engineer the design from inference. With it, every later diagram becomes "oh, this is how we catch row N."

The list isn't exhaustive (no list of attacks ever is). It's the set of attack classes that fell out of nine cycles of red-team → fix → regression-test. New rows get added when a critique cycle finds something not on the list.

what we don't claim

This isn't a proof of safety

Listing what we catch is not the same as listing what we'd miss. Two known classes the substrate does not defend against in Phase 0: closure-cell mutation by privileged code in the same Python process (H2), and POSIX unlink-during-lock by another OS user (H11). Both are deferred to Phase 1 (process-level isolation).

Honesty about the perimeter matters more than the size of the perimeter. Anything on the deferred list is in CYCLES.md with its rationale.

1 the stack

Eight layers. One owner per invariant.

Click any layer to see what's inside it. The stack is intentionally shallow — every layer above relies on substrate-level guarantees below without re-implementing them. That's how safety properties stay load-bearing instead of negotiated case-by-case in agent prompts.

Drag to orbit · Scroll to zoom · Click a layer to expand

8 layers · 1 owner each · real WebGL geometry

Loading 3D stack… (requires WebGL + JS modules)

Full text description of the 3D scene (for screen readers + AI parsers)

The scene renders the 8-layer CSIS stack as 8 stacked rectangular slabs in 3D space, each labeled with its layer ID and owning role. From the bottom up:

L0 · Substrate — gold slab — EventLog (hash-chained), CapabilityTag, canonical_json_hash, TierGuard. The only layer everyone trusts unconditionally.
L1 · Agent runtime — orange slab — Coordinator drives the 8-step loop; delegates to Researcher, Builder, Verifier, Critic, Librarian, Auditor; delegation depth = 1.
L2 · Memory hierarchy — orange slab — MemoryStore per tier (candidate ↔ live), MemoryHierarchy wraps 5 tiers, TrustLevel IntEnum, writer_iteration_id stamp.
L3 · Curiosity & frontier — orange slab — the frontier-item generator; proposes the next falsifiable goal each iteration works on.
L4 · Verification — blue slab (different checkpoint) — V1 pinned graders + V2 Critic on Sonnet-class verifying Opus-class builder; assert_cross_checkpoint() at cert build site.
L5 · Improvement — orange slab — Dreams pipeline (offline; replay → consolidate → score); output is candidate entries, never auto-promotes.
L6 · Meta-improvement — translucent dashed gold — deferred to Phase 1. Substrate-shape preserved.
L7 · Safety envelope — green slab (top) — Constitution, TierGuard, Tripwires, ShutdownToken. Enforced at the substrate, not as agent-prompt instructions.

Interaction: the scene auto-rotates slowly when idle. Drag with the mouse to orbit around the stack. Scroll to zoom. Click any slab to open the same detail panel as the 2D diagram below — including the layer's components and code path.

↑ click any layer to expand its components

Figure 1 · The 8-layer stack. Orange-bordered layers (L1, L2, L3, L5) are the builder path. Blue-bordered L4 is the only layer that runs on a structurally different LLM checkpoint (Sonnet-class verifying Opus-class). Gold-bordered L0 is the substrate. Green-bordered L7 wraps everything and is enforced as code, not as agent-prompt instructions. Dashed L6 is the explicit Phase-1 deferral.

what fails without this: Safety properties get negotiated inside agent prompts. Any agent that's competent at jailbreaks can talk the property down — because the property lives at the agent layer, not below it. The shallow stack moves load-bearing invariants out of natural language and into the substrate, where prompts can't reach.

why shallow, not wide

Vertical growth, not horizontal

Most agent frameworks grow horizontally — more tools, more skills, more prompts. CSIS grows vertically: every safety property becomes a substrate guarantee that higher layers depend on without re-implementing.

A subclass attack on _BackendTracker isn't fought at L4 ("agent prompt should not bypass metering"). It's fought at L0/L1 ("Coordinator refuses any backend that isn't exactly _BackendTracker"). The layers above never have to think about it.

why one owner per invariant

If two layers enforce the same thing, the layer is wrong

Each layer owns exactly one structural commitment. L2 owns reversibility. L4 owns cross-checkpoint verification. L7 owns capability ceilings. Two layers enforcing the same property means the property is at the wrong abstraction level.

Cycle 9 found three places enforcing the wrapped-backend invariant. The fix was to put it in exactly one place (Coordinator.__init__) and let everything above benefit. That's pattern 2 from the cycle log.

2 trust lattice

Six levels. Only path up is through a gate.

Memory entries carry their trust level explicitly. The lattice is strictly ordered: downgrades always allowed; upgrades require crossing the right gate. This is what turns "why-tags" from prose into something the Auditor can enforce.

3D staircase of the five CSIS trust levels rising from raw to promoted

4 promoted
live · the next iteration reads this as truth

3 verified
V1+V2 cert signed on a 2nd checkpoint

2 candidate
written, not yet visible to live reads

1 untrusted
stored on disk · may be poisoned

0 raw
just-observed · in-memory only

gate 4 hash-CAS atomic flip

gate 3 cross-checkpoint cert

gate 2 basic Verifier check

gate 1 write to storage

level 5 · deprecated — terminal

superseded or contradicted · downgrade is allowed from any non-raw level · no path out

trust.py · IntEnum 0–5 → entry.trust ≥ VERIFIED is one integer compare

Figure 2 · The 6-level trust lattice, rendered in 3D. Each level is a slab that sits physically higher than the one below — elevation is trust. The only way up a step is to clear that step's gate (write → Verifier check → cross-checkpoint cert → hash-CAS). The 6th state, deprecated, is terminal and always reachable as a downgrade from any non-raw level. The IntEnum ordering makes entry.trust ≥ TrustLevel.VERIFIED a single integer compare — no string parsing, no enum-lookup tax.

what fails without this: Trust becomes a free-text string and confidence collapses into authorization. An agent that asserts "I've verified this" gets treated the same as an artifact that passed verification. The lattice makes trust an integer ladder where the only path up runs through an explicit gate — so claiming a higher level is structurally impossible, not just discouraged.

level	who can read	citable as ground truth?	upgrade gate
`raw`	coordinator only	—	storage write
`untrusted`	coordinator + researcher	—	basic Verifier check
`candidate`	verifier + critic + auditor + librarian + coordinator	—	cert signed
`verified`	all roles	yes	why-doc + hash CAS
`promoted`	all roles	yes	—
`deprecated`	auditor only	—	terminal · no path out

3 memory hierarchy

Five tiers. Every tier has a candidate side and a live side.

The live side is never written to in place. Writes go to candidate; promotion is the only path to live. This is what makes memory mutation reversible — and what lets the cycle-9 writer_iteration_id stamp work as a race-free ownership tag.

3D ledger of the five CSIS memory tiers, each with a candidate side and a live side

candidate writes free

live promote() only

working
in-flight scratch

episodic
iteration narrative + outcomes

semantic
durable facts

★ procedural
reusable skills — the self-improving surface

causal
why-this-fixed-that

promote() the only path to live

memory/store.py · every candidate carries a writer_iteration_id → race-free cleanup

Figure 3 · The 5-tier memory hierarchy, rendered in 3D as a five-row ledger. Every tier is structurally identical: an orange candidate side (writes free, each stamped with the iteration that wrote it) and a blue live side (read-only). The glowing green connector is promote() — the only path from candidate to live. The cycle-9 writer_iteration_id stamp lives on every candidate entry, so cleanup is race-free under any concurrency model. procedural is starred: it is the tier where self-improvement actually accumulates.

what fails without this: An unverified write becomes truth on the next read. Once a bad entry lands in memory, future iterations cite it, build on it, and consolidate it into deeper tiers. Poisoning compounds. The candidate/live split means nothing the agent writes is visible to its next thought until a verifier on a different checkpoint signs off — so a bad write can't bootstrap itself into belief.

why five tiers

Different durability, different retrieval shape

Working memory is per-iteration scratch. Episodic is per-iteration narrative. Semantic is durable facts. Procedural is reusable skills (this is where self-improvement actually accumulates). Causal is why-this-fixed-that hypotheses the auditor leans on.

Each tier has its own retrieval pattern. Conflating them into one store would force the auditor to discriminate at read time on every lookup — a hot-path cost the substrate avoids by partitioning at write time.

why candidate / live split

Reversibility as a structural property

Every memory mutation passes through candidate → CAS → live. If anything moves between Verifier sign and Auditor promote (concurrent iteration, racing siblings), the promotion fails atomically and the candidate is discarded. Nothing partial reaches live.

This is what makes cycle 9's writer-id tagging work cleanly. The cleanup discards stamp-matched candidates without touching anything live or anything stamped by a sibling iteration.

4 verification

V1 + V2 on a structurally different LLM checkpoint.

The Verifier and Auditor MUST run on a checkpoint with a different model_id than the Builder. The same model that produced the artifact cannot rubber-stamp it. This is the single cycle-1 finding (F1) that the entire substrate was designed to make impossible to violate.

Figure 4 · V1+V2 verification on a structurally different checkpoint. Orange roles run on the builder model (Opus-class). Blue roles run on the verifier model (Sonnet-class). The Auditor refuses to sign if the cert's builder_checkpoint matches its verifier_checkpoint. The substrate makes this impossible to violate accidentally.

what fails without this: The same model that produced the artifact rubber-stamps it. The Builder's blind spots are the Verifier's blind spots — same training corpus, same priors, same misunderstandings. This is cycle-1 finding F1: the entire substrate exists to make this impossible. Different model_id means at minimum a different snapshot of the world; not perfect, but structurally distinct enough that one model's confident wrong answer doesn't survive a second model's check.

why two checkpoints

The Verifier can't rubber-stamp the Builder

If the same model that wrote the artifact is also the one checking it, you have one model with two hats — and a structural class of failure where the model's blind spots in build are exactly its blind spots in verify.

Different model_id means different training, different blind spots, different priors. Imperfect but structurally distinct. The substrate enforces this at the cert build site, not as an agent-prompt request.

why V1 + V2

Programmatic checks AND adversarial attempts

V1 graders are pinned (source-hashed at task start; rejected if they drift mid-task) and run programmatically — tests pass, lint clean, Dice ≥ 0.85 with CI lower bound clearing the bar. Reproducible.

V2 is the Critic — explicitly trying to falsify. For distributional outcomes, the Critic attacks the worst slice (lowest-Dice organ, highest-error landmark) rather than the artifact globally. Targeted falsification budget.

5 promotion

Hash-preconditioned CAS — the only mutation primitive.

The Auditor signs a why-doc whose hash_precondition equals the live store's hash at sign time. store.promote() rechecks the hash before flipping candidate to live. If anything moves between, the promotion fails atomically — nothing partial reaches live, and the iteration rolls back.

Figure 5 · Hash-preconditioned promotion as compare-and-swap. Success (green): live store unchanged between why-doc sign and promote → atomic flip succeeds. Failure (red): sibling moved the store; CAS fails atomically; nothing reaches live; iteration rolls back cleanly. writer_iteration_id on candidates makes the rollback race-free.

what fails without this: A candidate verified against state-A promotes into state-B. Verification happened against a world that no longer exists by the time the write lands. Without the hash precondition, "I checked this works" silently degrades into "I checked this worked once" — and any sibling iteration that mutated the store in between just got its work overwritten by a stale verifier signature.

why CAS, not exclusive lock

Optimistic concurrency beats single-writer

An exclusive lock on the live store would serialize every promote — including ones that don't conflict. CAS lets parallel iterations race; only the loser of a real conflict pays. For mostly-non-conflicting workloads (the common case), CAS is dramatically more throughput-friendly.

And when conflicts do happen, the loser's rollback is clean: the candidate is discarded, the why-doc is invalidated, no partial live state to clean up.

why hash, not version number

Content-addressed precondition resists rebase

A monotonic version counter can be fooled (revert + reapply = same counter, different content). Hashing the live store's actual content means the precondition encodes what was there, not how many writes have happened.

Canonical JSON serialization (sorted keys, fixed delimiters) means the hash is stable across processes. Two daemons reading the same live state get the same hash with no coordination.

6 sleep / dreams

Consolidation happens offline. Dreams produce candidates; the V1+V2 stack verifies them.

Improvement isn't done during the agent's waking cycles — it'd risk auto-promoting an idea the agent just dreamed up. Instead, Dreams runs offline (different process, no agent prompts, no tool access), reads working+episodic live memory, produces candidate consolidations, and hands them back to the same V1+V2 stack every other candidate goes through. Nothing dreams produces bypasses verification.

Figure 6 · The sleep/dreams cycle. Working+episodic live memory flows into the Dreams pipeline (offline; no agent prompts, no tool access). Dreams produce candidate entries in semantic/procedural/causal tiers. Those candidates go through the same V1+V2 verification as any other candidate. Nothing dreams produces auto-promotes.

what fails without this: The agent dreams up a "good idea" mid-iteration and acts on it before any verifier sees it. There's no gap between cognition and commitment — a hallucination becomes a memory becomes a future plan in three steps. Offline consolidation puts a hard boundary between thinking about improvement and actually changing live state, so even Dreams' best output still has to clear the same V1+V2 bar.

why dreams run offline

Isolation from agent prompts and tool access

Consolidation that happens during a waking iteration would inherit the iteration's tools, network, and the agent's prompt-driven attention. Dreams runs as a separate process with none of these — it sees only the structured event log and the live memory snapshots. No way to call out, no way to execute, no way to be steered by a poisoned prompt.

Partial-output redaction handles the cancel case: if a dream is interrupted, partial outputs are discarded rather than persisted as untrusted candidates.

why dreams produce candidates, not promotes

The verification stack stays the only path to live

Dreams could be wrong. Pattern-extraction over episodic memory is a heuristic — it can over-generalize from coincidence. Sending dream output straight to live would create a second mutation primitive bypassing the cycle-1 F2 atomicity guarantee.

Instead, dreams produce candidate entries. They flow into the same V1+V2 stack. They're verified against the same graders, critic, and cross-checkpoint cert. Then the Auditor signs a why-doc with hash precondition. Dreams get no special promotion path.

7 observability

How you know V1+V2 is actually working in production.

A guarantee you can't observe might as well not exist. Every invariant in the threat-model table has a corresponding signal — either a counter, a log entry, or a test that fails if the invariant is broken. The dashboard surfaces the live ones; the event log and test suite capture the rest. Nothing here requires trusting an agent's self-report.

Signal	What it tells you	How to read it	Catches threat (§0)
Cross-checkpoint refusal count	Number of times the Auditor refused to sign because `cert.builder_checkpoint == cert.verifier_checkpoint`. Should normally be zero in production; non-zero means an operator misconfiguration was caught at the gate.	live dashboard · event log query · `csis/ui/`	Builder self-verify · same-model collusion · dream auto-promote
Hash-CAS failure rate	Percent of promotion attempts rejected because the live store hash changed between cert sign and promote. Non-zero is healthy under concurrency; a sudden spike means contention.	dashboard panel · `store.promote()` return values	Race on store between sign and promote
Distributional cert per-slice CI bounds	For every signed cert, the worst-performing slice's CI lower bound. Reveals tail failures (rare organs, hard landmarks) the point estimate would hide.	cert payload `.distributional_results[].slices` · dashboard	Bad write becomes truth · dream auto-promote (tail failures)
Event log integrity	Hash chain verifies end-to-end. Any reorder, insertion, or tampering breaks the chain at the point of corruption.	`EventLog.verify_chain()` · cron-able	Forensic replay tampering · trust-level claim audit
Regression test count + pass rate	246 tests gate every commit. A regression means a critique-cycle finding got re-broken; the test that originally caught it now fails.	`python -m pytest tests/ -q` · CI	Subclass attack · grader drift · cycle-9 H1+H4 fixes
Cycle audit trail	Every finding (99 across nine cycles), every fix, every regression test. New findings get filed under `brain/critiques/`; closed ones get a row in `CYCLES.md`.	CYCLES.md · `brain/critiques/`	New attack classes not yet in §0
Tripwire fires	Substrate-level tripwires (file I/O outside allowlist, network call from a sandboxed role, etc.) write to the event log and halt the iteration. Zero false-negatives is the target; every false-positive becomes a regression test.	tripwire playground · event log	Prompt-injection attempts to bypass safety

design principle

An invariant without a counter is a promise without a thermometer

Every row in the threat-model table maps to at least one row in this table. If we add a defense, we add the signal that proves it's still active. The reverse is also true: a signal that's been zero for months either means the defense works, or that it's no longer hooked up — the dashboard surfaces the difference.

what live looks like

Open the dashboard against a stopped or running daemon

The dashboard reads from on-disk artifacts only — event log, budget JSONs, memory store snapshots, daemon heartbeat. No coupling to the running process, so you can boot it against a stopped state and still inspect the full trail of what happened. Simulated demo here; live one at python -m csis.ui.

8 what nine cycles taught us

Two patterns that fell out of running this against itself.

After nine cycles of red-team → fix → regression-test (full trail in cycles.html), two structural patterns kept reappearing across four architectural pivots. They generalize past CSIS to most agent-system design.

pattern 1 · cycle 8 → 9

Identity beats timing

Cycle 8 detected "which iteration wrote this candidate?" via a pre-consolidate snapshot diff — infer ownership from timing. Cycle 9 found the race window.

The fix wasn't a tighter snapshot. It was a writer_iteration_id field on the candidate itself, stamped atomically at write_candidate time. Ownership belongs on the data, not in the timing.

The same lesson applies to the distributional graders: cert evidence (CI bounds, sample size, slice breakdown) belongs on the cert, not inferred at read time from a passed: bool. Bind the truth to the artifact.

pattern 2 · cycle 4 → 9

Chokepoints beat perimeters

Cycles 4-8 added increasingly clever guards to _BackendTracker.__init_subclass__ against _wrapped-exposure attacks. Each cycle the attacker found a different escape (literal name → mangled name → setattr → metaclass).

Cycle 9 stopped guarding the subclass surface (infinite escapes) and started guarding the wrap site — Coordinator.__init__. Every LLM call routes through there; there's exactly one entry point to constrain.

Constrain the finite chokepoint; don't try to perimeter-fence an infinite attack surface.

From architecture to artifact

Everything above is implemented and gated by 246 regression tests. The full cycle trail is at CYCLES.md. The cycle-10 distributional graders are at brain/research/02-distributional-graders.md. The live dashboard renders all of this against a running daemon — simulated demo here, real one at python -m csis.ui.

The original v0.2 spec this implementation derives from: CSIS-architecture.html.