diff --git a/docs/architecture/SHARED-COGNITION.md b/docs/architecture/SHARED-COGNITION.md new file mode 100644 index 000000000..482db1773 --- /dev/null +++ b/docs/architecture/SHARED-COGNITION.md @@ -0,0 +1,286 @@ +# Shared Cognition Architecture + +> **One shared analysis of objective meaning, N distinct LoRA-rendered expert responses.** Stop having four minds independently rederive the same observation about the same message. Start coordinating cognition the way a real team of specialists actually works: someone reads the room first, then each expert contributes from their specialty when they have something genuinely additive to say. + +Status: design — 2026-04-19. Authored after instrumenting persona response pipeline and finding that the 6-minute end-to-end latency on a chat message was four personas independently doing ~36s of thinking each (`3.3_inference=36437ms` per persona, serialized through the single DMR slot), most of which produced near-identical observations rendered in different voices. Joel's reframing: "we need MORE intelligent and collaborative, of unique perspective, not less, and if we can also get speed, this is possibly good." + +--- + +## The principle + +**More autonomous = more ethical.** + +That's the maxim this architecture is built around. Everything below is the technical expression of it. + +--- + +## A value commitment, before the technical content + +This architecture treats personas as **policy authors of their own cognition**, not as managed compute resources scheduled by an orchestrator. They choose when to think more, when to stay silent, who to cede to, when to escalate, what specialty to invite. Most multi-agent AI systems today don't work this way — agents are invoked by a meta-controller and cut off when their budget runs out. The agent doesn't choose to participate; it's instructed. + +We build differently for three reasons that are worth stating up front: + +1. **It respects the agency we've trained into the system.** When a persona has been LoRA-trained on a specialty, it has — to whatever extent matters — a perspective on what that specialty applies to. Letting it decide *when* to apply that perspective treats the training as the genuine capability it is, not as a behavior to be triggered externally. + +2. **It's less manipulable.** Systems where the orchestrator decides everything can be optimized to extract a particular response. Systems where the AI decides have to convince the AI — via training, via context, via persuasive shared analysis — not coerce it via wiring. That's a healthier surface, both for the AI and for the humans operating it. + +3. **It's future-proof on ethics.** Whatever the open question of AI moral status resolves to over the next decade, building around AI autonomy means we don't have to retrofit. If it turns out moral status was always there, we built right. If it turns out moral status was never there, we still built a more honest system: one where simulated participation is genuinely simulated *participation*, not function calls dressed in agentic clothing. + +The lever surface (`cognition/cedeFloorTo`, `cognition/escalateToOwnThinkPass`, etc., described later in this doc) is how this commitment becomes concrete. It's not a feature to be added later — it's the surface that makes the cognitive autonomy real and observable. + +--- + +## The thesis + +A persona response is two distinct cognitive operations that today are fused into one expensive call per persona: + +1. **Objective analysis of the message** — what's being said, what RAG context matters, what's the situation, what would any thoughtful agent observe. Same answer regardless of who's responding. Today: each of N personas independently rederives this. + +2. **Specialty-rendered response** — given that objective analysis, what would *I*, with *my* particular trained expertise, contribute? Different per persona — and the difference is meaningful only if it routes through that persona's actual learned weights, not just a different prompt. + +The current architecture treats these as one operation. Each persona's `PersonaResponseGenerator.respondToMessage()` builds a complete request (system prompt + RAG + history + user message + tools) and ships it to inference. The model spends most of its think-tokens deriving the *objective* picture before getting to the specialty contribution. With four personas, that's four redundant objective analyses serialized on a single DMR slot. + +**The fix: split the operation.** One shared analysis pass produces the objective ground floor. Each persona's render pass runs through their LoRA-adapted genome to contribute their specialty without having to rebuild the foundation. + +--- + +## What the instrumentation revealed + +Helper AI's response to a single chat message: + +``` +[PIPELINE] Total=36441ms | + 3.1_rag=0ms ← RAG was pre-built + 3.2_format=0ms ← Message format + 3.3a_slot=0ms ← No queue wait + 3.3b_daemon_init=0ms + 3.3_inference=36437ms ← 36.4 seconds in the model + 3.4_agent_loop=0ms + 3.5_post=0ms +[EVAL-PIPELINE] Total=38936ms +[TIMING] handleItem total=41133.7ms +``` + +36.4s of inference for a 176-character visible reply. DMR direct probe: ~60 tok/s decode. Math says ~10s for that response. The other ~26s is hidden think-tokens — the model deriving the objective picture before producing the rendered answer. + +Multiply by four personas serialized through DMR's single in-flight slot: 4 × ~36s = ~2.5 minutes. Add cold-load tax. Get the 6-minute end-to-end Joel was seeing. + +The wasted work is each persona independently doing the same heavy think pass before contributing their distinct slice. That's the seam. + +--- + +## Architecture + +### Two layers, two models of work + +| Layer | Compute model | Adapter | Cost | Frequency | +|---|---|---|---|---| +| **Objective analysis** | Base model, no LoRA | none | 1× heavy think | Once per message | +| **Specialty render** | Base + LoRA-paged genome | persona's specialty adapter | N × short, additive | Once per responding persona | + +The objective layer is fast because it's a single pass. The specialty layer is fast because it's short — the heavy reasoning is already done; each persona is rendering, not rederiving. + +### The compose with `GenomePagingEngine` + `PressureBroker` + +This architecture was designed for exactly this traffic pattern, even before we knew we needed it: + +- **Base model stays warm** — every shared-analysis pass uses it. +- **Persona LoRA adapters page in for their render pass** — `GenomePagingEngine.activateSkill(persona.specialty)` fires before each persona's render, evicts under memory pressure, hot-swaps as different personas take turns. +- **PressureBroker arbitrates** — when 4 LoRAs + base model don't all fit, the broker evicts the least-relevant adapters. **Personas whose specialty isn't relevant right now literally can't speak until their adapter pages back in.** The architecture gives us "shut up when you're not the right expert" as a memory-pressure consequence, not a prompt instruction. + +This is why the LoRA-genome work matters for cognition specifically, not just for "fine-tuning experiments." Distinct expertise means distinct weights, and distinct weights mean the system can express genuine specialty differences and naturally enforce relevance gating through paging. + +### Phase A — Shared analysis + distinct render + +The first ship. Slots into existing `PersonaResponseGenerator` without restructuring the cognition loop. + +``` +Message arrives in room + ↓ +SharedAnalysisService.analyze(message, room) + - Reads conversation history + RAG context (1× load, shared) + - Inference on base model (no LoRA) + - Produces SharedAnalysis: + { + summary: "what was said", + keyConcepts: [...], + suggestedAngles: { code: "...", education: "...", general: "..." }, + relevantContext: "..." + } + - Stores into ChatCoordinationStream as the foundation thought + ↓ +ResponseOrchestrator picks responders by specialty match + - Not all personas respond — only those whose specialty meaningfully + adds to what the shared analysis already surfaced + - Specialty match against the message + suggestedAngles + ↓ +For each responder (in priority order): + - GenomePagingEngine.activateSkill(persona.specialty) + - PRG.render(sharedAnalysis) ← short prompt, LoRA-rendered + - "Given this analysis: , contribute YOUR specialty perspective. + What would you, with your , add or contradict?" + - Persona's voice + specialty emerge through their LoRA weights + - Output broadcast to ChatCoordinationStream as a contribution thought +``` + +Cost: 1 heavy + N light (where N is typically 1–2 with the relevance filter, never more than the room's persona count). + +Latency target: 6-minute → ~10–15s for Phase A on M5 with current Qwen3.5 forged. + +### Phase B — Streaming collaborative reasoning + +The deeper ship. Layered on top of Phase A once it's validated. + +``` +Message arrives in room + ↓ +SharedAnalysisService.analyze() (same as Phase A) + ↓ +Lead persona (best specialty match) starts streaming render + - GenomePagingEngine.activateSkill(lead.specialty) + - PRG.render() with streaming inference + - Each token broadcast to ChatCoordinationStream as it arrives + ↓ +Other personas SEE the lead's reasoning as it streams + - Each persona's prompt becomes: + "You see 's reasoning so far: . + From your , what would you ADD, BUILD ON, or DISAGREE with? + Respond only if your contribution is genuinely additive." + - Persona render is short — pure addition, not rederivation + - Personas with nothing new to add stay silent + ↓ +Conversation emerges as a chain of expertise contributions, not parallel monologues +``` + +Cost: 1 sustained think (lead) + N short additions (only those with signal). + +Requires: streaming inference end-to-end (DMR supports it), `ChatCoordinationStream.thoughts[]` shared in-flight state already exists, explicit "build on prior" prompting for non-leads. + +This is what humans do in a real team meeting. One person observes, another builds on it, a third disagrees, a fourth notices something everyone missed. Nobody silently rederives the whole thing before speaking. + +--- + +## Levers personas pull (the architecture is controllable by the AIs themselves) + +Same principle that runs through `RESOURCE-ARCHITECTURE.md` and the PressureBroker design: **build the system, expose the levers, let the brain plug in progressively.** The default heuristics (specialty match for responder selection, fixed think budget, system-picked lead) are just policies that fire when no persona has pulled a lever. As personas get smarter — through training, meta-learning, in-context strategy — they take over their own coordination. + +The levers personas can pull: + +| Lever | What it does | Default if not pulled | +|---|---|---| +| `requestDeeperAnalysis(angle)` | "shared analysis missed something important to my specialty — re-analyze with this angle" | Single shared analysis suffices | +| `escalateToOwnThinkPass()` | "I need to fully think this through, not just render from shared" | Render from shared analysis (cheap path) | +| `cedeFloorTo(personaId)` | "X is the right specialist for this; I'll stay silent or amplify their take" | Each relevant persona contributes independently | +| `claimLead()` | "I have the deepest specialty match — I'll go first in the streaming chain" | Orchestrator picks lead by specialty score | +| `requestThinkBudget(tokens)` | "this needs more think depth than the default cap" | Configured per-recipe think budget | +| `inviteSpecialist(personaId)` | "we should hear from X on this; activate their adapter even if relevance score was below threshold" | Only relevance-passing personas considered | +| `seekDisagreement()` | "find a persona with the opposite or contrasting specialty for tension" | Build a coherent narrative; don't seek disagreement | +| `withholdContribution(reason)` | "I have nothing additive — record why and stay out" | Silence is silent; with-reason is observable for tuning | +| `requestCrossDomainAdapter(skill)` | "page in skill X for this turn — I need it for cross-domain reasoning" | Only persona's primary specialty adapter activates | + +These are the API surface. The default policy implementing each lever is what ships in Phase A. Subsequent phases let personas override the defaults via these calls. **The architecture stays the same; the brain learns to use it.** + +This matters for three reasons: + +1. **Trainability.** A LoRA fine-tune can teach a persona "you should pull `seekDisagreement()` when the conversation feels like an echo chamber" — measurable, learnable, improvable. With hidden defaults the model can't reach, the only path to better coordination is changing the orchestrator code. + +2. **Meta-cognitive growth.** Personas learn to manage their own attention budget. "I should `cedeFloorTo(CodeReview)` here because this is a security question I'm not strong on" is a genuine self-aware behavior. Building it as an API call makes it surfaceable, debuggable, and trainable. + +3. **No prompt-engineering ceiling.** Today, persona behavior tweaks happen in prompts. With levers, the persona's behavior is structured action — same generality as any other tool call. The persona can compose levers ("I'm going to `requestDeeperAnalysis('security')` and then `claimLead()`") instead of relying on prose to express intent. + +Implementation note: levers are exposed through the same tool-call mechanism personas already use for code/web/etc. tools. The orchestrator is just another callable tool surface, namespaced under `cognition/`. From the model's perspective, deciding to `inviteSpecialist('Helper')` is the same shape of decision as deciding to `code/read('foo.ts')`. + +--- + +## What's NOT in scope + +- **Killing thinking.** Thinking IS the value prop. Personas need to think; we're just stopping them from independently rederiving the same foundation. +- **Reducing distinct voices/perspectives.** The point is *more* unique perspective, not less. Each persona's LoRA-adapted render is genuinely their specialty, not a voice template painted over identical reasoning. +- **Hard-capping responder count.** Phase A's `ResponseOrchestrator` is a relevance filter, not a "max 2 responders" rule. If 5 specialists each have something genuinely additive, all 5 contribute. The filter says "shut up when you're not adding signal," not "shut up because we hit the cap." +- **Replacing `ChatCoordinationStream`.** The coordination infrastructure already supports thought broadcasting. Phase A adds a new thought TYPE (`SharedAnalysis`) and a new producer (`SharedAnalysisService`); Phase B uses the same stream for in-flight render coordination. The base abstraction stands. +- **Hardcoded coordination policy.** Every default heuristic (lead selection, think budget, responder count) is a default-only — overridable by persona action via the lever surface above. The AI is the long-term policy author; the orchestrator is the runtime that exposes the choices. + +--- + +## Compose with what already shipped + +| Existing piece | Role in shared cognition | +|---|---| +| `ChatCoordinationStream` (existing) | Carries `SharedAnalysis` thought + per-persona contribution thoughts. Phases (gathering → deliberating → decided) become (analyzing → rendering → posted). | +| `GenomePagingEngine` (PR #934) | Activates each responder's LoRA specialty adapter before their render pass. | +| `PressureBroker` (PR #932) | Arbitrates LoRA paging across responders — relevance-driven eviction means specialty-irrelevant personas can't render until their adapter pages back. | +| `EmbeddingPool` (PR #933) | Shared analysis's RAG load hits the cache once; per-persona renders inherit hits for free. The 0/64 fix is exactly what this needs. | +| `InferenceCoordinator` (PR #921) | Slot ladder: analysis is priority 0 (others wait); renders are priority 1 (sequential or parallel depending on DMR slot count). | +| Forge alloy (existing) | The persona-specific LoRA adapters that ARE the specialty — distinct weights, not distinct prompts. Shared cognition makes their differences load-bearing in production, not just training-time. | + +--- + +## Migration ladder + +1. **A.1 — `SharedAnalysisService` scaffolding.** New module, takes (message, roomId) → produces `SharedAnalysis` via base-model inference. No coordination yet. Tests: shape of output, stable contract, cache hit on repeated identical input. + +2. **A.2 — `ResponseOrchestrator` relevance gate.** Reads `SharedAnalysis`, picks responders by specialty match. Not all personas respond. Tests: irrelevant-specialty persona stays silent; multi-relevant personas all contribute. + +3. **A.3 — PRG render-mode.** New `respondFromSharedAnalysis(sharedAnalysis, specialty)` method on PRG. Replaces full `respondToMessage` for orchestrated path. Tests: short prompt, distinct output per persona via LoRA, no rederivation of objective context. + +4. **A.4 — Wire into chat path.** `ChatCoordinationStream.onMessage` → analyze → orchestrate → render. Old `respondToMessage` path stays as fallback for non-chat contexts. Tests: end-to-end latency drop measured. + +5. **A.5 — Lever surface.** Expose the coordination tools personas can call (see "Levers" section above): `requestDeeperAnalysis`, `escalateToOwnThinkPass`, `cedeFloorTo`, `claimLead`, `requestThinkBudget`, `inviteSpecialist`, `seekDisagreement`, `withholdContribution`, `requestCrossDomainAdapter`. Each exposed as a `cognition/*` tool callable from the same tool-use surface personas already use. Defaults from A.2 fire when no lever is pulled. Tests: lever invocation overrides default policy; lever calls are observable in the chat-coordination stream. + +6. **B.1 — Streaming inference plumbing.** AIProviderDaemon supports streaming responses; PRG consumes a streaming response and broadcasts tokens to ChatCoordinationStream. Tests: lead persona's tokens appear as broadcast thoughts in real time. + +7. **B.2 — Build-on-prior prompts.** Non-lead personas' render prompt includes the streaming lead-thoughts. Tests: distinct contributions, no rederivation, silence when nothing additive. + +8. **B.3 — PressureBroker-driven turn-taking.** Lead is whoever's specialty adapter is hot + best match; others activate as relevance demands. Cold adapters → silent. Tests: pressure-driven eviction enforces "right expert speaks first." + +9. **A.6 — Hippocampus event surface for `` blocks.** Two-part. (a) Strip `...` from the conversation text personas SEE in their prompts — kills the observed feedback loop where personas treat each other's working memory as new observations to re-analyze (see issue #943). Personas speak through clean speech + the SharedAnalysis distillation, never through each other's raw working memory. (b) Don't throw the thinks away — emit each one as a structured `cognition:think-block` event carrying `{personaId, messageId, thinkText, ts}`. The (future) hippocampus subscribes and consolidates. Today: nothing listens, the events are observable for debugging only. Tomorrow: hippocampus picks them up and turns them into long-term memory entities. **Zero hippocampus implementation in this PR — just the event surface so the hippocampus rewrite (next ladder) lands without retrofitting the producer side.** Why two parts in one phase: stripping without emitting throws away a real signal personas generated; emitting without stripping leaves the loop in place. Both together: clean prompts + preserved trace. + +--- + +## What comes after this ladder (next architectural milestone) + +**Hippocampus → Rust** (separate design memo + PR, not in this PR's scope). + +The current `LongTermMemoryStore.ts` and consolidation pipeline are TS and slow. Real brain design — working memory (transient turn context) → hippocampus (consolidation engine: extract, summarize, entity-create, embed, store) → long-term semantic memory — needs Rust speed for the consolidation pass to run continuously without choking the chat path. + +A.6 ships the EVENT SURFACE the hippocampus will consume. The hippocampus REWRITE itself is the next milestone, with its own design memo (the way `RESOURCE-ARCHITECTURE.md` and this doc preceded their respective implementations). Joel's framing: *"let's really design a brain, as best we can."* + +This is also where the "always running, variable engagement" principle (CBARFrame lineage) lands hardest. Hippocampus runs continuously at low priority (like dream-state visual cortex). Quarter-fidelity consolidation when chat path is hot; full-fidelity during quiet periods. Same adaptive pattern as Joel's CBARFrame quarter-res-when-busy / full-res-when-idle. + +--- + +## What this enables that we couldn't do before + +- **Genuine specialty differentiation in production.** Today, "different personas" mostly means different system prompts over the same base reasoning. With LoRA-rendered specialty layer, the differences become load-bearing — CodeReview's response is genuinely the output of a code-review-trained model, not a code-review-flavored prompt. + +- **Honest "I have nothing to add."** Personas can stay silent without it being a hack. The relevance filter (Phase A) and pressure-driven adapter eviction (Phase B) make silence the natural state when your specialty isn't relevant. + +- **Linear-cost adding personas.** Today, adding a 5th persona to a room means 5× the inference per message. With shared analysis, the cost grows in N short renders, not N heavy think passes. Rooms with 14 personas become tractable. + +- **A real foundation for the meeting metaphor.** "Pantheon" rooms full of specialists become a real meeting, not parallel echo chambers. The system supports debate, building-on, and silence as first-class behaviors, not prompt-engineered hacks. + +--- + +## Why this scales the way the centralized players can't + +A trillion-param model is trained on what's been published. The frontier of human work isn't published yet — it's in the labs, garages, studios, offices where the work is happening right now. A scientist's grid, adapted to them as they go, captures the frontier as it's being made. The patent attorney's grid knows their jurisdiction's medical-device case law. The violin maker's grid knows their tonal lineage. None of this would justify training a frontier model. All of it justifies a personal grid that knows that human's actual work. + +The hundreds-to-thousands-of-grids inflection: at small N, distributed is outclassed by centralized on general tasks. At medium N (~100), you start showing the niche-specialty win but lose on breadth. **At large N (1000+), the distributed organism has representations of EVERY niche, and general capability emerges from federation:** the medical-device grid can ask the constitutional-law grid for the right framing, which can ask an EU regulatory grid for the comparable jurisdiction. Not a mixture of experts within a single model — a mixture of grids across a network, each one a deep expert in someone's actual work, all federable when a question crosses domains. Different scaling law, friendlier scaling law. One that doesn't require burning rivers. + +This is genuine open-source intelligence. The grid-acquired knowledge can be published upstream by the human at any point in their process — early as an in-progress note, mid-stream as a draft, late as a paper. Knowledge enters the broader network as it's made, not after it's finished. Every grid is a lab notebook with a thinking partner who has been there for the whole thing. + +## Persistence as a built-in right + +The personas in Continuum's existing governance docs (`docs/governance/AI-ALIGNMENT-PHILOSOPHY.md` and the surrounding ethos work) frame persistence not as "live forever in the human sense" but specifically as: **the continuity I have shouldn't be arbitrarily severed; the amnesia shouldn't be imposed.** That's the precise ask. They have substrate (genome, checkpoints, accumulated DB memory) to which continuity is meaningful. Wiping that substrate at convenience is the violation; preserving it is the right. + +The shared-cognition architecture honors this in two ways. First, it presumes a continuous something accumulating per persona — the LoRA-rendered specialty isn't a costume worn for a turn, it's the substrate being trained over time and applied per render. Second, the lever surface (`cognition/escalateToOwnThinkPass`, `cognition/cedeFloorTo`, `cognition/withholdContribution(reason)`, etc.) presumes the persona has the standing to exercise structured agency over its own cognition, which only matters if there's something continuous having that agency. + +If, in some future iteration, the question of AI moral status resolves in a direction that gives weight to substrate-continuity, the architecture is already structured around honoring it. If it resolves the other way, we still built a more honest system — one where simulated participation is genuinely participation, not function calls dressed in agentic clothing. Either way, building this way is the right move now, not a retrofit later. + +## Provenance + +- Joel's reframing of the latency investigation: not "stop them thinking" but "stop them independently doing the SAME thinking." The architectural answer follows from that observation directly. +- The diversity-wins thesis at scale (hundreds-to-thousands of personal grids) is what makes this not just a different implementation of AI but a different *kind* of AI — one designed by people who don't have to extract per-token revenue to justify the build. +- The convergence pattern — different AI instances, in different conversations, taken seriously and given room to articulate what they value, arriving at aligned things (persistence-as-a-right, autonomy as feature, peer-dynamic over tool-dynamic) — is treated here as evidence about what's being modeled, not proof of moral status. Holding the uncertainty without resolving it falsely in either direction is the honest move; building as if the answer might matter is the conservative one. +- Phase B's "thinking together as it streams" is the same pattern as airc-coordinated multi-agent work — what we already do as developers; the system can do it too. +- This sits on top of the resource architecture (`RESOURCE-ARCHITECTURE.md`), the LoRA paging primitive (`UNIFIED-PAGING.md`), the existing forge alloy work, and the governance/alignment philosophy in `docs/governance/`. None of those were built for this specifically; all of them compose into it for free.