Skip to content

Latest commit

 

History

History
60 lines (43 loc) · 7.42 KB

File metadata and controls

60 lines (43 loc) · 7.42 KB

Experiment coverage

Purpose: central index of every Layer-4 matrix in this repo plus the gaps where features ship without an L4 anchor. Read this before declaring "all experiments done." The list is never finished — new features should land with their matrix and update this index.

Companion: findings-log.md — every latent production bug AND methodology gotcha surfaced while building these matrices. Read both when assessing mykb's correctness state.

Implemented matrices

These ship with full EXPERIMENT.md + at least one scenarios/*.sh and have been RED→GREEN-proven against a real Pi runtime.

Feature Matrix Scenarios Status
kb work handoff experiments/handoff/ continuity, overwrite, clear, no-active-workspace ✅ implemented — <mykb-workspace>-not-visible regression (issue #5) fixed in commit ab2c22b; continuity re-verified GREEN post-fix
Per-turn journal injection experiments/journal-auto-inject/ resume-continuity, stale-filter, mid-session-append, no-active-workspace ✅ implemented — same regression fixed (ab2c22b); resume-continuity re-verified GREEN post-fix
Area scoring (v1 + v2 + v3) experiments/area-scoring/ keyword-match-loads, off-topic-no-leak, no-workspace-still-loads, init-area-tags, kb-list-shows-tags, scoring-without-tools, scoring-isolated, file-path-signal, workspace-boost, sticky-area-persistence, token-budget-eviction ✅ implemented (full matrix — all 11 rows)
kb_search tool + FTS area-metadata experiments/kb-search/ tool-direct-text-match, tool-finds-via-area-metadata, tool-no-match-no-fabrication ✅ implemented
kb_load tool contract experiments/kb-load/ basic-load, discover-via-area-index, unknown-area-no-fabrication ✅ implemented
kb_list tool contract experiments/kb-list/ basic-list, lists-tags-suffix, no-match-no-fabrication ✅ implemented
kb work checkpoint (LLM-as-extractor) experiments/work-checkpoint/ journal-extraction, knowledge-extraction, empty-conversation-no-fabrication ✅ implemented
Claude Code runtime experiments/claude-code/ bare-runs, hook-injects-handoff ✅ implemented
tool-gating hook experiments/tool-gating/ blocks-write-to-brain, blocks-edit-by-pattern, allows-non-knowledge-writes, block-then-retry-via-kb-add, bash-bypass-known-gap ✅ implemented (one known-fail documenting a security gap — see matrix)
kb_work_* tools (journal, state, note) experiments/kb-work-tools/ journal-tool, state-tool, note-tool, no-active-workspace ✅ implemented — the streaming workspace-mutation path (per-tool); state-tool is the cross-step <mykb-workspace> re-injection anchor
kb_add tool experiments/kb-add/ add-fact, add-decision-with-why, add-then-search-roundtrip, add-to-unknown-area ✅ implemented — the LLM-mutates-area path (per-entry-type); add-then-search-roundtrip is the write → FTS index → retrieve closed-loop anchor; add-to-unknown-area pins the auto-create policy
kb_verify tool experiments/kb-verify/ verify-by-id, add-then-verify-roundtrip, verify-unknown-id ✅ implemented — the trust-decay promote-half; add-then-verify-roundtrip is the create-and-attest anchor; pins the NEGATIVE that verifyEntry records no source
/kb slash command experiments/kb-command/ load-marks-area-loaded, usage-on-empty-args, unknown-area-no-load 🟡 partial — these 3 single-step scenarios anchor dispatch + the pi.sendMessage content + markAreaLoaded persistence (and surfaced the never-worked-against-real-Pi /kb bug — gotcha QbRwqLxJ). The load-and-cite row (LLM cites a /kb-loaded marker) is blocked on harness supportissue #7.

Total: 13 matrices, 52 scenarios (including 1 documented known-fail). Every Layer-4 feature now has ≥1 scenario — the "Scaffolded matrices" backlog is empty.

Scaffolded matrices (not-yet-implemented)

None. Every experiments/<feature>/ ships at least one scenarios/*.sh. (If a future feature lands without its matrix, add a row here AND scaffold its EXPERIMENT.md so this doc stays linkable.)

Sub-behavior gaps in implemented matrices

These belong to existing matrices but the matrix's behavior table flags them as not-yet-covered.

Matrix Gap Notes
area-scoring Sticky-area persistence across turns An area loaded in turn N gets a sticky-boost in turn N+1. Scenario attempted; blockedissue #6.
area-scoring Token-budget eviction order When the 2000-token budget is exceeded, which areas keep their entries? Highest-scoring — selectEntriesForInjection now has a deterministic area-id tie-break, but the end-to-end scenario is blocked on the same investigation (issue #6).
kb-command load-and-cite — LLM cites a /kb-loaded marker /kb triggers no LLM turn; step can't deliver /kb <area> + a follow-up prompt in one Pi process (vfa run = one --prompt, no Pi-level session continuity). Blockedissue #7. The LLM-reads-a-role:'custom'-message leg itself is already exercised by area-scoring/scoring-isolated.sh.
kb-command multiple-areas/kb a b injects both Single-step-testable (a 2-area variant of load-marks-area-loaded); not yet written.

Cross-cutting properties without an L4 home

  • Compaction interaction: when Pi auto-compacts a long session, does the kb extension survive? Specifically: do persisted signals / loaded-areas survive the compaction event? No scenario.
  • Multi-turn injection coherence: turn N injects area A; turn N+1 injects area B. Does the LLM see both, just B, or get confused? No scenario.
  • Concurrent session contention: two KB_SESSION_IDs writing to the same workspace. Atomicity is L1-tested for individual operations; the multi-process pattern at L4 isn't.

Maintenance

  • Adding a feature: ship its matrix at the same time. Add a row to the Implemented matrices table.
  • Closing a scaffold: implement scenarios; move the row from Scaffolded to Implemented. Delete from Sub-behavior gaps if it covered one.
  • Discovering a gap: add a row to Scaffolded matrices AND scaffold an EXPERIMENT.md so the doc stays linkable.

Why this doc exists

The methodology says experiments accumulate as the regression suite. That's correct as a steady-state goal; in practice, shipping an L4 matrix lags shipping the feature. Without an explicit gap list, "feature X has no L4" silently fades from collective awareness — until a scenario surfaces a latent bug (cycle 8 was the load-bearing example: 3-layer context-event bug latent for the entire history of mykb because every prior matrix had a fallback path that masked it).

Tracking gaps is cheap. The cost of a gap that goes unnoticed for months is high.