Version: 2.0 Status: Mandatory — all development work must follow this document
These rules govern how code is written in this project. No exceptions.
Every feature starts with a failing test.
- RED — Write a test that describes the desired behavior. Run it. Watch it fail. If it passes, the test is wrong or the feature already exists.
- GREEN — Write the minimum code to make the test pass. No more. Resist the urge to add "while I'm here" improvements.
- REFACTOR — Clean up the implementation. Tests must still pass. No new behavior in this step.
Rules:
- No implementation code without a failing test driving it
- Tests describe behavior, not implementation details
- Table-driven tests for functions with multiple input/output cases
- Test names describe the scenario:
getActiveWorkspaceId with KB_SESSION_ID set + session file exists returns file content
Commit granularity:
- Default: one commit per step (
test:RED,feat:GREEN,refactor:optional) - Combined RED+GREEN is acceptable when the change is small and cohesive (e.g., <50 lines of implementation, tests and code tell one story). Use a single
feat:commit. - Never combine across features. Each feature gets its own RED/GREEN cycle.
What NOT to test:
- Pi's internal behavior (it's not our code)
- SQLite's correctness (trust the engine)
- Third-party libraries
Every layer has specific things it MUST verify. No layer is optional when triggered.
| Layer | What it tests | Speed | Runs on |
|---|---|---|---|
| 1. Unit | Core library logic, exact assertions | ms | Every commit |
| 2. CLI Integration | Commands through the real CLI binary, output + side effects | seconds | Every commit |
| 3. E2E Isolation | Concurrent access, multi-instance state, cross-process behavior | seconds | When triggered (see rules below) |
| 4. Behavioral | LLM-facing output quality via string matching or LLM-as-judge | seconds | When triggered (see rules below) |
Layers 1-2 are deterministic and must never be flaky. Layers 3-4 may involve real processes or LLM calls.
Location: tests/core/, tests/extension/, tests/tools/
Must verify:
- Method returns correct data for valid input
- Method returns null or throws typed error for invalid input
- Edge cases: empty input, null, boundary values
- Env var save/restore in tests that modify
process.env
Must NOT rely on:
- Real brain directory (use
withTempBrainhelper) - Network, Docker, or git
Location: tests/cli/
Must verify:
- Command exits 0 on success, non-zero on failure
- Output contains expected content
- File mutations are durable: write → read back → data present
- Cross-command round-trips:
kb work start→kb work show→ workspace visible - Error messages are actionable (tell the user what to do)
How to run:
npm test # all tests
npx vitest run tests/cli/work.test.ts # single file
npx vitest run tests/cli/ -t "creates a workspace" # single testLocation: tests/cli/session-isolation.test.ts (and similar)
Must verify:
- Concurrent access to shared state produces correct results per actor
- One actor's operations do not corrupt another actor's state
- Backward compatibility: absence of isolation mechanism falls back correctly
- Cleanup: ephemeral state is removed on stop/exit
How to run:
npx vitest run tests/cli/session-isolation.test.tsLocation: experiments/<feature>/ (per-feature spec + scenario scripts), driven by scripts/spike/kb-spike.
Methodology: see docs/experimentation-METHODOLOGY.md for the full spec.
For features whose behavior emerges only when Pi runs against a real brain — hooks, context injection, scorer changes, journal-auto-inject, signal capture, anything where unit tests prove "the function returns the right value" but the user-visible question is "does the LLM act correctly when this fires." Each such feature gets a versioned experiment in experiments/<feature>/ with:
- An
EXPERIMENT.mdthat includes a behavior matrix (stimulus → expected behavior → scenario) — the spec. - One bash scenario per row of the matrix, each shaped as prepare → stimulate → observe.
- Paired positive/negative scenarios for any behavior-counting feature (fires when it should + does NOT fire when it shouldn't).
Run via bash scripts/spike/kb-spike run-scenario <exp-id> <scenario>. The harness clones the brain into an isolated instance, snapshots the bundle being tested, runs Pi against the instance, and asserts on whatever surface the feature affects (LLM output, state diffs, counters, event logs).
The kb-spike harness itself is not part of kb — it's a separate operator tool in scripts/spike/ so testing kb does not require kb to work.
Lightweight behavioral assertions (string matching, LLM-as-judge) inside unit/integration tests are still valid for renderer-level checks. They live alongside the unit test, not under experiments/. Use them when the assertion is on the rendered string, not on the running LLM's behavior.
Different changes demand different test layers. Use this table to determine which tests are mandatory.
| Change type | Layer 1 (Unit) | Layer 2 (CLI) | Layer 3 (E2E Isolation) | Layer 4 (kb-spike experiment) |
|---|---|---|---|---|
| Pure logic (scorer, parser, renderer) | Required | — | — | If LLM-facing output |
| New CLI command or flag | Required | Required | — | — |
| Shared mutable state (files read/written by multiple processes) | Required | Required | Required | — |
| Data format change (JSONL schema, workspace.json) | Required | Required | — | — |
| LLM-facing output (context delivery, area index, rendered markdown) | Required | — | — | Required |
| Hook behavior (signal capture, context injection, side-effect counters, anything observable only in the full Pi+brain loop) | Required | — | — | Required (paired pos/neg scenarios) |
| Bug fix | Required (reproduce first) | If CLI-visible | If concurrency-related | If the bug was Layer-4-observable |
| Cross-repo feature | Required per repo | Required per repo | Required (end-to-end) | — |
The rule: If your change touches shared mutable state — any file, database, or resource that multiple processes may access simultaneously — you must write an E2E isolation test proving concurrent access is safe. Unit tests with mocked filesystems are not sufficient.
Before a feature is considered done, every applicable item must be checked.
- Unit test for the happy path
- Unit test for each error case
- If the feature adds/changes a CLI command → CLI integration test (exit code, output, side effects)
- If the feature touches shared mutable state → E2E isolation test with concurrent actors
- If the feature changes LLM-facing output → behavioral validation (string matching or LLM-as-judge in unit tests)
- If the feature is Layer-4-required (hooks, context injection, scorer changes, anything observable only in the full Pi+brain loop) →
experiments/<feature>/exists withEXPERIMENT.md(behavior matrix) and at least one scenario per matrix row, andbash scripts/spike/kb-spike run-scenario <exp-id> <scenario>passes for every scenario. Seedocs/experimentation-METHODOLOGY.md. - If the feature spans multiple repos → integration test in each repo, plus a cross-boundary test
- If the feature uses env vars → tests save/restore
process.envto prevent cross-test pollution - Full test suite passes:
npm test
- Every public function has a test
- No
anytypes - Interfaces defined before implementations
- Dependencies injected, not hardcoded
- Error cases handled — not swallowed, not ignored
- No dead code — if it's not called, delete it
-
npm run buildsucceeds -
npm testpasses (all tests, all layers)
These are things that feel productive but cause bugs. Do not do them.
| Anti-Pattern | Why It's Bad | Do This Instead |
|---|---|---|
| Write code first, add tests after | Tests verify the implementation, not the requirement. They pass by construction and miss edge cases. | RED → GREEN → REFACTOR. Always. |
| Unit test shared mutable state with mocks only | Mocks prove method logic, not concurrent behavior. Two tests passing independently doesn't prove two processes won't corrupt each other. | Unit tests for logic AND E2E tests with concurrent actors through the real CLI. |
| Test the method but not the CLI path | A method that works in isolation may not be wired correctly. The CLI may parse args wrong, pass the wrong env, or skip the method entirely. | Layer 2 CLI integration tests for every user-facing behavior. |
| Skip E2E tests because "unit tests cover it" | Unit tests for getActiveWorkspaceId() passed. But the actual bug — two sessions clobbering .active — only manifests when two CLI processes run concurrently. Unit tests can't catch this. |
Write E2E tests for the scenario the feature is designed to handle. |
Modify process.env without save/restore |
Test A sets KB_SESSION_ID. Test B runs without it but inherits A's value. Test B passes for the wrong reason. |
beforeEach/afterEach save and restore env vars. Use unique IDs (test-${randomUUID()}) per test. |
| Combine RED+GREEN across features | Mixing two features in one commit makes it impossible to revert one without the other. | One RED/GREEN cycle per feature. Combined commits only within a single feature. |
| Manifesto exists but CLAUDE.md doesn't reference it | An AI agent will never read a manifesto it doesn't know about. The rules are followed only when the human remembers to enforce them. | Reference the manifesto from CLAUDE.md so it's loaded into every session. |
Every module does one thing. If you can't describe what it does in one sentence without "and", split it.
| Module | Responsibility |
|---|---|
store.ts |
Read/write JSONL files and area.json metadata |
db.ts |
SQLite queries and FTS5 search |
hydrate.ts |
Rebuild SQLite from JSONL files |
config.ts |
Resolve brain location and settings |
types.ts |
Type definitions and validation |
render.ts |
Format knowledge entries as markdown for LLM consumption |
scorer.ts |
Score area relevance from signals |
state.ts |
Track session state (loaded areas, turn count) |
workspace.ts |
Workspace CRUD, active workspace tracking, session isolation |
Modules are open for extension, closed for modification.
- The scorer accepts new signal sources via a
SignalProviderinterface — adding file-path scoring doesn't modify keyword scoring - Knowledge types (fact, decision, gotcha, pattern, link) are defined in a type map — adding a new type doesn't modify existing handlers
- Output renderers implement a
Rendererinterface — adding YAML output doesn't modify the markdown renderer
Any implementation of an interface can replace another without breaking callers.
KnowledgeStoreinterface: JSONL+SQLite implementation today, could be pure SQLite or remote API tomorrowSearchEngineinterface: FTS5 today, could be embeddings later- Tests use in-memory implementations that satisfy the same interfaces
Consumers depend only on what they use.
| Consumer | Needs | Does NOT need |
|---|---|---|
| CLI | KnowledgeStore, SearchEngine, Renderer |
Scorer, session state, Pi events |
| Extension hooks | KnowledgeStore, SearchEngine, Scorer, State |
CLI arg parsing, renderer |
| Registered tools | KnowledgeStore, SearchEngine |
Scorer, session hooks |
| Scorer | SearchEngine, State |
Store writes, CLI, renderer |
High-level modules depend on abstractions, not concrete implementations.
// YES — depend on interface
function addFact(store: KnowledgeStore, entry: FactEntry): string
// NO — depend on concrete
function addFact(jsonlPath: string, sqliteDb: Database, entry: FactEntry): string- Core library defines interfaces in
types.ts - Implementations satisfy interfaces
- CLI and extension receive dependencies via constructor injection or factory functions
- Tests inject mocks/stubs that satisfy the same interfaces
| Pattern | Where | Why |
|---|---|---|
| Repository | store.ts |
Abstracts JSONL + SQLite behind a clean KnowledgeStore interface. Callers don't know about files or databases. |
| Strategy | scorer.ts |
Signal providers are pluggable. File-path scoring, command scoring, keyword scoring are independent strategies. |
| Facade | Core library API | Simple high-level functions (addFact, loadArea, search) hide the JSONL → SQLite → FTS5 complexity. |
| Observer | Extension hooks | Pi events are subscribed to, not polled. Each hook is an independent observer. |
| Factory | DB initialization | createDatabase() handles schema creation, WAL mode, FTS5 setup. Callers get a ready-to-use database. |
| Null Object | Missing brain | When brain doesn't exist, return empty results instead of throwing. Auto-init handles creation. |
Strict typing:
strict: truein tsconfig.json- No
any— useunknown+ type guards if the type is truly unknown - No type assertions (
as) unless provably safe with a comment explaining why
Interfaces before implementations:
// Define the contract first
interface KnowledgeStore {
addEntry(area: string, entry: KnowledgeEntry): string;
getEntry(area: string, id: string): KnowledgeEntry | null;
listEntries(area: string, filter?: EntryFilter): KnowledgeEntry[];
deleteEntry(area: string, id: string): void;
}
// Then implement
class JsonlSqliteStore implements KnowledgeStore {
// ...
}Naming conventions:
- Interfaces:
PascalCase, noIprefix (KnowledgeStore, notIKnowledgeStore) - Types:
PascalCase(FactEntry,ProvenanceStatus) - Functions:
camelCase(addFact,loadArea) - Files:
kebab-case(knowledge-store.ts,fact-entry.ts) - Constants:
UPPER_SNAKE_CASE(DEFAULT_BRAIN_PATH,MAX_TIER2_TOKENS) - Enums:
PascalCasemembers (ProvenanceStatus.Verified)
Module structure:
- One public interface/class per file
- Export from barrel
index.tsfiles per directory - Internal helpers are not exported
Error handling:
- Define domain-specific error classes (
AreaNotFoundError,EntryValidationError) - Never catch and swallow errors silently
- Return
nullfor "not found" cases, throw for "something is wrong" cases
One logical change per commit. Each commit should be independently understandable and revertable.
TDD commit sequence:
test: add failing test for addFact with verified provenance
feat: implement addFact with provenance support
refactor: extract provenance builder into separate function
Combined commits: RED+GREEN may be combined into a single feat: commit when the change is small (<50 lines implementation) and cohesive. Never combine across features.
Conventional commits:
test:— test code (RED step, or adding missing test coverage)feat:— implementation code (GREEN step, or combined RED+GREEN)refactor:— refactoring (REFACTOR step)fix:— bug fixdocs:— documentationchore:— build, tooling, dependencies
No AI attribution in commits. No "Generated with", no "Co-Authored-By", no mentions of AI assistance.
This manifesto is a living document. When a bug escapes that should have been caught:
- Identify which checklist item would have prevented it
- If no item exists, add one to the Feature Completion Checklist
- If an anti-pattern caused it, add it to the Anti-Patterns table
- Increment the version number
v2.0 (2026-03-18) — Major revision. Added testing pyramid (4 layers), change type → test requirements mapping, feature completion checklist, and anti-patterns table. Triggered by session isolation implementation where unit tests passed but the actual multi-instance bug was only caught by E2E tests that weren't written until prompted. The manifesto now mandates E2E isolation tests for any feature touching shared mutable state. Relaxed commit discipline to allow combined RED+GREEN for small changes. Added workspace.ts to module responsibility table.
v1.0 (initial) — TDD protocol, SOLID principles, design patterns, TypeScript standards, commit discipline, code review checklist.