v3.5.0 — the read side gets intelligent#14
Merged
Merged
Conversation
The doctor check rolled its own crude rule (missing config OR metadata = ghost), counting empty stale leftover dirs as ghosts: doctor reported "29 ghosts" while `codevira projects` reported "0 ghost · 29 stale" on the same machine. It now delegates to _project_inventory.enumerate_projects / summarize — the single source of truth the projects/clean commands already read — so the two surfaces agree by construction. Stale dirs are surfaced informationally, never warned. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
decision_lock was presence-based: any non-additive edit to a file holding a do_not_revert decision hard-blocked, even when the change had nothing to do with what was locked (e.g. editing a tool description on server.py while the locked decisions are about the background watcher). It now compares the edit's diff envelope against each locked decision's text: a non-additive edit blocks only when it shares >=2 salient tokens with a locked decision's subject; a provably-orthogonal edit downgrades block->warn, decisions still surfaced for self-check. Conservative — unparseable/empty diffs still block, and CODEVIRA_DECISION_LOCK_CONTENT_AWARE=0 restores strict file-level locking. The destructive-Write moat becomes content-aware too. Reframed the lock / anti-regression / diff-envelope tests to the new contract (orthogonal modify -> warn; subject-touching modify -> block). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
search_decisions / list_decisions now default to compact rows — a one-line decision summary + key fields (do_not_revert, file_path, tags, score) — dropping the heavy per-row snippet/origin that dominated the old default. The new expand(ids=[...]) tool fetches full records on demand: scan cheap summaries, expand only what matters. Backward-compatible: full=true still returns untruncated rows, summary_only is unchanged, and CODEVIRA_DECISION_DETAIL=full restores the pre-E1 verbose default machine-wide. The compact rows keep the 'decision' key (a one-line summary) rather than renaming to 'summary', so existing callers don't break. get_session_context's recent_decisions now collapse to a true one-line summary (was a hard char-cut that could keep newlines). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The autouse _isolate_global_home fixture only redirected get_global_home, so
the PER-PROJECT data dir kept resolving from the cwd (= the repo root under
pytest). Any test that called decisions_store.record() without its own project
fixture wrote to the REAL repo's .codevira/decisions.jsonl. Over three weeks
this leaked 1240 'Use bcrypt for password hashing' fixtures into local memory
(found by the new read-side relevance eval: recall was 16% on the polluted
corpus, 100% after cleanup).
The fixture now chdir's into a throwaway tmp project (kept OUTSIDE the test's
tmp_path so a test doing tmp_path.rglob('*') can't pick up the injected
config.yaml). chdir composes with test-local project fixtures and leaves
project-resolution logic exercisable, unlike forcing _project_dir_override.
test_cli_version now passes cwd=repo to its subprocess (chdir otherwise made
'python -m mcp_server' resolve a stale installed package).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The intelligent read-side cluster (E2 + E3 + Phase 13), grouped because they co-evolve the CLI surface. E2 — read-only session-transcript ingest. mcp_server/ingest/ scans local Claude Code / Codex / Gemini logs, heuristically flags tool failures + user corrections (no LLM), and folds a sanitized, capped digest into the existing reflect pipeline via 'codevira reflect --from-sessions'. Candidates only. E3 — read-side relevance eval. 'codevira eval' measures recall@k / MRR / precision of search_decisions on cases self-derived from real .codevira/ memory. Lexical by default; LLM-as-judge offline + opt-in. Non-gating. Phase 13 — learned hot-path weight tuning. 'codevira tune-weights' grid- searches relevance_inject's ranking weights to maximize the E3 objective, persisting only a meaningful win. Hot path reads them OPT-IN (CODEVIRA_LEARNED_WEIGHTS, default off), cached for cache-stability, with transparent fallback to the shipped defaults. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Document the v3.5.0 work accumulated on this branch: summary-first payloads (E1), session-transcript ingest (E2), read-side relevance eval (E3), learned weight tuning (Phase 13), content-aware decision lock (Phase 18), the doctor ghost_projects fix, and the test-isolation/memory-integrity fix. AGENTS.md is the regenerated decision-tail. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The marker-block writer (AGENTS.md only) now maintains a configured set of memory files: CLAUDE.md, GEMINI.md, and .cursor/rules/codevira.mdc, via .codevira/config.yaml: managed_files. Default stays AGENTS.md-only — writing to a user's CLAUDE.md / .cursor unprompted is surprising, so the rest are opt-in. regenerate_all() loops the configured targets with two modes: shared_md (AGENTS/CLAUDE/GEMINI — reuse the atomic _merge_into_file, preserve everything outside the markers byte-for-byte) and owned_mdc (.cursor/rules/codevira.mdc — a dedicated codevira-owned file written as YAML frontmatter [byte 0, as Cursor requires] + block). Re-runs are idempotent (no mtime churn); one failing target never blocks the others. sync_after_write() now drives regenerate_all (default config = AGENTS.md only, byte-identical to before). Echo-safety: the E2 ingest scanner skips any text containing the codevira managed-block marker, so an injected block echoed into a transcript is never re-ingested as a user correction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Phase 16 was scoped as 'add TypeScript to get_signature (Python-only)', but get_signature already dispatches .ts/.tsx/.js/.jsx/.go/.rs to tree-sitter and extracts symbols — the 'Python-only' phase note was stale roadmap drift. Verified empirically (TS: functions/classes/methods; JS: functions) and closed the real gaps: the module docstring, the get_signature docstring, and the unsupported-extension error message all omitted JavaScript/.js/.jsx even though they work. Added a regression test class pinning TS/TSX/JS/JSX get_signature so the support can't silently regress (or be mistaken for Python-only again); the web-language cases skip when the optional tree-sitter grammars aren't installed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
outcome_tracker (SQLite → confidence) and storage/outcomes_writer (JSONL →
digest/replay/skills) each ran an INDEPENDENT git analysis to label a
decision kept/modified/reverted, and could disagree — e.g. the tracker
flagged a revert-message commit 'reverted' while the writer (which only
checked file deletion) called it 'modified', and they anchored differently
(--since vs anchor..HEAD).
Both now delegate to one shared indexer.outcome_classifier.classify_outcome,
so the two surfaces agree by construction. The shared heuristic merges the
best of both: the writer's precise anchor (commit at/before the decision ts)
plus the tracker's revert-keyword detection. A correctness win: a git failure
now yields None ('can't classify') instead of the old optimistic 'kept'.
Lives in indexer/ so mcp_server.storage imports it in the normal direction.
The old per-surface git helpers are now dead but left in place (removing them
trips the blast-radius guard on the high-fan-in outcomes_writer; not worth it
for cosmetic cleanup). New tests assert both surfaces agree across all four
scenarios on a real git repo; existing surface tests now mock the shared
classifier boundary.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The model-free half of E5 (the [semantic] embedding extra stays deferred — on-device model on hold). OR-joining and porter stemming were already in the FTS5 path; the remaining recall win was synonym expansion for concept- vocabulary mismatch. New mcp_server/storage/synonyms.py carries a curated dev-domain synonym map (auth/login, db/postgres/sql, config/settings, async/await, ...). When CODEVIRA_SYNONYM_WIDENING is set, _sanitize_fts_query OR-expands each token with its group, so a query in one vocabulary recalls a decision recorded in another (e.g. 'database' -> 'postgres'). Default OFF, by evidence: the E3 relevance eval showed widening keeps recall@5 at 1.000 on the self-derived corpus but slips MRR 0.942->0.935 (more OR terms dilute ranking) — it doesn't win where queries already use the decision's own words. Its value is the synonym-mismatch case, pinned by a test where 'database' recalls a 'postgres' decision only with widening on. Same discipline as Phase 13: ship a read-surface change only where it measurably helps. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds tests/e2e/test_v350_acceptance.py — one end-to-end check that walks every v3.5.0 deliverable through its real, user-facing surface (real stores, signals, policies, CLI), in the same idiom as test_cross_tool_universality.py rather than unit mocks. Green = v3.5.0 is coherent and releasable; a red class names exactly which shipped feature regressed. Covered: the doctor ghost/stale fix (D0000Z4), content-aware decision lock orthogonal-vs-conflict (D00010B), summary-first + expand() round-trip (D0000ZQ), read-only session ingest (D00010W), self-derived relevance eval (D00010Y), opt-in learned-weights round-trip (D00010Z), managed files sharing one canonical block (D000110), get_signature multi-language surface (P16), the unified git outcome classifier kept/modified/reverted (D000112), and opt-in synonym recall (D000113) — plus a release-coherence class asserting CHANGELOG completeness, the env-flag default-off contract, and CLI/MCP surface registration. Wired into `make test-e2e` so it runs as part of G2 in the release gauntlet — the gate fails loudly if any v3.5.0 surface regresses before a release can be cut. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cuts the v3.5.0 release across every version-bearing surface: - version 3.4.0 -> 3.5.0 (pyproject.toml + mcp_server/__init__.py) - CHANGELOG [Unreleased] promoted to [3.5.0] - 2026-06-17 - README: new "What's new in v3.5.0" section + corrected long-standing tool-count drift. The full tools/list is 50 tools / ~8K tokens (it was documented as both a 24-tool and a 49-tool surface in different spots), the lean profile is 12 tools / 71% smaller, and the CLI is 23 commands (eval + tune-weights landed in v3.5.0). - ROADMAP: v3.5.0 promoted to current. The prior "what's next" table (E1-E5, TS get_signature, learned tuning, single outcome store) ALL shipped in v3.5.0, so it's rewritten as an honest forward roadmap (opt-in [semantic], symbol-level locking, cross-project search). - website/index.html: un-stuck from v3.3.0 -> v3.5.0 (it had missed v3.4.0 too), 49 -> 50 tools, content-aware decision-lock card. - docs/architecture.md: version header + 49 -> 50 AI-facing tools. - AGENTS.md: regenerated via `codevira sync`. - test_v350_acceptance: the CHANGELOG-completeness check now reads the [3.5.0] section (features move out of [Unreleased] on promotion). Counts verified against the real tools/list handler (50 default, 12 lean) and argparse subcommands (23). Full suite: 2883 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
FAQ.md and PROTOCOL.md still documented features removed one to three versions ago: semantic search / ChromaDB / search_codebase (gone in v2.2.0) and the whole changeset workflow + analyze_changes / refresh_index / update_node (gone in v3.0.0). An agent following PROTOCOL.md verbatim would call tools that no longer exist. - FAQ: the ChromaDB/embeddings Q&As rewritten to the honest keyword-only reality (FTS5/BM25, no vectors, no model download — now a selling point); changesets dropped from feature lists; the cross-tool-continuity example updated to the real get_session_context panels (next_action, working memory, style); call-graph + troubleshooting answers point at get_impact / the watcher instead of deleted tools. - PROTOCOL: session start/end flows rewired to the current surface (get_session_context, get_phase / update_phase_status, complete_phase); multi-file work tracked via roadmap phases + working memory, not changesets; get_signature language list adds JS/JSX. - docs/roadmap.md: drop the removed open_changesets bullet. Verified zero removed-tool references remain (find_hotspots / analyze_changes confirmed deleted in mcp_server/server.py, not merely hidden). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…es]) A deep completeness sweep surfaced stale "current" claims the first release-sync pass didn't catch: - README: the Production-stable table still said 49 tools / 21 CLI (now 50 / 23); the README's OWN Roadmap section still listed the v3.5.0 features (E1-E5 etc.) as "next up" and named v3.4.0 as current -> rewritten to v3.5.0 + the real forward roadmap. - The `[all-languages]` tree-sitter extra was removed in v2.2.0 but was still advertised in README (language section), website, docs/ architecture, AND a self-contradicting pyproject.toml comment (one line said it was available, another that it was removed). All four corrected to "removed; agents Read those files directly". Engine-policy count (8) verified against register_default_policies and left unchanged. Counts re-derived from the running handlers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…isions jsonl_store._compute_next_id_locked tail-reads only the last 4 KB on files >100 KB, and (correctly) skips amendment records when scanning back for the last issued id. But that 4 KB window can contain ONLY amendments — e.g. right after `observe-git` appends a burst of small outcome-amendment records. The reversed scan then found no real id and fell through to "D000001", silently COLLIDING with the existing D000001 and clobbering it in the merged read view — including do_not_revert decisions. Observed live this session: a record_decision call right after a sync got D000001 and shadowed the locked "all writes go through atomic.py" decision in a committed AGENTS.md. Fix: when the tail-window scan comes up empty on a >100 KB file, re-scan the FULL file before falling back to D000001. The scan is factored into a helper so both paths share it. Regression test reproduces the exact shape (>100 KB file whose last 4 KB are pure amendments) and pins next-id = max+1. Also gives the binary tail-read its own variable so the file type-checks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 7 agent-persona files (builder/developer/documenter/orchestrator/ planner/reviewer/tester), duplicated in agents/ and the shipped mcp_server/data/agents/, were orphaned when the `agents` CLI command was removed in v3.0.0 — nothing loads them, and they still referenced removed tools (list_open_changesets, search_codebase, refresh_index, add_node). Deleting all 14 drops dead weight from the wheel (data/** is package data) and removes misleading content. docs/troubleshooting/antigravity.md troubleshot a torch/dlopen sandbox failure that became impossible in v2.2.0 (semantic search + torch removed). Replaced with an obsolete-notice stub (kept as a file because the CHANGELOG references it historically). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The top of the README stacked three "What's new" sections (v3.5.0 + v3.4.0 + v3.0.0, ~75 lines) before Quick Start, pushing the how-do-I-start content down to line 169 — bad for first-read comprehension. - Quick Start now follows The Problem directly (line 96). - The three changelog sections collapse to one concise "What's new in v3.5.0" (relocated below Quick Start) plus a one-line "Earlier releases" pointer to the CHANGELOG. No content invented; the per-release detail already lives in CHANGELOG.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The landing page sold codevira but gave developers no on-site path to actually use it. New website/docs.html — self-contained, matches the site's design tokens + theme toggle (shared cv-theme localStorage key) — covers: install + 3-command quick start, per-IDE MCP setup, the daily CLI commands, the 50-tool MCP surface by category, the opt-in env flags, and troubleshooting. Linked from the landing nav. Content verified against the running code: 50 AI-facing tools, 23 CLI commands, and the authoritative CODEVIRA_* env-var list. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
get_signature / get_code resolved any path string to file contents —
_resolve() returned absolute paths verbatim and never checked a relative
path stayed under the project root, so get_code("/etc/passwd") or
"../../../../etc/shadow" read arbitrary files. These two tools are the one
MCP surface that turns a path string into file contents, and they bypassed
the engine's edit-path containment (which only covers writes).
Added _within_root(): the readers now refuse paths that resolve outside
the project root (symlinks + .. resolved), returning the existing
found:False error shape. Bounded severity for a local single-user tool (the
user already has FS access), but it closes a prompt-injection vector where a
poisoned instruction makes the agent read secrets outside the repo.
Found by the v3.5.0 MCP-server audit. Regression-tested.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
MCP-server audit improvements (the protocol layer was otherwise in good shape — prompts, resources, tools, and sampling/createMessage all in use): - Tool annotations: ToolAnnotations were set on 1/50 tools. All 50 now declare readOnlyHint (28 reads) / idempotentHint / destructiveHint=False, so hosts can run read tools without a confirmation prompt and reason about side effects. Applied via one centralized pass keyed off a _READ_ONLY set. - read_resource returned a bare str — deprecated in the SDK, and it defaulted the content mimeType to text/plain, so the text/html declared in list_resources never reached the renderer (Claude Desktop showed escaped HTML instead of the decision timeline). Now returns ReadResourceContents with mime_type=text/html: fixes the DeprecationWarning and the render bug. - ToolAnnotations import is resilient (nested try) so an older mcp degrades gracefully instead of failing the whole import; SDK pinned mcp>=1.0,<2 so a future major can't silently break installs. - http_server docstring tool count 36 -> 50. Tests: the mcp.types mock stubs ToolAnnotations; read_resource callers extract .content from the new return shape. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
read_resource already serves the parameterized decisions resource
(codevira://decisions/<query> filters the timeline by substring), but it
was never declared via resources/templates/list — so resources/list only
advertised the static codevira://decisions URI and an MCP client had no way
to discover the query-able form.
Added a list_resource_templates handler returning the
codevira://decisions/{query} template. Surfaced by the v3.5.0 MCP-server
deep review. Test pins the declaration; the test-server/http mocks gain the
new decorator passthrough (per the documented "add new @server decorators
here too" rule).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rint doctor's C12 read ~/.codevira/crash.log while crash_logger writes crashes.log, so it never saw a real crash and always passed vacuously — crashes were invisible. C12 now reads the canonical path and surfaces what's failing: "N crash(es) recorded, M distinct; most recent: <ExcType>". Also fixed a fix_command that pointed at the removed `codevira report` command. Added crash_fingerprint(): a stable 12-char key = sha(exc_type + normalized top stack frames [basename:func, no line numbers/paths/values] + major.minor). Same root cause on any machine -> same key. Written into every crash entry (FINGERPRINT:) and grouped by crash_digest() — the dedup foundation for any future opt-in reporting. Distinct from the message-based _crash_signature used for the 60s local rate-limit. Self-restart already works: install_global_handler logs then re-raises via the original excepthook, so the process exits cleanly and the IDE respawns. (pre-commit: exclude test_crash_logger.py from detect-private-key — its fake PEM blocks are fixtures for the redactor test.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
An adversarial review of a proposed "auto-apply learned weights by default +
Stop-hook auto-tune" change found it unsafe, so that prototype was reverted
and the shipped opt-in behavior kept. This corrects the learned_weights
docstring, which had drifted ahead of reality:
- Removed the false "(debounced at the Stop hook)" claim. There is NO
automatic re-tune; the file changes only when `codevira tune-weights` is
run by hand. Claude Code's Stop hook fires PER-TURN, so an auto-tune loop
would rewrite learned_weights.json mid-conversation and bust the prompt
cache — recorded inline so the gap is not silently re-reopened.
- Stated the honest caveat that the tuner's "win" is on the E3 OFFLINE
proxy (recall@k + MRR): it does not model precision/noise (blind to the
D00005N signal-to-noise failure mode) and diverges from the online
relevance_inject._score_candidates scorer (token vs substring; FTS-only
pool vs the tag/file/FTS union; synthesized vs real prompts), so a
proxy-win can rank WORSE online. That is why apply stays opt-in via
CODEVIRA_LEARNED_WEIGHTS, not a hard "never worse than defaults" promise.
- Noted the win is machine-LOCAL (proven against the producing machine's
corpus + FTS index); re-run the tuner per machine.
No behavior change: relevance_inject and the Claude Code Stop-hook wiring are
reverted to their committed opt-in state; this commit is docstring + caveats
only. CHANGELOG already states "default off, fallback to shipped defaults".
Verified green: tests/test_weight_tuning_p13.py (14), the learned-weights +
env-flag-contract acceptance tests, the full v3.5.0 acceptance gate (27), and
the relevance_inject hero-policy suite (19).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
TestAcceptanceScenarios::test_7_hero_2_plus_hero_1_simultaneous_fire asserts that decision_lock (priority 100) is the PRIMARY blocker when it co-fires with anti_regression (priority 80). That holds only when decision_lock is in its default BLOCK mode. The module's autouse _clear_env fixture cleared CODEVIRA_ANTI_REGRESSION_MODE but NOT CODEVIRA_DECISION_LOCK_MODE, so a `CODEVIRA_DECISION_LOCK_MODE=warn` left in a developer's shell downgraded decision_lock to a warn: anti_regression became the sole blocker and the priority assertion failed (got 'anti_regression', expected 'decision_lock'). This was a non-hermetic-test false failure, not a code regression. GitHub CI (clean env) is green, but `make release-gauntlet` / `make test-unit` failed locally whenever the var lingered. Extend _clear_env to also clear CODEVIRA_DECISION_LOCK_MODE and CODEVIRA_DECISION_LOCK_CONTENT_AWARE, mirroring the v3.5.0 acceptance suite's _clean_v350_env. No test in this file sets either var, so default (block) behavior is asserted regardless of ambient env. Verified with CODEVIRA_DECISION_LOCK_MODE=warn STILL exported: - tests/engine/test_anti_regression.py: 19 passed, 4 skipped - make test-unit: 2778 passed, 12 skipped (previously 1 failed) - make test-e2e: 70 passed, 9 skipped Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cut date moved to the actual release day. Updates the CHANGELOG [3.5.0] heading and the README + ROADMAP anchor links that derive from it (#350--2026-06-19), so the cross-references stay valid. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
v3.5.0 — the read side gets intelligent
Ships the v3.5.0 roadmap (E1–E5 + Phases 13/16/17/18) plus this session's release-hardening. No new runtime dependencies.
Highlights
search_decisions/list_decisions+expand(ids)reflect --from-sessions)codevira eval: recall@k / MRR / precision)tune-weights); opt-in viaCODEVIRA_LEARNED_WEIGHTSCLAUDE.md/GEMINI.md/.cursor)CODEVIRA_SYNONYM_WIDENING=1)get_signatureJS/JSX accuracy; reconciled the two git-outcome storesghost_projectsfalse-positive fixRelease-readiness (this session)
32e893c).CODEVIRA_DECISION_LOCK_MODElingered in the shell (cec60ec).2026-06-19(651d596).Validation
make release-gauntlet+ build + twine, all green on a clean env:G1 unit 2778 · G1.5 MCP round-trip · G1.6 help-text · G1.7 sandboxed-parent · G2 e2e 70 · G2.5 cold-install smoke · G3 50-tool MCP handshake · G4 no crashes. Wheel + sdist build;
twine checkpasses.🤖 Generated with Claude Code