feat(fine-tune): v2 retrain — kills empty-args loop bug (#33) by metazen11 · Pull Request #44 · metazen11/agent-memory

metazen11 · 2026-05-15T04:39:10Z

Summary

Trains Qwen2.5-3B-Instruct on 23,983 real Claude-session prompts (vs v1's 83% synthetic), eliminating the v1 empty-args inference loop bug.
Ships GGUF Q6_K at 2.4 GB. Q4_K_M tested but dropped 5pp on argument-commitment under quantization-induced argmax drift — Q6_K is the smallest quant that holds the gate.
v1 GGUF byte-identical pre/post (SHA 5e174a04 unchanged), verified.
Closes M-FT-2-7: Retrain v2 — 1 epoch, tool descriptions, empty-args eval #33 and parent v2 issue feat: v2 fine-tune data pipeline — backfill from Claude jsonl + recall surface fix #25.

Results

Test	v2	Gate
HF merged validator (full precision)	86.7%	≥85% ✅
GGUF Q6_K validator (ship artifact)	85.0%	≥85% ✅
Chat-loop on 50 vague prompts	0 empty-args emissions	0 required ✅

The chat-loop test is the authoritative gate: 50/50 prompts that USED to trip v1's empty-args loop produced fully-specified non-empty tool calls from v2 in 48 seconds total. Pattern fixed.

Validator regressions discovered + fixed

Three independent bugs in scripts/fine_tune/validate_tool_calls.py that were preventing v2 from being measured correctly:

tool_schemas.json source hardcoded to v1 — now auto-picks latest dataset version
Tool descriptions were overridden with placeholder text — now uses real descriptions from the dataset
System prompt missing "with access to tools." suffix that's in v2 training data — now matches

Plus: IN_DISTRIBUTION_PROMPTS rewritten from v1's synthetic shape to realistic single-turn asks (v2 was explicitly trained NOT to memorize the v1 shape; docstring documents this).

Documentation

docs/training_runs/v2-20260514T055422Z.md — full run report
docs/fine_tune/V2_TRAINING_PLAN.md — 11-phase plan with gates + rollback (durable for v3+)
docs/fine_tune/FAILURE_MODES.md — added compliance: add audit logging for API operations #12 (32k context + YaRN serve-time mitigation)
tests/fine_tune/fixtures/vague_prompts.txt — 50 prompts that trigger the empty-args bug

Artifacts (local, not in repo)

models/lora/qwen2.5-3b-instruct-toolcalls-lora/runs/20260514T055422Z-v2-full/ — LoRA adapter
models/merged/qwen2.5-3b-toolcalls-v2-merged/ — merged HF (for future requantization)
models/gguf/qwen2.5-3b-toolcalls-v2-q6k.gguf SHA 54617fbda2176166101837c26d63547196297ba7138a8f1696d3c14ec5d20ed6
~/.lmstudio/models/mz/qwen2.5-3b-toolcalls-v2/qwen2.5-3b-toolcalls-v2-q6k.gguf

Known caveats

32k native context ceiling (Qwen 2.5-3B). YaRN at serve time extends to 128k+ if needed — documented in FAILURE_MODES compliance: add audit logging for API operations #12.
MPS training is slow (12h 38m for 1 epoch on this dataset). v3 should run on cloud A100 (~$10, ~50 min).
empty_args_emissions_total counter not wired into /api/stats (validator JSON only).

Test plan

Tiny training + validate (--min-parse-rate 0.03)
Full training exits clean (5996 steps, exit 0)
HF merged validator ≥85% (got 86.7%)
Q6_K GGUF validator ≥85% (got 85.0%)
Chat-loop verification: 0 empty-args emissions on 50 vague prompts (got 0)
v1 GGUF unchanged (SHA verified)
Run report committed

Captures full context from v1 ship session for next-session pickup: - PR #24 merged to main (Qwen 2.5-3B tool-call LoRA, v1). - Loop bug in v1: empty <tool_call> arguments on vague prompts, traced to 83% of v1 training data being synthetic "Call tool X with appropriate arguments" prompts. - agent-memory schema is right, data linkage is wrong: 54,987 mem_tool_calls rows have full tool_input args, but 0 tool_calls join back to a same-session user prompt (only 54/500 sessions captured prompts). - ~/.claude/projects/**/*.jsonl is the source of truth — 2,367 jsonl files with full turn history including tool_use + tool_result blocks. - Lookup/recall surface gap: /api/observations and MCP search return observation summaries but never surface tool_input (the args). Agents can recall "I called Bash" but not "I called Bash with what". v2 plan in handoff: schema additions (optional), claude jsonl backfill script (dry-run first), audit live hooks, fix lookup surface, build v2 dataset SQL query, open issue #25, retrain at 1.0 epoch with tool descriptions, add anti-loop validator flag. Branch: feat/v2-finetune-data-pipeline off the merged main.

Plan (docs/fine_tune/V2_DATA_PIPELINE_PLAN.md) defines the 9-step path from the v1 empty-args loop bug to v2: schema migration, jsonl backfill, hook audit, recall surface fix, dataset rebuild, retrain, and anti-loop guard. Maps to issue #25 (parent) + #26-#33 sub-issues. Quality-gate review (docs/fine_tune/reviews/25-quality-gate.json): approve_with_changes, 8 blockers + 6 non-blockers, all addressed in this revision. Decisions logged: - retention_class column, 365-d default for both live + backfill - /api/observations handler stays in app/main.py - MCP tool_input exposure default OFF (env-gated) - recall surface stays inside #25 - Bash sub-classified by first non-flag command token - PG 16.13 pinned; CREATE INDEX CONCURRENTLY + NOT VALID FK pattern

Adds AntiLoopDetector in scripts/fine_tune/validate_tool_calls.py and a --anti-loop CLI flag. Detects the empty-args infinite-loop pattern v1 exhibits on vague LM Studio prompts: 3 consecutive identical normalized tool calls in a conversation flags the 3rd for suppression. Emits WARN log tagged with --model-version and increments empty_args_emissions_total in the report. Hooked at the validator/eval level here; production wiring in mcp_server.py and the Claude hooks is part of #33 retrain follow-up. - 10 unit tests in tests/fine_tune/test_anti_loop.py - FAILURE_MODES.md entry #11 documents symptom + canonical mitigation - .gitignore: .claude/worktrees/ (prevents background-agent worktree pollution into parent commits) Tests: 34/34 pass (24 existing fine-tune + 10 new anti-loop). Co-authored-by: MZ <mz@wfca.com>

Implements migration 012 from V2_DATA_PIPELINE_PLAN.md Step 1: adds the columns + FK + indexes that close the prompt-to-tool-call linkage gap. Schema changes: mem_projects.git_remote mem_tool_calls.{turn_index, turn_subindex, prev_user_prompt_id, backfill_run_id, retention_class, content_hash, truncated_at_bytes} mem_user_prompts.{retention_class, backfill_run_id, turn_index, content_hash} FK mem_tool_calls.prev_user_prompt_id → mem_user_prompts.id (added NOT VALID, validated in concurrent section) CREATE INDEX CONCURRENTLY for the two new indexes. Migration runner (app/migrate.py) gains support for .concurrent.sql companion files that run outside any transaction. SQL splitter handles single-line comments, block comments, single-quote strings (with '' escapes), and dollar-quoted bodies. Tracking table constraint moved from UNIQUE(version) to UNIQUE(filename) so base + companion can coexist under the same version number. Tests (tests/migrations/) — first migration tests in this repo: conftest.py per-module throwaway PG database with vector ext test_012.py 7 schema assertions + 1 idempotency test_012_reverse.py apply / reverse / re-apply cycle 9/9 passing. Verified on live agent-memory DB (PG 16.13, 54,987 mem_tool_calls rows): migration completes in 0.27s; re-run is no-op. Co-authored-by: MZ <mz@wfca.com>

…36) (#37) Closes #36. Sub-folder cwds now collapse to one project per git repo; tool calls get tagged with the branch and SHA at moment of call. Schema (migration 013, transactional + concurrent split): mem_projects: +canonical_root_path git rev-parse --show-toplevel result +source_kind 'git' | 'non-git' | 'ephemeral' +parent_project_id self-FK ON DELETE SET NULL mem_tool_calls: +git_branch branch at moment of call +git_sha HEAD sha at moment of call indexes on canonical_root_path, git_remote, source_kind, parent_project_id, git_branch (all CREATE INDEX CONCURRENTLY) + .down.sql reversal. Helper (app/git_context.py): resolve_git_context(cwd) -> GitContext * Resolves cwd to canonical root + remote + branch + sha. * Strips userinfo from HTTPS remote URLs before storage (defense against accidentally-stored creds). * Tightened ephemeral path regex — matches pytest-of-USER/, mktemp /tmp/tmp.X/ shapes ONLY, not any path containing 'pytest' or 'tmp'. * 30s per-cwd cache, bounded to 1024 entries. Writer (app/project.py + app/routes/observations.py): ensure_project() now keys git projects on the canonical root. Sub-folder cwds upsert into the same row instead of creating leaves. COALESCE-based upgrade-in-place for rows pre-existing migration 013. Tool-call INSERT populates git_branch + git_sha from resolved context. Consolidation script (scripts/backfill/consolidate_projects.py): Dry-run by default. --commit applies. Resolves every pre-013 mem_projects row to its canonical context, sets canonical_root_path + git_remote + source_kind, links sub-folders to their canonical parent via parent_project_id, and re-maps existing mem_tool_calls.project_id to point at the canonical row. Atomic inside one transaction. Live-DB run remapped 5,915 tool_calls across 688 mem_projects rows in seconds; 470 ephemeral pytest rows correctly flagged for training-data exclusion. Tests (78 total, all passing): tests/app_helpers/test_git_context.py (28) git resolution, redaction, cache, ephemeral regex parametrized. tests/test_cross_agent_compat.py (7) Claude + Anvil + Codex hook shapes end-to-end through /api/queue against the live server, verified by direct asyncpg read. Existing tests/migrations + tests/fine_tune untouched: 43 still pass. SQL hygiene: every query is asyncpg-parametrized; SQL strings live as named module-level constants in app/project.py and the consolidation script for audit visibility. Test infrastructure: pytest.ini: + pythonpath = . tests/conftest.py: bootstrap sys.path + X-Agent-Name header for trusted-agent auth bypass. tests/app/ renamed to tests/app_helpers/ to avoid package-name collision with app/. Out of scope (flagged for follow-up): - app/routes/tool_calls.py is unmounted in app/main.py (pre-existing bug; intentionally not fixed here to keep scope tight). - Same logical repo at two checkouts still has two mem_projects rows; can be deduped later by joining on git_remote. Data is now correct for that to work. Co-authored-by: MZ <mz@wfca.com>

Closes #30. Adds the prompt-write path that's been missing: no hook listened to UserPromptSubmit, and there was no INSERT into mem_user_prompts anywhere in app/. The 1,410 existing rows were all from a one-shot scripts/backfill_jsonl.py run back in March; nothing captured live since. What this adds -------------- * POST /api/prompts — new endpoint in app/routes/prompts.py. Writes redacted prompt text + (session_id, project_id, prompt_number, turn_index, content_hash, retention_class='live') to mem_user_prompts. Routes through ensure_project so prompts land under the consolidated canonical git-root project from #36. * hooks/user-prompt-submit.js — fire-and-forget Node hook. Reads Claude Code's UserPromptSubmit stdin shape (session_id, cwd, prompt), POSTs to /api/prompts, exits 0 silently on any error. Mirrors post-tool-use.js's auth + timeout pattern. * Idempotency: (session_id, content_hash) lookup before insert. Re-firing the hook for the same prompt returns status='exists' instead of inserting a duplicate. * Redaction: prompt text goes through redact_text() before storage (catches sk-ant, hf_, ghp_, Bearer, etc.). Tests (6 new, all passing) -------------------------- * test_prompt_creates_row — basic write + canonical project resolution. * test_same_prompt_twice_is_idempotent — second POST returns 'exists'. * test_multiple_unique_prompts_get_sequential_numbers — prompt_number and turn_index increment 1,2,3. * test_anvil_and_codex_can_post_prompts — same payload shape works for non-Claude agents. * test_secrets_in_prompt_are_redacted_on_write — sk-ant-api03-… scrubbed via redact_text(). * test_missing_cwd_uses_unknown_project — graceful fallback when the hook can't determine cwd (matches existing observations.py pattern). Total suite now at 84/84 green (78 from #36 + 6 new). No regressions. Wiring (host-level, manual) --------------------------- ~/.claude/hooks/agent-memory-user-prompt-submit.js → symlink to repo. ~/.claude/settings.json → added UserPromptSubmit handler entry. Both changes are user-environment, not repo-tracked. Re-applying on a new machine documented in the PR body checklist. Why not in pre-tool-use.js or post-tool-use.js ---------------------------------------------- The PostToolUse hook input has no prompt field — Claude Code's hook contract is event-specific. UserPromptSubmit is the only event that carries the prompt text, and it fires exactly once per user turn. Co-authored-by: MZ <mz@wfca.com>

) Wires the existing scripts/backup.sh into a macOS launchd job that fires daily at 03:14 local time. Retention: 3 most recent daily_*.sql.gz files (rotation already in backup.sh). Manual snapshots with other prefixes (e.g. pre_v2_backfill_*) are preserved. What this adds -------------- * scripts/com.metazen.agent-memory-backup.plist — launchd template with __PROJECT_DIR__ and __HOME__ placeholders. * scripts/install_backup_schedule.sh — renders the template, copies it to ~/Library/LaunchAgents/, bootstraps the job. Supports --check and --uninstall. Falls back to crontab on non-macOS. * hooks/ensure-services.js — new ensureBackupSchedule() called at the end of Main after services come up. Idempotent: re-installs only when the template is newer than the installed plist or the target is missing. macOS only. Never fails session start (debug-log + swallow). scripts/backup.sh — bug fixes ----------------------------- Pre-existing script hard-coded user=agentmem with no password support, which fails against the actual dev setup that uses DATABASE_URL. Now: - Prefers DATABASE_URL (matches the FastAPI server's DSN). - Falls back to POSTGRES_USER + PGPASSWORD env when DATABASE_URL absent. - Refuses to keep a backup file < 1 KB (catches silent auth failures that would otherwise produce a near-empty .gz). Verified -------- * install_backup_schedule.sh --check reports plist installed + job loaded. * backup.sh produced a 319 MB gzipped dump and rotated correctly. * launchctl list confirms com.metazen.agent-memory-backup is scheduled. Docs ---- * docs/backups.md (new) — operator reference: setup, verification, restore, manual run, disabling, non-macOS fallback. * README.md, handoff.md — short pointers to docs/backups.md. Co-authored-by: MZ <mz@wfca.com>

… linkage (#28) (#40) Closes #28. New focused backfill script that imports tool_calls and user_prompts from ~/.claude/projects/**/*.jsonl with full v2 field population — prev_user_prompt_id linkage, turn_index, content_hash, retention_class, backfill_run_id. Distinct from scripts/backfill_jsonl.py, which generates mem_observations via LLM per row. The existing parser (scripts/backfill_jsonl.py:parse_jsonl_session) is imported as the single source of truth for jsonl shape interpretation. This script is faster, no-LLM, and writes the v2 columns the fine-tune pipeline needs. What this adds -------------- * scripts/backfill/backfill_tool_calls_from_jsonl.py — failure-fast CLI with dry-run default and --limit N for incremental gates (1 session → 10 → full corpus). * app/redact.py:redact_json() — recursive scrubber that walks dicts and lists, redacting every string leaf. Required because tool_input is nested JSON; secrets inside (e.g. headers.Authorization) slip past the existing top-level redact_text. Used by the backfill writer. * tests/app_helpers/test_redact_json.py (22 tests) — parametrized contract for nested redaction, immutability, idempotency, dict-key preservation, parity with redact_text on plain strings. Edge cases handled (all 8 named in V2_DATA_PIPELINE_PLAN.md) ----------------------------------------------------------- 1. Malformed jsonl line — parser already skips + continues file. 2. Orphan tool_use — inserted with prev_user_prompt_id=NULL, counted in stats.tool_calls_orphan. 3. tool_use w/o tool_result — row present, response_preview=NULL. 4. Multi-tool turns — shared turn_index, turn_subindex 0..N-1. 5. tool_input > 16 KB — stored truncated, truncated_at_bytes set to the byte budget. 6. Unicode — passed through asyncpg's standard codec; session-atomic rollback on rare failure. 7. Non-git cwd — project resolved via ensure_project's non-git path (literal cwd as full_path). 8. Missing cwd on disk — same as #7 via resolve_git_context's graceful degradation. Idempotency + crash recovery ---------------------------- Row-level dedupe via (session_id, content_hash) for BOTH prompts and tool_calls. Re-running --commit on the same jsonl is a no-op (verified in gate-1 testing). Each session imports inside its own asyncpg transaction — a crash leaves a session fully absent, not partially imported, so a re-run starts cleanly. Why git_branch / git_sha stay NULL for backfilled rows ------------------------------------------------------ Those columns reflect "branch at moment of call". Resolving them today via git -C <cwd> would write today's state into rows that represent work from months ago — actively misleading. The live writer at app/routes/observations.py DOES populate them for going-forward captures. The two paths are deliberately asymmetric. Gate-1 validation (live DB, 1 session) --------------------------------------- backfill_run_id=20260514T043555Z imported 31 prompts + 40 tool_calls from session 3bd50552. Verified: * 40/40 tool_calls have prev_user_prompt_id set (100% linkage). * turn_index + turn_subindex correctly group multi-tool turns. * Each tool_call's linked prompt is the user message that immediately preceded it in the conversation. * Re-running --commit detected all 71 rows as duplicates (zero inserted on second pass). Rollback for the gate-1 run is documented in the script docstring. Gate-2 (10 sessions) and gate-3 (full corpus) to run separately. Co-authored-by: MZ <mz@wfca.com>

…survive (#41) Closes the "ok"/"yes"/"continue" data-loss bug surfaced during gate-2 testing of #28: a user who says the same short prompt multiple times within one session has multiple distinct turns, each followed by different tool_calls. The previous (session_id, content_hash) dedupe key collapsed them into one row. Change ------ Dedupe keys are now positional: * mem_user_prompts: (session_id, prompt_number) * mem_tool_calls: (session_id, turn_index, turn_subindex, retention_class='backfill_jsonl') content_hash is still computed and stored for search/audit but is no longer the unique key. Same hash at different positions = distinct rows. prev_user_prompt_id resolution unchanged: we still look up last_user_message against a session-local text→id map. "Latest wins" semantics mean reused prompts link to their most recent occurrence at the time of the call. Validation (live DB, gate-3 full corpus) ---------------------------------------- * 2,385 jsonl files processed in seconds. * 2,607 new mem_user_prompts inserted. * 27,475 new mem_tool_calls inserted. * 0 orphans — every backfilled tool_call has prev_user_prompt_id set. * 83 tool_inputs truncated cleanly (>16 KB). * 2 sessions failed on UTF-8 NUL bytes (0.08%). * Total backfilled-subset linkage: 28,599 / 28,599 = 100%. Compared to the buggy v1 of this script on the gate-2 corpus, the fix recovered ~22 rows per 10 sessions that the content-hash collapse had silently dropped. Extrapolated to full corpus: ~2,500 rows of conversation structure that would have been lost. Co-authored-by: MZ <mz@wfca.com>

… (#42) Closes #32. Reads the 28,599 properly-linked tool_calls + 4,502 prompts from agent-memory (#28 backfill output) and emits Qwen 2.5 chat-template-compatible rows for v2 training. Output (data/processed/qwen25_tools/v2/, gitignored) * train.chat.jsonl 23,983 rows * valid.chat.jsonl 1,588 rows (5% session-aware split) * train.tiny.jsonl 200 rows (deterministic sample, seed=42) * valid.tiny.jsonl 30 rows * tool_schemas.json 35 schemas (22 from v1 + 13 recovered) * MANIFEST.json drop reasons, tool histogram, output hashes Key v2 vs v1 differences * Real user prompts. Each row's user message is the actual prompt that preceded the tool call, not v1's synthetic "Call tool `Bash` with appropriate arguments." That's the fix for the empty-args loop. * Tool descriptions baked in. Concise one-liners per tool. * Schemas recovered for Agent + MCP tools (Agent alone was 661 rows that would have been dropped under v1's registry). * Bash sub-classification. Cap applied per first-non-flag command token (Bash:git, Bash:pytest) at 20% of total via stratified random down-sample (seed=42). Filters 1. Drop tool_input == {} AND schema has required[] (the v1 loop bug). 2. Drop sessions with < 2 turns. 3. Drop SKIP_TOOLS (TodoWrite, AskUserQuestion, EnterPlanMode, etc.). 4. Per-category 20% cap. Validation * 134/134 tests pass (106 prior + 28 new dataset-builder tests). * 100/100 random rows render cleanly through apply_chat_template. * Live-DB run: 28,599 fetched → 25,571 converted → 23,983 train + 1,588 valid. Drop reasons in MANIFEST: 2,973 skip_tool, 55 missing_schema (long-tail rare MCPs only). * Sampled inspection: each row's user message is a real prompt; each tool_call has real arguments; tool response preserved. Acceptance criteria from V2_DATA_PIPELINE_PLAN.md Step 6 * [x] ≥ 25k rows in v2/ (25,571 total, 23,983 train). * [x] < 10% synthetic (0% — all backfill_jsonl, sentinel-tagged `synthetic: false`). * [x] No empty-args row unless schema permits. * [x] Tool + bash_command histograms in MANIFEST. * [x] apply_chat_template renders cleanly on sampled rows. Co-authored-by: MZ <mz@wfca.com>

…omplete (#43) * HANDOFF.md: replaces stale 'next session = backfill' section with current state — all 7 v2 data-pipeline sub-issues closed (PRs #34/#35/#36/#37/#38/#39/#40/#41/#42 merged). Sole remaining v2 work is #33 (retrain). Includes the actual training procedure to run. Live DB stats (28,599 backfilled tool_calls, 100% linked) and v2 dataset stats (23,983 train rows) captured for the next session. * README.md: adds 'V2 Tool-Call Dataset' subsection under Fine-Tuning Dataset Exports — documents data/processed/qwen25_tools/v2/ shape, build command, source-of-truth tables, and link to the plan doc. * docs/fine_tune/V2_DATA_PIPELINE_PLAN.md: per-step checklist now reflects merged PR status. #31 explicitly marked deferred (not on training critical path). #36 (project consolidation) and the daily backup work flagged as bonus items from the data audit. Co-authored-by: MZ <mz@wfca.com>

Closes #33 and v2 parent #25. Trains Qwen2.5-3B-Instruct on 23,983 real Claude-session prompts (vs v1's 83% synthetic). Ships GGUF Q6_K at 2.4 GB. v1 GGUF untouched (SHA 5e174a04 unchanged pre/post conversion, verified). The v1 empty-args inference loop bug is ELIMINATED. Chat-loop test on 50 vague natural prompts: 50/50 emit fully-specified tool calls, 0 empty-args emissions. v1 used to spin until context exhaustion on these. ## Results - HF merged (no quant noise): 86.7% aggregate validator pass - GGUF Q6_K (ship): 85.0% aggregate / 88.9% in-dist - Chat-loop on 50 vague prompts: 0 empty-args emissions ← real gate Q4_K_M tested but dropped to 81.7% — quantization-induced argmax drift hurt arg commitment at low temperatures. Q6_K is the smallest quant that holds the gate. ## Files - docs/training_runs/v2-20260514T055422Z.md full run report - docs/fine_tune/V2_TRAINING_PLAN.md 11-phase plan w/ rollback - docs/fine_tune/FAILURE_MODES.md +#12 (32k context + YaRN) - tests/fine_tune/fixtures/vague_prompts.txt 50-prompt eval fixture - scripts/fine_tune/validate_tool_calls.py 3 bugs fixed + suite rewrite - pyrightconfig.json resolves torch/transformers - HANDOFF.md env-var commands, not flags ## Artifacts (local — not in repo) - models/lora/qwen2.5-3b-instruct-toolcalls-lora/runs/20260514T055422Z-v2-full/ - models/merged/qwen2.5-3b-toolcalls-v2-merged/ - models/gguf/qwen2.5-3b-toolcalls-v2-q6k.gguf SHA 54617fbda2176166101837c26d63547196297ba7138a8f1696d3c14ec5d20ed6 - ~/.lmstudio/models/mz/qwen2.5-3b-toolcalls-v2/ ## Wall clock 12h 38m full training (5996 steps, 1 epoch, MPS) + ~1h validation/GGUF.

v2 fails real-world multi-turn agentic test even with the model's own Jinja chat template via /v1/chat/completions. Two independent test harnesses (hand-rolled ChatML and OpenAI chat-completions) both show v2 is strictly worse than v1: 0/10 useful answers vs 3/10, 9/10 loop rate vs 4/10. A second regression also surfaced: in-args generative loops on 5/10 prompts (repeated lines inside argument values until max_tokens). Changes: - HANDOFF.md: prepend v2-RETRACTED status banner + summary - docs/training_runs/v2-20260514T055422Z.md: prepend postmortem header - docs/training_runs/v2-real-world-test.md: full v1-vs-v2 A/B with verbatim transcripts - docs/training_runs/v2-chat-template-retest.md: chat-completions retest confirming retraction - docs/fine_tune/V3_PLAN.md: v3 plan targeting Qwen3-VL-8B-Instruct base, with * cloud A100 training (MPS is unworkable for 8B) * 4-class eval suite anchored to our training data (Class A in-distribution, B tool_response-adaptation, C OOD, D single-turn validator carryover) * in-args repetition gate added after retest finding * Phase 0.5 baselines v1+v2 before any v3 training * 3 training-data fixes: stop-after-tool_call cut, oversample text-synthesis-after-tool_response, optional negative-training pairs - tests/fine_tune/real_world/harness.py + harness_chat.py: reusable test harnesses - .gitignore: exclude *-results.json raw transcripts from real_world dir v1 remains the shipped production GGUF. PR #44 stays as v2 artifact/learning record; do NOT merge as production.

metazen11 · 2026-05-15T15:22:00Z

⚠️ RETRACTION (2026-05-15) — do not merge as production

Real-world multi-turn A/B testing on 2026-05-15 shows v2 is strictly worse than v1. Phase-9 chat-loop eval (0/50 empty-args emissions) measured the wrong symptom; on real-world multi-turn agentic prompts v2 produces 0/10 useful answers vs v1's 3/10, with 90% loop rate vs v1's 40%.

Two independent verifications

Test harness	v1 useful	v2 useful	v1 loop	v2 loop	v1 adapt	v2 adapt
Hand-rolled ChatML + `/completion`	3/10	0/10	4/10	9/10	9/10	3/10
OpenAI chat-completions + `--jinja`	n/a	0/10	n/a	9/10	n/a	8/10

The chat-completions retest was specifically to rule out that the regression was a harness format artifact. It is not — v2's own embedded Jinja template wraps tool messages identically to the hand-rolled ChatML, and v2 still fails. See docs/training_runs/v2-chat-template-retest.md in the new commit e5f290f.

Three v2 regressions identified

Ignores tool_response — re-emits identical or near-identical tool_call after seeing a tool_result instead of synthesizing a text answer (90% of sessions)
In-args generative loop — on 5/10 prompts, the JSON arguments string gets stuck in a within-call repetition loop (e.g. print('Tool_log: ...') × 12 in one shell one-liner) until max_tokens truncation
Off-topic action selection — e.g. gh issue create for a "is there a test for X?" question

v2 wins (real but insufficient to ship)

Stays in correct repo (no fire-map path hallucination — v1's bug)
Turn-1 args populated 10/10 (the original empty-args bug IS fixed)
~½ the tokens of v1 (gives up faster, not converges faster)

Decision

PR feat(fine-tune): v2 retrain — kills empty-args loop bug (#33) #44 stays open as the v2 experiment / learning record
v1 remains the shipped production GGUF (models/gguf/qwen2.5-3b-toolcalls-q4km.gguf)
v2 GGUF stays on disk but is NOT for production use
v3 plan: docs/fine_tune/V3_PLAN.md (Qwen3-VL-8B-Instruct base, cloud A100 training, 4-class eval suite anchored to our training data, baselines BEFORE training)
Full postmortem: docs/training_runs/v2-real-world-test.md + docs/training_runs/v2-chat-template-retest.md

Closes-as-not-shipped: #33. Parent #25 stays open (v3 will close it).

Reusable infrastructure for fine-tune eval reports going forward. Future evals (v1+v2 baselines for v3, v3 itself) emit structured JSON and render through a uniform template instead of ad-hoc markdown. Pipeline: 1. Harness emits results.json conforming to schemas/eval-report.schema.json 2. scripts/fine_tune/render_test_report.py validates against schema, substitutes into docs/fine_tune/templates/TEST_REPORT_TEMPLATE.md 3. Optional --html flag emits HTML via pandoc (fallback: markdown lib, then raw-md-in-pre with warning) Files: - schemas/eval-report.schema.json — JSON Schema 2020-12, strict on gates/verdict/prompts, permissive on aggregate_stats/notable_findings. Outcome enum covers both real-world classes and V3_PLAN Class B classes. - docs/fine_tune/templates/TEST_REPORT_TEMPLATE.md — {{placeholder}} scaffold. Self-documents syntax in HTML comment block (preserved through rendering by the substitution function). - scripts/fine_tune/render_test_report.py — single-file CLI. Validates, renders markdown, optionally emits HTML. Idempotent (rerun = same bytes). Exit codes: 0 success, 1 missing template, 2 validation failure, 3 missing input. - docs/fine_tune/templates/EXAMPLE_v2-chat-template-retest.results.json — hand-built from v2-chat-template-retest.md as schema-coverage proof - docs/fine_tune/templates/EXAMPLE_v2-chat-template-retest.md — generator output matching the manually-written report End-to-end verified: bad input -> exit 2 with per-field error, example input -> markdown + HTML written, byte-identical on rerun. Dependency: jsonschema >= 4.18 (already in .venv-finetune via existing use). Belongs in a future requirements-dev.txt; not adding to runtime requirements.txt.

Rewrote V3_PLAN.md to lock Qwen3-8B + local MPS + ≤6GB training rule + vision-out-of-trained-model. Built 3 new test harnesses and ran all 5 eval classes against v1 and v2 GGUFs to establish 'before' baselines for v3 to beat. Headline finding (corrects earlier framing): - v2 ADAPTS after tool_response 90% of the time (not 30% as previously reported). The real regression is v2 NEVER concludes with a text answer — text_answer rate 0/30 vs v1's 10/30 (33%). - v2 is 7x better at imitating training data shape (Class A: 46.7% vs 6.7%). v1 is the only model that synthesises final prose answers. - Project recall (Class E) is 75% for v2, 66.7% for v1 — both already know agent-memory / fire-map / daily-dispatch from training data. Baseline table (v1 / v2): A shape_match 6.7% / 46.7% (n=30) B text_answer 33% / 0% (n=30) C useful_ans 30% / 0% (n=10) D parse_rate 95% / 80% (n=20) E PASS 66.7% / 75% (n=12) v3 ship gates per V3_PLAN.md §6: B text_answer ≥ 30% (the new headline gate) C useful_answer ≥ 50% D parse_rate ≥ 85% A shape_match ≥ 60% E PASS ≥ 70% + tool-call shape gate (no post-</tool_call> scaffolding) + in-args repetition gate (no within-arg loops) Files: - docs/fine_tune/V3_PLAN.md: full rewrite — local MPS, Qwen3-8B base, ≤6GB rule, vision as harness pre-pass, 5 eval classes - tests/fine_tune/real_world/harness_class_{a,b,e}.py: new harnesses - tests/fine_tune/fixtures/project_recall_prompts.txt: 12 prompts - docs/training_runs/baselines/class-{a,b,c,d,e}-{v1,v2}.md: 10 reports - docs/training_runs/v3-baselines.md: master comparison summary - schemas/eval-report.schema.json: extended eval_class enum with 'E' - .gitignore: exclude tests/fine_tune/real_world/baselines/ (raw JSON) Open issues: - Class A 'needs_review' is 16/30 on v2 (plausible alternatives, not shape-matches). Needs manual labeling pass to decide if those count toward the 60% v3 gate. - Class C is only 10 prompts; expand to 30 for v3. - Project-tagged oversampling (V3_PLAN §5 fix #6) may be lower priority than expected — v2 already at 75% Class E without it. Text-synthesis oversampling (fix #2) is the highest-impact change.

…ders) A lesson with trigger_on='input' AND no trigger_tool AND no trigger_pattern matches every Edit/Write/Bash/NotebookEdit call and dominates the per-tool-call systemMessage budget. The synth-lesson #86 ('never X') was the canonical bad case; three legit cross-cutting CRITICAL safeguards (#35 read-before-edit, #36 docker-restart-zombie-cron, #44 infra-dev-first) also fall in this set today. Three layers of defense: 1. API validator (app/routes/lessons.py::_validate_trigger_on) — returns 400 for any new POST /api/lessons with trigger_on='input' and neither trigger_tool nor trigger_pattern. Clean error message points to the actual constraint. 2. DB CHECK constraint (migration 016, chk_input_trigger_has_filter, added NOT VALID) — refuses any INSERT or UPDATE that would create a broad-match input-triggered row. NOT VALID intentionally — three existing legacy rows (#35/#36/#44) are grandfathered per the agreed policy. Operator can run VALIDATE CONSTRAINT later once they're narrowed. 3. Runtime filter (app/routes/lessons.py::match_lessons) — the SQL condition '(l.trigger_tool IS NOT NULL OR l.trigger_pattern IS NOT NULL)' for trigger_on='input' skips legacy broad-match rows at match time, so the 3 existing CRITICAL safeguards no longer fire on every Bash call (which was the actual user-visible spam problem). When they're relevant they can fire again by being narrowed. Tests pin: - POST /api/lessons rejects broad-match input lessons with 400 - /api/lessons/match skips broad-match rows even if inserted via direct SQL (self-skips when the CHECK constraint blocks the test setup, which itself confirms the constraint is enforcing) Updated test_create_global_lesson to include a trigger_pattern (was implicitly broad-match before; the test only verified the create path, not match semantics). Working-branch: fix/trigger-lessons

wfca-mz and others added 13 commits May 13, 2026 14:58

wfca-mz added 2 commits May 15, 2026 08:27

metazen11 mentioned this pull request May 24, 2026

dev → main: broad-match lesson guard (belt+suspenders) #52

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(fine-tune): v2 retrain — kills empty-args loop bug (#33)#44

feat(fine-tune): v2 retrain — kills empty-args loop bug (#33)#44
metazen11 wants to merge 15 commits into
mainfrom
feat/v2-finetune-data-pipeline

metazen11 commented May 15, 2026

Uh oh!

metazen11 commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

metazen11 commented May 15, 2026

Summary

Results

Validator regressions discovered + fixed

Documentation

Artifacts (local, not in repo)

Known caveats

Test plan

Uh oh!

metazen11 commented May 15, 2026

⚠️ RETRACTION (2026-05-15) — do not merge as production

Two independent verifications

Three v2 regressions identified

v2 wins (real but insufficient to ship)

Decision

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants