feat(fine-tune): v2 retrain — kills empty-args loop bug (#33)#44
Open
metazen11 wants to merge 15 commits into
Open
feat(fine-tune): v2 retrain — kills empty-args loop bug (#33)#44metazen11 wants to merge 15 commits into
metazen11 wants to merge 15 commits into
Conversation
Captures full context from v1 ship session for next-session pickup: - PR #24 merged to main (Qwen 2.5-3B tool-call LoRA, v1). - Loop bug in v1: empty <tool_call> arguments on vague prompts, traced to 83% of v1 training data being synthetic "Call tool X with appropriate arguments" prompts. - agent-memory schema is right, data linkage is wrong: 54,987 mem_tool_calls rows have full tool_input args, but 0 tool_calls join back to a same-session user prompt (only 54/500 sessions captured prompts). - ~/.claude/projects/**/*.jsonl is the source of truth — 2,367 jsonl files with full turn history including tool_use + tool_result blocks. - Lookup/recall surface gap: /api/observations and MCP search return observation summaries but never surface tool_input (the args). Agents can recall "I called Bash" but not "I called Bash with what". v2 plan in handoff: schema additions (optional), claude jsonl backfill script (dry-run first), audit live hooks, fix lookup surface, build v2 dataset SQL query, open issue #25, retrain at 1.0 epoch with tool descriptions, add anti-loop validator flag. Branch: feat/v2-finetune-data-pipeline off the merged main.
Plan (docs/fine_tune/V2_DATA_PIPELINE_PLAN.md) defines the 9-step path from the v1 empty-args loop bug to v2: schema migration, jsonl backfill, hook audit, recall surface fix, dataset rebuild, retrain, and anti-loop guard. Maps to issue #25 (parent) + #26-#33 sub-issues. Quality-gate review (docs/fine_tune/reviews/25-quality-gate.json): approve_with_changes, 8 blockers + 6 non-blockers, all addressed in this revision. Decisions logged: - retention_class column, 365-d default for both live + backfill - /api/observations handler stays in app/main.py - MCP tool_input exposure default OFF (env-gated) - recall surface stays inside #25 - Bash sub-classified by first non-flag command token - PG 16.13 pinned; CREATE INDEX CONCURRENTLY + NOT VALID FK pattern
Adds AntiLoopDetector in scripts/fine_tune/validate_tool_calls.py and a --anti-loop CLI flag. Detects the empty-args infinite-loop pattern v1 exhibits on vague LM Studio prompts: 3 consecutive identical normalized tool calls in a conversation flags the 3rd for suppression. Emits WARN log tagged with --model-version and increments empty_args_emissions_total in the report. Hooked at the validator/eval level here; production wiring in mcp_server.py and the Claude hooks is part of #33 retrain follow-up. - 10 unit tests in tests/fine_tune/test_anti_loop.py - FAILURE_MODES.md entry #11 documents symptom + canonical mitigation - .gitignore: .claude/worktrees/ (prevents background-agent worktree pollution into parent commits) Tests: 34/34 pass (24 existing fine-tune + 10 new anti-loop). Co-authored-by: MZ <mz@wfca.com>
Implements migration 012 from V2_DATA_PIPELINE_PLAN.md Step 1: adds the
columns + FK + indexes that close the prompt-to-tool-call linkage gap.
Schema changes:
mem_projects.git_remote
mem_tool_calls.{turn_index, turn_subindex, prev_user_prompt_id,
backfill_run_id, retention_class, content_hash,
truncated_at_bytes}
mem_user_prompts.{retention_class, backfill_run_id, turn_index,
content_hash}
FK mem_tool_calls.prev_user_prompt_id → mem_user_prompts.id
(added NOT VALID, validated in concurrent section)
CREATE INDEX CONCURRENTLY for the two new indexes.
Migration runner (app/migrate.py) gains support for .concurrent.sql
companion files that run outside any transaction. SQL splitter handles
single-line comments, block comments, single-quote strings (with ''
escapes), and dollar-quoted bodies. Tracking table constraint moved
from UNIQUE(version) to UNIQUE(filename) so base + companion can
coexist under the same version number.
Tests (tests/migrations/) — first migration tests in this repo:
conftest.py per-module throwaway PG database with vector ext
test_012.py 7 schema assertions + 1 idempotency
test_012_reverse.py apply / reverse / re-apply cycle
9/9 passing.
Verified on live agent-memory DB (PG 16.13, 54,987 mem_tool_calls rows):
migration completes in 0.27s; re-run is no-op.
Co-authored-by: MZ <mz@wfca.com>
…36) (#37) Closes #36. Sub-folder cwds now collapse to one project per git repo; tool calls get tagged with the branch and SHA at moment of call. Schema (migration 013, transactional + concurrent split): mem_projects: +canonical_root_path git rev-parse --show-toplevel result +source_kind 'git' | 'non-git' | 'ephemeral' +parent_project_id self-FK ON DELETE SET NULL mem_tool_calls: +git_branch branch at moment of call +git_sha HEAD sha at moment of call indexes on canonical_root_path, git_remote, source_kind, parent_project_id, git_branch (all CREATE INDEX CONCURRENTLY) + .down.sql reversal. Helper (app/git_context.py): resolve_git_context(cwd) -> GitContext * Resolves cwd to canonical root + remote + branch + sha. * Strips userinfo from HTTPS remote URLs before storage (defense against accidentally-stored creds). * Tightened ephemeral path regex — matches pytest-of-USER/, mktemp /tmp/tmp.X/ shapes ONLY, not any path containing 'pytest' or 'tmp'. * 30s per-cwd cache, bounded to 1024 entries. Writer (app/project.py + app/routes/observations.py): ensure_project() now keys git projects on the canonical root. Sub-folder cwds upsert into the same row instead of creating leaves. COALESCE-based upgrade-in-place for rows pre-existing migration 013. Tool-call INSERT populates git_branch + git_sha from resolved context. Consolidation script (scripts/backfill/consolidate_projects.py): Dry-run by default. --commit applies. Resolves every pre-013 mem_projects row to its canonical context, sets canonical_root_path + git_remote + source_kind, links sub-folders to their canonical parent via parent_project_id, and re-maps existing mem_tool_calls.project_id to point at the canonical row. Atomic inside one transaction. Live-DB run remapped 5,915 tool_calls across 688 mem_projects rows in seconds; 470 ephemeral pytest rows correctly flagged for training-data exclusion. Tests (78 total, all passing): tests/app_helpers/test_git_context.py (28) git resolution, redaction, cache, ephemeral regex parametrized. tests/test_cross_agent_compat.py (7) Claude + Anvil + Codex hook shapes end-to-end through /api/queue against the live server, verified by direct asyncpg read. Existing tests/migrations + tests/fine_tune untouched: 43 still pass. SQL hygiene: every query is asyncpg-parametrized; SQL strings live as named module-level constants in app/project.py and the consolidation script for audit visibility. Test infrastructure: pytest.ini: + pythonpath = . tests/conftest.py: bootstrap sys.path + X-Agent-Name header for trusted-agent auth bypass. tests/app/ renamed to tests/app_helpers/ to avoid package-name collision with app/. Out of scope (flagged for follow-up): - app/routes/tool_calls.py is unmounted in app/main.py (pre-existing bug; intentionally not fixed here to keep scope tight). - Same logical repo at two checkouts still has two mem_projects rows; can be deduped later by joining on git_remote. Data is now correct for that to work. Co-authored-by: MZ <mz@wfca.com>
Closes #30. Adds the prompt-write path that's been missing: no hook listened to UserPromptSubmit, and there was no INSERT into mem_user_prompts anywhere in app/. The 1,410 existing rows were all from a one-shot scripts/backfill_jsonl.py run back in March; nothing captured live since. What this adds -------------- * POST /api/prompts — new endpoint in app/routes/prompts.py. Writes redacted prompt text + (session_id, project_id, prompt_number, turn_index, content_hash, retention_class='live') to mem_user_prompts. Routes through ensure_project so prompts land under the consolidated canonical git-root project from #36. * hooks/user-prompt-submit.js — fire-and-forget Node hook. Reads Claude Code's UserPromptSubmit stdin shape (session_id, cwd, prompt), POSTs to /api/prompts, exits 0 silently on any error. Mirrors post-tool-use.js's auth + timeout pattern. * Idempotency: (session_id, content_hash) lookup before insert. Re-firing the hook for the same prompt returns status='exists' instead of inserting a duplicate. * Redaction: prompt text goes through redact_text() before storage (catches sk-ant, hf_, ghp_, Bearer, etc.). Tests (6 new, all passing) -------------------------- * test_prompt_creates_row — basic write + canonical project resolution. * test_same_prompt_twice_is_idempotent — second POST returns 'exists'. * test_multiple_unique_prompts_get_sequential_numbers — prompt_number and turn_index increment 1,2,3. * test_anvil_and_codex_can_post_prompts — same payload shape works for non-Claude agents. * test_secrets_in_prompt_are_redacted_on_write — sk-ant-api03-… scrubbed via redact_text(). * test_missing_cwd_uses_unknown_project — graceful fallback when the hook can't determine cwd (matches existing observations.py pattern). Total suite now at 84/84 green (78 from #36 + 6 new). No regressions. Wiring (host-level, manual) --------------------------- ~/.claude/hooks/agent-memory-user-prompt-submit.js → symlink to repo. ~/.claude/settings.json → added UserPromptSubmit handler entry. Both changes are user-environment, not repo-tracked. Re-applying on a new machine documented in the PR body checklist. Why not in pre-tool-use.js or post-tool-use.js ---------------------------------------------- The PostToolUse hook input has no prompt field — Claude Code's hook contract is event-specific. UserPromptSubmit is the only event that carries the prompt text, and it fires exactly once per user turn. Co-authored-by: MZ <mz@wfca.com>
) Wires the existing scripts/backup.sh into a macOS launchd job that fires daily at 03:14 local time. Retention: 3 most recent daily_*.sql.gz files (rotation already in backup.sh). Manual snapshots with other prefixes (e.g. pre_v2_backfill_*) are preserved. What this adds -------------- * scripts/com.metazen.agent-memory-backup.plist — launchd template with __PROJECT_DIR__ and __HOME__ placeholders. * scripts/install_backup_schedule.sh — renders the template, copies it to ~/Library/LaunchAgents/, bootstraps the job. Supports --check and --uninstall. Falls back to crontab on non-macOS. * hooks/ensure-services.js — new ensureBackupSchedule() called at the end of Main after services come up. Idempotent: re-installs only when the template is newer than the installed plist or the target is missing. macOS only. Never fails session start (debug-log + swallow). scripts/backup.sh — bug fixes ----------------------------- Pre-existing script hard-coded user=agentmem with no password support, which fails against the actual dev setup that uses DATABASE_URL. Now: - Prefers DATABASE_URL (matches the FastAPI server's DSN). - Falls back to POSTGRES_USER + PGPASSWORD env when DATABASE_URL absent. - Refuses to keep a backup file < 1 KB (catches silent auth failures that would otherwise produce a near-empty .gz). Verified -------- * install_backup_schedule.sh --check reports plist installed + job loaded. * backup.sh produced a 319 MB gzipped dump and rotated correctly. * launchctl list confirms com.metazen.agent-memory-backup is scheduled. Docs ---- * docs/backups.md (new) — operator reference: setup, verification, restore, manual run, disabling, non-macOS fallback. * README.md, handoff.md — short pointers to docs/backups.md. Co-authored-by: MZ <mz@wfca.com>
… linkage (#28) (#40) Closes #28. New focused backfill script that imports tool_calls and user_prompts from ~/.claude/projects/**/*.jsonl with full v2 field population — prev_user_prompt_id linkage, turn_index, content_hash, retention_class, backfill_run_id. Distinct from scripts/backfill_jsonl.py, which generates mem_observations via LLM per row. The existing parser (scripts/backfill_jsonl.py:parse_jsonl_session) is imported as the single source of truth for jsonl shape interpretation. This script is faster, no-LLM, and writes the v2 columns the fine-tune pipeline needs. What this adds -------------- * scripts/backfill/backfill_tool_calls_from_jsonl.py — failure-fast CLI with dry-run default and --limit N for incremental gates (1 session → 10 → full corpus). * app/redact.py:redact_json() — recursive scrubber that walks dicts and lists, redacting every string leaf. Required because tool_input is nested JSON; secrets inside (e.g. headers.Authorization) slip past the existing top-level redact_text. Used by the backfill writer. * tests/app_helpers/test_redact_json.py (22 tests) — parametrized contract for nested redaction, immutability, idempotency, dict-key preservation, parity with redact_text on plain strings. Edge cases handled (all 8 named in V2_DATA_PIPELINE_PLAN.md) ----------------------------------------------------------- 1. Malformed jsonl line — parser already skips + continues file. 2. Orphan tool_use — inserted with prev_user_prompt_id=NULL, counted in stats.tool_calls_orphan. 3. tool_use w/o tool_result — row present, response_preview=NULL. 4. Multi-tool turns — shared turn_index, turn_subindex 0..N-1. 5. tool_input > 16 KB — stored truncated, truncated_at_bytes set to the byte budget. 6. Unicode — passed through asyncpg's standard codec; session-atomic rollback on rare failure. 7. Non-git cwd — project resolved via ensure_project's non-git path (literal cwd as full_path). 8. Missing cwd on disk — same as #7 via resolve_git_context's graceful degradation. Idempotency + crash recovery ---------------------------- Row-level dedupe via (session_id, content_hash) for BOTH prompts and tool_calls. Re-running --commit on the same jsonl is a no-op (verified in gate-1 testing). Each session imports inside its own asyncpg transaction — a crash leaves a session fully absent, not partially imported, so a re-run starts cleanly. Why git_branch / git_sha stay NULL for backfilled rows ------------------------------------------------------ Those columns reflect "branch at moment of call". Resolving them today via git -C <cwd> would write today's state into rows that represent work from months ago — actively misleading. The live writer at app/routes/observations.py DOES populate them for going-forward captures. The two paths are deliberately asymmetric. Gate-1 validation (live DB, 1 session) --------------------------------------- backfill_run_id=20260514T043555Z imported 31 prompts + 40 tool_calls from session 3bd50552. Verified: * 40/40 tool_calls have prev_user_prompt_id set (100% linkage). * turn_index + turn_subindex correctly group multi-tool turns. * Each tool_call's linked prompt is the user message that immediately preceded it in the conversation. * Re-running --commit detected all 71 rows as duplicates (zero inserted on second pass). Rollback for the gate-1 run is documented in the script docstring. Gate-2 (10 sessions) and gate-3 (full corpus) to run separately. Co-authored-by: MZ <mz@wfca.com>
…survive (#41) Closes the "ok"/"yes"/"continue" data-loss bug surfaced during gate-2 testing of #28: a user who says the same short prompt multiple times within one session has multiple distinct turns, each followed by different tool_calls. The previous (session_id, content_hash) dedupe key collapsed them into one row. Change ------ Dedupe keys are now positional: * mem_user_prompts: (session_id, prompt_number) * mem_tool_calls: (session_id, turn_index, turn_subindex, retention_class='backfill_jsonl') content_hash is still computed and stored for search/audit but is no longer the unique key. Same hash at different positions = distinct rows. prev_user_prompt_id resolution unchanged: we still look up last_user_message against a session-local text→id map. "Latest wins" semantics mean reused prompts link to their most recent occurrence at the time of the call. Validation (live DB, gate-3 full corpus) ---------------------------------------- * 2,385 jsonl files processed in seconds. * 2,607 new mem_user_prompts inserted. * 27,475 new mem_tool_calls inserted. * 0 orphans — every backfilled tool_call has prev_user_prompt_id set. * 83 tool_inputs truncated cleanly (>16 KB). * 2 sessions failed on UTF-8 NUL bytes (0.08%). * Total backfilled-subset linkage: 28,599 / 28,599 = 100%. Compared to the buggy v1 of this script on the gate-2 corpus, the fix recovered ~22 rows per 10 sessions that the content-hash collapse had silently dropped. Extrapolated to full corpus: ~2,500 rows of conversation structure that would have been lost. Co-authored-by: MZ <mz@wfca.com>
… (#42) Closes #32. Reads the 28,599 properly-linked tool_calls + 4,502 prompts from agent-memory (#28 backfill output) and emits Qwen 2.5 chat-template-compatible rows for v2 training. Output (data/processed/qwen25_tools/v2/, gitignored) * train.chat.jsonl 23,983 rows * valid.chat.jsonl 1,588 rows (5% session-aware split) * train.tiny.jsonl 200 rows (deterministic sample, seed=42) * valid.tiny.jsonl 30 rows * tool_schemas.json 35 schemas (22 from v1 + 13 recovered) * MANIFEST.json drop reasons, tool histogram, output hashes Key v2 vs v1 differences * Real user prompts. Each row's user message is the actual prompt that preceded the tool call, not v1's synthetic "Call tool `Bash` with appropriate arguments." That's the fix for the empty-args loop. * Tool descriptions baked in. Concise one-liners per tool. * Schemas recovered for Agent + MCP tools (Agent alone was 661 rows that would have been dropped under v1's registry). * Bash sub-classification. Cap applied per first-non-flag command token (Bash:git, Bash:pytest) at 20% of total via stratified random down-sample (seed=42). Filters 1. Drop tool_input == {} AND schema has required[] (the v1 loop bug). 2. Drop sessions with < 2 turns. 3. Drop SKIP_TOOLS (TodoWrite, AskUserQuestion, EnterPlanMode, etc.). 4. Per-category 20% cap. Validation * 134/134 tests pass (106 prior + 28 new dataset-builder tests). * 100/100 random rows render cleanly through apply_chat_template. * Live-DB run: 28,599 fetched → 25,571 converted → 23,983 train + 1,588 valid. Drop reasons in MANIFEST: 2,973 skip_tool, 55 missing_schema (long-tail rare MCPs only). * Sampled inspection: each row's user message is a real prompt; each tool_call has real arguments; tool response preserved. Acceptance criteria from V2_DATA_PIPELINE_PLAN.md Step 6 * [x] ≥ 25k rows in v2/ (25,571 total, 23,983 train). * [x] < 10% synthetic (0% — all backfill_jsonl, sentinel-tagged `synthetic: false`). * [x] No empty-args row unless schema permits. * [x] Tool + bash_command histograms in MANIFEST. * [x] apply_chat_template renders cleanly on sampled rows. Co-authored-by: MZ <mz@wfca.com>
…omplete (#43) * HANDOFF.md: replaces stale 'next session = backfill' section with current state — all 7 v2 data-pipeline sub-issues closed (PRs #34/#35/#36/#37/#38/#39/#40/#41/#42 merged). Sole remaining v2 work is #33 (retrain). Includes the actual training procedure to run. Live DB stats (28,599 backfilled tool_calls, 100% linked) and v2 dataset stats (23,983 train rows) captured for the next session. * README.md: adds 'V2 Tool-Call Dataset' subsection under Fine-Tuning Dataset Exports — documents data/processed/qwen25_tools/v2/ shape, build command, source-of-truth tables, and link to the plan doc. * docs/fine_tune/V2_DATA_PIPELINE_PLAN.md: per-step checklist now reflects merged PR status. #31 explicitly marked deferred (not on training critical path). #36 (project consolidation) and the daily backup work flagged as bonus items from the data audit. Co-authored-by: MZ <mz@wfca.com>
Closes #33 and v2 parent #25. Trains Qwen2.5-3B-Instruct on 23,983 real Claude-session prompts (vs v1's 83% synthetic). Ships GGUF Q6_K at 2.4 GB. v1 GGUF untouched (SHA 5e174a04 unchanged pre/post conversion, verified). The v1 empty-args inference loop bug is ELIMINATED. Chat-loop test on 50 vague natural prompts: 50/50 emit fully-specified tool calls, 0 empty-args emissions. v1 used to spin until context exhaustion on these. ## Results - HF merged (no quant noise): 86.7% aggregate validator pass - GGUF Q6_K (ship): 85.0% aggregate / 88.9% in-dist - Chat-loop on 50 vague prompts: 0 empty-args emissions ← real gate Q4_K_M tested but dropped to 81.7% — quantization-induced argmax drift hurt arg commitment at low temperatures. Q6_K is the smallest quant that holds the gate. ## Files - docs/training_runs/v2-20260514T055422Z.md full run report - docs/fine_tune/V2_TRAINING_PLAN.md 11-phase plan w/ rollback - docs/fine_tune/FAILURE_MODES.md +#12 (32k context + YaRN) - tests/fine_tune/fixtures/vague_prompts.txt 50-prompt eval fixture - scripts/fine_tune/validate_tool_calls.py 3 bugs fixed + suite rewrite - pyrightconfig.json resolves torch/transformers - HANDOFF.md env-var commands, not flags ## Artifacts (local — not in repo) - models/lora/qwen2.5-3b-instruct-toolcalls-lora/runs/20260514T055422Z-v2-full/ - models/merged/qwen2.5-3b-toolcalls-v2-merged/ - models/gguf/qwen2.5-3b-toolcalls-v2-q6k.gguf SHA 54617fbda2176166101837c26d63547196297ba7138a8f1696d3c14ec5d20ed6 - ~/.lmstudio/models/mz/qwen2.5-3b-toolcalls-v2/ ## Wall clock 12h 38m full training (5996 steps, 1 epoch, MPS) + ~1h validation/GGUF.
v2 fails real-world multi-turn agentic test even with the model's own
Jinja chat template via /v1/chat/completions. Two independent test
harnesses (hand-rolled ChatML and OpenAI chat-completions) both show
v2 is strictly worse than v1: 0/10 useful answers vs 3/10, 9/10 loop
rate vs 4/10. A second regression also surfaced: in-args generative
loops on 5/10 prompts (repeated lines inside argument values until
max_tokens).
Changes:
- HANDOFF.md: prepend v2-RETRACTED status banner + summary
- docs/training_runs/v2-20260514T055422Z.md: prepend postmortem header
- docs/training_runs/v2-real-world-test.md: full v1-vs-v2 A/B with verbatim transcripts
- docs/training_runs/v2-chat-template-retest.md: chat-completions retest confirming retraction
- docs/fine_tune/V3_PLAN.md: v3 plan targeting Qwen3-VL-8B-Instruct base, with
* cloud A100 training (MPS is unworkable for 8B)
* 4-class eval suite anchored to our training data (Class A in-distribution,
B tool_response-adaptation, C OOD, D single-turn validator carryover)
* in-args repetition gate added after retest finding
* Phase 0.5 baselines v1+v2 before any v3 training
* 3 training-data fixes: stop-after-tool_call cut, oversample
text-synthesis-after-tool_response, optional negative-training pairs
- tests/fine_tune/real_world/harness.py + harness_chat.py: reusable test harnesses
- .gitignore: exclude *-results.json raw transcripts from real_world dir
v1 remains the shipped production GGUF. PR #44 stays as v2 artifact/learning
record; do NOT merge as production.
Owner
Author
|
| Test harness | v1 useful | v2 useful | v1 loop | v2 loop | v1 adapt | v2 adapt |
|---|---|---|---|---|---|---|
Hand-rolled ChatML + /completion |
3/10 | 0/10 | 4/10 | 9/10 | 9/10 | 3/10 |
OpenAI chat-completions + --jinja |
n/a | 0/10 | n/a | 9/10 | n/a | 8/10 |
The chat-completions retest was specifically to rule out that the regression was a harness format artifact. It is not — v2's own embedded Jinja template wraps tool messages identically to the hand-rolled ChatML, and v2 still fails. See docs/training_runs/v2-chat-template-retest.md in the new commit e5f290f.
Three v2 regressions identified
- Ignores tool_response — re-emits identical or near-identical tool_call after seeing a tool_result instead of synthesizing a text answer (90% of sessions)
- In-args generative loop — on 5/10 prompts, the JSON arguments string gets stuck in a within-call repetition loop (e.g.
print('Tool_log: ...')× 12 in one shell one-liner) until max_tokens truncation - Off-topic action selection — e.g.
gh issue createfor a "is there a test for X?" question
v2 wins (real but insufficient to ship)
- Stays in correct repo (no fire-map path hallucination — v1's bug)
- Turn-1 args populated 10/10 (the original empty-args bug IS fixed)
- ~½ the tokens of v1 (gives up faster, not converges faster)
Decision
- PR feat(fine-tune): v2 retrain — kills empty-args loop bug (#33) #44 stays open as the v2 experiment / learning record
- v1 remains the shipped production GGUF (
models/gguf/qwen2.5-3b-toolcalls-q4km.gguf) - v2 GGUF stays on disk but is NOT for production use
- v3 plan:
docs/fine_tune/V3_PLAN.md(Qwen3-VL-8B-Instruct base, cloud A100 training, 4-class eval suite anchored to our training data, baselines BEFORE training) - Full postmortem:
docs/training_runs/v2-real-world-test.md+docs/training_runs/v2-chat-template-retest.md
Closes-as-not-shipped: #33. Parent #25 stays open (v3 will close it).
Reusable infrastructure for fine-tune eval reports going forward.
Future evals (v1+v2 baselines for v3, v3 itself) emit structured JSON
and render through a uniform template instead of ad-hoc markdown.
Pipeline:
1. Harness emits results.json conforming to schemas/eval-report.schema.json
2. scripts/fine_tune/render_test_report.py validates against schema,
substitutes into docs/fine_tune/templates/TEST_REPORT_TEMPLATE.md
3. Optional --html flag emits HTML via pandoc (fallback: markdown lib,
then raw-md-in-pre with warning)
Files:
- schemas/eval-report.schema.json — JSON Schema 2020-12, strict on
gates/verdict/prompts, permissive on aggregate_stats/notable_findings.
Outcome enum covers both real-world classes and V3_PLAN Class B classes.
- docs/fine_tune/templates/TEST_REPORT_TEMPLATE.md — {{placeholder}}
scaffold. Self-documents syntax in HTML comment block (preserved
through rendering by the substitution function).
- scripts/fine_tune/render_test_report.py — single-file CLI. Validates,
renders markdown, optionally emits HTML. Idempotent (rerun = same
bytes). Exit codes: 0 success, 1 missing template, 2 validation
failure, 3 missing input.
- docs/fine_tune/templates/EXAMPLE_v2-chat-template-retest.results.json
— hand-built from v2-chat-template-retest.md as schema-coverage proof
- docs/fine_tune/templates/EXAMPLE_v2-chat-template-retest.md —
generator output matching the manually-written report
End-to-end verified: bad input -> exit 2 with per-field error,
example input -> markdown + HTML written, byte-identical on rerun.
Dependency: jsonschema >= 4.18 (already in .venv-finetune via existing
use). Belongs in a future requirements-dev.txt; not adding to
runtime requirements.txt.
Rewrote V3_PLAN.md to lock Qwen3-8B + local MPS + ≤6GB training rule
+ vision-out-of-trained-model. Built 3 new test harnesses and ran all
5 eval classes against v1 and v2 GGUFs to establish 'before' baselines
for v3 to beat.
Headline finding (corrects earlier framing):
- v2 ADAPTS after tool_response 90% of the time (not 30% as
previously reported). The real regression is v2 NEVER concludes
with a text answer — text_answer rate 0/30 vs v1's 10/30 (33%).
- v2 is 7x better at imitating training data shape (Class A: 46.7% vs
6.7%). v1 is the only model that synthesises final prose answers.
- Project recall (Class E) is 75% for v2, 66.7% for v1 — both already
know agent-memory / fire-map / daily-dispatch from training data.
Baseline table (v1 / v2):
A shape_match 6.7% / 46.7% (n=30)
B text_answer 33% / 0% (n=30)
C useful_ans 30% / 0% (n=10)
D parse_rate 95% / 80% (n=20)
E PASS 66.7% / 75% (n=12)
v3 ship gates per V3_PLAN.md §6:
B text_answer ≥ 30% (the new headline gate)
C useful_answer ≥ 50%
D parse_rate ≥ 85%
A shape_match ≥ 60%
E PASS ≥ 70%
+ tool-call shape gate (no post-</tool_call> scaffolding)
+ in-args repetition gate (no within-arg loops)
Files:
- docs/fine_tune/V3_PLAN.md: full rewrite — local MPS, Qwen3-8B base,
≤6GB rule, vision as harness pre-pass, 5 eval classes
- tests/fine_tune/real_world/harness_class_{a,b,e}.py: new harnesses
- tests/fine_tune/fixtures/project_recall_prompts.txt: 12 prompts
- docs/training_runs/baselines/class-{a,b,c,d,e}-{v1,v2}.md: 10 reports
- docs/training_runs/v3-baselines.md: master comparison summary
- schemas/eval-report.schema.json: extended eval_class enum with 'E'
- .gitignore: exclude tests/fine_tune/real_world/baselines/ (raw JSON)
Open issues:
- Class A 'needs_review' is 16/30 on v2 (plausible alternatives, not
shape-matches). Needs manual labeling pass to decide if those count
toward the 60% v3 gate.
- Class C is only 10 prompts; expand to 30 for v3.
- Project-tagged oversampling (V3_PLAN §5 fix #6) may be lower priority
than expected — v2 already at 75% Class E without it. Text-synthesis
oversampling (fix #2) is the highest-impact change.
metazen11
pushed a commit
that referenced
this pull request
May 24, 2026
…ders)
A lesson with trigger_on='input' AND no trigger_tool AND no trigger_pattern
matches every Edit/Write/Bash/NotebookEdit call and dominates the
per-tool-call systemMessage budget. The synth-lesson #86 ('never X') was
the canonical bad case; three legit cross-cutting CRITICAL safeguards
(#35 read-before-edit, #36 docker-restart-zombie-cron, #44 infra-dev-first)
also fall in this set today.
Three layers of defense:
1. API validator (app/routes/lessons.py::_validate_trigger_on) — returns
400 for any new POST /api/lessons with trigger_on='input' and neither
trigger_tool nor trigger_pattern. Clean error message points to the
actual constraint.
2. DB CHECK constraint (migration 016, chk_input_trigger_has_filter,
added NOT VALID) — refuses any INSERT or UPDATE that would create a
broad-match input-triggered row. NOT VALID intentionally — three
existing legacy rows (#35/#36/#44) are grandfathered per the agreed
policy. Operator can run VALIDATE CONSTRAINT later once they're
narrowed.
3. Runtime filter (app/routes/lessons.py::match_lessons) — the SQL
condition '(l.trigger_tool IS NOT NULL OR l.trigger_pattern IS NOT NULL)'
for trigger_on='input' skips legacy broad-match rows at match time,
so the 3 existing CRITICAL safeguards no longer fire on every Bash
call (which was the actual user-visible spam problem). When they're
relevant they can fire again by being narrowed.
Tests pin:
- POST /api/lessons rejects broad-match input lessons with 400
- /api/lessons/match skips broad-match rows even if inserted via direct
SQL (self-skips when the CHECK constraint blocks the test setup, which
itself confirms the constraint is enforcing)
Updated test_create_global_lesson to include a trigger_pattern (was
implicitly broad-match before; the test only verified the create path,
not match semantics).
Working-branch: fix/trigger-lessons
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
5e174a04unchanged), verified.Results
The chat-loop test is the authoritative gate: 50/50 prompts that USED to trip v1's empty-args loop produced fully-specified non-empty tool calls from v2 in 48 seconds total. Pattern fixed.
Validator regressions discovered + fixed
Three independent bugs in
scripts/fine_tune/validate_tool_calls.pythat were preventing v2 from being measured correctly:tool_schemas.jsonsource hardcoded to v1 — now auto-picks latest dataset version"with access to tools."suffix that's in v2 training data — now matchesPlus:
IN_DISTRIBUTION_PROMPTSrewritten from v1's synthetic shape to realistic single-turn asks (v2 was explicitly trained NOT to memorize the v1 shape; docstring documents this).Documentation
docs/training_runs/v2-20260514T055422Z.md— full run reportdocs/fine_tune/V2_TRAINING_PLAN.md— 11-phase plan with gates + rollback (durable for v3+)docs/fine_tune/FAILURE_MODES.md— added compliance: add audit logging for API operations #12 (32k context + YaRN serve-time mitigation)tests/fine_tune/fixtures/vague_prompts.txt— 50 prompts that trigger the empty-args bugArtifacts (local, not in repo)
models/lora/qwen2.5-3b-instruct-toolcalls-lora/runs/20260514T055422Z-v2-full/— LoRA adaptermodels/merged/qwen2.5-3b-toolcalls-v2-merged/— merged HF (for future requantization)models/gguf/qwen2.5-3b-toolcalls-v2-q6k.ggufSHA54617fbda2176166101837c26d63547196297ba7138a8f1696d3c14ec5d20ed6~/.lmstudio/models/mz/qwen2.5-3b-toolcalls-v2/qwen2.5-3b-toolcalls-v2-q6k.ggufKnown caveats
empty_args_emissions_totalcounter not wired into/api/stats(validator JSON only).Test plan
--min-parse-rate 0.03)