Skip to content

feat(fine-tune): v2 retrain — kills empty-args loop bug (#33)#44

Open
metazen11 wants to merge 15 commits into
mainfrom
feat/v2-finetune-data-pipeline
Open

feat(fine-tune): v2 retrain — kills empty-args loop bug (#33)#44
metazen11 wants to merge 15 commits into
mainfrom
feat/v2-finetune-data-pipeline

Conversation

@metazen11
Copy link
Copy Markdown
Owner

Summary

Results

Test v2 Gate
HF merged validator (full precision) 86.7% ≥85% ✅
GGUF Q6_K validator (ship artifact) 85.0% ≥85% ✅
Chat-loop on 50 vague prompts 0 empty-args emissions 0 required

The chat-loop test is the authoritative gate: 50/50 prompts that USED to trip v1's empty-args loop produced fully-specified non-empty tool calls from v2 in 48 seconds total. Pattern fixed.

Validator regressions discovered + fixed

Three independent bugs in scripts/fine_tune/validate_tool_calls.py that were preventing v2 from being measured correctly:

  1. tool_schemas.json source hardcoded to v1 — now auto-picks latest dataset version
  2. Tool descriptions were overridden with placeholder text — now uses real descriptions from the dataset
  3. System prompt missing "with access to tools." suffix that's in v2 training data — now matches

Plus: IN_DISTRIBUTION_PROMPTS rewritten from v1's synthetic shape to realistic single-turn asks (v2 was explicitly trained NOT to memorize the v1 shape; docstring documents this).

Documentation

  • docs/training_runs/v2-20260514T055422Z.md — full run report
  • docs/fine_tune/V2_TRAINING_PLAN.md — 11-phase plan with gates + rollback (durable for v3+)
  • docs/fine_tune/FAILURE_MODES.md — added compliance: add audit logging for API operations #12 (32k context + YaRN serve-time mitigation)
  • tests/fine_tune/fixtures/vague_prompts.txt — 50 prompts that trigger the empty-args bug

Artifacts (local, not in repo)

  • models/lora/qwen2.5-3b-instruct-toolcalls-lora/runs/20260514T055422Z-v2-full/ — LoRA adapter
  • models/merged/qwen2.5-3b-toolcalls-v2-merged/ — merged HF (for future requantization)
  • models/gguf/qwen2.5-3b-toolcalls-v2-q6k.gguf SHA 54617fbda2176166101837c26d63547196297ba7138a8f1696d3c14ec5d20ed6
  • ~/.lmstudio/models/mz/qwen2.5-3b-toolcalls-v2/qwen2.5-3b-toolcalls-v2-q6k.gguf

Known caveats

  • 32k native context ceiling (Qwen 2.5-3B). YaRN at serve time extends to 128k+ if needed — documented in FAILURE_MODES compliance: add audit logging for API operations #12.
  • MPS training is slow (12h 38m for 1 epoch on this dataset). v3 should run on cloud A100 (~$10, ~50 min).
  • empty_args_emissions_total counter not wired into /api/stats (validator JSON only).

Test plan

  • Tiny training + validate (--min-parse-rate 0.03)
  • Full training exits clean (5996 steps, exit 0)
  • HF merged validator ≥85% (got 86.7%)
  • Q6_K GGUF validator ≥85% (got 85.0%)
  • Chat-loop verification: 0 empty-args emissions on 50 vague prompts (got 0)
  • v1 GGUF unchanged (SHA verified)
  • Run report committed

wfca-mz and others added 13 commits May 13, 2026 14:58
Captures full context from v1 ship session for next-session pickup:

- PR #24 merged to main (Qwen 2.5-3B tool-call LoRA, v1).
- Loop bug in v1: empty <tool_call> arguments on vague prompts, traced to
  83% of v1 training data being synthetic "Call tool X with appropriate
  arguments" prompts.
- agent-memory schema is right, data linkage is wrong: 54,987 mem_tool_calls
  rows have full tool_input args, but 0 tool_calls join back to a
  same-session user prompt (only 54/500 sessions captured prompts).
- ~/.claude/projects/**/*.jsonl is the source of truth — 2,367 jsonl files
  with full turn history including tool_use + tool_result blocks.
- Lookup/recall surface gap: /api/observations and MCP search return
  observation summaries but never surface tool_input (the args).
  Agents can recall "I called Bash" but not "I called Bash with what".

v2 plan in handoff: schema additions (optional), claude jsonl backfill
script (dry-run first), audit live hooks, fix lookup surface, build v2
dataset SQL query, open issue #25, retrain at 1.0 epoch with tool
descriptions, add anti-loop validator flag.

Branch: feat/v2-finetune-data-pipeline off the merged main.
Plan (docs/fine_tune/V2_DATA_PIPELINE_PLAN.md) defines the 9-step path
from the v1 empty-args loop bug to v2: schema migration, jsonl backfill,
hook audit, recall surface fix, dataset rebuild, retrain, and anti-loop
guard. Maps to issue #25 (parent) + #26-#33 sub-issues.

Quality-gate review (docs/fine_tune/reviews/25-quality-gate.json):
approve_with_changes, 8 blockers + 6 non-blockers, all addressed in
this revision. Decisions logged:
- retention_class column, 365-d default for both live + backfill
- /api/observations handler stays in app/main.py
- MCP tool_input exposure default OFF (env-gated)
- recall surface stays inside #25
- Bash sub-classified by first non-flag command token
- PG 16.13 pinned; CREATE INDEX CONCURRENTLY + NOT VALID FK pattern
Adds AntiLoopDetector in scripts/fine_tune/validate_tool_calls.py and a
--anti-loop CLI flag. Detects the empty-args infinite-loop pattern v1
exhibits on vague LM Studio prompts: 3 consecutive identical normalized
tool calls in a conversation flags the 3rd for suppression. Emits WARN log
tagged with --model-version and increments empty_args_emissions_total in
the report.

Hooked at the validator/eval level here; production wiring in mcp_server.py
and the Claude hooks is part of #33 retrain follow-up.

- 10 unit tests in tests/fine_tune/test_anti_loop.py
- FAILURE_MODES.md entry #11 documents symptom + canonical mitigation
- .gitignore: .claude/worktrees/ (prevents background-agent worktree
  pollution into parent commits)

Tests: 34/34 pass (24 existing fine-tune + 10 new anti-loop).

Co-authored-by: MZ <mz@wfca.com>
Implements migration 012 from V2_DATA_PIPELINE_PLAN.md Step 1: adds the
columns + FK + indexes that close the prompt-to-tool-call linkage gap.

Schema changes:
  mem_projects.git_remote
  mem_tool_calls.{turn_index, turn_subindex, prev_user_prompt_id,
                  backfill_run_id, retention_class, content_hash,
                  truncated_at_bytes}
  mem_user_prompts.{retention_class, backfill_run_id, turn_index,
                    content_hash}
  FK mem_tool_calls.prev_user_prompt_id → mem_user_prompts.id
    (added NOT VALID, validated in concurrent section)
  CREATE INDEX CONCURRENTLY for the two new indexes.

Migration runner (app/migrate.py) gains support for .concurrent.sql
companion files that run outside any transaction. SQL splitter handles
single-line comments, block comments, single-quote strings (with ''
escapes), and dollar-quoted bodies. Tracking table constraint moved
from UNIQUE(version) to UNIQUE(filename) so base + companion can
coexist under the same version number.

Tests (tests/migrations/) — first migration tests in this repo:
  conftest.py         per-module throwaway PG database with vector ext
  test_012.py         7 schema assertions + 1 idempotency
  test_012_reverse.py apply / reverse / re-apply cycle
9/9 passing.

Verified on live agent-memory DB (PG 16.13, 54,987 mem_tool_calls rows):
migration completes in 0.27s; re-run is no-op.

Co-authored-by: MZ <mz@wfca.com>
…36) (#37)

Closes #36. Sub-folder cwds now collapse to one project per git repo;
tool calls get tagged with the branch and SHA at moment of call.

Schema (migration 013, transactional + concurrent split):
  mem_projects:
    +canonical_root_path     git rev-parse --show-toplevel result
    +source_kind             'git' | 'non-git' | 'ephemeral'
    +parent_project_id       self-FK ON DELETE SET NULL
  mem_tool_calls:
    +git_branch              branch at moment of call
    +git_sha                 HEAD sha at moment of call
  indexes on canonical_root_path, git_remote, source_kind,
  parent_project_id, git_branch (all CREATE INDEX CONCURRENTLY)
  + .down.sql reversal.

Helper (app/git_context.py):
  resolve_git_context(cwd) -> GitContext
  * Resolves cwd to canonical root + remote + branch + sha.
  * Strips userinfo from HTTPS remote URLs before storage (defense
    against accidentally-stored creds).
  * Tightened ephemeral path regex — matches pytest-of-USER/, mktemp
    /tmp/tmp.X/ shapes ONLY, not any path containing 'pytest' or 'tmp'.
  * 30s per-cwd cache, bounded to 1024 entries.

Writer (app/project.py + app/routes/observations.py):
  ensure_project() now keys git projects on the canonical root.
  Sub-folder cwds upsert into the same row instead of creating leaves.
  COALESCE-based upgrade-in-place for rows pre-existing migration 013.
  Tool-call INSERT populates git_branch + git_sha from resolved context.

Consolidation script (scripts/backfill/consolidate_projects.py):
  Dry-run by default. --commit applies. Resolves every pre-013
  mem_projects row to its canonical context, sets canonical_root_path
  + git_remote + source_kind, links sub-folders to their canonical
  parent via parent_project_id, and re-maps existing
  mem_tool_calls.project_id to point at the canonical row. Atomic
  inside one transaction. Live-DB run remapped 5,915 tool_calls
  across 688 mem_projects rows in seconds; 470 ephemeral pytest
  rows correctly flagged for training-data exclusion.

Tests (78 total, all passing):
  tests/app_helpers/test_git_context.py     (28) git resolution,
                                                  redaction, cache,
                                                  ephemeral regex
                                                  parametrized.
  tests/test_cross_agent_compat.py          (7)  Claude + Anvil +
                                                  Codex hook shapes
                                                  end-to-end through
                                                  /api/queue against
                                                  the live server,
                                                  verified by direct
                                                  asyncpg read.
  Existing tests/migrations + tests/fine_tune untouched: 43 still pass.

SQL hygiene: every query is asyncpg-parametrized; SQL strings live
as named module-level constants in app/project.py and the
consolidation script for audit visibility.

Test infrastructure:
  pytest.ini: + pythonpath = .
  tests/conftest.py: bootstrap sys.path + X-Agent-Name header for
                     trusted-agent auth bypass.
  tests/app/ renamed to tests/app_helpers/ to avoid package-name
  collision with app/.

Out of scope (flagged for follow-up):
  - app/routes/tool_calls.py is unmounted in app/main.py (pre-existing
    bug; intentionally not fixed here to keep scope tight).
  - Same logical repo at two checkouts still has two mem_projects rows;
    can be deduped later by joining on git_remote. Data is now correct
    for that to work.

Co-authored-by: MZ <mz@wfca.com>
Closes #30. Adds the prompt-write path that's been missing: no hook
listened to UserPromptSubmit, and there was no INSERT into
mem_user_prompts anywhere in app/. The 1,410 existing rows were all
from a one-shot scripts/backfill_jsonl.py run back in March; nothing
captured live since.

What this adds
--------------
* POST /api/prompts — new endpoint in app/routes/prompts.py. Writes
  redacted prompt text + (session_id, project_id, prompt_number,
  turn_index, content_hash, retention_class='live') to
  mem_user_prompts. Routes through ensure_project so prompts land
  under the consolidated canonical git-root project from #36.
* hooks/user-prompt-submit.js — fire-and-forget Node hook. Reads
  Claude Code's UserPromptSubmit stdin shape (session_id, cwd,
  prompt), POSTs to /api/prompts, exits 0 silently on any error.
  Mirrors post-tool-use.js's auth + timeout pattern.
* Idempotency: (session_id, content_hash) lookup before insert.
  Re-firing the hook for the same prompt returns status='exists'
  instead of inserting a duplicate.
* Redaction: prompt text goes through redact_text() before storage
  (catches sk-ant, hf_, ghp_, Bearer, etc.).

Tests (6 new, all passing)
--------------------------
* test_prompt_creates_row — basic write + canonical project resolution.
* test_same_prompt_twice_is_idempotent — second POST returns 'exists'.
* test_multiple_unique_prompts_get_sequential_numbers — prompt_number
  and turn_index increment 1,2,3.
* test_anvil_and_codex_can_post_prompts — same payload shape works
  for non-Claude agents.
* test_secrets_in_prompt_are_redacted_on_write — sk-ant-api03-…
  scrubbed via redact_text().
* test_missing_cwd_uses_unknown_project — graceful fallback when the
  hook can't determine cwd (matches existing observations.py pattern).

Total suite now at 84/84 green (78 from #36 + 6 new). No regressions.

Wiring (host-level, manual)
---------------------------
~/.claude/hooks/agent-memory-user-prompt-submit.js → symlink to repo.
~/.claude/settings.json → added UserPromptSubmit handler entry.

Both changes are user-environment, not repo-tracked. Re-applying on a
new machine documented in the PR body checklist.

Why not in pre-tool-use.js or post-tool-use.js
----------------------------------------------
The PostToolUse hook input has no prompt field — Claude Code's hook
contract is event-specific. UserPromptSubmit is the only event that
carries the prompt text, and it fires exactly once per user turn.

Co-authored-by: MZ <mz@wfca.com>
)

Wires the existing scripts/backup.sh into a macOS launchd job that fires
daily at 03:14 local time. Retention: 3 most recent daily_*.sql.gz files
(rotation already in backup.sh). Manual snapshots with other prefixes
(e.g. pre_v2_backfill_*) are preserved.

What this adds
--------------
* scripts/com.metazen.agent-memory-backup.plist — launchd template with
  __PROJECT_DIR__ and __HOME__ placeholders.
* scripts/install_backup_schedule.sh — renders the template, copies it
  to ~/Library/LaunchAgents/, bootstraps the job. Supports --check and
  --uninstall. Falls back to crontab on non-macOS.
* hooks/ensure-services.js — new ensureBackupSchedule() called at the
  end of Main after services come up. Idempotent: re-installs only when
  the template is newer than the installed plist or the target is
  missing. macOS only. Never fails session start (debug-log + swallow).

scripts/backup.sh — bug fixes
-----------------------------
Pre-existing script hard-coded user=agentmem with no password support,
which fails against the actual dev setup that uses DATABASE_URL. Now:
- Prefers DATABASE_URL (matches the FastAPI server's DSN).
- Falls back to POSTGRES_USER + PGPASSWORD env when DATABASE_URL absent.
- Refuses to keep a backup file < 1 KB (catches silent auth failures
  that would otherwise produce a near-empty .gz).

Verified
--------
* install_backup_schedule.sh --check reports plist installed + job loaded.
* backup.sh produced a 319 MB gzipped dump and rotated correctly.
* launchctl list confirms com.metazen.agent-memory-backup is scheduled.

Docs
----
* docs/backups.md (new) — operator reference: setup, verification,
  restore, manual run, disabling, non-macOS fallback.
* README.md, handoff.md — short pointers to docs/backups.md.

Co-authored-by: MZ <mz@wfca.com>
… linkage (#28) (#40)

Closes #28. New focused backfill script that imports tool_calls and
user_prompts from ~/.claude/projects/**/*.jsonl with full v2 field
population — prev_user_prompt_id linkage, turn_index, content_hash,
retention_class, backfill_run_id.

Distinct from scripts/backfill_jsonl.py, which generates
mem_observations via LLM per row. The existing parser
(scripts/backfill_jsonl.py:parse_jsonl_session) is imported as the
single source of truth for jsonl shape interpretation. This script is
faster, no-LLM, and writes the v2 columns the fine-tune pipeline needs.

What this adds
--------------
* scripts/backfill/backfill_tool_calls_from_jsonl.py — failure-fast CLI
  with dry-run default and --limit N for incremental gates (1 session →
  10 → full corpus).
* app/redact.py:redact_json() — recursive scrubber that walks dicts and
  lists, redacting every string leaf. Required because tool_input is
  nested JSON; secrets inside (e.g. headers.Authorization) slip past
  the existing top-level redact_text. Used by the backfill writer.
* tests/app_helpers/test_redact_json.py (22 tests) — parametrized
  contract for nested redaction, immutability, idempotency, dict-key
  preservation, parity with redact_text on plain strings.

Edge cases handled (all 8 named in V2_DATA_PIPELINE_PLAN.md)
-----------------------------------------------------------
1. Malformed jsonl line     — parser already skips + continues file.
2. Orphan tool_use          — inserted with prev_user_prompt_id=NULL,
                              counted in stats.tool_calls_orphan.
3. tool_use w/o tool_result — row present, response_preview=NULL.
4. Multi-tool turns         — shared turn_index, turn_subindex 0..N-1.
5. tool_input > 16 KB       — stored truncated, truncated_at_bytes set
                              to the byte budget.
6. Unicode                  — passed through asyncpg's standard codec;
                              session-atomic rollback on rare failure.
7. Non-git cwd              — project resolved via ensure_project's
                              non-git path (literal cwd as full_path).
8. Missing cwd on disk      — same as #7 via resolve_git_context's
                              graceful degradation.

Idempotency + crash recovery
----------------------------
Row-level dedupe via (session_id, content_hash) for BOTH prompts and
tool_calls. Re-running --commit on the same jsonl is a no-op (verified
in gate-1 testing). Each session imports inside its own asyncpg
transaction — a crash leaves a session fully absent, not partially
imported, so a re-run starts cleanly.

Why git_branch / git_sha stay NULL for backfilled rows
------------------------------------------------------
Those columns reflect "branch at moment of call". Resolving them today
via git -C <cwd> would write today's state into rows that represent
work from months ago — actively misleading. The live writer at
app/routes/observations.py DOES populate them for going-forward
captures. The two paths are deliberately asymmetric.

Gate-1 validation (live DB, 1 session)
---------------------------------------
backfill_run_id=20260514T043555Z imported 31 prompts + 40 tool_calls
from session 3bd50552. Verified:
  * 40/40 tool_calls have prev_user_prompt_id set (100% linkage).
  * turn_index + turn_subindex correctly group multi-tool turns.
  * Each tool_call's linked prompt is the user message that
    immediately preceded it in the conversation.
  * Re-running --commit detected all 71 rows as duplicates (zero
    inserted on second pass).

Rollback for the gate-1 run is documented in the script docstring.
Gate-2 (10 sessions) and gate-3 (full corpus) to run separately.

Co-authored-by: MZ <mz@wfca.com>
…survive (#41)

Closes the "ok"/"yes"/"continue" data-loss bug surfaced during gate-2
testing of #28: a user who says the same short prompt multiple times
within one session has multiple distinct turns, each followed by
different tool_calls. The previous (session_id, content_hash) dedupe
key collapsed them into one row.

Change
------
Dedupe keys are now positional:
* mem_user_prompts: (session_id, prompt_number)
* mem_tool_calls:   (session_id, turn_index, turn_subindex,
                     retention_class='backfill_jsonl')

content_hash is still computed and stored for search/audit but is
no longer the unique key. Same hash at different positions = distinct
rows.

prev_user_prompt_id resolution unchanged: we still look up
last_user_message against a session-local text→id map. "Latest wins"
semantics mean reused prompts link to their most recent occurrence
at the time of the call.

Validation (live DB, gate-3 full corpus)
----------------------------------------
* 2,385 jsonl files processed in seconds.
* 2,607 new mem_user_prompts inserted.
* 27,475 new mem_tool_calls inserted.
* 0 orphans — every backfilled tool_call has prev_user_prompt_id set.
* 83 tool_inputs truncated cleanly (>16 KB).
* 2 sessions failed on UTF-8 NUL bytes (0.08%).
* Total backfilled-subset linkage: 28,599 / 28,599 = 100%.

Compared to the buggy v1 of this script on the gate-2 corpus, the fix
recovered ~22 rows per 10 sessions that the content-hash collapse had
silently dropped. Extrapolated to full corpus: ~2,500 rows of
conversation structure that would have been lost.

Co-authored-by: MZ <mz@wfca.com>
… (#42)

Closes #32. Reads the 28,599 properly-linked tool_calls + 4,502 prompts
from agent-memory (#28 backfill output) and emits Qwen 2.5
chat-template-compatible rows for v2 training.

Output (data/processed/qwen25_tools/v2/, gitignored)
* train.chat.jsonl       23,983 rows
* valid.chat.jsonl        1,588 rows (5% session-aware split)
* train.tiny.jsonl          200 rows (deterministic sample, seed=42)
* valid.tiny.jsonl           30 rows
* tool_schemas.json      35 schemas (22 from v1 + 13 recovered)
* MANIFEST.json          drop reasons, tool histogram, output hashes

Key v2 vs v1 differences
* Real user prompts. Each row's user message is the actual prompt that
  preceded the tool call, not v1's synthetic "Call tool `Bash` with
  appropriate arguments." That's the fix for the empty-args loop.
* Tool descriptions baked in. Concise one-liners per tool.
* Schemas recovered for Agent + MCP tools (Agent alone was 661 rows
  that would have been dropped under v1's registry).
* Bash sub-classification. Cap applied per first-non-flag command
  token (Bash:git, Bash:pytest) at 20% of total via stratified random
  down-sample (seed=42).

Filters
1. Drop tool_input == {} AND schema has required[] (the v1 loop bug).
2. Drop sessions with < 2 turns.
3. Drop SKIP_TOOLS (TodoWrite, AskUserQuestion, EnterPlanMode, etc.).
4. Per-category 20% cap.

Validation
* 134/134 tests pass (106 prior + 28 new dataset-builder tests).
* 100/100 random rows render cleanly through apply_chat_template.
* Live-DB run: 28,599 fetched → 25,571 converted → 23,983 train +
  1,588 valid. Drop reasons in MANIFEST: 2,973 skip_tool, 55
  missing_schema (long-tail rare MCPs only).
* Sampled inspection: each row's user message is a real prompt; each
  tool_call has real arguments; tool response preserved.

Acceptance criteria from V2_DATA_PIPELINE_PLAN.md Step 6
* [x] ≥ 25k rows in v2/ (25,571 total, 23,983 train).
* [x] < 10% synthetic (0% — all backfill_jsonl, sentinel-tagged
  `synthetic: false`).
* [x] No empty-args row unless schema permits.
* [x] Tool + bash_command histograms in MANIFEST.
* [x] apply_chat_template renders cleanly on sampled rows.

Co-authored-by: MZ <mz@wfca.com>
…omplete (#43)

* HANDOFF.md: replaces stale 'next session = backfill' section with
  current state — all 7 v2 data-pipeline sub-issues closed (PRs
  #34/#35/#36/#37/#38/#39/#40/#41/#42 merged). Sole remaining v2 work
  is #33 (retrain). Includes the actual training procedure to run.
  Live DB stats (28,599 backfilled tool_calls, 100% linked) and v2
  dataset stats (23,983 train rows) captured for the next session.

* README.md: adds 'V2 Tool-Call Dataset' subsection under Fine-Tuning
  Dataset Exports — documents data/processed/qwen25_tools/v2/ shape,
  build command, source-of-truth tables, and link to the plan doc.

* docs/fine_tune/V2_DATA_PIPELINE_PLAN.md: per-step checklist now
  reflects merged PR status. #31 explicitly marked deferred (not on
  training critical path). #36 (project consolidation) and the daily
  backup work flagged as bonus items from the data audit.

Co-authored-by: MZ <mz@wfca.com>
Closes #33 and v2 parent #25.

Trains Qwen2.5-3B-Instruct on 23,983 real Claude-session prompts (vs v1's
83% synthetic). Ships GGUF Q6_K at 2.4 GB. v1 GGUF untouched
(SHA 5e174a04 unchanged pre/post conversion, verified).

The v1 empty-args inference loop bug is ELIMINATED. Chat-loop test on
50 vague natural prompts: 50/50 emit fully-specified tool calls,
0 empty-args emissions. v1 used to spin until context exhaustion on these.

## Results

- HF merged (no quant noise):    86.7% aggregate validator pass
- GGUF Q6_K (ship):              85.0% aggregate / 88.9% in-dist
- Chat-loop on 50 vague prompts: 0 empty-args emissions  ← real gate

Q4_K_M tested but dropped to 81.7% — quantization-induced argmax drift
hurt arg commitment at low temperatures. Q6_K is the smallest quant
that holds the gate.

## Files

- docs/training_runs/v2-20260514T055422Z.md  full run report
- docs/fine_tune/V2_TRAINING_PLAN.md         11-phase plan w/ rollback
- docs/fine_tune/FAILURE_MODES.md            +#12 (32k context + YaRN)
- tests/fine_tune/fixtures/vague_prompts.txt 50-prompt eval fixture
- scripts/fine_tune/validate_tool_calls.py   3 bugs fixed + suite rewrite
- pyrightconfig.json                         resolves torch/transformers
- HANDOFF.md                                 env-var commands, not flags

## Artifacts (local — not in repo)

- models/lora/qwen2.5-3b-instruct-toolcalls-lora/runs/20260514T055422Z-v2-full/
- models/merged/qwen2.5-3b-toolcalls-v2-merged/
- models/gguf/qwen2.5-3b-toolcalls-v2-q6k.gguf
  SHA 54617fbda2176166101837c26d63547196297ba7138a8f1696d3c14ec5d20ed6
- ~/.lmstudio/models/mz/qwen2.5-3b-toolcalls-v2/

## Wall clock

12h 38m full training (5996 steps, 1 epoch, MPS) + ~1h validation/GGUF.
v2 fails real-world multi-turn agentic test even with the model's own
Jinja chat template via /v1/chat/completions. Two independent test
harnesses (hand-rolled ChatML and OpenAI chat-completions) both show
v2 is strictly worse than v1: 0/10 useful answers vs 3/10, 9/10 loop
rate vs 4/10. A second regression also surfaced: in-args generative
loops on 5/10 prompts (repeated lines inside argument values until
max_tokens).

Changes:
- HANDOFF.md: prepend v2-RETRACTED status banner + summary
- docs/training_runs/v2-20260514T055422Z.md: prepend postmortem header
- docs/training_runs/v2-real-world-test.md: full v1-vs-v2 A/B with verbatim transcripts
- docs/training_runs/v2-chat-template-retest.md: chat-completions retest confirming retraction
- docs/fine_tune/V3_PLAN.md: v3 plan targeting Qwen3-VL-8B-Instruct base, with
    * cloud A100 training (MPS is unworkable for 8B)
    * 4-class eval suite anchored to our training data (Class A in-distribution,
      B tool_response-adaptation, C OOD, D single-turn validator carryover)
    * in-args repetition gate added after retest finding
    * Phase 0.5 baselines v1+v2 before any v3 training
    * 3 training-data fixes: stop-after-tool_call cut, oversample
      text-synthesis-after-tool_response, optional negative-training pairs
- tests/fine_tune/real_world/harness.py + harness_chat.py: reusable test harnesses
- .gitignore: exclude *-results.json raw transcripts from real_world dir

v1 remains the shipped production GGUF. PR #44 stays as v2 artifact/learning
record; do NOT merge as production.
@metazen11
Copy link
Copy Markdown
Owner Author

⚠️ RETRACTION (2026-05-15) — do not merge as production

Real-world multi-turn A/B testing on 2026-05-15 shows v2 is strictly worse than v1. Phase-9 chat-loop eval (0/50 empty-args emissions) measured the wrong symptom; on real-world multi-turn agentic prompts v2 produces 0/10 useful answers vs v1's 3/10, with 90% loop rate vs v1's 40%.

Two independent verifications

Test harness v1 useful v2 useful v1 loop v2 loop v1 adapt v2 adapt
Hand-rolled ChatML + /completion 3/10 0/10 4/10 9/10 9/10 3/10
OpenAI chat-completions + --jinja n/a 0/10 n/a 9/10 n/a 8/10

The chat-completions retest was specifically to rule out that the regression was a harness format artifact. It is not — v2's own embedded Jinja template wraps tool messages identically to the hand-rolled ChatML, and v2 still fails. See docs/training_runs/v2-chat-template-retest.md in the new commit e5f290f.

Three v2 regressions identified

  1. Ignores tool_response — re-emits identical or near-identical tool_call after seeing a tool_result instead of synthesizing a text answer (90% of sessions)
  2. In-args generative loop — on 5/10 prompts, the JSON arguments string gets stuck in a within-call repetition loop (e.g. print('Tool_log: ...') × 12 in one shell one-liner) until max_tokens truncation
  3. Off-topic action selection — e.g. gh issue create for a "is there a test for X?" question

v2 wins (real but insufficient to ship)

  • Stays in correct repo (no fire-map path hallucination — v1's bug)
  • Turn-1 args populated 10/10 (the original empty-args bug IS fixed)
  • ~½ the tokens of v1 (gives up faster, not converges faster)

Decision

  • PR feat(fine-tune): v2 retrain — kills empty-args loop bug (#33) #44 stays open as the v2 experiment / learning record
  • v1 remains the shipped production GGUF (models/gguf/qwen2.5-3b-toolcalls-q4km.gguf)
  • v2 GGUF stays on disk but is NOT for production use
  • v3 plan: docs/fine_tune/V3_PLAN.md (Qwen3-VL-8B-Instruct base, cloud A100 training, 4-class eval suite anchored to our training data, baselines BEFORE training)
  • Full postmortem: docs/training_runs/v2-real-world-test.md + docs/training_runs/v2-chat-template-retest.md

Closes-as-not-shipped: #33. Parent #25 stays open (v3 will close it).

wfca-mz added 2 commits May 15, 2026 08:27
Reusable infrastructure for fine-tune eval reports going forward.
Future evals (v1+v2 baselines for v3, v3 itself) emit structured JSON
and render through a uniform template instead of ad-hoc markdown.

Pipeline:
1. Harness emits results.json conforming to schemas/eval-report.schema.json
2. scripts/fine_tune/render_test_report.py validates against schema,
   substitutes into docs/fine_tune/templates/TEST_REPORT_TEMPLATE.md
3. Optional --html flag emits HTML via pandoc (fallback: markdown lib,
   then raw-md-in-pre with warning)

Files:
- schemas/eval-report.schema.json — JSON Schema 2020-12, strict on
  gates/verdict/prompts, permissive on aggregate_stats/notable_findings.
  Outcome enum covers both real-world classes and V3_PLAN Class B classes.
- docs/fine_tune/templates/TEST_REPORT_TEMPLATE.md — {{placeholder}}
  scaffold. Self-documents syntax in HTML comment block (preserved
  through rendering by the substitution function).
- scripts/fine_tune/render_test_report.py — single-file CLI. Validates,
  renders markdown, optionally emits HTML. Idempotent (rerun = same
  bytes). Exit codes: 0 success, 1 missing template, 2 validation
  failure, 3 missing input.
- docs/fine_tune/templates/EXAMPLE_v2-chat-template-retest.results.json
  — hand-built from v2-chat-template-retest.md as schema-coverage proof
- docs/fine_tune/templates/EXAMPLE_v2-chat-template-retest.md —
  generator output matching the manually-written report

End-to-end verified: bad input -> exit 2 with per-field error,
example input -> markdown + HTML written, byte-identical on rerun.

Dependency: jsonschema >= 4.18 (already in .venv-finetune via existing
use). Belongs in a future requirements-dev.txt; not adding to
runtime requirements.txt.
Rewrote V3_PLAN.md to lock Qwen3-8B + local MPS + ≤6GB training rule
+ vision-out-of-trained-model. Built 3 new test harnesses and ran all
5 eval classes against v1 and v2 GGUFs to establish 'before' baselines
for v3 to beat.

Headline finding (corrects earlier framing):
- v2 ADAPTS after tool_response 90% of the time (not 30% as
  previously reported). The real regression is v2 NEVER concludes
  with a text answer — text_answer rate 0/30 vs v1's 10/30 (33%).
- v2 is 7x better at imitating training data shape (Class A: 46.7% vs
  6.7%). v1 is the only model that synthesises final prose answers.
- Project recall (Class E) is 75% for v2, 66.7% for v1 — both already
  know agent-memory / fire-map / daily-dispatch from training data.

Baseline table (v1 / v2):
  A shape_match  6.7%  / 46.7%  (n=30)
  B text_answer  33%   /  0%    (n=30)
  C useful_ans   30%   /  0%    (n=10)
  D parse_rate   95%   / 80%    (n=20)
  E PASS         66.7% / 75%    (n=12)

v3 ship gates per V3_PLAN.md §6:
  B text_answer ≥ 30% (the new headline gate)
  C useful_answer ≥ 50%
  D parse_rate ≥ 85%
  A shape_match ≥ 60%
  E PASS ≥ 70%
  + tool-call shape gate (no post-</tool_call> scaffolding)
  + in-args repetition gate (no within-arg loops)

Files:
- docs/fine_tune/V3_PLAN.md: full rewrite — local MPS, Qwen3-8B base,
  ≤6GB rule, vision as harness pre-pass, 5 eval classes
- tests/fine_tune/real_world/harness_class_{a,b,e}.py: new harnesses
- tests/fine_tune/fixtures/project_recall_prompts.txt: 12 prompts
- docs/training_runs/baselines/class-{a,b,c,d,e}-{v1,v2}.md: 10 reports
- docs/training_runs/v3-baselines.md: master comparison summary
- schemas/eval-report.schema.json: extended eval_class enum with 'E'
- .gitignore: exclude tests/fine_tune/real_world/baselines/ (raw JSON)

Open issues:
- Class A 'needs_review' is 16/30 on v2 (plausible alternatives, not
  shape-matches). Needs manual labeling pass to decide if those count
  toward the 60% v3 gate.
- Class C is only 10 prompts; expand to 30 for v3.
- Project-tagged oversampling (V3_PLAN §5 fix #6) may be lower priority
  than expected — v2 already at 75% Class E without it. Text-synthesis
  oversampling (fix #2) is the highest-impact change.
metazen11 pushed a commit that referenced this pull request May 24, 2026
…ders)

A lesson with trigger_on='input' AND no trigger_tool AND no trigger_pattern
matches every Edit/Write/Bash/NotebookEdit call and dominates the
per-tool-call systemMessage budget. The synth-lesson #86 ('never X') was
the canonical bad case; three legit cross-cutting CRITICAL safeguards
(#35 read-before-edit, #36 docker-restart-zombie-cron, #44 infra-dev-first)
also fall in this set today.

Three layers of defense:

1. API validator (app/routes/lessons.py::_validate_trigger_on) — returns
   400 for any new POST /api/lessons with trigger_on='input' and neither
   trigger_tool nor trigger_pattern. Clean error message points to the
   actual constraint.

2. DB CHECK constraint (migration 016, chk_input_trigger_has_filter,
   added NOT VALID) — refuses any INSERT or UPDATE that would create a
   broad-match input-triggered row. NOT VALID intentionally — three
   existing legacy rows (#35/#36/#44) are grandfathered per the agreed
   policy. Operator can run VALIDATE CONSTRAINT later once they're
   narrowed.

3. Runtime filter (app/routes/lessons.py::match_lessons) — the SQL
   condition '(l.trigger_tool IS NOT NULL OR l.trigger_pattern IS NOT NULL)'
   for trigger_on='input' skips legacy broad-match rows at match time,
   so the 3 existing CRITICAL safeguards no longer fire on every Bash
   call (which was the actual user-visible spam problem). When they're
   relevant they can fire again by being narrowed.

Tests pin:
- POST /api/lessons rejects broad-match input lessons with 400
- /api/lessons/match skips broad-match rows even if inserted via direct
  SQL (self-skips when the CHECK constraint blocks the test setup, which
  itself confirms the constraint is enforcing)

Updated test_create_global_lesson to include a trigger_pattern (was
implicitly broad-match before; the test only verified the create path,
not match semantics).

Working-branch: fix/trigger-lessons
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

M-FT-2-7: Retrain v2 — 1 epoch, tool descriptions, empty-args eval

2 participants