feat: activate inference feedback loop#14
Merged
Conversation
…s back Phase 1 — Critical write-back: - inference-tuner.ts: addKeywordMapping() no longer a no-op. Reads routing-mappings.json, checks for conflicts, appends new keyword mappings, writes back to disk. This single change activates the entire 3000-line analytics pipeline as an active learning system. - inference-tuner.ts: wired routingRefiner into performTuning(). Tuning cycle now consumes refiner suggestions and applies them. - inference-tuner.ts: imported routingRefiner singleton. Phase 2 — Runtime feedback: - agent-delegator.ts: when top agent confidence < 0.85, consults predictiveAnalytics.predictSync() for a historically-better routing suggestion. Logs the refinement and promotes predicted agent. - predictive-analytics.ts: added predictSync() method for hot-path usage (no disk reload, operates on in-memory data). Phase 3 — Dead code and stubs: - predictive-analytics.ts: replaced 11-line stub with 190-line implementation. predict() uses keyword overlap + historical success rate. predictOptimalAgent() returns best agent with >= 3 samples. - kernel/ directory: deleted. Standalone package with zero imports. - package.json: removed strray-analytics bin and analytics:daily scripts (pointed to non-existent files). - kernel-patterns.ts: learn() now writes to this.assumptions and this.cascades Maps instead of empty this.patterns Map. Includes confidence increment (+0.05 on match, cap 1.0) and decay (-0.02 on miss, floor 0.1). Phase 4 — Data quality: - rule-registry.ts: addRule() is now idempotent — silently updates on duplicate instead of throwing. Fixes codex-1 duplicate registration error that fired 5x per test run. - rule-registry.test.ts: updated tests for idempotent behavior. - outcome-tracker.ts: getPromptData() now computes real complexity (description length / 5, capped at 100) and extracts keywords (words > 3 chars, deduplicated, max 10) instead of returning 0/[]. All 2399 tests pass. Zero TS errors.
OpenCode plugin (strray-codex-injection.ts): - Added module-level tool call counter - After every tool.execute.after hook, increments counter - Every 100 calls, dynamically imports inferenceTuner and runs a single tuning cycle (fire-and-forget, non-blocking) Hermes plugin (__init__.py): - Added _INFERENCE_TUNE_INTERVAL = 100 counter - After every post_tool_call hook, checks threshold - Shells out to npx strray-ai inference:tuner --run-once in a background daemon thread (30s timeout) - Logs result to activity.log - Counter resets on session_start Both plugins now auto-calibrate the routing feedback loop without manual intervention. 127 test files, 2399 tests green.
The inference tuner was dry — only the MCP orchestrator recorded outcomes, so the auto-tune at call #100 always hit the 'insufficient data' guard. Normal tool calls (write, edit, search, etc.) never fed into the analytics pipeline. OpenCode plugin (strray-codex-injection.ts): - Added TOOL_AGENT_MAP: maps tool names (write, edit, bash, search, read, glob, grep, ls) to agent/skill identifiers - After every tool.execute.after, imports routingOutcomeTracker and records the outcome with tool name, args description, agent/skill mapping, confidence, and success status Hermes plugin (__init__.py): - Added _TOOL_AGENT_MAP: same mapping for Hermes tool names (write_file, patch, execute_code, terminal, search_files, etc.) - Added _record_tool_outcome(): writes directly to logs/framework/routing-outcomes.json (same format as TS tracker) - Called from _on_post_tool_call after error detection - Circular buffer: keeps last 1000 outcomes - Supports wildcard patterns (browser_*) Both plugins now feed real data into the analytics pipeline. By call #100, the tuner has ~100 outcomes to analyze. Instance-level tuning is fully functional. Upstream tuning (sending calibration data to Jelly) still requires the Jelly API — tracked separately. 127 test files, 2399 tests green.
Three changes to unblock the inference feedback loop: 1. determineAgents() now loads routing-mappings.json fresh each call, keyword-matches against the operation string, and uses the learned mapping if confidence > 0.7. Falls back to hardcoded if nothing hits. 2. Predictive analytics threshold dropped from 0.85 to 0.7 so the prediction layer actually fires instead of being suppressed by hardcoded high-confidence values. 3. Task-type classification added to both OpenCode and Hermes plugin outcome recording. Tool calls are now classified (testing, build, security, lint, git, etc.) instead of every terminal call being recorded as 'testing-lead/execution'. RoutingOutcome interface gains optional taskType field. All backwards-compatible. 2399 tests passing, 5 pipelines green.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The analytics pipeline (3000+ lines across 15 files) was a one-way pipe — it collected data, analyzed it, generated suggestions, but could never write anything back. This PR closes the loop.
What Changed
Phase 1: Critical write-back (the big one)
Phase 2: Runtime feedback
Phase 3: Dead code and stubs
Phase 4: Data quality
Test Results