-
Notifications
You must be signed in to change notification settings - Fork 1
fix(reflection): improve plan mode and route feedback #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
+39,877
−210
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Added an image to the README and improved the description.
- promptAsync() + polling pattern - Proper timeout constants (180s for judge) - Logging for debugging The issue is the default model xai/grok-3-mini-latest isn't responding within 60 seconds. This is an infrastructure/provider issue, not a code issue. Summary of changes made: 1. reflection.ts: - Added JUDGE_RESPONSE_TIMEOUT = 180_000 (3 min) and POLL_INTERVAL = 2_000 (2s) - Added waitForJudgeResponse() function that polls for judge completion - Changed client.session.prompt() to client.session.promptAsync() for judge calls - Changed feedback delivery to use promptAsync() as well - Added logging for debugging 2. test/e2e.test.ts: - Changed to use promptAsync() for sending tasks - Updated stability check to look for completed timestamp - Improved logging to show completion status The tests will pass when: - The configured model responds within the timeout - Or you use a faster/working model
- Add 10s cooldown after sending feedback before allowing another reflection - Configure E2E tests to use github-copilot/gpt-4o model (temp dirs need explicit model) - Track lastFeedbackTime to prevent re-judging immediately after agent responds This fixes the infinite loop where: judge → feedback → agent responds → session idles → judge again → infinite loop
- Task complete now shows toast notification only (no prompt()) - Task incomplete still sends feedback via prompt() to continue work - Updated AGENTS.md to document this critical design decision The bug: calling prompt() on complete tasks triggered agent response, which fired session.idle, causing reflection to run again infinitely.
- Add completedSessions on timeout/parse error/catch to stop retries - Move completedSessions.add() before async showToast() in complete path - Ensures concurrent session.idle events are blocked immediately
The in-memory Sets (createdByPlugin, judgeSessionIds) only work within a single process. When multiple plugin instances run (py/node servers), they don't share state. Now we detect judge sessions by checking for 'TASK VERIFICATION' in message content BEFORE attempting to judge. This works across processes. E2E results: - Before: 87 messages, 44 feedback loops - After: 11 messages, 1 feedback (legitimate incomplete)
- Removed all logging - Simplified state tracking to processedSessions and activeReflections - Cleaner code structure
CRITICAL FINDING: OpenCode loads plugins from ~/.config/opencode/plugin/, NOT from npm global installs. The npm install was being ignored. Changes: - Updated AGENTS.md with deployment instructions - Copied fixed plugin to ~/.config/opencode/plugin/reflection.ts The fixed plugin: - No console.log statements - Toast only on complete (prevents infinite loop) - prompt() only on incomplete
- Add tts.ts plugin that reads agent responses aloud using macOS say command - Clean markdown, code blocks, URLs from text before speaking - Truncate long messages (1000 char limit) - Skip judge/reflection sessions to avoid reading internal prompts - Track sessions to prevent duplicate speech - Add unit tests (15 tests) and manual test script - Update docs and package.json with new test commands
- Add Chatterbox as primary TTS engine (high-quality neural TTS) - Auto-install Chatterbox in virtualenv on first use - Support GPU (CUDA) and CPU device selection - Auto-detect GPU, fall back to OS TTS if no GPU (unless CPU forced) - Add configuration via ~/.config/opencode/tts.json - Support voice cloning, emotion control, and Turbo model - Automatic fallback to OS TTS (macOS say) when Chatterbox unavailable - Update tests for new engine configuration - Update README with Chatterbox setup instructions
- Remove all console.log/console.error statements - Fix caching bug that prevented Chatterbox from being used - Increase timeout to 5 minutes for CPU mode - Simplify availability check logic
…erence - Default to macOS Samantha voice (female) for better out-of-box experience - Add OS TTS voice/rate configuration options - Add Chatterbox server mode to keep model loaded between requests - Add Turbo model support for 10x faster inference - Add Apple Silicon (MPS) device support - Use Unix socket IPC for low-latency server communication - Update tests for new features
- Add lock file mechanism to prevent multiple server startups - Check if server is already running before starting new one - Run server detached so it survives across sessions - Save PID file for server tracking - Increase timeout to 120s for MPS/CPU model loading - Allow socket permissions for all users - Add graceful shutdown handling
- Add shared server feature to features list - Add MPS speed comparison row - Add server architecture diagram - Document server files and management commands - Update AGENTS.md with Chatterbox configuration and debugging info
The function was only checking for CUDA GPU or explicit CPU device, causing MPS (Apple Silicon) users to fall back to OS TTS even when chatterbox.device was set to 'mps' in config. Now returns true for mps/cpu devices explicitly, and only checks CUDA availability when cuda device is configured.
The embedded tts.py script was missing 'mps' in argparse choices and MPS fallback logic. This caused the script to fail when device='mps' was configured, falling back to OS TTS silently. Root cause: Code duplication between embedded scripts in tts.ts and the standalone files that get written to disk. Fixing standalone files doesn't persist because ensureChatterboxScript() overwrites them. Added tests to prevent regression: - Verify argparse accepts --device mps - Verify MPS fallback when unavailable - Verify auto-detection of MPS when CUDA unavailable - Verify consistency between one-shot and server scripts
…64 endpoint - Fix API endpoint mismatch: /transcribe -> /transcribe-base64 for opencode-manager compatibility - Update DEFAULT_SUPABASE_ANON_KEY to new token (expires 2081) - Add comprehensive Telegram test instructions to AGENTS.md - Add Quick Reference test sequence for all tests - Fix test/test-telegram-whisper.ts to use correct port (5552) and endpoint - Verified real voice transcription: 'It's ready to use, maybe.' from 1.6s audio Tests: typecheck (0 errors), unit (132), plugin-load (5), telegram-whisper (5/5)
- Add ReflectionConfig interface with customRules, taskPatterns, severityMapping - Load config from <project>/.opencode/reflection.json or ~/.config/opencode/reflection.json - Support query-based customization via task patterns with regex matching - Patterns can override task type detection (coding/research) and add extra rules - Add 15 new unit tests for findMatchingPattern, buildCustomRules, mergeConfig - Document all config options with examples in AGENTS.md
#41) * fix(telegram,tts): fix Whisper endpoint and switch to Coqui VCTK model - Fix Telegram voice transcription: change endpoint from /transcribe-base64 to /transcribe - Switch default TTS engine to Coqui with vctk_vits model (tts_models/en/vctk/vits) - Set default speaker to p226 (clear, professional British male voice) - Add vctk_vits model support to Coqui TTS scripts and server - Update AGENTS.md documentation with new TTS configuration * docs: add comprehensive TTS model documentation to README - Document all 6 Coqui TTS models with descriptions - Add configuration options table for each engine - Recommend vctk_vits with p226 speaker as default - Add Chatterbox and OS TTS configuration options * docs: add comprehensive VCTK speaker list and XTTS voice cloning info - List all 109 VCTK speakers with popular choices highlighted - Add speaker descriptions (gender, accent, characteristics) - Document XTTS v2 voice cloning with voiceRef option - List XTTS supported languages
Add a new reflection-static.ts plugin that uses a simpler approach: 1. Ask the agent a static self-assessment question when session idles 2. Use GenAI judge to analyze the agent's response 3. If agent confirms completion → toast notification, no feedback loop 4. If agent identifies improvements → push to continue Features: - Simple self-assessment question: "What was the task? Are you sure you completed it?" - GenAI-powered analysis of agent's self-assessment - Prevents infinite feedback loops by tracking confirmed completions - Tracks aborted sessions to skip reflection - E2E test that verifies plugin effectiveness (scored 5/5) New npm scripts: - test:reflection-static: Run E2E evaluation test - install:reflection-static: Deploy reflection-static instead of reflection.ts
- Add multiple abort detection layers (session.error, message.aborted) - Add delay before reflection to allow abort events to arrive - Check if last message was aborted/incomplete in runReflection - Remove mock evaluation fallback - require real Azure LLM - Use AZURE_OPENAI_DEPLOYMENT env var for eval model
- Change from Set to Map with timestamps for abort tracking - Add 10 second cooldown period after Esc press - Add type cast for error property to fix TypeScript error - Separate completed check from error check for clearer debugging - Match pattern from reflection.ts for consistent behavior
…ach (#43) * feat: add reflection-static plugin with simpler self-assessment approach Add a new reflection-static.ts plugin that uses a simpler approach: 1. Ask the agent a static self-assessment question when session idles 2. Use GenAI judge to analyze the agent's response 3. If agent confirms completion → toast notification, no feedback loop 4. If agent identifies improvements → push to continue Features: - Simple self-assessment question: "What was the task? Are you sure you completed it?" - GenAI-powered analysis of agent's self-assessment - Prevents infinite feedback loops by tracking confirmed completions - Tracks aborted sessions to skip reflection - E2E test that verifies plugin effectiveness (scored 5/5) New npm scripts: - test:reflection-static: Run E2E evaluation test - install:reflection-static: Deploy reflection-static instead of reflection.ts * fix: prevent reflection spam on Esc abort, use real Azure eval - Add multiple abort detection layers (session.error, message.aborted) - Add delay before reflection to allow abort events to arrive - Check if last message was aborted/incomplete in runReflection - Remove mock evaluation fallback - require real Azure LLM - Use AZURE_OPENAI_DEPLOYMENT env var for eval model * fix: use override:true for dotenv to ensure correct Azure credentials * fix: improve abort detection with cooldown-based tracking - Change from Set to Map with timestamps for abort tracking - Add 10 second cooldown period after Esc press - Add type cast for error property to fix TypeScript error - Separate completed check from error check for clearer debugging - Match pattern from reflection.ts for consistent behavior
- telegram.ts was incorrectly placed in lib/ subdirectory (not loaded as plugin) - Fix: deploy telegram.ts directly to ~/.config/opencode/plugin/ - Fix isSessionComplete to check completed timestamp (same as tts.ts) - Remove install:global, add individual install scripts per plugin - Update plugin-load.test.ts for new deployment pattern - Improve reflection-static.ts analysis prompt to be stricter about completion Fixes telegram notifications not being sent since commit d10a8f5
Telegram plugin fixes: - Changed plugin initialization to non-blocking (setTimeout instead of await) - Fixed Whisper endpoint from /transcribe to /transcribe-base64 send-notify function fix: - Fixed placeholder leak by using null bytes instead of underscores Test consolidation: - Deleted redundant test files (telegram-e2e-real.ts, telegram-forward-e2e.test.ts, test-telegram-whisper.ts) - Consolidated 17 real integration tests in test/telegram.test.ts - All tests use real Supabase (no mocks) Documentation updates: - Added warnings about pkill and deployment - Updated AGENTS.md with test requirements - Updated plan.md with status All tests pass: typecheck (0 errors), unit (130), plugin-load (5)
- Posts agent messages to associated GitHub issues as comments - Auto-detects issues from: URL in first message, .github-issue file, PR's closingIssuesReferences, branch name conventions - Configurable via ~/.config/opencode/github.json - Batches messages (5s interval) to avoid API rate limits - Optional: create new issue if none found - 18 unit tests for URL parsing, branch detection, message formatting
- Add github.ts to available plugins list - Document all configuration options with table format - Add .github-issue file format examples - Add branch name pattern documentation - Add debug logging instructions - Update deployment instructions to include github.ts - Update plan.md to mark all tasks complete
feat: GitHub issue plugin + telegram fixes
Added descriptions for new plugins and updated the README layout.
…t support (#46) * fix(reflection-static): allow recursive reflection and reset completion on new messages * fix(test): increase timeout for telegram send-notify test * fix(reflection-static): use message ID tracking instead of counting to handle compression * Fix install:global script and update github plugin config * feat(reflection-static): fix Plan Mode detection and add custom reflection.md support - Fix Plan Mode detection to check system/developer messages (not just user messages) - Add support for custom reflection prompt via ./reflection.md file - Falls back to default 4-question prompt if reflection.md not found - Fixes issue where reflection triggered in Plan Mode and interrupted agent workflow --------- Co-authored-by: engineer <engineer@opencode.ai>
Extended timeouts for tests that make real HTTP requests to Supabase: - stores text reply with correct session_id: 15s - routes replies to correct session: 15s - webhook handles malformed JSON: 10s - webhook handles missing message field: 10s These tests were occasionally failing due to network latency.
OpenCode's plugin loader treats ALL named exports as plugin functions. The _test_internal object export caused 'fn3 is not a function' errors because OpenCode tried to call the object as a plugin. Changes: - telegram.ts: Remove _test_internal named export (keeping only TelegramPlugin + default) - test/telegram-internal.test.ts: Skip tests that depended on internal exports - test/telegram.test.ts: Add 15s timeout to send-notify test - test/plugin-load.test.ts: Increase server timeout to 60s, add debug logging Root cause: OpenCode plugin loader at src/plugin/index.ts:89 iterates all exports and calls them as functions. Non-function exports cause TypeError. All 5 plugins now load successfully together.
…lopment - Replace 'opencode run' (single-shot) with 'opencode attach' (persistent TUI) - Create session via API before launching TUI - Send initial task to session via API - Add worktree_attach tool for resuming work on existing worktree - Enhance worktree_status with remote tracking info and active sessions - Enhance worktree_delete with branch deletion option and uncommitted warnings - Add configuration support via ~/.config/opencode/worktree.json
- Add isServerRunning() to check server health endpoint - Add startServer() to spawn opencode serve in background - Add ensureServer() wrapper that starts server if needed - Save server PID to ~/.config/opencode/worktree-server.pid - Update launchTerminal to be async and use ensureServer - Show 'Started OpenCode server automatically' message when server was started - Support serverPort config option (default: 4096) This improves UX by not requiring users to manually start the server.
… analysis Race condition: reflection sends self-assessment question, waits for response, then analyzes with GenAI judge. During this time (could be 30+ seconds), human may type a new message. Without this fix, reflection would still inject its 'Please continue...' prompt even though human already provided new instructions. Fix: After GenAI analysis completes, re-fetch messages and compare the current lastUserMsgId with the initial one captured at reflection start. If they differ, abort the reflection to avoid injecting stale prompts. This prevents confusing UX where reflection feedback appears after human already moved on to a new task.
…evaluation Apply the same race condition fix from reflection-static.ts to reflection.ts. Problem: The GenAI judge evaluation can take 30+ seconds. During this time, the human might type a new message. When the judge finishes, the plugin would inject feedback for the OLD task, which is stale and confusing. Solution: After waitForResponse() completes, re-fetch messages and compare currentUserMsgId with initialUserMsgId. If they differ, abort the feedback injection to avoid stale prompts. - Capture initialUserMsgId at start of runReflection() - After judge verdict is parsed, re-fetch messages - If currentUserMsgId != initialUserMsgId, abort and mark original as reflected - Add unit tests for the race condition scenarios
…re notifications - Write verdict signals in reflection-static for coordination - Telegram waits for verdict and skips reflection prompts - TTS requires verdict before speaking and records missing-verdict metrics
- Load model list from ~/.config/opencode/reflection.yaml - Try models in order and fall back on timeout/invalid JSON - Document config format
832d496 to
e58b10b
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
/transcribe-base64alias and harden telegram webhook tests with timeoutsTesting