fix: make ParallelExecutor the single timeout authority (#760)#911
Open
SBALAVIGNESH123 wants to merge 9 commits intofrontman-ai:mainfrom
Open
fix: make ParallelExecutor the single timeout authority (#760)#911SBALAVIGNESH123 wants to merge 9 commits intofrontman-ai:mainfrom
SBALAVIGNESH123 wants to merge 9 commits intofrontman-ai:mainfrom
Conversation
…iour Implements frontman-ai#784 — replaces the closure-based tool_executor interface with a composable {module, opts} ExecutionStrategy behaviour. SwarmAI gains: - SwarmAi.ExecutionStrategy behaviour (4 callbacks) - SwarmAi.DefaultExecutionStrategy (batteries-included + use macro) - SwarmAi.Runtime.deliver_tool_result/5 and await_tool_result/3 - Internal tool result Registry inside Runtime's supervision subtree Domain loses: - FrontmanServer.ToolCallRegistry (removed from application.ex) - Execution.notify_tool_result/4 (replaced by Runtime.deliver_tool_result) - :notified/:no_executor leak through Tasks.add_tool_result Domain gains: - FrontmanServer.Tasks.Execution.AgentStrategy (plain struct, testable) - Clean separation: Tasks.add_tool_result returns {:ok, interaction} only Files: 5 new, 8 modified (4 swarm_ai + 4 frontman_server)
…tration, dead code removal
Critical fixes found during audit:
1. deliver_tool_result used {:tool_result_delivered, ...} but PE expects {:tool_result, ...} — unified to {:tool_result, ...}
2. start_mcp_tool_mfa was not registering PE's pid in Runtime.ToolRegistry — deliver_tool_result would silently return :no_executor
3. execution.ex still passed strategy: opt but Runtime only understands tool_executor: — bridged via AgentStrategy.init + closure adapter
4. Removed dead encode_result_for_swarm function (orphaned by notify_tool_result removal)
Devin Review caught that process_result: nil dropped maybe_enrich_with_images for screenshot tools. PE's build_await_result needs the MFA to convert base64 image data URLs into ContentPart.image parts that the LLM can process. Added AgentStrategy.make_mcp_tool_result/4 and wired it in both the adapter closure (execution.ex) and the legacy executor (build_legacy_executor).
MCP.parse_tool_result can return maps when tool output is valid JSON. maybe_enrich_with_images has a 'when is_binary(content)' guard that skips image extraction for non-string content. The old notify_tool_result called encode_result_for_swarm before sending. Added encode_for_delivery/1 in TaskChannel to JSON-encode map results at the boundary before delivery to PE.
…e return
ToolExecutor is dead code (replaced by AgentStrategy) but must compile
cleanly. Updated pattern match from {:ok, _interaction, _executor_status}
to {:ok, _interaction} to match the new add_tool_result return type.
execute_mcp_tool published the tool call before registering in the ToolRegistry, creating a race where the client could respond before we were listening. Fixed by inlining the receive and registering BEFORE publish, matching the pattern in start_mcp_tool_mfa. Uses raw receive instead of await_tool_result to avoid double Registry.register (await_tool_result registers internally).
) Remove hardcoded 600s MCP timeout from AgentStrategy.execute_mcp_tool. PE now owns all per-tool deadlines via Process.send_after — no other layer enforces timeouts, eliminating the double-timeout race condition. Changes: - agent_strategy.ex: bare receive, no internal after clause - runtime.ex: await_tool_result default timeout -> :infinity - default_execution_strategy.ex: remove dead {:error, :timeout} branch - parallel_executor.ex: document Timeout Ownership Contract - execution_strategy.ex: document timeout contract in execute_tool - timeout_ownership_test.exs: 6 new tests for timeout ownership
Devin Review caught that handle_timeout_mfa(:error, :triggered) was missing the Sentry.capture_message call that the old ToolExecutor had. Without this, tool timeouts in production would not be captured in error monitoring.
… limit Moves the tool_executor closure from submit_to_runtime into a separate build_tool_executor/1 helper. submit_to_runtime is now ~55 lines, well under the elixir-style.md 70-line limit.
Author
|
Addressed the two new findings from Devin Review: 1. ToolExecutor.register_mcp_tool references removed ToolCallRegistry — This is pre-existing from PR #902 (ExecutionStrategy migration). ToolExecutor is legacy code that's only reachable through sub-agent paths now. The primary production path routes through AgentStrategy which uses the Runtime's ToolRegistry instead. Migrating sub-agent paths off ToolExecutor is planned as follow-up work — not in scope for this timeout-ownership PR. 2. submit_to_runtime exceeds 70-line limit — Fixed in ca363fe. Extracted the tool_executor closure into a separate |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Removes the hardcoded 600s MCP timeout from
AgentStrategy.execute_mcp_tooland establishes PE as the single source of truth for all tool deadlines.Why
After #755 introduced parallel execution, timeout enforcement ended up scattered across two layers:
Process.send_afterdeadline per tool (fromToolExecution.timeout_ms)AgentStrategy.execute_mcp_toolhas its ownreceive..after 600_000These race each other. If PE's deadline fires first, it kills the task — but the inline
aftermight also fire in edge cases. Worse, the 600s value is hardcoded and ignores the tool definition entirely, so interactive tools that should timeout at 120s actually wait 600s.The fix is simple: PE already has everything it needs. The strategy just needs to get out of the way.
Changes
agent_strategy.ex— the actual fixreceive..after 600_000with barereceive(noafter)runtime.ex— default timeout to infinityawait_tool_resultdefault changed from600_000to:infinitytimeout:opt still works for tests and escape hatchesdefault_execution_strategy.ex— dead code removal{:error, :timeout}branch that is now unreachableparallel_executor.ex+execution_strategy.ex— docstimeout_ownership_test.exs— new testsreceiveon_timeoutcallback fires correctly:pause_agentworks with barereceiveHow to verify
All existing PE and delivery tests pass unchanged — they already use explicit
timeout:values.Answers to the questions from #760
No. PE is type-agnostic — it reads
timeout_msandon_timeout_policyfrom theToolExecutionstruct. The strategy decides what those values are.For now, yes — sub-agents run inside the parent's Sync task, so PE's deadline covers the whole tree. Decoupling sub-agent timeouts is a bigger change and should be separate.
Not relevant anymore — PE uses
Task.Supervisor.async_nolink, which is correct.SwarmAi.Tool.timeout_msalready does this. The problem was thatAgentStrategywas not using it. Now it does not need to — PE reads it fromToolExecution.timeout_msand enforces it.Closes #760