fix: make ParallelExecutor the single timeout authority (#760) by SBALAVIGNESH123 · Pull Request #911 · frontman-ai/frontman

SBALAVIGNESH123 · 2026-04-20T19:36:17Z

What

Removes the hardcoded 600s MCP timeout from AgentStrategy.execute_mcp_tool and establishes PE as the single source of truth for all tool deadlines.

Why

After #755 introduced parallel execution, timeout enforcement ended up scattered across two layers:

PE sets a Process.send_after deadline per tool (from ToolExecution.timeout_ms)
AgentStrategy.execute_mcp_tool has its own receive..after 600_000

These race each other. If PE's deadline fires first, it kills the task — but the inline after might also fire in edge cases. Worse, the 600s value is hardcoded and ignores the tool definition entirely, so interactive tools that should timeout at 120s actually wait 600s.

The fix is simple: PE already has everything it needs. The strategy just needs to get out of the way.

Changes

agent_strategy.ex — the actual fix

Replaced receive..after 600_000 with bare receive (no after)
PE kills the process on deadline, Registry auto-cleans up the key — no manual timeout needed
Added comment explaining the contract so nobody re-adds a timeout here

runtime.ex — default timeout to infinity

await_tool_result default changed from 600_000 to :infinity
Same reasoning: PE owns the deadline, not the callee
Explicit timeout: opt still works for tests and escape hatches

default_execution_strategy.ex — dead code removal

Removed the {:error, :timeout} branch that is now unreachable

parallel_executor.ex + execution_strategy.ex — docs

Added Timeout Ownership Contract section documenting PE as the single authority
Documents the end-to-end flow: tool definition -> ToolExecution struct -> PE deadline -> on_timeout callback

timeout_ownership_test.exs — new tests

PE kills a process stuck in bare receive
on_timeout callback fires correctly
Per-tool timeout isolation (fast tool times out, slow tool succeeds)
Registry key auto-cleanup on process kill
:pause_agent works with bare receive

How to verify

mix test apps/swarm_ai/test/swarm_ai/timeout_ownership_test.exs
mix test apps/swarm_ai/test/swarm_ai/parallel_executor_test.exs
mix test apps/swarm_ai/test/swarm_ai/runtime/tool_delivery_test.exs

All existing PE and delivery tests pass unchanged — they already use explicit timeout: values.

Answers to the questions from #760

Should PE know about tool types at all?

No. PE is type-agnostic — it reads timeout_ms and on_timeout_policy from the ToolExecution struct. The strategy decides what those values are.

Should the 60s timeout cover the full sub-agent tree?

For now, yes — sub-agents run inside the parent's Sync task, so PE's deadline covers the whole tree. Decoupling sub-agent timeouts is a bigger change and should be separate.

Is Task.async (linked) the right primitive?

Not relevant anymore — PE uses Task.Supervisor.async_nolink, which is correct.

Could we give each tool a timeout hint as metadata?

SwarmAi.Tool.timeout_ms already does this. The problem was that AgentStrategy was not using it. Now it does not need to — PE reads it from ToolExecution.timeout_ms and enforces it.

Closes #760

…iour Implements frontman-ai#784 — replaces the closure-based tool_executor interface with a composable {module, opts} ExecutionStrategy behaviour. SwarmAI gains: - SwarmAi.ExecutionStrategy behaviour (4 callbacks) - SwarmAi.DefaultExecutionStrategy (batteries-included + use macro) - SwarmAi.Runtime.deliver_tool_result/5 and await_tool_result/3 - Internal tool result Registry inside Runtime's supervision subtree Domain loses: - FrontmanServer.ToolCallRegistry (removed from application.ex) - Execution.notify_tool_result/4 (replaced by Runtime.deliver_tool_result) - :notified/:no_executor leak through Tasks.add_tool_result Domain gains: - FrontmanServer.Tasks.Execution.AgentStrategy (plain struct, testable) - Clean separation: Tasks.add_tool_result returns {:ok, interaction} only Files: 5 new, 8 modified (4 swarm_ai + 4 frontman_server)

…tration, dead code removal Critical fixes found during audit: 1. deliver_tool_result used {:tool_result_delivered, ...} but PE expects {:tool_result, ...} — unified to {:tool_result, ...} 2. start_mcp_tool_mfa was not registering PE's pid in Runtime.ToolRegistry — deliver_tool_result would silently return :no_executor 3. execution.ex still passed strategy: opt but Runtime only understands tool_executor: — bridged via AgentStrategy.init + closure adapter 4. Removed dead encode_result_for_swarm function (orphaned by notify_tool_result removal)

Devin Review caught that process_result: nil dropped maybe_enrich_with_images for screenshot tools. PE's build_await_result needs the MFA to convert base64 image data URLs into ContentPart.image parts that the LLM can process. Added AgentStrategy.make_mcp_tool_result/4 and wired it in both the adapter closure (execution.ex) and the legacy executor (build_legacy_executor).

MCP.parse_tool_result can return maps when tool output is valid JSON. maybe_enrich_with_images has a 'when is_binary(content)' guard that skips image extraction for non-string content. The old notify_tool_result called encode_result_for_swarm before sending. Added encode_for_delivery/1 in TaskChannel to JSON-encode map results at the boundary before delivery to PE.

…e return ToolExecutor is dead code (replaced by AgentStrategy) but must compile cleanly. Updated pattern match from {:ok, _interaction, _executor_status} to {:ok, _interaction} to match the new add_tool_result return type.

execute_mcp_tool published the tool call before registering in the ToolRegistry, creating a race where the client could respond before we were listening. Fixed by inlining the receive and registering BEFORE publish, matching the pattern in start_mcp_tool_mfa. Uses raw receive instead of await_tool_result to avoid double Registry.register (await_tool_result registers internally).

) Remove hardcoded 600s MCP timeout from AgentStrategy.execute_mcp_tool. PE now owns all per-tool deadlines via Process.send_after — no other layer enforces timeouts, eliminating the double-timeout race condition. Changes: - agent_strategy.ex: bare receive, no internal after clause - runtime.ex: await_tool_result default timeout -> :infinity - default_execution_strategy.ex: remove dead {:error, :timeout} branch - parallel_executor.ex: document Timeout Ownership Contract - execution_strategy.ex: document timeout contract in execute_tool - timeout_ownership_test.exs: 6 new tests for timeout ownership

Devin Review caught that handle_timeout_mfa(:error, :triggered) was missing the Sentry.capture_message call that the old ToolExecutor had. Without this, tool timeouts in production would not be captured in error monitoring.

… limit Moves the tool_executor closure from submit_to_runtime into a separate build_tool_executor/1 helper. submit_to_runtime is now ~55 lines, well under the elixir-style.md 70-line limit.

SBALAVIGNESH123 · 2026-04-20T21:00:29Z

Addressed the two new findings from Devin Review:

1. ToolExecutor.register_mcp_tool references removed ToolCallRegistry — This is pre-existing from PR #902 (ExecutionStrategy migration). ToolExecutor is legacy code that's only reachable through sub-agent paths now. The primary production path routes through AgentStrategy which uses the Runtime's ToolRegistry instead. Migrating sub-agent paths off ToolExecutor is planned as follow-up work — not in scope for this timeout-ownership PR.

2. submit_to_runtime exceeds 70-line limit — Fixed in ca363fe. Extracted the tool_executor closure into a separate build_tool_executor/1 helper. submit_to_runtime is now ~55 lines, well under the limit.

SBALAVIGNESH123 added 7 commits April 20, 2026 13:41

This comment was marked as resolved.

Sign in to view

fix: add Sentry reporting for tool timeouts in AgentStrategy

7179bd5

Devin Review caught that handle_timeout_mfa(:error, :triggered) was missing the Sentry.capture_message call that the old ToolExecutor had. Without this, tool timeouts in production would not be captured in error monitoring.

This comment was marked as resolved.

Sign in to view

refactor: extract build_tool_executor to comply with 70-line function…

ca363fe

… limit Moves the tool_executor closure from submit_to_runtime into a separate build_tool_executor/1 helper. submit_to_runtime is now ~55 lines, well under the elixir-style.md 70-line limit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make ParallelExecutor the single timeout authority (#760)#911

fix: make ParallelExecutor the single timeout authority (#760)#911
SBALAVIGNESH123 wants to merge 9 commits intofrontman-ai:mainfrom
SBALAVIGNESH123:feat/760-per-tool-timeout-ownership

SBALAVIGNESH123 commented Apr 20, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

SBALAVIGNESH123 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SBALAVIGNESH123 commented Apr 20, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Changes

How to verify

Answers to the questions from #760

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

SBALAVIGNESH123 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SBALAVIGNESH123 commented Apr 20, 2026 •

edited by devin-ai-integration Bot

Loading