Skip to content

fix: make ParallelExecutor the single timeout authority (#760)#911

Open
SBALAVIGNESH123 wants to merge 9 commits intofrontman-ai:mainfrom
SBALAVIGNESH123:feat/760-per-tool-timeout-ownership
Open

fix: make ParallelExecutor the single timeout authority (#760)#911
SBALAVIGNESH123 wants to merge 9 commits intofrontman-ai:mainfrom
SBALAVIGNESH123:feat/760-per-tool-timeout-ownership

Conversation

@SBALAVIGNESH123
Copy link
Copy Markdown

@SBALAVIGNESH123 SBALAVIGNESH123 commented Apr 20, 2026

What

Removes the hardcoded 600s MCP timeout from AgentStrategy.execute_mcp_tool and establishes PE as the single source of truth for all tool deadlines.

Why

After #755 introduced parallel execution, timeout enforcement ended up scattered across two layers:

  1. PE sets a Process.send_after deadline per tool (from ToolExecution.timeout_ms)
  2. AgentStrategy.execute_mcp_tool has its own receive..after 600_000

These race each other. If PE's deadline fires first, it kills the task — but the inline after might also fire in edge cases. Worse, the 600s value is hardcoded and ignores the tool definition entirely, so interactive tools that should timeout at 120s actually wait 600s.

The fix is simple: PE already has everything it needs. The strategy just needs to get out of the way.

Changes

agent_strategy.ex — the actual fix

  • Replaced receive..after 600_000 with bare receive (no after)
  • PE kills the process on deadline, Registry auto-cleans up the key — no manual timeout needed
  • Added comment explaining the contract so nobody re-adds a timeout here

runtime.ex — default timeout to infinity

  • await_tool_result default changed from 600_000 to :infinity
  • Same reasoning: PE owns the deadline, not the callee
  • Explicit timeout: opt still works for tests and escape hatches

default_execution_strategy.ex — dead code removal

  • Removed the {:error, :timeout} branch that is now unreachable

parallel_executor.ex + execution_strategy.ex — docs

  • Added Timeout Ownership Contract section documenting PE as the single authority
  • Documents the end-to-end flow: tool definition -> ToolExecution struct -> PE deadline -> on_timeout callback

timeout_ownership_test.exs — new tests

  • PE kills a process stuck in bare receive
  • on_timeout callback fires correctly
  • Per-tool timeout isolation (fast tool times out, slow tool succeeds)
  • Registry key auto-cleanup on process kill
  • :pause_agent works with bare receive

How to verify

mix test apps/swarm_ai/test/swarm_ai/timeout_ownership_test.exs
mix test apps/swarm_ai/test/swarm_ai/parallel_executor_test.exs
mix test apps/swarm_ai/test/swarm_ai/runtime/tool_delivery_test.exs

All existing PE and delivery tests pass unchanged — they already use explicit timeout: values.

Answers to the questions from #760

Should PE know about tool types at all?

No. PE is type-agnostic — it reads timeout_ms and on_timeout_policy from the ToolExecution struct. The strategy decides what those values are.

Should the 60s timeout cover the full sub-agent tree?

For now, yes — sub-agents run inside the parent's Sync task, so PE's deadline covers the whole tree. Decoupling sub-agent timeouts is a bigger change and should be separate.

Is Task.async (linked) the right primitive?

Not relevant anymore — PE uses Task.Supervisor.async_nolink, which is correct.

Could we give each tool a timeout hint as metadata?

SwarmAi.Tool.timeout_ms already does this. The problem was that AgentStrategy was not using it. Now it does not need to — PE reads it from ToolExecution.timeout_ms and enforces it.

Closes #760


Open in Devin Review

…iour

Implements frontman-ai#784 — replaces the closure-based tool_executor interface with a
composable {module, opts} ExecutionStrategy behaviour.

SwarmAI gains:
- SwarmAi.ExecutionStrategy behaviour (4 callbacks)
- SwarmAi.DefaultExecutionStrategy (batteries-included + use macro)
- SwarmAi.Runtime.deliver_tool_result/5 and await_tool_result/3
- Internal tool result Registry inside Runtime's supervision subtree

Domain loses:
- FrontmanServer.ToolCallRegistry (removed from application.ex)
- Execution.notify_tool_result/4 (replaced by Runtime.deliver_tool_result)
- :notified/:no_executor leak through Tasks.add_tool_result

Domain gains:
- FrontmanServer.Tasks.Execution.AgentStrategy (plain struct, testable)
- Clean separation: Tasks.add_tool_result returns {:ok, interaction} only

Files: 5 new, 8 modified (4 swarm_ai + 4 frontman_server)
…tration, dead code removal

Critical fixes found during audit:
1. deliver_tool_result used {:tool_result_delivered, ...} but PE expects {:tool_result, ...} — unified to {:tool_result, ...}
2. start_mcp_tool_mfa was not registering PE's pid in Runtime.ToolRegistry — deliver_tool_result would silently return :no_executor
3. execution.ex still passed strategy: opt but Runtime only understands tool_executor: — bridged via AgentStrategy.init + closure adapter
4. Removed dead encode_result_for_swarm function (orphaned by notify_tool_result removal)
Devin Review caught that process_result: nil dropped maybe_enrich_with_images
for screenshot tools. PE's build_await_result needs the MFA to convert base64
image data URLs into ContentPart.image parts that the LLM can process.

Added AgentStrategy.make_mcp_tool_result/4 and wired it in both the adapter
closure (execution.ex) and the legacy executor (build_legacy_executor).
MCP.parse_tool_result can return maps when tool output is valid JSON.
maybe_enrich_with_images has a 'when is_binary(content)' guard that
skips image extraction for non-string content. The old notify_tool_result
called encode_result_for_swarm before sending.

Added encode_for_delivery/1 in TaskChannel to JSON-encode map results
at the boundary before delivery to PE.
…e return

ToolExecutor is dead code (replaced by AgentStrategy) but must compile
cleanly. Updated pattern match from {:ok, _interaction, _executor_status}
to {:ok, _interaction} to match the new add_tool_result return type.
execute_mcp_tool published the tool call before registering in the
ToolRegistry, creating a race where the client could respond before
we were listening. Fixed by inlining the receive and registering
BEFORE publish, matching the pattern in start_mcp_tool_mfa.

Uses raw receive instead of await_tool_result to avoid double
Registry.register (await_tool_result registers internally).
)

Remove hardcoded 600s MCP timeout from AgentStrategy.execute_mcp_tool.
PE now owns all per-tool deadlines via Process.send_after — no other
layer enforces timeouts, eliminating the double-timeout race condition.

Changes:
- agent_strategy.ex: bare receive, no internal after clause
- runtime.ex: await_tool_result default timeout -> :infinity
- default_execution_strategy.ex: remove dead {:error, :timeout} branch
- parallel_executor.ex: document Timeout Ownership Contract
- execution_strategy.ex: document timeout contract in execute_tool
- timeout_ownership_test.exs: 6 new tests for timeout ownership
devin-ai-integration[bot]

This comment was marked as resolved.

Devin Review caught that handle_timeout_mfa(:error, :triggered) was
missing the Sentry.capture_message call that the old ToolExecutor had.
Without this, tool timeouts in production would not be captured in
error monitoring.
devin-ai-integration[bot]

This comment was marked as resolved.

… limit

Moves the tool_executor closure from submit_to_runtime into a separate
build_tool_executor/1 helper. submit_to_runtime is now ~55 lines,
well under the elixir-style.md 70-line limit.
@SBALAVIGNESH123
Copy link
Copy Markdown
Author

Addressed the two new findings from Devin Review:

1. ToolExecutor.register_mcp_tool references removed ToolCallRegistry — This is pre-existing from PR #902 (ExecutionStrategy migration). ToolExecutor is legacy code that's only reachable through sub-agent paths now. The primary production path routes through AgentStrategy which uses the Runtime's ToolRegistry instead. Migrating sub-agent paths off ToolExecutor is planned as follow-up work — not in scope for this timeout-ownership PR.

2. submit_to_runtime exceeds 70-line limit — Fixed in ca363fe. Extracted the tool_executor closure into a separate build_tool_executor/1 helper. submit_to_runtime is now ~55 lines, well under the limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rearchitect: per-tool timeout ownership in parallel executor

1 participant