Skip to content

Production-Readiness for Run Skill Script Tool #5079

@haiyuan-eng-google

Description

@haiyuan-eng-google

Production-Readiness for Run Skill Script Tool

Summary

RunSkillScriptTool enables agents to execute Python and shell scripts within skills. While functional, two critical gaps prevent its safe use in production or shared environments:

  1. Python scripts can hang indefinitely: Unlike shell scripts, Python scripts executed via runpy.run_path() lack timeout enforcement, risking process-wide deadlocks.
  2. Unsafe local execution: The default zero-dependency UnsafeLocalCodeExecutor runs code with full host privileges, posing significant security risks.

This RFC proposes two P0 changes to address these gaps:

  • P0-A: Uniform Timeout Support: Introduce a consistent timeout mechanism across all code executors.
  • P0-B: Isolated Subprocess Mode: Enhance the local executor to use subprocess-based isolation by default, leveraging patterns from PR #3225.

Motivation

The current state of script execution in ADK presents reliability and security challenges:

  • Reliability: A single buggy Python skill script without a timeout can block all code execution instances sharing the UnsafeLocalCodeExecutor lock, leading to a denial-of-service.
  • Security: UnsafeLocalCodeExecutor allows arbitrary code execution on the host, making it unsuitable for multi-user environments, CI/CD pipelines running untrusted scripts, or any production-like deployment. The ease of use of UnsafeLocalCodeExecutor means it's often used in scenarios where its risks are not acceptable.

These issues must be resolved before skills leveraging RunSkillScriptTool can be considered production-ready.

Proposal

P0-A: Uniform Timeout Support

Goal: Ensure all execute_code() calls have a configurable, bounded execution time.

Design:

  1. Add timeout_seconds: Optional[int] to BaseCodeExecutor: Provides a fallback timeout if not specified in the input.

Executor Implementation:

  • UnsafeLocalCodeExecutor: Execute exec() in a daemon thread, using thread.join(timeout). Mark the executor as unhealthy (self._healthy = False) if a timeout occurs, preventing further executions until reinitialize() is called. This mitigates risks from lingering threads mutating shared state.
  • ContainerCodeExecutor: Run exec_start in a thread, using os.kill on the host PID (obtained via exec_inspect) to terminate on timeout. Includes container restart as a fallback.
  • RunSkillScriptTool: Wire the SkillToolset(script_timeout=N) value to CodeExecutionInput.timeout_seconds.

P0-B: Isolated Subprocess Mode via in_process in UnsafeLocalCodeExecutor

Goal: Improve local execution safety by moving from exec() in a shared process to a separate subprocess, while keeping the "Unsafe" naming to avoid a false sense of security.

Design:

Refined Design: Instead of a use_subprocess flag, we will introduce an in_process parameter (defaulting to False) to the UnsafeLocalCodeExecutor.

Logic: When False (default), the executor uses the subprocess isolation logic. When True, it reverts to runpy.run_path() for debugging or low-risk local tasks.

Rationale: This follows the Python convention where None and False lead to the same (now safe) default path.

This incorporates logic from PR #3225: sandboxed subprocess using Python's standard library:

  • Mechanism: Use subprocess.Popen([sys.executable, "-c", code]) for execution.
  • Isolation: Ensures a separate memory space and prevents a script crash from taking down the main agent runtime.
  • Resource Limits: Utilize the resource module (Unix-only) to set RLIMIT_CPU and RLIMIT_AS (memory) within the child process before execution.
  • Environment: Pass only an explicit allowlist of environment variables to the subprocess.
  • Cleanup: Use proc.communicate(timeout=N). On timeout, kill the entire process group using os.killpg.

Comparison of Execution Modes:

Threat Current (In-Process exec) Proposed (Default Subprocess)
Infinite loops Blocked threads/Deadlocks Terminated via OS timeout
Memory Exhaustion Crashes main host process Process-limited via RLIMIT_AS
Host Process Crash High risk Isolated to child process
Filesystem Access Full access Partial (CWD restricted, but absolute paths open)

Timeline

Phase 1: Timeout & Isolation Foundation

  • Integrate subprocess execution logic into UnsafeLocalCodeExecutor and set as the default mode.
  • Implement timeout_seconds in BaseCodeExecutor.
  • Update RunSkillScriptTool to support global timeout configuration.

Phase 2: Security Hardening

  • Add explicit SecurityWarning when subprocess mode is disabled.
  • Update documentation to clearly define the boundary between Subprocess Isolation and full Container Isolation.

Cross-workstream impacts or dependencies

  • This proposal directly enhances the security and reliability of the ADK agent runtime.
  • Affects all skills utilizing the RunSkillScriptTool.

Outcome

  • All code executors support configurable instance-level timeouts.
  • RunSkillScriptTool defaults to subprocess-based isolation, preventing skill scripts from crashing the host or leaking process memory.
  • Consistent patterns for process-level safety across the ADK

Metadata

Metadata

Labels

tools[Component] This issue is related to tools

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions