-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Production-Readiness for Run Skill Script Tool #5079
Description
Production-Readiness for Run Skill Script Tool
Summary
RunSkillScriptTool enables agents to execute Python and shell scripts within skills. While functional, two critical gaps prevent its safe use in production or shared environments:
- Python scripts can hang indefinitely: Unlike shell scripts, Python scripts executed via
runpy.run_path()lack timeout enforcement, risking process-wide deadlocks. - Unsafe local execution: The default zero-dependency
UnsafeLocalCodeExecutorruns code with full host privileges, posing significant security risks.
This RFC proposes two P0 changes to address these gaps:
- P0-A: Uniform Timeout Support: Introduce a consistent timeout mechanism across all code executors.
- P0-B: Isolated Subprocess Mode: Enhance the local executor to use subprocess-based isolation by default, leveraging patterns from PR #3225.
Motivation
The current state of script execution in ADK presents reliability and security challenges:
- Reliability: A single buggy Python skill script without a timeout can block all code execution instances sharing the
UnsafeLocalCodeExecutorlock, leading to a denial-of-service. - Security:
UnsafeLocalCodeExecutorallows arbitrary code execution on the host, making it unsuitable for multi-user environments, CI/CD pipelines running untrusted scripts, or any production-like deployment. The ease of use ofUnsafeLocalCodeExecutormeans it's often used in scenarios where its risks are not acceptable.
These issues must be resolved before skills leveraging RunSkillScriptTool can be considered production-ready.
Proposal
P0-A: Uniform Timeout Support
Goal: Ensure all execute_code() calls have a configurable, bounded execution time.
Design:
- Add
timeout_seconds: Optional[int]toBaseCodeExecutor: Provides a fallback timeout if not specified in the input.
Executor Implementation:
UnsafeLocalCodeExecutor: Executeexec()in a daemon thread, usingthread.join(timeout). Mark the executor as unhealthy (self._healthy = False) if a timeout occurs, preventing further executions untilreinitialize()is called. This mitigates risks from lingering threads mutating shared state.ContainerCodeExecutor: Runexec_startin a thread, usingos.killon the host PID (obtained viaexec_inspect) to terminate on timeout. Includes container restart as a fallback.RunSkillScriptTool: Wire theSkillToolset(script_timeout=N)value toCodeExecutionInput.timeout_seconds.
P0-B: Isolated Subprocess Mode via in_process in UnsafeLocalCodeExecutor
Goal: Improve local execution safety by moving from exec() in a shared process to a separate subprocess, while keeping the "Unsafe" naming to avoid a false sense of security.
Design:
Refined Design: Instead of a use_subprocess flag, we will introduce an in_process parameter (defaulting to False) to the UnsafeLocalCodeExecutor.
Logic: When False (default), the executor uses the subprocess isolation logic. When True, it reverts to runpy.run_path() for debugging or low-risk local tasks.
Rationale: This follows the Python convention where None and False lead to the same (now safe) default path.
This incorporates logic from PR #3225: sandboxed subprocess using Python's standard library:
- Mechanism: Use
subprocess.Popen([sys.executable, "-c", code])for execution. - Isolation: Ensures a separate memory space and prevents a script crash from taking down the main agent runtime.
- Resource Limits: Utilize the
resourcemodule (Unix-only) to setRLIMIT_CPUandRLIMIT_AS(memory) within the child process before execution. - Environment: Pass only an explicit allowlist of environment variables to the subprocess.
- Cleanup: Use
proc.communicate(timeout=N). On timeout, kill the entire process group usingos.killpg.
Comparison of Execution Modes:
| Threat | Current (In-Process exec) |
Proposed (Default Subprocess) |
|---|---|---|
| Infinite loops | Blocked threads/Deadlocks | Terminated via OS timeout |
| Memory Exhaustion | Crashes main host process | Process-limited via RLIMIT_AS |
| Host Process Crash | High risk | Isolated to child process |
| Filesystem Access | Full access | Partial (CWD restricted, but absolute paths open) |
Timeline
Phase 1: Timeout & Isolation Foundation
- Integrate subprocess execution logic into
UnsafeLocalCodeExecutorand set as the default mode. - Implement
timeout_secondsinBaseCodeExecutor. - Update
RunSkillScriptToolto support global timeout configuration.
Phase 2: Security Hardening
- Add explicit
SecurityWarningwhen subprocess mode is disabled. - Update documentation to clearly define the boundary between Subprocess Isolation and full Container Isolation.
Cross-workstream impacts or dependencies
- This proposal directly enhances the security and reliability of the ADK agent runtime.
- Affects all skills utilizing the
RunSkillScriptTool.
Outcome
- All code executors support configurable instance-level timeouts.
- RunSkillScriptTool defaults to subprocess-based isolation, preventing skill scripts from crashing the host or leaking process memory.
- Consistent patterns for process-level safety across the ADK