feat: Add ExecuteSkillScriptTool for running skill scripts by caohy1988 · Pull Request #4575 · google/adk-python

caohy1988 · 2026-02-21T08:22:11Z

Summary

Adds ExecuteSkillScriptTool to SkillToolset, enabling agents to execute scripts from a skill's scripts/ directory via ADK's BaseCodeExecutor infrastructure
Supports Python (.py) and shell (.sh/.bash) scripts with optional input_args for parameterized execution
Implements executor resolution chain: toolset-level code_executor → agent fallback → actionable error
Truncates execution error messages to 200 chars for LLM-friendly responses (full traces logged via logger.exception)
Updates DEFAULT_SKILL_SYSTEM_INSTRUCTION to guide the LLM on execute_skill_script usage
Adds code_executor: Optional[BaseCodeExecutor] = None keyword-only parameter to SkillToolset.__init__ (backward compatible)

Changes

Core implementation (`src/google/adk/tools/skill_toolset.py`)

New ExecuteSkillScriptTool class with _prepare_code() for Python/shell dispatch
Error codes: MISSING_SKILL_NAME, MISSING_SCRIPT_NAME, SKILL_NOT_FOUND, SCRIPT_NOT_FOUND, NO_CODE_EXECUTOR, UNSUPPORTED_SCRIPT_TYPE, EXECUTION_ERROR
scripts/ prefix auto-stripping for consistency with LoadSkillResourceTool path format

Unit tests (`tests/unittests/tools/test_skill_toolset.py`)

19 new tests (37 total) covering: missing params, skill/script not found, no executor, unsupported types, Python/shell success, input_args injection, prefix stripping, executor priority, agent fallback, execution error, stderr status, error truncation

Sample agent (`contributing/samples/skill_script_demo/`)

Loads Anthropic's official mcp-builder skill (from anthropics/skills) and a python-helper skill with self-contained scripts (fibonacci, word_count, json_format)
14 compatibility tests verifying all 4 tools against the real mcp-builder skill
Live integration test with 9 scenarios (15 steps) exercising discovery, loading, reference reading, script execution, multi-turn conversations, cross-skill routing, and negative cases — all verified passing with Gemini 2.5 Flash

Test plan

pytest tests/unittests/tools/test_skill_toolset.py -v — 37/37 passed
pytest contributing/samples/skill_script_demo/test_skill_compat.py -v — 14/14 passed
Live test with Gemini 2.5 Flash via Vertex AI — 15/15 passed
Autoformat with ./autoformat.sh
Backward compatible: existing SkillToolset(skills=[...]) works unchanged

🤖 Generated with Claude Code

…xecutor Adds execute_skill_script tool to SkillToolset, enabling agents to run scripts from a skill's scripts/ directory through ADK's BaseCodeExecutor infrastructure. Supports Python (.py) and shell (.sh/.bash) scripts with optional input_args, executor resolution chain (toolset → agent fallback), and LLM-friendly error truncation. Includes 37 unit tests, a sample agent with Anthropic's mcp-builder skill and a python-helper skill, compatibility tests, and a live integration test exercising all four skill tools end-to-end with Gemini. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

google-cla · 2026-02-21T08:22:27Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

gemini-code-assist · 2026-02-21T08:22:45Z

Summary of Changes

Hello @caohy1988, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the SkillToolset by integrating a new ExecuteSkillScriptTool. This tool empowers agents to directly execute Python and shell scripts residing within a skill's scripts/ directory, providing a powerful mechanism for extending agent capabilities with custom logic and automation. The implementation includes robust error handling, flexible argument passing, and a clear hierarchy for code executor selection, making it easier for agents to leverage dynamic scripting for complex tasks.

Highlights

New Script Execution Tool: Introduced ExecuteSkillScriptTool to enable agents to run scripts from skill scripts/ directories, supporting Python (.py) and shell (.sh, .bash) scripts with optional input_args.
Code Executor Resolution: Implemented a flexible code executor resolution chain where a toolset-level executor takes precedence, falling back to the agent's executor if not explicitly provided.
Enhanced Error Handling: Improved error handling for script execution, including specific error codes and truncation of long error messages to ensure LLM-friendly responses.
System Instruction Update: Updated the DEFAULT_SKILL_SYSTEM_INSTRUCTION to guide LLMs on the proper usage of the new execute_skill_script tool.
Backward Compatibility: Ensured backward compatibility by adding code_executor as a keyword-only parameter to the SkillToolset.__init__ method.
Comprehensive Demo and Tests: Added a new sample agent, a live integration test, and extensive unit tests to demonstrate and validate the new script execution capabilities across various scenarios and external skills.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

contributing/samples/skill_script_demo/init.py
- Initialized the skill script demo package.
contributing/samples/skill_script_demo/agent.py
- Defined a sample agent for skill script demonstration.
contributing/samples/skill_script_demo/run_live_test.py
- Added a live integration test for skill script execution.
contributing/samples/skill_script_demo/skills/mcp-builder/LICENSE.txt
- Added the Apache License 2.0 for the mcp-builder skill.
contributing/samples/skill_script_demo/skills/mcp-builder/SKILL.md
- Documented the mcp-builder skill's purpose and workflow.
contributing/samples/skill_script_demo/skills/mcp-builder/references/evaluation.md
- Provided a guide for creating MCP server evaluations.
contributing/samples/skill_script_demo/skills/mcp-builder/references/mcp_best_practices.md
- Documented best practices for MCP server development.
contributing/samples/skill_script_demo/skills/mcp-builder/references/node_mcp_server.md
- Provided a Node/TypeScript MCP server implementation guide.
contributing/samples/skill_script_demo/skills/mcp-builder/references/python_mcp_server.md
- Provided a Python MCP server implementation guide.
contributing/samples/skill_script_demo/skills/mcp-builder/scripts/connections.py
- Implemented lightweight connection handling for MCP servers.
contributing/samples/skill_script_demo/skills/mcp-builder/scripts/evaluation.py
- Added an MCP server evaluation harness script.
contributing/samples/skill_script_demo/skills/mcp-builder/scripts/example_evaluation.xml
- Added an example XML file for MCP server evaluations.
contributing/samples/skill_script_demo/skills/mcp-builder/scripts/requirements.txt
- Listed Python dependencies for the mcp-builder evaluation script.
contributing/samples/skill_script_demo/skills/python-helper/SKILL.md
- Documented the python-helper skill and its scripts.
contributing/samples/skill_script_demo/skills/python-helper/references/usage.md
- Provided a usage guide for python-helper scripts.
contributing/samples/skill_script_demo/skills/python-helper/scripts/fibonacci.py
- Added a Python script to generate Fibonacci sequences.
contributing/samples/skill_script_demo/skills/python-helper/scripts/json_format.py
- Added a Python script to format and validate JSON.
contributing/samples/skill_script_demo/skills/python-helper/scripts/word_count.py
- Added a Python script to perform word frequency analysis.
contributing/samples/skill_script_demo/test_skill_compat.py
- Added compatibility tests for the mcp-builder skill.
src/google/adk/tools/skill_toolset.py
- Implemented ExecuteSkillScriptTool for script execution within skills.
- Added code preparation logic for Python and shell scripts.
- Introduced error handling, including specific error codes and message truncation for LLMs.
- Modified SkillToolset constructor to accept an optional code_executor parameter.
- Updated DEFAULT_SKILL_SYSTEM_INSTRUCTION to include guidance on execute_skill_script usage.
tests/unittests/tools/test_skill_toolset.py
- Added unit tests for the new ExecuteSkillScriptTool functionality.
- Expanded test coverage for various error conditions, script types, argument passing, and executor resolution.

Activity

A new ExecuteSkillScriptTool was introduced to the SkillToolset to enable agents to execute scripts.
The DEFAULT_SKILL_SYSTEM_INSTRUCTION was updated to provide guidance to LLMs on how to use the new script execution tool.
Extensive unit tests for ExecuteSkillScriptTool were added, covering scenarios such as missing parameters, skill/script not found, absence of a code executor, unsupported script types, successful execution, argument injection, script prefix stripping, executor priority, and error handling with message truncation.
A new sample agent (agent.py) and a comprehensive live integration test (run_live_test.py) were added to demonstrate and validate the new script execution capabilities with external skills like mcp-builder and python-helper.
The SkillToolset constructor was made backward compatible by introducing code_executor as a keyword-only parameter.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

adk-bot · 2026-02-21T08:23:02Z

Response from ADK Triaging Agent

Hello @caohy1988, thank you for creating this PR!

Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

Also, this PR is a new feature, could you please associate the github issue with this PR? If there is no existing issue, could you please create one?

This information will help reviewers to review your PR more efficiently. Thanks!

gemini-code-assist

Code Review

The addition of ExecuteSkillScriptTool significantly enhances the SkillToolset by allowing agents to run scripts directly from skills. The implementation is well-structured and includes comprehensive tests. However, there are a few security and correctness issues in the script execution logic, particularly regarding shell injection vulnerabilities and argument parsing. Addressing these will make the tool more robust and secure.

gemini-code-assist · 2026-02-21T08:24:34Z

src/google/adk/tools/skill_toolset.py

+          "import subprocess\n"
+          "_result = subprocess.run(\n"
+          f"    ['bash', '-c', {script_src!r}"
+          + (f" + ' ' + {input_args!r}" if input_args else "")
+          + f"],\n"


The current implementation is vulnerable to shell injection because input_args is concatenated directly into the command string passed to bash -c. An LLM could provide arguments containing shell metacharacters (e.g., ;, $(...)) to execute arbitrary commands. Additionally, this approach does not correctly pass arguments as positional parameters ($1, $2, etc.) to the script. Arguments should be passed as separate elements to subprocess.run.

Suggested change

"import subprocess\n"

"_result = subprocess.run(\n"

f" ['bash', '-c', {script_src!r}"

+ (f" + ' ' + {input_args!r}" if input_args else "")

+ f"],\n"

"import subprocess, shlex\n"

"_result = subprocess.run(\n"

f" ['bash', '-c', {script_src!r}, {script_name!r}] + shlex.split({input_args!r}),\n"

Good catch — this was a real shell injection vulnerability. Fixed in 06d995e.

Applied the suggested pattern: args are now passed as separate list elements via shlex.split() rather than concatenated into the bash -c command string:

["bash", "-c", script_src, script_name] + shlex.split(input_args)

This way $0 gets the script name and $1, $2, ... get the individual arguments, with no shell metacharacter interpretation on the args.

gemini-code-assist · 2026-02-21T08:24:34Z

src/google/adk/tools/skill_toolset.py

+          "script_name": script_name,
+          "stdout": result.stdout,
+          "stderr": result.stderr,
+          "status": "error" if result.stderr else "success",


The status detection logic relies on result.stderr being non-empty. However, the default UnsafeLocalCodeExecutor does not capture sys.stderr. Since the generated wrapper for shell scripts (lines 412-414) prints errors to sys.stderr, these will not be captured by the executor, and the tool will return status: "success" even if the script failed. Consider using check=True in subprocess.run to raise an exception on failure, which the executor will catch and report in stderr.

Agreed. Fixed in 06d995e.

Added check=True to the generated subprocess.run call. Now when a shell script exits with non-zero status, CalledProcessError is raised inside the exec'd code, which UnsafeLocalCodeExecutor captures and surfaces in result.stderr. This means the "status": "error" if result.stderr else "success" check works correctly.

Also removed the manual stderr printing from the generated wrapper since capture_output=True + the exception path now handles both cases (non-zero exit and stderr output).

gemini-code-assist · 2026-02-21T08:24:34Z

src/google/adk/tools/skill_toolset.py

+            "import sys\n"
+            f"sys.argv = [{script_name!r}] + {input_args!r}.split()\n"


Using .split() for parsing input_args is fragile as it does not handle quoted arguments with spaces correctly (e.g., --name "John Doe"). It is recommended to use shlex.split() for robust argument parsing.

Suggested change

"import sys\n"

f"sys.argv = [{script_name!r}] + {input_args!r}.split()\n"

"import sys, shlex\n"

f"sys.argv = [{script_name!r}] + shlex.split({input_args!r})\n"

Agreed — .split() silently breaks on quoted arguments. Fixed in 06d995e.

Both Python and shell script paths now use shlex.split() for argument parsing:

# Python scripts sys.argv = ["script.py"] + shlex.split(input_args) # Shell scripts ["bash", "-c", script_src, script_name] + shlex.split(input_args)

This correctly handles --name "John Doe", paths with spaces, and other shell-style quoting.

…k=True 1. Shell injection (HIGH): Use shlex.split() and pass args as separate list elements to subprocess.run instead of concatenating into bash -c command string. Args now arrive as positional params ($1, $2, ...). 2. Fragile arg parsing (MEDIUM): Replace .split() with shlex.split() for Python script sys.argv injection — correctly handles quoted arguments like --name "John Doe". 3. Status detection (MEDIUM): Add check=True to subprocess.run in shell wrapper so non-zero exit codes raise CalledProcessError, which the code executor captures as stderr. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix sys.exit(0) treated as error: zero/None exit codes now return success - Fix shell stderr discarding stdout: serialize both streams as JSON envelope through stdout to work around UnsafeLocalCodeExecutor losing stdout on exception - Add early shlex.split() validation for input_args before executor dispatch - Add script_timeout parameter to SkillToolset for subprocess.run timeout - Reject extensionless scripts (previously treated as Python) - Replace os.popen() with subprocess.run() in run_live_test.py - Add UnsafeLocalCodeExecutor safety warning to sample agent - Fix broken reference/ → references/ links in mcp-builder SKILL.md - Remove duplicate @pytest.mark.asyncio decorator - Add 4 integration tests using real UnsafeLocalCodeExecutor Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

caohy1988 · 2026-02-21T23:28:40Z

ExecuteSkillScriptTool — Design Details & Considerations

Overview

ExecuteSkillScriptTool enables ADK agents to execute scripts bundled inside a skill's scripts/ directory via ADK's BaseCodeExecutor infrastructure. This closes the gap between the Agent Skills spec (which defines scripts/ as an optional skill resource) and ADK's runtime capabilities.

Architecture

LLM calls execute_skill_script(skill_name, script_name, input_args)
        │
        ▼
┌─ ExecuteSkillScriptTool.run_async() ─────────────────────┐
│  1. Validate params & resolve skill/script                │
│  2. Resolve code executor (toolset → agent fallback)      │
│  3. Validate input_args via shlex.split() (early check)   │
│  4. _prepare_code() → generate Python code string         │
│  5. code_executor.execute_code(generated_code)            │
│  6. Parse result (JSON envelope for shell scripts)        │
│  7. Return {stdout, stderr, status} to LLM               │
└───────────────────────────────────────────────────────────┘

Script Type Handling

Type	Extension	Execution Method	Timeout	Args Injection
Python	`.py`	Direct `exec()` via code executor	No (executor-level)	`sys.argv = [script_name] + shlex.split(input_args)`
Shell	`.sh`, `.bash`	`subprocess.run(['bash', '-c', src, name] + args)`	Yes (`script_timeout`)	`shlex.split(input_args)` as list elements
Other	any	Rejected	N/A	N/A

Extensionless files are rejected (not silently treated as Python) to avoid unexpected behavior.

Code Executor Resolution Chain

1. SkillToolset(code_executor=...)    ← explicit, highest priority
2. agent.code_executor                ← fallback to agent's executor
3. None → return NO_CODE_EXECUTOR     ← actionable error

This design allows a single toolset-level executor to be shared across all skills, or per-agent executors for different isolation levels.

Shell Script JSON Envelope

Shell scripts face a unique challenge: UnsafeLocalCodeExecutor captures stdout via redirect_stdout(StringIO), but if the generated code raises an exception, stdout.getvalue() is never called and stdout is lost. This means a naive raise RuntimeError(stderr) approach discards any stdout the script produced.

Solution: The generated shell wrapper serializes both stdout and stderr as a JSON envelope through the single stdout channel:

# Generated code for shell scripts:
import subprocess, shlex, json as _json
try:
    _r = subprocess.run(
        ['bash', '-c', SCRIPT_SRC, SCRIPT_NAME] + shlex.split(INPUT_ARGS),
        capture_output=True, text=True,
        timeout=SCRIPT_TIMEOUT,
    )
    print(_json.dumps({
        '__shell_result__': True,
        'stdout': _r.stdout,
        'stderr': _r.stderr,
        'returncode': _r.returncode,
    }))
except subprocess.TimeoutExpired as _e:
    print(_json.dumps({
        '__shell_result__': True,
        'stdout': _e.stdout or '',
        'stderr': 'Timed out after Ns',
        'returncode': -1,
    }))

run_async() then parses this JSON envelope (only for shell scripts, keyed on __shell_result__) to extract both streams and the return code. This works reliably with both UnsafeLocalCodeExecutor and container-based executors.

Three-State Status Model

Status	Condition
`success`	No stderr
`warning`	Both stdout and stderr present (e.g., deprecation warnings)
`error`	Stderr only (no stdout), or non-zero returncode with no stdout

Security Considerations

Shell injection prevention:

input_args are parsed via shlex.split() and passed as separate list elements to subprocess.run, never interpolated into the shell command string
The script source goes to bash -c as a single argument; positional parameters ($0, $1, ...) receive the args
input_args validation happens eagerly before executor dispatch to fail fast on malformed input (e.g., unclosed quotes)

SystemExit handling:

sys.exit() raises SystemExit(BaseException), which is NOT caught by except Exception in executors
run_async() explicitly catches SystemExit to prevent skill scripts from terminating the host process
sys.exit(0) and sys.exit(None) are treated as successful termination
Non-zero exit codes return EXECUTION_ERROR

Executor security:

UnsafeLocalCodeExecutor runs code in the host process via exec() — suitable only for trusted, first-party skills
For third-party or untrusted skills, a sandboxed executor (e.g., ContainerCodeExecutor) should be used
The sample agent includes explicit warnings about this

Known Limitations & Future Work

No timeout for Python scripts: exec() provides no built-in timeout mechanism. A malicious/buggy Python script can hang indefinitely. This is an executor-level concern — solving it properly requires running Python scripts in a subprocess or implementing executor-level cancellation.
Python script stdout lost on exception: When a Python script writes to stdout and then raises, UnsafeLocalCodeExecutor loses the stdout (same root cause as the shell fix). This is less critical for Python since exceptions are the natural error mechanism, but could be improved at the executor level.
input_args is a string, not an array: The LLM passes arguments as a space-separated string, which is then parsed by shlex.split(). An array type would be more explicit but adds complexity for the LLM to construct correctly. The current string approach matches how humans think about CLI arguments.

Error Codes Reference

Error Code	Meaning
`MISSING_SKILL_NAME`	`skill_name` parameter not provided
`MISSING_SCRIPT_NAME`	`script_name` parameter not provided
`SKILL_NOT_FOUND`	No skill with that name registered
`SCRIPT_NOT_FOUND`	No script with that name in the skill
`NO_CODE_EXECUTOR`	No code executor configured (toolset or agent)
`UNSUPPORTED_SCRIPT_TYPE`	File extension not `.py`, `.sh`, or `.bash`
`INVALID_INPUT_ARGS`	`input_args` failed `shlex.split()` validation
`EXECUTION_ERROR`	Runtime error, non-zero exit, or `sys.exit(N)`

Test Coverage

44 unit tests with mocked executor covering all error paths, script types, argument injection, executor resolution, status model, SystemExit handling
4 integration tests using real UnsafeLocalCodeExecutor verifying: Python stdout capture, sys.exit(0) as success, shell stdout+stderr preservation via JSON envelope, shell stderr-only error status
14 compatibility tests against Anthropic's mcp-builder skill structure
15-step live integration test verified passing with Gemini 2.5 Flash

Design doc covering three high-priority improvements to the ADK code executor infrastructure: 1. Uniform timeout support across all executors (currently only GkeCodeExecutor has it) 2. Stateful Python execution in ContainerCodeExecutor (currently frozen to stateful=False) 3. Security hardening with a new LocalSandboxCodeExecutor and UnsafeLocalCodeExecutor restrictions Also removes dead default parameter from _prepare_code(timeout=) since timeout is now read directly from self._toolset._script_timeout inside the shell branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… doc Add Non-Goals & Invariants section, fix append-before-execute state poisoning, redesign timeout as per-invocation field on CodeExecutionInput, use Docker exec kill instead of thread+join for container timeout, replace preexec_fn with process_group, pin default container image digest, reframe restricted builtins as best-effort friction, document execution_id gap in ExecuteSkillScriptTool. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…, timeout API - Fix execution_id to use session-stable key (session.id), not per-turn invocation_id - Fix container timeout kill: document PID namespace mismatch, use pkill -f inside container instead of host-PID kill - Fix Python version: ADK minimum is >=3.10 (pyproject.toml), not 3.11; gate process_group with version check, preexec_fn fallback - Fix LocalSandboxCodeExecutor to use default_timeout_seconds (consistent with BaseCodeExecutor API) - Replace all `or` and `??` timeout resolution with explicit `is not None` checks to allow timeout=0 - Fix test strategy to document missing ContainerCodeExecutor tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Redesign container timeout: run exec_start in thread + join(timeout), kill via host-side os.kill(host_pid) instead of in-container pkill. Eliminates blocking exec_start DoS and overbroad pkill -f pattern. - Remove contradictory "RECOMMENDED" label from Option C; Option A (persistent process) is the single recommended approach. - Remove Windows implementation guidance (contradicts Non-Goals §2.4); raise NotImplementedError instead. - Guard resource import in LocalSandboxCodeExecutor wrapper script for platforms where resource module is unavailable. - Fix remote executor timeout to use synchronous thread+join, not asyncio.wait_for (execute_code APIs are synchronous). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…k gaps - Roadmap Phase 2 now targets Option A (persistent process) directly instead of cumulative replay (Option C) first - Unify recovery policy: timeout and crash both return error indicating state loss, no automatic replay (consistent across §4.2.3 and §9) - Add Windows NotImplementedError gate to LocalSandboxCodeExecutor - Guard resource import in 3.10 preexec_fn fallback path - Document that PermissionError → container.restart() is the primary timeout path in practice (container runs as root, ADK as non-root) - Add explicit test cases for PermissionError fallback and kill paths - Update open question #1 to focus on I/O boundary protocol design Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- PermissionError from os.kill now triggers container.restart() instead of being silently swallowed (fixes timeout DoS for the common case where container runs as root and ADK does not) - ProcessLookupError remains the only silently-ignored exception (process already exited, no action needed) - Non-Goals §2.2 updated to describe persistent REPL crash recovery instead of stale cumulative-replay language Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…d join - If both os.kill and container.restart() fail, set _healthy=False and return distinct "cleanup failed" error instead of silently returning a normal timeout result - Add thread.join(timeout=2) after kill/restart to prevent daemon thread leaks on repeated timeout failures - Log warning if worker thread is still alive after post-kill join - Add test cases for total cleanup failure and thread leak scenarios Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add _healthy check at execute_code entry: raise RuntimeError if executor is unhealthy from a prior cleanup failure - Add python3 --version readiness check after container.restart(), mirroring the init-time validation pattern - Update roadmap Phase 1 step 4 to include _healthy guard, post-restart validation, and post-kill thread join Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Define the recovery path for unhealthy executors: reinitialize() stops the container, creates a new one, validates readiness, and sets _healthy=True. Referenced in error message, _healthy lifecycle, and roadmap Phase 1 step 5. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Non-zero exit from python3 --version does not raise in Docker SDK, so check exit_code explicitly and raise to trigger the _healthy=False path, mirroring container_code_executor.py:169. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add a complete SkillsBench benchmark runner with Docker-based execution and real pytest scoring, matching the official SkillsBench methodology. Key components: - TaskContainerExecutor: BaseCodeExecutor subclass that runs code inside per-task Docker containers built from each task's Dockerfile - ContainerBashTool: Bash tool enabling the agent to read/write files and run shell commands inside the task container - Lenient skill loader: Handles malformed SKILL.md frontmatter (invalid names, list-typed allowed-tools, dict-typed compatibility/metadata) loading 226/232 skills across all 87 tasks - Real pytest scoring: Copies tests into the container, runs test.sh, reads reward.txt for binary pass/fail — matching SkillsBench scoring - Per-task config from task.toml (agent/build/verifier timeouts) - Exponential backoff retry for 429/503 API errors - CLI with --filter, --rebuild, --build-only, --skip-tests flags Also includes the standalone eval runner, agent definition, metrics, 8 bundled skills, and evaluation set for local development. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Wrap all Docker exec_run calls in a ThreadPoolExecutor with a 300s default timeout. This prevents the asyncio event loop from blocking indefinitely when a script runs forever inside the container, which was causing the full evaluation to hang on long-running tasks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Refactor code execution to use a temporary directory for input files and change working directory as needed.

This document outlines the design for integrating pre-registered function execution and script execution within ADK's non-bash based skills toolset. It details the motivation, proposal, enabling usage of ADK tools, and the architecture of the RunSkillScriptTool.

- full_runner.py: Add retry loop (max 2 retries) with Docker prune between attempts, forcerm=True for intermediate containers, periodic prune every N tasks (--prune-interval flag), and build summary stats - test_skill_toolset.py: Add 8 tests covering shell JSON envelope parsing, non-zero returncode, TimeoutExpired path, input_files packaging with correct paths/content, and working_dir validation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix system instruction referencing 'execute_skill_script' instead of actual tool name 'run_skill_script' - Use 'is not None' checks when packaging input_files so empty files (zero-byte configs, sentinels) are mounted instead of silently dropped - Add defensive validation for 'args' parameter type, returning clear INVALID_ARGS_TYPE error instead of crashing on non-dict input - Add tests for all three findings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove Feature B (additional_tools parameter and allowed_tools dynamic resolution in get_tools()) to keep the PR focused on script execution. Remove global _execution_lock from UnsafeLocalCodeExecutor as it was an arbitrary side-effect unrelated to the script execution feature. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

os.chdir() and redirect_stdout() mutate process-global state, so concurrent execute_code() calls without synchronization cause stdout bleed and cwd contamination between executions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Preserve original bare exec() path for existing callers that pass no input_files and no working_dir. The temp directory, file mounting, chdir, and execution lock now only activate when needed by the new script execution feature. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

redirect_stdout mutates process-global sys.stdout, so both the sandbox path (tempdir + chdir) and the plain exec() path must be serialized. The previous commit only locked the sandbox branch, allowing concurrent plain calls to bleed stdout with sandbox calls. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Runs 50 iterations of concurrent sandbox (with input_files + working_dir) and plain (bare exec) calls to verify no stdout bleed between them. Protects the execution lock fix from future regressions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Strengthen the mixed-concurrency test to also verify that the plain execution path observes the original working directory, not a sandbox temp dir leaked from a concurrent sandbox call. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

caohy1988 · 2026-02-24T07:44:07Z

Thanks for the review! I went through every behavior change in this PR against main to validate your concern. You were right — there were arbitrary changes that didn't belong here. Here's what I did and what remains:

Already removed

additional_tools parameter + allowed_tools resolution in get_tools() — This was the main offender. get_tools() had ~70 lines dynamically resolving built-in ADK tools from skill frontmatter. Removed entirely; get_tools() is back to a simple return self._tools. This should be a separate feature/PR if needed.

Remaining changes — all necessary for script execution

#	Change	Backward compatible?	Why needed
1	`RunSkillScriptTool` always in `get_tools()` (4 tools instead of 3)	N/A — new feature	Core feature of this PR. Returns `NO_CODE_EXECUTOR` gracefully if no executor configured. Skills are designed to have scripts, so the tool should always be available.
2	System instruction line 4	Yes — additive	Model needs to know `run_skill_script` exists.
3	`SkillToolset.__init__` adds `code_executor` and `script_timeout` kwargs	Yes — both optional with defaults	`RunSkillScriptTool` needs an executor to run against.
4	`File.path` field	Yes — defaults to `None`	Scripts need to be mounted at relative paths (e.g., `references/data.csv`), not flat basenames.
5	`CodeExecutionInput.working_dir` field	Yes — defaults to `None`	Executor needs to know where to `chdir` inside the temp directory.
6	`UnsafeLocalCodeExecutor` sandbox path	Yes — gated by `if input_files or working_dir`	Temp dir + file mounting + chdir only activates when the new fields are used. Existing callers (no files, no working_dir) hit the original bare `exec()` path unchanged. Both paths share a `_execution_lock` since `redirect_stdout` mutates process-global `sys.stdout`.
7	`VertexAiCodeExecutor` working_dir + `f.path`	Yes — no-op when `None`, falls back to `f.name`	Both executors should honor the same contract.

Happy to discuss further — let me know if any of these still feel out of scope.

…rent models

…hardening

Review the design doc against the actual RunSkillScriptTool and executor implementations, fixing factual inaccuracies, missing details, and underspecified areas: - §3.1: UnsafeLocalCodeExecutor isolation is partial (temp-dir sandbox), not "None"; add BuiltInCodeExecutor clarification - §3.2: Add optimize_data_file and actual default delimiter values - §3.3: Document binary content handling and _prepare_globals helper - §3.4: Document asyncio.to_thread() wrapping (not direct call) - §3.4.1 (new): Add RunSkillScriptTool implementation details: executor resolution chain, script_timeout param (shell-only), scripts/ prefix normalization, resource packaging (incl. empty files), SystemExit handling, error truncation >200 chars, shell JSON envelope edge cases, duplicate skill name validation, and process_llm_request system instruction injection - §4.2.2: Add _execution_lock interaction analysis for timeout thread - §4.4: Expand Python timeout gap (zero protection at any layer) - §6.2: Update threat table with current mitigations - §7.1: Add missing fields to combined API changes snippet - §7.3: Expand impact table with output_files, system instructions, error truncation rows - §7.5.1: Document status derivation asymmetry (shell vs Python) - §7.5.4: Document current str(v) behavior before proposed rules Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

5-page RFC targeting ADK TL/UTL audience, covering the two P0 gaps that block production use of RunSkillScriptTool: - P0-A: Uniform timeout support — add timeout_seconds to CodeExecutionInput and default_timeout_seconds to BaseCodeExecutor, with per-executor implementations (thread+join for local, Docker exec kill for container). - P0-B: LocalSandboxCodeExecutor — new stdlib-only executor using subprocess with resource limits, replacing UnsafeLocalCodeExecutor as the recommended local default. Includes alternatives considered, phased rollout plan, success metrics, and execution flow appendix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

1. High: UnsafeLocalCodeExecutor timeout — replace naive lock-release with unhealthy-guard pattern. On timeout, mark executor unhealthy and fail fast until reinitialize(), preventing cross-execution stdout/cwd contamination from lingering daemon threads. 2. High: LocalSandboxCodeExecutor — replace subprocess.run(timeout) with Popen + os.killpg(SIGKILL) to kill entire process group on timeout. Adjust threat table: fork bomb protection is partial (no RLIMIT_NPROC), child-proc infinite loops are covered. 3. Medium: Add AgentEngineSandboxCodeExecutor to timeout landscape table (opaque Vertex AI Sandbox internal timeout). 4. Medium: Rephrase "de facto default" claim — UnsafeLocalCodeExecutor is common in samples/dev setups, not a global LlmAgent default. 5. Medium: Soften "drop-in replacement" exit criterion — define supported script profile (stdout/stderr, sandbox-local files, explicit env var allowlist) and compatibility envelope. 6. Low: Make test line count approximate (~1170-line). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix 3 High, 3 Medium, 1 Low severity issues: - LocalSandboxCodeExecutor: materialize input_files, honor working_dir, use Popen+killpg for process-group cleanup on timeout - UnsafeLocalCodeExecutor: replace lock-release pattern with unhealthy-guard - Stateful execution: split Option A (final target) from Option C (interim) with acceptance criteria and deprecation plan - Container security: make network/read-only defaults opt-in first - Backward compat: clarify "Breaking + Minor" with explicit versioning policy - Open Question 1: align phase reference to Phase 4 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- High: Align timeout/lock model — lock releases normally via `with`, unhealthy flag prevents new executions (consistent across both docs) - High: Unify Python 3.10 fallback — preexec_fn calls os.setpgrp() AND sets resource limits in both design doc and RFC - Medium: Phase 3 roadmap now says opt-in for container hardening, matching §6.3.2-C graduation plan; default flip deferred to later - Medium: Fix duplicate §5.2.4 numbering (→5.2.5), deduplicate execution_id wiring (Phase 2 owns it, Phase 4 cross-references) - Low: Add scope note linking RFC (8-11d, P0 only) to full design doc (15-22d, P0-P2) - Low: RFC threat table specifies temp-dir condition; lock impact clarified as process-global Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- High: Close race in unhealthy-guard — add double-check of _healthy after acquiring _execution_lock (both docs) - Medium: Normalize 3.10 fallback to _setup_child (os.setpgrp + rlimits) in design doc platform section - Low: Fix cross-reference §6.3.2-C → §6.3.3-C in roadmap - Low: Fix RFC docstring subprocess.run → Popen + killpg Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Medium: Reconcile breaking default-flips with SemVer — pre-1.0 may ship in minor with DeprecationWarning; post-1.0 reserve for major - Low: Add missing `import os` in LocalSandboxCodeExecutor snippet - Low: Specify reinitialize() failure mode — stays unhealthy and raises RuntimeError if daemon thread does not exit within grace period (both docs) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Change "future minor" to "future release: minor pre-1.0 / major post-1.0" to match the policy stated in §6.3.3-C. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Reusable end-to-end evaluation framework for any BigQuery skill or agent built with ADK BigQueryToolset. Mirrors SkillsBench architecture with BigQuery-specific metrics and public dataset eval cases. Contents: - agent.py: Root agent with BigQueryToolset (read-only, env-configurable) - metrics.py: 3 custom metrics (schema_discovery, tool_usage, output_correctness) - runner.py: Standalone runner with filter, dry-run, multi-run support - eval_sets/bigquerybench_eval.json: 7 eval cases across 3 tiers (schema exploration, SQL generation, multi-step analysis) - README.md: Full pipeline docs with "add new eval case" guide and AI/ML tool templates (forecast, detect_anomalies, analyze_contribution) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…evaluation Adds a detailed "Complete Walkthrough" section to the BigQueryBench README with a concrete cluster_data (K-Means) tool example covering: - Step 1: Register tool in BigQueryToolset (auto-discovered) - Step 2: Evaluate whether existing metrics suffice - Step 2b: Adding a custom metric (clustering_quality_score example) - Step 3: Pick a public dataset (ml_datasets.penguins) - Step 4: Write the eval case JSON with structural reference lines - Step 5: Configure write mode for ML tools - Step 6: Validate with dry-run and multi-run - Step 7: Commit Includes a summary table showing which files to touch per scenario (new eval case only, new tool with existing metrics, new tool needing custom metric, agent config change). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace 3-metric response-text-matching approach with 2 trace-only metrics that are deterministic and easy to maintain: - tool_invocation_score: all expected tool names appear in trace - tool_args_score: all expected (tool, project/dataset/table) pairs appear in trace Key changes: - metrics.py: Replace schema_discovery + tool_usage + output_correctness with tool_invocation + tool_args - eval JSON: Remove final_response fields — only tool_uses matter - runner.py: Print tool-call trace for debugging, simplified output - README.md: Rewritten for trace-based approach with concrete cluster_data walkthrough showing JSON-only eval case addition Adding a new eval case now requires zero code changes — just add a JSON object specifying the expected tool calls and their key args. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…adherence evaluation Three-metric evaluation pipeline: 1. tool_invocation_score — trace-based, verifies correct skill tools called 2. tool_args_score — trace-based, verifies correct skill/resource/script args 3. instruction_adherence_score — LLM-as-judge, checks natural-language rubrics Includes bq-sql-analyst skill, 5 eval cases with 12 rubrics total, and 14 unit tests covering all metrics (mocked LLM for judge tests). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ptional BigQuery toolset - Agent model uses Vertex AI API key (GOOGLE_CLOUD_API_KEY) via custom _VertexGemini subclass with HttpRetryOptions (5 attempts, 2x exponential backoff) - LLM judge uses vertexai=True + api_key with manual retry backoff on 429 rate limits (5 attempts, 2s→4s→8s→16s) - BigQuery toolset is now optional — gracefully skipped when GCP project/ADC is unavailable, enabling skill-only evaluation - Agent instruction adapts to available tools (skill-only vs full) - Updated eval case skill_query_with_reference to be skill-only - Updated test mocks for client.aio.models.generate_content() API - E2E verified: 5/5 cases pass (100%), all metrics at 1.00 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

adk-bot added the tools [Component] This issue is related to tools label Feb 21, 2026

gemini-code-assist bot reviewed Feb 21, 2026

View reviewed changes

caohy1988 marked this pull request as draft February 21, 2026 08:25

caohy1988 and others added 2 commits February 21, 2026 00:28

caohy1988 and others added 20 commits February 21, 2026 18:48

docs: Document _healthy lifecycle (init, failure, reinit)

c3a003d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rename ExecuteSkillScriptTool to RunSkillScriptTool

8169ce3

Rename ExecuteSkillScriptTool to RunSkillScriptTool

446f8a6

Add optional path and working_dir to execution dataclasses

e2445c2

Implement temporary directory for code execution

8b99a5e

Refactor code execution to use a temporary directory for input files and change working directory as needed.

Enhance code execution with working directory support

c2e92cf

Add 'path' field to input files in tests

c77ac54

caohy1988 and others added 8 commits February 23, 2026 22:59

caohy1988 and others added 15 commits February 24, 2026 00:23

docs(code-executor): align design doc with RunSkillScriptTool and cur…

0f4165a

…rent models

docs(code-executor): prioritize run-script roadmap and tool contract …

97397ba

…hardening

docs(code-executor): align §6.5 graduation row with SemVer policy

ab1dd28

Change "future minor" to "future release: minor pre-1.0 / major post-1.0" to match the policy stated in §6.3.3-C. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: Add ExecuteSkillScriptTool for running skill scripts#4575

feat: Add ExecuteSkillScriptTool for running skill scripts#4575
caohy1988 wants to merge 46 commits intogoogle:mainfrom
caohy1988:feat/execute-skill-script

caohy1988 commented Feb 21, 2026

Uh oh!

google-cla bot commented Feb 21, 2026

Uh oh!

gemini-code-assist bot commented Feb 21, 2026

Uh oh!

adk-bot commented Feb 21, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 21, 2026

Uh oh!

caohy1988 Feb 21, 2026

Uh oh!

gemini-code-assist bot Feb 21, 2026

Uh oh!

caohy1988 Feb 21, 2026

Uh oh!

gemini-code-assist bot Feb 21, 2026

Uh oh!

caohy1988 Feb 21, 2026

Uh oh!

caohy1988 commented Feb 21, 2026

Uh oh!

caohy1988 commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		"import sys\n"
		f"sys.argv = [{script_name!r}] + {input_args!r}.split()\n"

Comments

Conversation

caohy1988 commented Feb 21, 2026

Summary

Changes

Core implementation (src/google/adk/tools/skill_toolset.py)

Unit tests (tests/unittests/tools/test_skill_toolset.py)

Sample agent (contributing/samples/skill_script_demo/)

Test plan

Uh oh!

google-cla bot commented Feb 21, 2026

Uh oh!

gemini-code-assist bot commented Feb 21, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

adk-bot commented Feb 21, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

caohy1988 Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

caohy1988 Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

caohy1988 Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

caohy1988 commented Feb 21, 2026

ExecuteSkillScriptTool — Design Details & Considerations

Overview

Architecture

Script Type Handling

Code Executor Resolution Chain

Shell Script JSON Envelope

Three-State Status Model

Security Considerations

Known Limitations & Future Work

Error Codes Reference

Test Coverage

Uh oh!

caohy1988 commented Feb 24, 2026

Already removed

Remaining changes — all necessary for script execution

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Core implementation (`src/google/adk/tools/skill_toolset.py`)

Unit tests (`tests/unittests/tools/test_skill_toolset.py`)

Sample agent (`contributing/samples/skill_script_demo/`)