Skip to content

Comments

feat: Add ExecuteSkillScriptTool for running skill scripts#4575

Draft
caohy1988 wants to merge 46 commits intogoogle:mainfrom
caohy1988:feat/execute-skill-script
Draft

feat: Add ExecuteSkillScriptTool for running skill scripts#4575
caohy1988 wants to merge 46 commits intogoogle:mainfrom
caohy1988:feat/execute-skill-script

Conversation

@caohy1988
Copy link

Summary

  • Adds ExecuteSkillScriptTool to SkillToolset, enabling agents to execute scripts from a skill's scripts/ directory via ADK's BaseCodeExecutor infrastructure
  • Supports Python (.py) and shell (.sh/.bash) scripts with optional input_args for parameterized execution
  • Implements executor resolution chain: toolset-level code_executor → agent fallback → actionable error
  • Truncates execution error messages to 200 chars for LLM-friendly responses (full traces logged via logger.exception)
  • Updates DEFAULT_SKILL_SYSTEM_INSTRUCTION to guide the LLM on execute_skill_script usage
  • Adds code_executor: Optional[BaseCodeExecutor] = None keyword-only parameter to SkillToolset.__init__ (backward compatible)

Changes

Core implementation (src/google/adk/tools/skill_toolset.py)

  • New ExecuteSkillScriptTool class with _prepare_code() for Python/shell dispatch
  • Error codes: MISSING_SKILL_NAME, MISSING_SCRIPT_NAME, SKILL_NOT_FOUND, SCRIPT_NOT_FOUND, NO_CODE_EXECUTOR, UNSUPPORTED_SCRIPT_TYPE, EXECUTION_ERROR
  • scripts/ prefix auto-stripping for consistency with LoadSkillResourceTool path format

Unit tests (tests/unittests/tools/test_skill_toolset.py)

  • 19 new tests (37 total) covering: missing params, skill/script not found, no executor, unsupported types, Python/shell success, input_args injection, prefix stripping, executor priority, agent fallback, execution error, stderr status, error truncation

Sample agent (contributing/samples/skill_script_demo/)

  • Loads Anthropic's official mcp-builder skill (from anthropics/skills) and a python-helper skill with self-contained scripts (fibonacci, word_count, json_format)
  • 14 compatibility tests verifying all 4 tools against the real mcp-builder skill
  • Live integration test with 9 scenarios (15 steps) exercising discovery, loading, reference reading, script execution, multi-turn conversations, cross-skill routing, and negative cases — all verified passing with Gemini 2.5 Flash

Test plan

  • pytest tests/unittests/tools/test_skill_toolset.py -v — 37/37 passed
  • pytest contributing/samples/skill_script_demo/test_skill_compat.py -v — 14/14 passed
  • Live test with Gemini 2.5 Flash via Vertex AI — 15/15 passed
  • Autoformat with ./autoformat.sh
  • Backward compatible: existing SkillToolset(skills=[...]) works unchanged

🤖 Generated with Claude Code

…xecutor

Adds execute_skill_script tool to SkillToolset, enabling agents to run
scripts from a skill's scripts/ directory through ADK's BaseCodeExecutor
infrastructure. Supports Python (.py) and shell (.sh/.bash) scripts with
optional input_args, executor resolution chain (toolset → agent fallback),
and LLM-friendly error truncation.

Includes 37 unit tests, a sample agent with Anthropic's mcp-builder skill
and a python-helper skill, compatibility tests, and a live integration test
exercising all four skill tools end-to-end with Gemini.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@google-cla
Copy link

google-cla bot commented Feb 21, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @caohy1988, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the SkillToolset by integrating a new ExecuteSkillScriptTool. This tool empowers agents to directly execute Python and shell scripts residing within a skill's scripts/ directory, providing a powerful mechanism for extending agent capabilities with custom logic and automation. The implementation includes robust error handling, flexible argument passing, and a clear hierarchy for code executor selection, making it easier for agents to leverage dynamic scripting for complex tasks.

Highlights

  • New Script Execution Tool: Introduced ExecuteSkillScriptTool to enable agents to run scripts from skill scripts/ directories, supporting Python (.py) and shell (.sh, .bash) scripts with optional input_args.
  • Code Executor Resolution: Implemented a flexible code executor resolution chain where a toolset-level executor takes precedence, falling back to the agent's executor if not explicitly provided.
  • Enhanced Error Handling: Improved error handling for script execution, including specific error codes and truncation of long error messages to ensure LLM-friendly responses.
  • System Instruction Update: Updated the DEFAULT_SKILL_SYSTEM_INSTRUCTION to guide LLMs on the proper usage of the new execute_skill_script tool.
  • Backward Compatibility: Ensured backward compatibility by adding code_executor as a keyword-only parameter to the SkillToolset.__init__ method.
  • Comprehensive Demo and Tests: Added a new sample agent, a live integration test, and extensive unit tests to demonstrate and validate the new script execution capabilities across various scenarios and external skills.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • contributing/samples/skill_script_demo/init.py
    • Initialized the skill script demo package.
  • contributing/samples/skill_script_demo/agent.py
    • Defined a sample agent for skill script demonstration.
  • contributing/samples/skill_script_demo/run_live_test.py
    • Added a live integration test for skill script execution.
  • contributing/samples/skill_script_demo/skills/mcp-builder/LICENSE.txt
    • Added the Apache License 2.0 for the mcp-builder skill.
  • contributing/samples/skill_script_demo/skills/mcp-builder/SKILL.md
    • Documented the mcp-builder skill's purpose and workflow.
  • contributing/samples/skill_script_demo/skills/mcp-builder/references/evaluation.md
    • Provided a guide for creating MCP server evaluations.
  • contributing/samples/skill_script_demo/skills/mcp-builder/references/mcp_best_practices.md
    • Documented best practices for MCP server development.
  • contributing/samples/skill_script_demo/skills/mcp-builder/references/node_mcp_server.md
    • Provided a Node/TypeScript MCP server implementation guide.
  • contributing/samples/skill_script_demo/skills/mcp-builder/references/python_mcp_server.md
    • Provided a Python MCP server implementation guide.
  • contributing/samples/skill_script_demo/skills/mcp-builder/scripts/connections.py
    • Implemented lightweight connection handling for MCP servers.
  • contributing/samples/skill_script_demo/skills/mcp-builder/scripts/evaluation.py
    • Added an MCP server evaluation harness script.
  • contributing/samples/skill_script_demo/skills/mcp-builder/scripts/example_evaluation.xml
    • Added an example XML file for MCP server evaluations.
  • contributing/samples/skill_script_demo/skills/mcp-builder/scripts/requirements.txt
    • Listed Python dependencies for the mcp-builder evaluation script.
  • contributing/samples/skill_script_demo/skills/python-helper/SKILL.md
    • Documented the python-helper skill and its scripts.
  • contributing/samples/skill_script_demo/skills/python-helper/references/usage.md
    • Provided a usage guide for python-helper scripts.
  • contributing/samples/skill_script_demo/skills/python-helper/scripts/fibonacci.py
    • Added a Python script to generate Fibonacci sequences.
  • contributing/samples/skill_script_demo/skills/python-helper/scripts/json_format.py
    • Added a Python script to format and validate JSON.
  • contributing/samples/skill_script_demo/skills/python-helper/scripts/word_count.py
    • Added a Python script to perform word frequency analysis.
  • contributing/samples/skill_script_demo/test_skill_compat.py
    • Added compatibility tests for the mcp-builder skill.
  • src/google/adk/tools/skill_toolset.py
    • Implemented ExecuteSkillScriptTool for script execution within skills.
    • Added code preparation logic for Python and shell scripts.
    • Introduced error handling, including specific error codes and message truncation for LLMs.
    • Modified SkillToolset constructor to accept an optional code_executor parameter.
    • Updated DEFAULT_SKILL_SYSTEM_INSTRUCTION to include guidance on execute_skill_script usage.
  • tests/unittests/tools/test_skill_toolset.py
    • Added unit tests for the new ExecuteSkillScriptTool functionality.
    • Expanded test coverage for various error conditions, script types, argument passing, and executor resolution.
Activity
  • A new ExecuteSkillScriptTool was introduced to the SkillToolset to enable agents to execute scripts.
  • The DEFAULT_SKILL_SYSTEM_INSTRUCTION was updated to provide guidance to LLMs on how to use the new script execution tool.
  • Extensive unit tests for ExecuteSkillScriptTool were added, covering scenarios such as missing parameters, skill/script not found, absence of a code executor, unsupported script types, successful execution, argument injection, script prefix stripping, executor priority, and error handling with message truncation.
  • A new sample agent (agent.py) and a comprehensive live integration test (run_live_test.py) were added to demonstrate and validate the new script execution capabilities with external skills like mcp-builder and python-helper.
  • The SkillToolset constructor was made backward compatible by introducing code_executor as a keyword-only parameter.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@adk-bot adk-bot added the tools [Component] This issue is related to tools label Feb 21, 2026
@adk-bot
Copy link
Collaborator

adk-bot commented Feb 21, 2026

Response from ADK Triaging Agent

Hello @caohy1988, thank you for creating this PR!

Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

Also, this PR is a new feature, could you please associate the github issue with this PR? If there is no existing issue, could you please create one?

This information will help reviewers to review your PR more efficiently. Thanks!

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The addition of ExecuteSkillScriptTool significantly enhances the SkillToolset by allowing agents to run scripts directly from skills. The implementation is well-structured and includes comprehensive tests. However, there are a few security and correctness issues in the script execution logic, particularly regarding shell injection vulnerabilities and argument parsing. Addressing these will make the tool more robust and secure.

Comment on lines 403 to 407
"import subprocess\n"
"_result = subprocess.run(\n"
f" ['bash', '-c', {script_src!r}"
+ (f" + ' ' + {input_args!r}" if input_args else "")
+ f"],\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The current implementation is vulnerable to shell injection because input_args is concatenated directly into the command string passed to bash -c. An LLM could provide arguments containing shell metacharacters (e.g., ;, $(...)) to execute arbitrary commands. Additionally, this approach does not correctly pass arguments as positional parameters ($1, $2, etc.) to the script. Arguments should be passed as separate elements to subprocess.run.

Suggested change
"import subprocess\n"
"_result = subprocess.run(\n"
f" ['bash', '-c', {script_src!r}"
+ (f" + ' ' + {input_args!r}" if input_args else "")
+ f"],\n"
"import subprocess, shlex\n"
"_result = subprocess.run(\n"
f" ['bash', '-c', {script_src!r}, {script_name!r}] + shlex.split({input_args!r}),\n"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — this was a real shell injection vulnerability. Fixed in 06d995e.

Applied the suggested pattern: args are now passed as separate list elements via shlex.split() rather than concatenated into the bash -c command string:

["bash", "-c", script_src, script_name] + shlex.split(input_args)

This way $0 gets the script name and $1, $2, ... get the individual arguments, with no shell metacharacter interpretation on the args.

"script_name": script_name,
"stdout": result.stdout,
"stderr": result.stderr,
"status": "error" if result.stderr else "success",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The status detection logic relies on result.stderr being non-empty. However, the default UnsafeLocalCodeExecutor does not capture sys.stderr. Since the generated wrapper for shell scripts (lines 412-414) prints errors to sys.stderr, these will not be captured by the executor, and the tool will return status: "success" even if the script failed. Consider using check=True in subprocess.run to raise an exception on failure, which the executor will catch and report in stderr.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Fixed in 06d995e.

Added check=True to the generated subprocess.run call. Now when a shell script exits with non-zero status, CalledProcessError is raised inside the exec'd code, which UnsafeLocalCodeExecutor captures and surfaces in result.stderr. This means the "status": "error" if result.stderr else "success" check works correctly.

Also removed the manual stderr printing from the generated wrapper since capture_output=True + the exception path now handles both cases (non-zero exit and stderr output).

Comment on lines 395 to 396
"import sys\n"
f"sys.argv = [{script_name!r}] + {input_args!r}.split()\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using .split() for parsing input_args is fragile as it does not handle quoted arguments with spaces correctly (e.g., --name "John Doe"). It is recommended to use shlex.split() for robust argument parsing.

Suggested change
"import sys\n"
f"sys.argv = [{script_name!r}] + {input_args!r}.split()\n"
"import sys, shlex\n"
f"sys.argv = [{script_name!r}] + shlex.split({input_args!r})\n"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — .split() silently breaks on quoted arguments. Fixed in 06d995e.

Both Python and shell script paths now use shlex.split() for argument parsing:

# Python scripts
sys.argv = ["script.py"] + shlex.split(input_args)

# Shell scripts
["bash", "-c", script_src, script_name] + shlex.split(input_args)

This correctly handles --name "John Doe", paths with spaces, and other shell-style quoting.

@caohy1988 caohy1988 marked this pull request as draft February 21, 2026 08:25
caohy1988 and others added 2 commits February 21, 2026 00:28
…k=True

1. Shell injection (HIGH): Use shlex.split() and pass args as separate
   list elements to subprocess.run instead of concatenating into bash -c
   command string. Args now arrive as positional params ($1, $2, ...).

2. Fragile arg parsing (MEDIUM): Replace .split() with shlex.split() for
   Python script sys.argv injection — correctly handles quoted arguments
   like --name "John Doe".

3. Status detection (MEDIUM): Add check=True to subprocess.run in shell
   wrapper so non-zero exit codes raise CalledProcessError, which the
   code executor captures as stderr.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix sys.exit(0) treated as error: zero/None exit codes now return success
- Fix shell stderr discarding stdout: serialize both streams as JSON envelope
  through stdout to work around UnsafeLocalCodeExecutor losing stdout on exception
- Add early shlex.split() validation for input_args before executor dispatch
- Add script_timeout parameter to SkillToolset for subprocess.run timeout
- Reject extensionless scripts (previously treated as Python)
- Replace os.popen() with subprocess.run() in run_live_test.py
- Add UnsafeLocalCodeExecutor safety warning to sample agent
- Fix broken reference/ → references/ links in mcp-builder SKILL.md
- Remove duplicate @pytest.mark.asyncio decorator
- Add 4 integration tests using real UnsafeLocalCodeExecutor

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@caohy1988
Copy link
Author

ExecuteSkillScriptTool — Design Details & Considerations

Overview

ExecuteSkillScriptTool enables ADK agents to execute scripts bundled inside a skill's scripts/ directory via ADK's BaseCodeExecutor infrastructure. This closes the gap between the Agent Skills spec (which defines scripts/ as an optional skill resource) and ADK's runtime capabilities.

Architecture

LLM calls execute_skill_script(skill_name, script_name, input_args)
        │
        ▼
┌─ ExecuteSkillScriptTool.run_async() ─────────────────────┐
│  1. Validate params & resolve skill/script                │
│  2. Resolve code executor (toolset → agent fallback)      │
│  3. Validate input_args via shlex.split() (early check)   │
│  4. _prepare_code() → generate Python code string         │
│  5. code_executor.execute_code(generated_code)            │
│  6. Parse result (JSON envelope for shell scripts)        │
│  7. Return {stdout, stderr, status} to LLM               │
└───────────────────────────────────────────────────────────┘

Script Type Handling

Type Extension Execution Method Timeout Args Injection
Python .py Direct exec() via code executor No (executor-level) sys.argv = [script_name] + shlex.split(input_args)
Shell .sh, .bash subprocess.run(['bash', '-c', src, name] + args) Yes (script_timeout) shlex.split(input_args) as list elements
Other any Rejected N/A N/A

Extensionless files are rejected (not silently treated as Python) to avoid unexpected behavior.

Code Executor Resolution Chain

1. SkillToolset(code_executor=...)    ← explicit, highest priority
2. agent.code_executor                ← fallback to agent's executor
3. None → return NO_CODE_EXECUTOR     ← actionable error

This design allows a single toolset-level executor to be shared across all skills, or per-agent executors for different isolation levels.

Shell Script JSON Envelope

Shell scripts face a unique challenge: UnsafeLocalCodeExecutor captures stdout via redirect_stdout(StringIO), but if the generated code raises an exception, stdout.getvalue() is never called and stdout is lost. This means a naive raise RuntimeError(stderr) approach discards any stdout the script produced.

Solution: The generated shell wrapper serializes both stdout and stderr as a JSON envelope through the single stdout channel:

# Generated code for shell scripts:
import subprocess, shlex, json as _json
try:
    _r = subprocess.run(
        ['bash', '-c', SCRIPT_SRC, SCRIPT_NAME] + shlex.split(INPUT_ARGS),
        capture_output=True, text=True,
        timeout=SCRIPT_TIMEOUT,
    )
    print(_json.dumps({
        '__shell_result__': True,
        'stdout': _r.stdout,
        'stderr': _r.stderr,
        'returncode': _r.returncode,
    }))
except subprocess.TimeoutExpired as _e:
    print(_json.dumps({
        '__shell_result__': True,
        'stdout': _e.stdout or '',
        'stderr': 'Timed out after Ns',
        'returncode': -1,
    }))

run_async() then parses this JSON envelope (only for shell scripts, keyed on __shell_result__) to extract both streams and the return code. This works reliably with both UnsafeLocalCodeExecutor and container-based executors.

Three-State Status Model

Status Condition
success No stderr
warning Both stdout and stderr present (e.g., deprecation warnings)
error Stderr only (no stdout), or non-zero returncode with no stdout

Security Considerations

Shell injection prevention:

  • input_args are parsed via shlex.split() and passed as separate list elements to subprocess.run, never interpolated into the shell command string
  • The script source goes to bash -c as a single argument; positional parameters ($0, $1, ...) receive the args
  • input_args validation happens eagerly before executor dispatch to fail fast on malformed input (e.g., unclosed quotes)

SystemExit handling:

  • sys.exit() raises SystemExit(BaseException), which is NOT caught by except Exception in executors
  • run_async() explicitly catches SystemExit to prevent skill scripts from terminating the host process
  • sys.exit(0) and sys.exit(None) are treated as successful termination
  • Non-zero exit codes return EXECUTION_ERROR

Executor security:

  • UnsafeLocalCodeExecutor runs code in the host process via exec() — suitable only for trusted, first-party skills
  • For third-party or untrusted skills, a sandboxed executor (e.g., ContainerCodeExecutor) should be used
  • The sample agent includes explicit warnings about this

Known Limitations & Future Work

  1. No timeout for Python scripts: exec() provides no built-in timeout mechanism. A malicious/buggy Python script can hang indefinitely. This is an executor-level concern — solving it properly requires running Python scripts in a subprocess or implementing executor-level cancellation.

  2. Python script stdout lost on exception: When a Python script writes to stdout and then raises, UnsafeLocalCodeExecutor loses the stdout (same root cause as the shell fix). This is less critical for Python since exceptions are the natural error mechanism, but could be improved at the executor level.

  3. input_args is a string, not an array: The LLM passes arguments as a space-separated string, which is then parsed by shlex.split(). An array type would be more explicit but adds complexity for the LLM to construct correctly. The current string approach matches how humans think about CLI arguments.

Error Codes Reference

Error Code Meaning
MISSING_SKILL_NAME skill_name parameter not provided
MISSING_SCRIPT_NAME script_name parameter not provided
SKILL_NOT_FOUND No skill with that name registered
SCRIPT_NOT_FOUND No script with that name in the skill
NO_CODE_EXECUTOR No code executor configured (toolset or agent)
UNSUPPORTED_SCRIPT_TYPE File extension not .py, .sh, or .bash
INVALID_INPUT_ARGS input_args failed shlex.split() validation
EXECUTION_ERROR Runtime error, non-zero exit, or sys.exit(N)

Test Coverage

  • 44 unit tests with mocked executor covering all error paths, script types, argument injection, executor resolution, status model, SystemExit handling
  • 4 integration tests using real UnsafeLocalCodeExecutor verifying: Python stdout capture, sys.exit(0) as success, shell stdout+stderr preservation via JSON envelope, shell stderr-only error status
  • 14 compatibility tests against Anthropic's mcp-builder skill structure
  • 15-step live integration test verified passing with Gemini 2.5 Flash

caohy1988 and others added 20 commits February 21, 2026 18:48
Design doc covering three high-priority improvements to the ADK
code executor infrastructure:

1. Uniform timeout support across all executors (currently only
   GkeCodeExecutor has it)
2. Stateful Python execution in ContainerCodeExecutor (currently
   frozen to stateful=False)
3. Security hardening with a new LocalSandboxCodeExecutor and
   UnsafeLocalCodeExecutor restrictions

Also removes dead default parameter from _prepare_code(timeout=)
since timeout is now read directly from self._toolset._script_timeout
inside the shell branch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… doc

Add Non-Goals & Invariants section, fix append-before-execute state
poisoning, redesign timeout as per-invocation field on CodeExecutionInput,
use Docker exec kill instead of thread+join for container timeout,
replace preexec_fn with process_group, pin default container image digest,
reframe restricted builtins as best-effort friction, document execution_id
gap in ExecuteSkillScriptTool.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, timeout API

- Fix execution_id to use session-stable key (session.id), not
  per-turn invocation_id
- Fix container timeout kill: document PID namespace mismatch, use
  pkill -f inside container instead of host-PID kill
- Fix Python version: ADK minimum is >=3.10 (pyproject.toml), not
  3.11; gate process_group with version check, preexec_fn fallback
- Fix LocalSandboxCodeExecutor to use default_timeout_seconds
  (consistent with BaseCodeExecutor API)
- Replace all `or` and `??` timeout resolution with explicit
  `is not None` checks to allow timeout=0
- Fix test strategy to document missing ContainerCodeExecutor tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Redesign container timeout: run exec_start in thread + join(timeout),
  kill via host-side os.kill(host_pid) instead of in-container pkill.
  Eliminates blocking exec_start DoS and overbroad pkill -f pattern.
- Remove contradictory "RECOMMENDED" label from Option C; Option A
  (persistent process) is the single recommended approach.
- Remove Windows implementation guidance (contradicts Non-Goals §2.4);
  raise NotImplementedError instead.
- Guard resource import in LocalSandboxCodeExecutor wrapper script
  for platforms where resource module is unavailable.
- Fix remote executor timeout to use synchronous thread+join, not
  asyncio.wait_for (execute_code APIs are synchronous).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…k gaps

- Roadmap Phase 2 now targets Option A (persistent process) directly
  instead of cumulative replay (Option C) first
- Unify recovery policy: timeout and crash both return error indicating
  state loss, no automatic replay (consistent across §4.2.3 and §9)
- Add Windows NotImplementedError gate to LocalSandboxCodeExecutor
- Guard resource import in 3.10 preexec_fn fallback path
- Document that PermissionError → container.restart() is the primary
  timeout path in practice (container runs as root, ADK as non-root)
- Add explicit test cases for PermissionError fallback and kill paths
- Update open question #1 to focus on I/O boundary protocol design

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- PermissionError from os.kill now triggers container.restart() instead
  of being silently swallowed (fixes timeout DoS for the common case
  where container runs as root and ADK does not)
- ProcessLookupError remains the only silently-ignored exception
  (process already exited, no action needed)
- Non-Goals §2.2 updated to describe persistent REPL crash recovery
  instead of stale cumulative-replay language

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d join

- If both os.kill and container.restart() fail, set _healthy=False
  and return distinct "cleanup failed" error instead of silently
  returning a normal timeout result
- Add thread.join(timeout=2) after kill/restart to prevent daemon
  thread leaks on repeated timeout failures
- Log warning if worker thread is still alive after post-kill join
- Add test cases for total cleanup failure and thread leak scenarios

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add _healthy check at execute_code entry: raise RuntimeError if
  executor is unhealthy from a prior cleanup failure
- Add python3 --version readiness check after container.restart(),
  mirroring the init-time validation pattern
- Update roadmap Phase 1 step 4 to include _healthy guard,
  post-restart validation, and post-kill thread join

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Define the recovery path for unhealthy executors: reinitialize()
stops the container, creates a new one, validates readiness, and
sets _healthy=True. Referenced in error message, _healthy lifecycle,
and roadmap Phase 1 step 5.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Non-zero exit from python3 --version does not raise in Docker SDK,
so check exit_code explicitly and raise to trigger the _healthy=False
path, mirroring container_code_executor.py:169.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a complete SkillsBench benchmark runner with Docker-based execution
and real pytest scoring, matching the official SkillsBench methodology.

Key components:
- TaskContainerExecutor: BaseCodeExecutor subclass that runs code inside
  per-task Docker containers built from each task's Dockerfile
- ContainerBashTool: Bash tool enabling the agent to read/write files
  and run shell commands inside the task container
- Lenient skill loader: Handles malformed SKILL.md frontmatter (invalid
  names, list-typed allowed-tools, dict-typed compatibility/metadata)
  loading 226/232 skills across all 87 tasks
- Real pytest scoring: Copies tests into the container, runs test.sh,
  reads reward.txt for binary pass/fail — matching SkillsBench scoring
- Per-task config from task.toml (agent/build/verifier timeouts)
- Exponential backoff retry for 429/503 API errors
- CLI with --filter, --rebuild, --build-only, --skip-tests flags

Also includes the standalone eval runner, agent definition, metrics,
8 bundled skills, and evaluation set for local development.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap all Docker exec_run calls in a ThreadPoolExecutor with a 300s
default timeout. This prevents the asyncio event loop from blocking
indefinitely when a script runs forever inside the container, which
was causing the full evaluation to hang on long-running tasks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Refactor code execution to use a temporary directory for input files and change working directory as needed.
This document outlines the design for integrating pre-registered function execution and script execution within ADK's non-bash based skills toolset. It details the motivation, proposal, enabling usage of ADK tools, and the architecture of the RunSkillScriptTool.
caohy1988 and others added 8 commits February 23, 2026 22:59
- full_runner.py: Add retry loop (max 2 retries) with Docker prune
  between attempts, forcerm=True for intermediate containers, periodic
  prune every N tasks (--prune-interval flag), and build summary stats
- test_skill_toolset.py: Add 8 tests covering shell JSON envelope
  parsing, non-zero returncode, TimeoutExpired path, input_files
  packaging with correct paths/content, and working_dir validation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix system instruction referencing 'execute_skill_script' instead of
  actual tool name 'run_skill_script'
- Use 'is not None' checks when packaging input_files so empty files
  (zero-byte configs, sentinels) are mounted instead of silently dropped
- Add defensive validation for 'args' parameter type, returning clear
  INVALID_ARGS_TYPE error instead of crashing on non-dict input
- Add tests for all three findings

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove Feature B (additional_tools parameter and allowed_tools dynamic
resolution in get_tools()) to keep the PR focused on script execution.
Remove global _execution_lock from UnsafeLocalCodeExecutor as it was an
arbitrary side-effect unrelated to the script execution feature.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
os.chdir() and redirect_stdout() mutate process-global state, so
concurrent execute_code() calls without synchronization cause stdout
bleed and cwd contamination between executions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Preserve original bare exec() path for existing callers that pass no
input_files and no working_dir. The temp directory, file mounting, chdir,
and execution lock now only activate when needed by the new script
execution feature.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
redirect_stdout mutates process-global sys.stdout, so both the sandbox
path (tempdir + chdir) and the plain exec() path must be serialized.
The previous commit only locked the sandbox branch, allowing concurrent
plain calls to bleed stdout with sandbox calls.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Runs 50 iterations of concurrent sandbox (with input_files + working_dir)
and plain (bare exec) calls to verify no stdout bleed between them.
Protects the execution lock fix from future regressions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Strengthen the mixed-concurrency test to also verify that the plain
execution path observes the original working directory, not a sandbox
temp dir leaked from a concurrent sandbox call.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@caohy1988
Copy link
Author

Thanks for the review! I went through every behavior change in this PR against main to validate your concern. You were right — there were arbitrary changes that didn't belong here. Here's what I did and what remains:

Already removed

  • additional_tools parameter + allowed_tools resolution in get_tools() — This was the main offender. get_tools() had ~70 lines dynamically resolving built-in ADK tools from skill frontmatter. Removed entirely; get_tools() is back to a simple return self._tools. This should be a separate feature/PR if needed.

Remaining changes — all necessary for script execution

# Change Backward compatible? Why needed
1 RunSkillScriptTool always in get_tools() (4 tools instead of 3) N/A — new feature Core feature of this PR. Returns NO_CODE_EXECUTOR gracefully if no executor configured. Skills are designed to have scripts, so the tool should always be available.
2 System instruction line 4 Yes — additive Model needs to know run_skill_script exists.
3 SkillToolset.__init__ adds code_executor and script_timeout kwargs Yes — both optional with defaults RunSkillScriptTool needs an executor to run against.
4 File.path field Yes — defaults to None Scripts need to be mounted at relative paths (e.g., references/data.csv), not flat basenames.
5 CodeExecutionInput.working_dir field Yes — defaults to None Executor needs to know where to chdir inside the temp directory.
6 UnsafeLocalCodeExecutor sandbox path Yes — gated by if input_files or working_dir Temp dir + file mounting + chdir only activates when the new fields are used. Existing callers (no files, no working_dir) hit the original bare exec() path unchanged. Both paths share a _execution_lock since redirect_stdout mutates process-global sys.stdout.
7 VertexAiCodeExecutor working_dir + f.path Yes — no-op when None, falls back to f.name Both executors should honor the same contract.

Happy to discuss further — let me know if any of these still feel out of scope.

caohy1988 and others added 15 commits February 24, 2026 00:23
Review the design doc against the actual RunSkillScriptTool and
executor implementations, fixing factual inaccuracies, missing
details, and underspecified areas:

- §3.1: UnsafeLocalCodeExecutor isolation is partial (temp-dir sandbox),
  not "None"; add BuiltInCodeExecutor clarification
- §3.2: Add optimize_data_file and actual default delimiter values
- §3.3: Document binary content handling and _prepare_globals helper
- §3.4: Document asyncio.to_thread() wrapping (not direct call)
- §3.4.1 (new): Add RunSkillScriptTool implementation details:
  executor resolution chain, script_timeout param (shell-only),
  scripts/ prefix normalization, resource packaging (incl. empty
  files), SystemExit handling, error truncation >200 chars, shell
  JSON envelope edge cases, duplicate skill name validation, and
  process_llm_request system instruction injection
- §4.2.2: Add _execution_lock interaction analysis for timeout thread
- §4.4: Expand Python timeout gap (zero protection at any layer)
- §6.2: Update threat table with current mitigations
- §7.1: Add missing fields to combined API changes snippet
- §7.3: Expand impact table with output_files, system instructions,
  error truncation rows
- §7.5.1: Document status derivation asymmetry (shell vs Python)
- §7.5.4: Document current str(v) behavior before proposed rules

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5-page RFC targeting ADK TL/UTL audience, covering the two P0 gaps
that block production use of RunSkillScriptTool:

- P0-A: Uniform timeout support — add timeout_seconds to
  CodeExecutionInput and default_timeout_seconds to BaseCodeExecutor,
  with per-executor implementations (thread+join for local, Docker
  exec kill for container).
- P0-B: LocalSandboxCodeExecutor — new stdlib-only executor using
  subprocess with resource limits, replacing UnsafeLocalCodeExecutor
  as the recommended local default.

Includes alternatives considered, phased rollout plan, success
metrics, and execution flow appendix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. High: UnsafeLocalCodeExecutor timeout — replace naive lock-release
   with unhealthy-guard pattern. On timeout, mark executor unhealthy
   and fail fast until reinitialize(), preventing cross-execution
   stdout/cwd contamination from lingering daemon threads.

2. High: LocalSandboxCodeExecutor — replace subprocess.run(timeout)
   with Popen + os.killpg(SIGKILL) to kill entire process group on
   timeout. Adjust threat table: fork bomb protection is partial
   (no RLIMIT_NPROC), child-proc infinite loops are covered.

3. Medium: Add AgentEngineSandboxCodeExecutor to timeout landscape
   table (opaque Vertex AI Sandbox internal timeout).

4. Medium: Rephrase "de facto default" claim — UnsafeLocalCodeExecutor
   is common in samples/dev setups, not a global LlmAgent default.

5. Medium: Soften "drop-in replacement" exit criterion — define
   supported script profile (stdout/stderr, sandbox-local files,
   explicit env var allowlist) and compatibility envelope.

6. Low: Make test line count approximate (~1170-line).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix 3 High, 3 Medium, 1 Low severity issues:
- LocalSandboxCodeExecutor: materialize input_files, honor working_dir,
  use Popen+killpg for process-group cleanup on timeout
- UnsafeLocalCodeExecutor: replace lock-release pattern with unhealthy-guard
- Stateful execution: split Option A (final target) from Option C (interim)
  with acceptance criteria and deprecation plan
- Container security: make network/read-only defaults opt-in first
- Backward compat: clarify "Breaking + Minor" with explicit versioning policy
- Open Question 1: align phase reference to Phase 4

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- High: Align timeout/lock model — lock releases normally via `with`,
  unhealthy flag prevents new executions (consistent across both docs)
- High: Unify Python 3.10 fallback — preexec_fn calls os.setpgrp()
  AND sets resource limits in both design doc and RFC
- Medium: Phase 3 roadmap now says opt-in for container hardening,
  matching §6.3.2-C graduation plan; default flip deferred to later
- Medium: Fix duplicate §5.2.4 numbering (→5.2.5), deduplicate
  execution_id wiring (Phase 2 owns it, Phase 4 cross-references)
- Low: Add scope note linking RFC (8-11d, P0 only) to full design
  doc (15-22d, P0-P2)
- Low: RFC threat table specifies temp-dir condition; lock impact
  clarified as process-global

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- High: Close race in unhealthy-guard — add double-check of _healthy
  after acquiring _execution_lock (both docs)
- Medium: Normalize 3.10 fallback to _setup_child (os.setpgrp +
  rlimits) in design doc platform section
- Low: Fix cross-reference §6.3.2-C → §6.3.3-C in roadmap
- Low: Fix RFC docstring subprocess.run → Popen + killpg

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Medium: Reconcile breaking default-flips with SemVer — pre-1.0 may
  ship in minor with DeprecationWarning; post-1.0 reserve for major
- Low: Add missing `import os` in LocalSandboxCodeExecutor snippet
- Low: Specify reinitialize() failure mode — stays unhealthy and
  raises RuntimeError if daemon thread does not exit within grace
  period (both docs)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change "future minor" to "future release: minor pre-1.0 / major
post-1.0" to match the policy stated in §6.3.3-C.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reusable end-to-end evaluation framework for any BigQuery skill or
agent built with ADK BigQueryToolset. Mirrors SkillsBench architecture
with BigQuery-specific metrics and public dataset eval cases.

Contents:
- agent.py: Root agent with BigQueryToolset (read-only, env-configurable)
- metrics.py: 3 custom metrics (schema_discovery, tool_usage, output_correctness)
- runner.py: Standalone runner with filter, dry-run, multi-run support
- eval_sets/bigquerybench_eval.json: 7 eval cases across 3 tiers
  (schema exploration, SQL generation, multi-step analysis)
- README.md: Full pipeline docs with "add new eval case" guide and
  AI/ML tool templates (forecast, detect_anomalies, analyze_contribution)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…evaluation

Adds a detailed "Complete Walkthrough" section to the BigQueryBench
README with a concrete cluster_data (K-Means) tool example covering:

- Step 1: Register tool in BigQueryToolset (auto-discovered)
- Step 2: Evaluate whether existing metrics suffice
- Step 2b: Adding a custom metric (clustering_quality_score example)
- Step 3: Pick a public dataset (ml_datasets.penguins)
- Step 4: Write the eval case JSON with structural reference lines
- Step 5: Configure write mode for ML tools
- Step 6: Validate with dry-run and multi-run
- Step 7: Commit

Includes a summary table showing which files to touch per scenario
(new eval case only, new tool with existing metrics, new tool needing
custom metric, agent config change).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace 3-metric response-text-matching approach with 2 trace-only
metrics that are deterministic and easy to maintain:

- tool_invocation_score: all expected tool names appear in trace
- tool_args_score: all expected (tool, project/dataset/table) pairs
  appear in trace

Key changes:
- metrics.py: Replace schema_discovery + tool_usage +
  output_correctness with tool_invocation + tool_args
- eval JSON: Remove final_response fields — only tool_uses matter
- runner.py: Print tool-call trace for debugging, simplified output
- README.md: Rewritten for trace-based approach with concrete
  cluster_data walkthrough showing JSON-only eval case addition

Adding a new eval case now requires zero code changes — just add
a JSON object specifying the expected tool calls and their key args.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…adherence evaluation

Three-metric evaluation pipeline:
1. tool_invocation_score — trace-based, verifies correct skill tools called
2. tool_args_score — trace-based, verifies correct skill/resource/script args
3. instruction_adherence_score — LLM-as-judge, checks natural-language rubrics

Includes bq-sql-analyst skill, 5 eval cases with 12 rubrics total, and
14 unit tests covering all metrics (mocked LLM for judge tests).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ptional BigQuery toolset

- Agent model uses Vertex AI API key (GOOGLE_CLOUD_API_KEY) via
  custom _VertexGemini subclass with HttpRetryOptions (5 attempts,
  2x exponential backoff)
- LLM judge uses vertexai=True + api_key with manual retry backoff
  on 429 rate limits (5 attempts, 2s→4s→8s→16s)
- BigQuery toolset is now optional — gracefully skipped when GCP
  project/ADC is unavailable, enabling skill-only evaluation
- Agent instruction adapts to available tools (skill-only vs full)
- Updated eval case skill_query_with_reference to be skill-only
- Updated test mocks for client.aio.models.generate_content() API
- E2E verified: 5/5 cases pass (100%), all metrics at 1.00

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tools [Component] This issue is related to tools

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants