feat: Add ExecuteSkillScriptTool for running skill scripts#4575
feat: Add ExecuteSkillScriptTool for running skill scripts#4575caohy1988 wants to merge 46 commits intogoogle:mainfrom
Conversation
…xecutor Adds execute_skill_script tool to SkillToolset, enabling agents to run scripts from a skill's scripts/ directory through ADK's BaseCodeExecutor infrastructure. Supports Python (.py) and shell (.sh/.bash) scripts with optional input_args, executor resolution chain (toolset → agent fallback), and LLM-friendly error truncation. Includes 37 unit tests, a sample agent with Anthropic's mcp-builder skill and a python-helper skill, compatibility tests, and a live integration test exercising all four skill tools end-to-end with Gemini. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Summary of ChangesHello @caohy1988, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Response from ADK Triaging Agent Hello @caohy1988, thank you for creating this PR! Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). Also, this PR is a new feature, could you please associate the github issue with this PR? If there is no existing issue, could you please create one? This information will help reviewers to review your PR more efficiently. Thanks! |
There was a problem hiding this comment.
Code Review
The addition of ExecuteSkillScriptTool significantly enhances the SkillToolset by allowing agents to run scripts directly from skills. The implementation is well-structured and includes comprehensive tests. However, there are a few security and correctness issues in the script execution logic, particularly regarding shell injection vulnerabilities and argument parsing. Addressing these will make the tool more robust and secure.
| "import subprocess\n" | ||
| "_result = subprocess.run(\n" | ||
| f" ['bash', '-c', {script_src!r}" | ||
| + (f" + ' ' + {input_args!r}" if input_args else "") | ||
| + f"],\n" |
There was a problem hiding this comment.
The current implementation is vulnerable to shell injection because input_args is concatenated directly into the command string passed to bash -c. An LLM could provide arguments containing shell metacharacters (e.g., ;, $(...)) to execute arbitrary commands. Additionally, this approach does not correctly pass arguments as positional parameters ($1, $2, etc.) to the script. Arguments should be passed as separate elements to subprocess.run.
| "import subprocess\n" | |
| "_result = subprocess.run(\n" | |
| f" ['bash', '-c', {script_src!r}" | |
| + (f" + ' ' + {input_args!r}" if input_args else "") | |
| + f"],\n" | |
| "import subprocess, shlex\n" | |
| "_result = subprocess.run(\n" | |
| f" ['bash', '-c', {script_src!r}, {script_name!r}] + shlex.split({input_args!r}),\n" |
There was a problem hiding this comment.
Good catch — this was a real shell injection vulnerability. Fixed in 06d995e.
Applied the suggested pattern: args are now passed as separate list elements via shlex.split() rather than concatenated into the bash -c command string:
["bash", "-c", script_src, script_name] + shlex.split(input_args)This way $0 gets the script name and $1, $2, ... get the individual arguments, with no shell metacharacter interpretation on the args.
| "script_name": script_name, | ||
| "stdout": result.stdout, | ||
| "stderr": result.stderr, | ||
| "status": "error" if result.stderr else "success", |
There was a problem hiding this comment.
The status detection logic relies on result.stderr being non-empty. However, the default UnsafeLocalCodeExecutor does not capture sys.stderr. Since the generated wrapper for shell scripts (lines 412-414) prints errors to sys.stderr, these will not be captured by the executor, and the tool will return status: "success" even if the script failed. Consider using check=True in subprocess.run to raise an exception on failure, which the executor will catch and report in stderr.
There was a problem hiding this comment.
Agreed. Fixed in 06d995e.
Added check=True to the generated subprocess.run call. Now when a shell script exits with non-zero status, CalledProcessError is raised inside the exec'd code, which UnsafeLocalCodeExecutor captures and surfaces in result.stderr. This means the "status": "error" if result.stderr else "success" check works correctly.
Also removed the manual stderr printing from the generated wrapper since capture_output=True + the exception path now handles both cases (non-zero exit and stderr output).
| "import sys\n" | ||
| f"sys.argv = [{script_name!r}] + {input_args!r}.split()\n" |
There was a problem hiding this comment.
Using .split() for parsing input_args is fragile as it does not handle quoted arguments with spaces correctly (e.g., --name "John Doe"). It is recommended to use shlex.split() for robust argument parsing.
| "import sys\n" | |
| f"sys.argv = [{script_name!r}] + {input_args!r}.split()\n" | |
| "import sys, shlex\n" | |
| f"sys.argv = [{script_name!r}] + shlex.split({input_args!r})\n" |
There was a problem hiding this comment.
Agreed — .split() silently breaks on quoted arguments. Fixed in 06d995e.
Both Python and shell script paths now use shlex.split() for argument parsing:
# Python scripts
sys.argv = ["script.py"] + shlex.split(input_args)
# Shell scripts
["bash", "-c", script_src, script_name] + shlex.split(input_args)This correctly handles --name "John Doe", paths with spaces, and other shell-style quoting.
…k=True 1. Shell injection (HIGH): Use shlex.split() and pass args as separate list elements to subprocess.run instead of concatenating into bash -c command string. Args now arrive as positional params ($1, $2, ...). 2. Fragile arg parsing (MEDIUM): Replace .split() with shlex.split() for Python script sys.argv injection — correctly handles quoted arguments like --name "John Doe". 3. Status detection (MEDIUM): Add check=True to subprocess.run in shell wrapper so non-zero exit codes raise CalledProcessError, which the code executor captures as stderr. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix sys.exit(0) treated as error: zero/None exit codes now return success - Fix shell stderr discarding stdout: serialize both streams as JSON envelope through stdout to work around UnsafeLocalCodeExecutor losing stdout on exception - Add early shlex.split() validation for input_args before executor dispatch - Add script_timeout parameter to SkillToolset for subprocess.run timeout - Reject extensionless scripts (previously treated as Python) - Replace os.popen() with subprocess.run() in run_live_test.py - Add UnsafeLocalCodeExecutor safety warning to sample agent - Fix broken reference/ → references/ links in mcp-builder SKILL.md - Remove duplicate @pytest.mark.asyncio decorator - Add 4 integration tests using real UnsafeLocalCodeExecutor Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ExecuteSkillScriptTool — Design Details & ConsiderationsOverview
ArchitectureScript Type Handling
Extensionless files are rejected (not silently treated as Python) to avoid unexpected behavior. Code Executor Resolution ChainThis design allows a single toolset-level executor to be shared across all skills, or per-agent executors for different isolation levels. Shell Script JSON EnvelopeShell scripts face a unique challenge: Solution: The generated shell wrapper serializes both stdout and stderr as a JSON envelope through the single stdout channel: # Generated code for shell scripts:
import subprocess, shlex, json as _json
try:
_r = subprocess.run(
['bash', '-c', SCRIPT_SRC, SCRIPT_NAME] + shlex.split(INPUT_ARGS),
capture_output=True, text=True,
timeout=SCRIPT_TIMEOUT,
)
print(_json.dumps({
'__shell_result__': True,
'stdout': _r.stdout,
'stderr': _r.stderr,
'returncode': _r.returncode,
}))
except subprocess.TimeoutExpired as _e:
print(_json.dumps({
'__shell_result__': True,
'stdout': _e.stdout or '',
'stderr': 'Timed out after Ns',
'returncode': -1,
}))
Three-State Status Model
Security ConsiderationsShell injection prevention:
Executor security:
Known Limitations & Future Work
Error Codes Reference
Test Coverage
|
Design doc covering three high-priority improvements to the ADK code executor infrastructure: 1. Uniform timeout support across all executors (currently only GkeCodeExecutor has it) 2. Stateful Python execution in ContainerCodeExecutor (currently frozen to stateful=False) 3. Security hardening with a new LocalSandboxCodeExecutor and UnsafeLocalCodeExecutor restrictions Also removes dead default parameter from _prepare_code(timeout=) since timeout is now read directly from self._toolset._script_timeout inside the shell branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… doc Add Non-Goals & Invariants section, fix append-before-execute state poisoning, redesign timeout as per-invocation field on CodeExecutionInput, use Docker exec kill instead of thread+join for container timeout, replace preexec_fn with process_group, pin default container image digest, reframe restricted builtins as best-effort friction, document execution_id gap in ExecuteSkillScriptTool. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, timeout API - Fix execution_id to use session-stable key (session.id), not per-turn invocation_id - Fix container timeout kill: document PID namespace mismatch, use pkill -f inside container instead of host-PID kill - Fix Python version: ADK minimum is >=3.10 (pyproject.toml), not 3.11; gate process_group with version check, preexec_fn fallback - Fix LocalSandboxCodeExecutor to use default_timeout_seconds (consistent with BaseCodeExecutor API) - Replace all `or` and `??` timeout resolution with explicit `is not None` checks to allow timeout=0 - Fix test strategy to document missing ContainerCodeExecutor tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Redesign container timeout: run exec_start in thread + join(timeout), kill via host-side os.kill(host_pid) instead of in-container pkill. Eliminates blocking exec_start DoS and overbroad pkill -f pattern. - Remove contradictory "RECOMMENDED" label from Option C; Option A (persistent process) is the single recommended approach. - Remove Windows implementation guidance (contradicts Non-Goals §2.4); raise NotImplementedError instead. - Guard resource import in LocalSandboxCodeExecutor wrapper script for platforms where resource module is unavailable. - Fix remote executor timeout to use synchronous thread+join, not asyncio.wait_for (execute_code APIs are synchronous). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…k gaps - Roadmap Phase 2 now targets Option A (persistent process) directly instead of cumulative replay (Option C) first - Unify recovery policy: timeout and crash both return error indicating state loss, no automatic replay (consistent across §4.2.3 and §9) - Add Windows NotImplementedError gate to LocalSandboxCodeExecutor - Guard resource import in 3.10 preexec_fn fallback path - Document that PermissionError → container.restart() is the primary timeout path in practice (container runs as root, ADK as non-root) - Add explicit test cases for PermissionError fallback and kill paths - Update open question #1 to focus on I/O boundary protocol design Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- PermissionError from os.kill now triggers container.restart() instead of being silently swallowed (fixes timeout DoS for the common case where container runs as root and ADK does not) - ProcessLookupError remains the only silently-ignored exception (process already exited, no action needed) - Non-Goals §2.2 updated to describe persistent REPL crash recovery instead of stale cumulative-replay language Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d join - If both os.kill and container.restart() fail, set _healthy=False and return distinct "cleanup failed" error instead of silently returning a normal timeout result - Add thread.join(timeout=2) after kill/restart to prevent daemon thread leaks on repeated timeout failures - Log warning if worker thread is still alive after post-kill join - Add test cases for total cleanup failure and thread leak scenarios Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add _healthy check at execute_code entry: raise RuntimeError if executor is unhealthy from a prior cleanup failure - Add python3 --version readiness check after container.restart(), mirroring the init-time validation pattern - Update roadmap Phase 1 step 4 to include _healthy guard, post-restart validation, and post-kill thread join Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Define the recovery path for unhealthy executors: reinitialize() stops the container, creates a new one, validates readiness, and sets _healthy=True. Referenced in error message, _healthy lifecycle, and roadmap Phase 1 step 5. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Non-zero exit from python3 --version does not raise in Docker SDK, so check exit_code explicitly and raise to trigger the _healthy=False path, mirroring container_code_executor.py:169. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a complete SkillsBench benchmark runner with Docker-based execution and real pytest scoring, matching the official SkillsBench methodology. Key components: - TaskContainerExecutor: BaseCodeExecutor subclass that runs code inside per-task Docker containers built from each task's Dockerfile - ContainerBashTool: Bash tool enabling the agent to read/write files and run shell commands inside the task container - Lenient skill loader: Handles malformed SKILL.md frontmatter (invalid names, list-typed allowed-tools, dict-typed compatibility/metadata) loading 226/232 skills across all 87 tasks - Real pytest scoring: Copies tests into the container, runs test.sh, reads reward.txt for binary pass/fail — matching SkillsBench scoring - Per-task config from task.toml (agent/build/verifier timeouts) - Exponential backoff retry for 429/503 API errors - CLI with --filter, --rebuild, --build-only, --skip-tests flags Also includes the standalone eval runner, agent definition, metrics, 8 bundled skills, and evaluation set for local development. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap all Docker exec_run calls in a ThreadPoolExecutor with a 300s default timeout. This prevents the asyncio event loop from blocking indefinitely when a script runs forever inside the container, which was causing the full evaluation to hang on long-running tasks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Refactor code execution to use a temporary directory for input files and change working directory as needed.
This document outlines the design for integrating pre-registered function execution and script execution within ADK's non-bash based skills toolset. It details the motivation, proposal, enabling usage of ADK tools, and the architecture of the RunSkillScriptTool.
- full_runner.py: Add retry loop (max 2 retries) with Docker prune between attempts, forcerm=True for intermediate containers, periodic prune every N tasks (--prune-interval flag), and build summary stats - test_skill_toolset.py: Add 8 tests covering shell JSON envelope parsing, non-zero returncode, TimeoutExpired path, input_files packaging with correct paths/content, and working_dir validation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix system instruction referencing 'execute_skill_script' instead of actual tool name 'run_skill_script' - Use 'is not None' checks when packaging input_files so empty files (zero-byte configs, sentinels) are mounted instead of silently dropped - Add defensive validation for 'args' parameter type, returning clear INVALID_ARGS_TYPE error instead of crashing on non-dict input - Add tests for all three findings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove Feature B (additional_tools parameter and allowed_tools dynamic resolution in get_tools()) to keep the PR focused on script execution. Remove global _execution_lock from UnsafeLocalCodeExecutor as it was an arbitrary side-effect unrelated to the script execution feature. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
os.chdir() and redirect_stdout() mutate process-global state, so concurrent execute_code() calls without synchronization cause stdout bleed and cwd contamination between executions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Preserve original bare exec() path for existing callers that pass no input_files and no working_dir. The temp directory, file mounting, chdir, and execution lock now only activate when needed by the new script execution feature. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
redirect_stdout mutates process-global sys.stdout, so both the sandbox path (tempdir + chdir) and the plain exec() path must be serialized. The previous commit only locked the sandbox branch, allowing concurrent plain calls to bleed stdout with sandbox calls. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Runs 50 iterations of concurrent sandbox (with input_files + working_dir) and plain (bare exec) calls to verify no stdout bleed between them. Protects the execution lock fix from future regressions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Strengthen the mixed-concurrency test to also verify that the plain execution path observes the original working directory, not a sandbox temp dir leaked from a concurrent sandbox call. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for the review! I went through every behavior change in this PR against Already removed
Remaining changes — all necessary for script execution
Happy to discuss further — let me know if any of these still feel out of scope. |
Review the design doc against the actual RunSkillScriptTool and executor implementations, fixing factual inaccuracies, missing details, and underspecified areas: - §3.1: UnsafeLocalCodeExecutor isolation is partial (temp-dir sandbox), not "None"; add BuiltInCodeExecutor clarification - §3.2: Add optimize_data_file and actual default delimiter values - §3.3: Document binary content handling and _prepare_globals helper - §3.4: Document asyncio.to_thread() wrapping (not direct call) - §3.4.1 (new): Add RunSkillScriptTool implementation details: executor resolution chain, script_timeout param (shell-only), scripts/ prefix normalization, resource packaging (incl. empty files), SystemExit handling, error truncation >200 chars, shell JSON envelope edge cases, duplicate skill name validation, and process_llm_request system instruction injection - §4.2.2: Add _execution_lock interaction analysis for timeout thread - §4.4: Expand Python timeout gap (zero protection at any layer) - §6.2: Update threat table with current mitigations - §7.1: Add missing fields to combined API changes snippet - §7.3: Expand impact table with output_files, system instructions, error truncation rows - §7.5.1: Document status derivation asymmetry (shell vs Python) - §7.5.4: Document current str(v) behavior before proposed rules Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5-page RFC targeting ADK TL/UTL audience, covering the two P0 gaps that block production use of RunSkillScriptTool: - P0-A: Uniform timeout support — add timeout_seconds to CodeExecutionInput and default_timeout_seconds to BaseCodeExecutor, with per-executor implementations (thread+join for local, Docker exec kill for container). - P0-B: LocalSandboxCodeExecutor — new stdlib-only executor using subprocess with resource limits, replacing UnsafeLocalCodeExecutor as the recommended local default. Includes alternatives considered, phased rollout plan, success metrics, and execution flow appendix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. High: UnsafeLocalCodeExecutor timeout — replace naive lock-release with unhealthy-guard pattern. On timeout, mark executor unhealthy and fail fast until reinitialize(), preventing cross-execution stdout/cwd contamination from lingering daemon threads. 2. High: LocalSandboxCodeExecutor — replace subprocess.run(timeout) with Popen + os.killpg(SIGKILL) to kill entire process group on timeout. Adjust threat table: fork bomb protection is partial (no RLIMIT_NPROC), child-proc infinite loops are covered. 3. Medium: Add AgentEngineSandboxCodeExecutor to timeout landscape table (opaque Vertex AI Sandbox internal timeout). 4. Medium: Rephrase "de facto default" claim — UnsafeLocalCodeExecutor is common in samples/dev setups, not a global LlmAgent default. 5. Medium: Soften "drop-in replacement" exit criterion — define supported script profile (stdout/stderr, sandbox-local files, explicit env var allowlist) and compatibility envelope. 6. Low: Make test line count approximate (~1170-line). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix 3 High, 3 Medium, 1 Low severity issues: - LocalSandboxCodeExecutor: materialize input_files, honor working_dir, use Popen+killpg for process-group cleanup on timeout - UnsafeLocalCodeExecutor: replace lock-release pattern with unhealthy-guard - Stateful execution: split Option A (final target) from Option C (interim) with acceptance criteria and deprecation plan - Container security: make network/read-only defaults opt-in first - Backward compat: clarify "Breaking + Minor" with explicit versioning policy - Open Question 1: align phase reference to Phase 4 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- High: Align timeout/lock model — lock releases normally via `with`, unhealthy flag prevents new executions (consistent across both docs) - High: Unify Python 3.10 fallback — preexec_fn calls os.setpgrp() AND sets resource limits in both design doc and RFC - Medium: Phase 3 roadmap now says opt-in for container hardening, matching §6.3.2-C graduation plan; default flip deferred to later - Medium: Fix duplicate §5.2.4 numbering (→5.2.5), deduplicate execution_id wiring (Phase 2 owns it, Phase 4 cross-references) - Low: Add scope note linking RFC (8-11d, P0 only) to full design doc (15-22d, P0-P2) - Low: RFC threat table specifies temp-dir condition; lock impact clarified as process-global Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- High: Close race in unhealthy-guard — add double-check of _healthy after acquiring _execution_lock (both docs) - Medium: Normalize 3.10 fallback to _setup_child (os.setpgrp + rlimits) in design doc platform section - Low: Fix cross-reference §6.3.2-C → §6.3.3-C in roadmap - Low: Fix RFC docstring subprocess.run → Popen + killpg Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Medium: Reconcile breaking default-flips with SemVer — pre-1.0 may ship in minor with DeprecationWarning; post-1.0 reserve for major - Low: Add missing `import os` in LocalSandboxCodeExecutor snippet - Low: Specify reinitialize() failure mode — stays unhealthy and raises RuntimeError if daemon thread does not exit within grace period (both docs) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change "future minor" to "future release: minor pre-1.0 / major post-1.0" to match the policy stated in §6.3.3-C. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reusable end-to-end evaluation framework for any BigQuery skill or agent built with ADK BigQueryToolset. Mirrors SkillsBench architecture with BigQuery-specific metrics and public dataset eval cases. Contents: - agent.py: Root agent with BigQueryToolset (read-only, env-configurable) - metrics.py: 3 custom metrics (schema_discovery, tool_usage, output_correctness) - runner.py: Standalone runner with filter, dry-run, multi-run support - eval_sets/bigquerybench_eval.json: 7 eval cases across 3 tiers (schema exploration, SQL generation, multi-step analysis) - README.md: Full pipeline docs with "add new eval case" guide and AI/ML tool templates (forecast, detect_anomalies, analyze_contribution) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…evaluation Adds a detailed "Complete Walkthrough" section to the BigQueryBench README with a concrete cluster_data (K-Means) tool example covering: - Step 1: Register tool in BigQueryToolset (auto-discovered) - Step 2: Evaluate whether existing metrics suffice - Step 2b: Adding a custom metric (clustering_quality_score example) - Step 3: Pick a public dataset (ml_datasets.penguins) - Step 4: Write the eval case JSON with structural reference lines - Step 5: Configure write mode for ML tools - Step 6: Validate with dry-run and multi-run - Step 7: Commit Includes a summary table showing which files to touch per scenario (new eval case only, new tool with existing metrics, new tool needing custom metric, agent config change). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace 3-metric response-text-matching approach with 2 trace-only metrics that are deterministic and easy to maintain: - tool_invocation_score: all expected tool names appear in trace - tool_args_score: all expected (tool, project/dataset/table) pairs appear in trace Key changes: - metrics.py: Replace schema_discovery + tool_usage + output_correctness with tool_invocation + tool_args - eval JSON: Remove final_response fields — only tool_uses matter - runner.py: Print tool-call trace for debugging, simplified output - README.md: Rewritten for trace-based approach with concrete cluster_data walkthrough showing JSON-only eval case addition Adding a new eval case now requires zero code changes — just add a JSON object specifying the expected tool calls and their key args. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…adherence evaluation Three-metric evaluation pipeline: 1. tool_invocation_score — trace-based, verifies correct skill tools called 2. tool_args_score — trace-based, verifies correct skill/resource/script args 3. instruction_adherence_score — LLM-as-judge, checks natural-language rubrics Includes bq-sql-analyst skill, 5 eval cases with 12 rubrics total, and 14 unit tests covering all metrics (mocked LLM for judge tests). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ptional BigQuery toolset - Agent model uses Vertex AI API key (GOOGLE_CLOUD_API_KEY) via custom _VertexGemini subclass with HttpRetryOptions (5 attempts, 2x exponential backoff) - LLM judge uses vertexai=True + api_key with manual retry backoff on 429 rate limits (5 attempts, 2s→4s→8s→16s) - BigQuery toolset is now optional — gracefully skipped when GCP project/ADC is unavailable, enabling skill-only evaluation - Agent instruction adapts to available tools (skill-only vs full) - Updated eval case skill_query_with_reference to be skill-only - Updated test mocks for client.aio.models.generate_content() API - E2E verified: 5/5 cases pass (100%), all metrics at 1.00 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
ExecuteSkillScriptTooltoSkillToolset, enabling agents to execute scripts from a skill'sscripts/directory via ADK'sBaseCodeExecutorinfrastructure.py) and shell (.sh/.bash) scripts with optionalinput_argsfor parameterized executioncode_executor→ agent fallback → actionable errorlogger.exception)DEFAULT_SKILL_SYSTEM_INSTRUCTIONto guide the LLM onexecute_skill_scriptusagecode_executor: Optional[BaseCodeExecutor] = Nonekeyword-only parameter toSkillToolset.__init__(backward compatible)Changes
Core implementation (
src/google/adk/tools/skill_toolset.py)ExecuteSkillScriptToolclass with_prepare_code()for Python/shell dispatchMISSING_SKILL_NAME,MISSING_SCRIPT_NAME,SKILL_NOT_FOUND,SCRIPT_NOT_FOUND,NO_CODE_EXECUTOR,UNSUPPORTED_SCRIPT_TYPE,EXECUTION_ERRORscripts/prefix auto-stripping for consistency withLoadSkillResourceToolpath formatUnit tests (
tests/unittests/tools/test_skill_toolset.py)input_argsinjection, prefix stripping, executor priority, agent fallback, execution error, stderr status, error truncationSample agent (
contributing/samples/skill_script_demo/)mcp-builderskill (from anthropics/skills) and apython-helperskill with self-contained scripts (fibonacci, word_count, json_format)Test plan
pytest tests/unittests/tools/test_skill_toolset.py -v— 37/37 passedpytest contributing/samples/skill_script_demo/test_skill_compat.py -v— 14/14 passed./autoformat.shSkillToolset(skills=[...])works unchanged🤖 Generated with Claude Code