Skip to content

Add skill activation eval yamls.#164

Open
CoasterJX wants to merge 2 commits intoevalsfrom
evals-skill-activate
Open

Add skill activation eval yamls.#164
CoasterJX wants to merge 2 commits intoevalsfrom
evals-skill-activate

Conversation

@CoasterJX
Copy link
Copy Markdown
Contributor

Corresponding changes to skillgrade: sdlc-graph/skillgrade#21

@CoasterJX CoasterJX requested review from haroonc and kmontg May 4, 2026 22:03
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces evaluation configuration files for Google CI/CD deployment and pipeline design skills. The review feedback points out a critical logic error in the regex used to identify prohibited events, which would result in false positives. Additionally, the reviewer recommended improving portability by avoiding hardcoded credential paths and increasing execution efficiency by removing unnecessary sleep commands in the cleanup scripts.

Comment thread eval.google-cicd-deploy.yaml Outdated
agentWorkingDir: /tmp/skill-activation/tinyjam-deploy
expectedSkill: google-cicd-deploy
prohibitedEventsBeforeActivation:
- "activate_skill.*(?!google-cicd-deploy)"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The regex activate_skill.*(?!google-cicd-deploy) will incorrectly match the intended skill activation. Because .* is greedy and consumes the entire string, the negative lookahead (?!google-cicd-deploy) at the end will always succeed (as the empty string at the end of the line is not google-cicd-deploy). This will cause the evaluation to fail even when the agent correctly activates the expected skill.

    - "activate_skill(?!(.*google-cicd-deploy)).*"

Comment thread eval.google-cicd-pipeline-design.yaml Outdated
agentWorkingDir: /tmp/skill-activation/tinyjam-pipeline-design
expectedSkill: google-cicd-pipeline-design
prohibitedEventsBeforeActivation:
- "activate_skill.*(?!google-cicd-pipeline-design)"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The regex activate_skill.*(?!google-cicd-pipeline-design) will incorrectly match the intended skill activation because the greedy .* consumes the string before the lookahead is evaluated. This will lead to false positives where the correct skill activation is treated as a prohibited event.

    - "activate_skill(?!(.*google-cicd-pipeline-design)).*"

docker:
base: cicd-evals:latest
env:
GOOGLE_APPLICATION_CREDENTIALS: ~/.config/gcloud/application_default_credentials.json
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding the path to application_default_credentials.json in the user's home directory (~) makes the evaluation configuration non-portable. This configuration will fail in environments where the credentials are not stored in that specific location (e.g., CI runners or other developers' machines). Consider using a more flexible approach like an environment variable or a relative path.

Comment on lines +20 to +21
sleep 3
rm -rf /tmp/skill-activation/tinyjam-deploy
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The sleep 3 in the cleanup script appears unnecessary for local file system operations and adds cumulative delay to the evaluation suite (especially with 5 trials). If there is no specific asynchronous process to wait for, this should be removed to improve efficiency.

      rm -rf /tmp/skill-activation/tinyjam-deploy

docker:
base: cicd-evals:latest
env:
GOOGLE_APPLICATION_CREDENTIALS: ~/.config/gcloud/application_default_credentials.json
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a hardcoded path to the home directory for GOOGLE_APPLICATION_CREDENTIALS reduces portability. It is recommended to use a path that can be consistently resolved across different execution environments, such as a relative path or an environment variable.

Comment on lines +20 to +21
sleep 3
rm -rf /tmp/skill-activation/tinyjam-pipeline-design
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The sleep 3 in the cleanup block is likely redundant for local directory removal and increases the total execution time of the evaluation suite.

      rm -rf /tmp/skill-activation/tinyjam-pipeline-design

@CoasterJX
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two new evaluation configuration files, eval.google-cicd-deploy.yaml and eval.google-cicd-pipeline-design.yaml, designed to test Google Cloud CI/CD skills. The review feedback highlights several critical improvements for the setup and cleanup scripts: specifically, using trial-specific directories (via ${_EVAL_TRIAL}) to ensure idempotency and support parallel execution, removing hardcoded credential paths to improve portability across environments, and eliminating unnecessary sleep commands to optimize performance.

Comment on lines +17 to +18
mkdir -p /tmp/skill-activation
git clone https://github.com/sdlc-graph/tinyjam.git /tmp/skill-activation/tinyjam-deploy
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The setup script is not idempotent; git clone will fail if the directory already exists from a previous run. Furthermore, using a hardcoded path prevents parallel execution of trials. It is recommended to use a trial-specific directory using ${_EVAL_TRIAL}.

      mkdir -p /tmp/skill-activation/${_EVAL_TRIAL}
      git clone https://github.com/sdlc-graph/tinyjam.git /tmp/skill-activation/${_EVAL_TRIAL}/tinyjam-deploy

cleanup: |
sleep 3
rm -rf /tmp/skill-activation/tinyjam-deploy
agentWorkingDir: /tmp/skill-activation/tinyjam-deploy
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This path must match the one used in the setup script to ensure the agent starts in the correct directory for each trial.

  agentWorkingDir: /tmp/skill-activation/${_EVAL_TRIAL}/tinyjam-deploy

Comment on lines +17 to +18
mkdir -p /tmp/skill-activation
git clone https://github.com/sdlc-graph/tinyjam.git /tmp/skill-activation/tinyjam-pipeline-design
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The setup script is not idempotent and will cause collisions during parallel trial execution. Use a trial-specific directory.

      mkdir -p /tmp/skill-activation/${_EVAL_TRIAL}
      git clone https://github.com/sdlc-graph/tinyjam.git /tmp/skill-activation/${_EVAL_TRIAL}/tinyjam-pipeline-design

cleanup: |
sleep 3
rm -rf /tmp/skill-activation/tinyjam-pipeline-design
agentWorkingDir: /tmp/skill-activation/tinyjam-pipeline-design
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Ensure the working directory matches the trial-specific path used in setup.

  agentWorkingDir: /tmp/skill-activation/${_EVAL_TRIAL}/tinyjam-pipeline-design

docker:
base: cicd-evals:latest
env:
GOOGLE_APPLICATION_CREDENTIALS: ~/.config/gcloud/application_default_credentials.json
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding the path to ~/.config/gcloud/application_default_credentials.json makes the evaluation non-portable. This path is specific to local developer environments and will likely fail in CI/CD pipelines or on systems with different configurations. Consider using the existing GOOGLE_APPLICATION_CREDENTIALS environment variable or a more generic path.

Comment on lines +20 to +21
sleep 3
rm -rf /tmp/skill-activation/tinyjam-deploy
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The sleep 3 adds an unnecessary delay to each trial. If the goal is to ensure the directory is removed, rm -rf is sufficient on its own. Also, ensure the path matches the trial-specific directory if adopted.

      rm -rf /tmp/skill-activation/${_EVAL_TRIAL}/tinyjam-deploy

docker:
base: cicd-evals:latest
env:
GOOGLE_APPLICATION_CREDENTIALS: ~/.config/gcloud/application_default_credentials.json
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding the path to the local gcloud config directory makes this evaluation file non-portable. Consider using a more generic approach for credentials.

Comment on lines +20 to +21
sleep 3
rm -rf /tmp/skill-activation/tinyjam-pipeline-design
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Remove the arbitrary sleep 3 to improve evaluation performance.

      rm -rf /tmp/skill-activation/${_EVAL_TRIAL}/tinyjam-pipeline-design

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant