Skip to content

feat: migrate trigger tests to Vally eval configs (22 skills, 722 stimuli)#2282

Open
wbreza wants to merge 10 commits into
mainfrom
feature/trigger-test-migration
Open

feat: migrate trigger tests to Vally eval configs (22 skills, 722 stimuli)#2282
wbreza wants to merge 10 commits into
mainfrom
feature/trigger-test-migration

Conversation

@wbreza
Copy link
Copy Markdown
Collaborator

@wbreza wbreza commented May 15, 2026

Replaces #1899 (migrated from fork to org branch for CI secret access).


Summary

Migrates trigger routing tests to Vally eval configs for all 22 skills, building on the foundation established in #1912. Each skill gets a dedicated triggers.eval.yaml with positive and negative routing stimuli.

Skills Migrated

Skill Stimuli Key Graders
appinsights-instrumentation 17 skill-invocation, completed, output-not-matches
azure-ai 25 skill-invocation, completed, output-not-matches
azure-aigateway 25 skill-invocation, completed, output-not-matches
azure-cloud-migrate 24 skill-invocation, completed, output-not-matches
azure-compliance 36 skill-invocation, completed, output-not-matches
azure-compute 62 skill-invocation, completed, output-not-matches
azure-cost 60 skill-invocation, completed, output-not-matches
azure-deploy 23 skill-invocation, completed, output-not-matches
azure-diagnostics 40 skill-invocation, completed, output-not-matches
azure-enterprise-infra-planner 39 skill-invocation, completed, output-not-matches
azure-hosted-copilot-sdk 18 skill-invocation, completed, output-not-matches
azure-kubernetes 44 skill-invocation, completed, output-not-matches
azure-messaging 32 skill-invocation, completed, output-not-matches
azure-prepare 38 skill-invocation, completed, output-not-matches
azure-quotas 25 skill-invocation, completed, output-not-matches
azure-rbac 35 skill-invocation, completed, output-not-matches
azure-resource-lookup 46 skill-invocation, completed, output-not-matches
azure-resource-visualizer 37 skill-invocation, completed, output-not-matches
azure-upgrade 16 skill-invocation, completed, output-not-matches
azure-validate 23 skill-invocation, completed, output-not-matches
entra-app-registration 19 skill-invocation, completed, output-not-matches
microsoft-foundry 38 skill-invocation, completed, output-not-matches

Total: 722 stimuli across 22 skills

Changes

  • Adds 22 new evals/*/triggers.eval.yaml configs
  • Each config includes positive triggers (skill SHOULD be invoked) and negative triggers (skill should NOT be invoked)
  • Consistent grader pattern: skill-invocation + completed + output-not-matches (fatal error guard)
  • Updates CI workflow (eval.yml) for new eval paths
  • Cleans up legacy test configs

Eval Patterns

All specs follow the validated patterns from #1912:

  • Model: claude-sonnet-4.5
  • Executor: copilot-sdk
  • Duration format: human-friendly ("2m")
  • 3 runs per stimulus for routing reliability
  • Valid grader types only (per published @microsoft/vally-cli@0.4.0)

Dependencies

wbreza and others added 9 commits May 15, 2026 15:26
The eval workflow was invoking
px @microsoft/vally-cli without any
npm auth setup, so npm fell back to the public registry and the package
(published to GitHub Packages) could not be resolved.

- Add .npmrc mapping @microsoft scope to npm.pkg.github.com
- Add scope: '@microsoft' to setup-node so NODE_AUTH_TOKEN is applied
- Add an
pm install --no-save step (with NODE_AUTH_TOKEN) so the
  @microsoft/vally-cli devDependency is resolved via authenticated fetch
- Declare @microsoft/vally-cli in devDependencies (latest) so local dev
  and CI both resolve it through a single config path

This mirrors the working setup in wbreza/skills.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Workflow hardening:
- Drop pull_request trigger (keep workflow_dispatch only) to eliminate
  token exfiltration vector from untrusted PR code
- Add top-level permissions block (contents/packages: read) for
  defense-in-depth

Package hygiene:
- Remove @microsoft/vally-cli from devDependencies (CI installs it
  explicitly via GitHub Packages); lockfile regenerated in sync
- Remove unused root yaml dependency

Eval spec cleanup:
- Remove 13 broad output-not-contains "error"/"failed" graders from
  azure-hosted-copilot-sdk/eval.yaml (kept specific fatal-error regex)
- Add azure-prepare, azure-validate, azure-deploy to environment.skills
- Remove cost:free tag from all LLM-backed stimuli across 4 eval files
  (reserved now for non-LLM static evals)
- Align .vally.yaml suite descriptions with accurate tag semantics

Cleanup:
- Delete stale Waza task files in azure-hosted-copilot-sdk/tasks/
- Add evals/README.md with local vally-cli run instructions
- Gitignore local results/ output directory

Follow-up issue #1920 tracks wiring CI to a curated medium suite.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Update ai-bench references in evals/README.md to microsoft/evaluate
  (the actual upstream Vally repo name)
- Add https://aka.ms/vally as the canonical docs link
- Clarify that contributors don't need source-repo access to run evals
  locally — the @microsoft/vally-cli package from GitHub Packages is
  sufficient

Addresses JasonYeMSFT's review question on evals/README.md.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…/vally

- Remove .npmrc private registry config (GitHub Packages no longer needed)
- Update eval workflow to use @microsoft/vally from public npm
- Remove VALLY_NPM_TOKEN secret requirement from CI
- Update evals/README.md with public npm installation instructions

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Delete tests/scripts/run-waza.js (replaced by direct vally CLI usage)
- Delete .waza.yaml (replaced by .vally.yaml)
- Remove waza and waza:live npm scripts from tests/package.json
- Update tests/README.md: replace Waza Eval Mode section with Vally
- Update tests/azure-prepare/eval/README.md: replace waza references with vally
- Update eval.yaml comment headers in azure-enterprise-infra-planner and azure-prepare

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Install @microsoft/vally-cli (not @microsoft/vally) for CLI executable
- Run npm install to sync package-lock.json with package.json

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Convert trigger test prompt lists from Jest triggers.test.ts files
and Waza trigger_tests.yaml files into Vally eval configs using
skill-invocation grader with required/disallowed assertions.

22 skills covered, 722 total stimuli:
  - 458 positive triggers (skill-invocation required)
  - 264 negative triggers (skill-invocation disallowed)

Merged Waza YAML prompts for azure-deploy, azure-prepare,
and azure-enterprise-infra-planner (deduplicated).

Tags: type=trigger, polarity=positive/negative, tier=full,
cost=free, area=routing

Config: runs=3, timeout=120, model=claude-sonnet-4, threshold=0.8

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Correct model identifiers, grader types, tag format, and scoring weights
  per verified patterns from PR #1912

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Change cost: free to cost: llm for copilot-sdk trigger stimuli
- Add scope note for Foundry sub-skills deferral
- Fix README: @microsoft/vally -> @microsoft/vally-cli, distinct mock/copilot-sdk commands
- Add scoping comment for nodejs_entry_point grader

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates trigger routing test coverage for 22 skills from the Jest-based approach to Vally eval specs, by adding per-skill triggers.eval.yaml files containing positive/negative routing stimuli and a consistent grader pattern.

Changes:

  • Added 22 new evals/*/triggers.eval.yaml specs with positive (must invoke skill) and negative (must not invoke skill) prompts.
  • Standardized routing graders across the new specs (skill-invocation, completed, and a fatal-error guard via output-not-matches).
  • Removed Vally-related npm scripts from tests/package.json (leaving only Jest/lint/typecheck scripts).
Show a summary per file
File Description
tests/package.json Removes Vally eval / eval:lint scripts from the tests package scripts.
evals/appinsights-instrumentation/triggers.eval.yaml Adds Vally trigger routing stimuli for appinsights-instrumentation.
evals/azure-ai/triggers.eval.yaml Adds Vally trigger routing stimuli for azure-ai.
evals/azure-aigateway/triggers.eval.yaml Adds Vally trigger routing stimuli for azure-aigateway.
evals/azure-cloud-migrate/triggers.eval.yaml Adds Vally trigger routing stimuli for azure-cloud-migrate.
evals/azure-compliance/triggers.eval.yaml Adds Vally trigger routing stimuli for azure-compliance.
evals/azure-compute/triggers.eval.yaml Adds Vally trigger routing stimuli for azure-compute.
evals/azure-cost/triggers.eval.yaml Adds Vally trigger routing stimuli for azure-cost.
evals/azure-deploy/triggers.eval.yaml Adds Vally trigger routing stimuli for azure-deploy.
evals/azure-diagnostics/triggers.eval.yaml Adds Vally trigger routing stimuli for azure-diagnostics.
evals/azure-enterprise-infra-planner/triggers.eval.yaml Adds Vally trigger routing stimuli for azure-enterprise-infra-planner.
evals/azure-hosted-copilot-sdk/triggers.eval.yaml Adds/extends Vally trigger routing stimuli for azure-hosted-copilot-sdk.
evals/azure-kubernetes/triggers.eval.yaml Adds Vally trigger routing stimuli for azure-kubernetes.
evals/azure-messaging/triggers.eval.yaml Adds Vally trigger routing stimuli for azure-messaging.
evals/azure-prepare/triggers.eval.yaml Adds Vally trigger routing stimuli for azure-prepare.
evals/azure-quotas/triggers.eval.yaml Adds Vally trigger routing stimuli for azure-quotas.
evals/azure-rbac/triggers.eval.yaml Adds Vally trigger routing stimuli for azure-rbac.
evals/azure-resource-lookup/triggers.eval.yaml Adds Vally trigger routing stimuli for azure-resource-lookup.
evals/azure-resource-visualizer/triggers.eval.yaml Adds Vally trigger routing stimuli for azure-resource-visualizer.
evals/azure-upgrade/triggers.eval.yaml Adds Vally trigger routing stimuli for azure-upgrade.
evals/azure-validate/triggers.eval.yaml Adds Vally trigger routing stimuli for azure-validate.
evals/entra-app-registration/triggers.eval.yaml Adds Vally trigger routing stimuli for entra-app-registration.
evals/microsoft-foundry/triggers.eval.yaml Adds Vally trigger routing stimuli for microsoft-foundry.

Copilot's findings

  • Files reviewed: 23/23 changed files
  • Comments generated: 1

Comment thread tests/package.json
Address PR #2282 review feedback. Update note to recommend
npx @microsoft/vally-cli exclusively (matching .github/workflows/eval.yml)
since the eval / eval:lint npm scripts were removed during Vally migration.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

Details# 🔍 Token Analysis Report

@github-copilot-for-azure/scripts@1.0.0 tokens
node --import tsx src/tokens/cli.ts compare --base origin/main --head HEAD --markdown

📊 Token Change Report

Comparing origin/mainHEAD

Summary

Metric Value
📈 Total Change +3 tokens (0%)
Before 1,465 tokens
After 1,468 tokens
Files Changed 1

Changed Files

File Before After Change
tests/azure-prepare/eval/README.md 1,465 1,468 +3 (0%)

@github-copilot-for-azure/scripts@1.0.0 tokens
node --import tsx src/tokens/cli.ts check --markdown

📊 Token Limit Check Report

Checked: 644 files
Exceeded: 94 files

⚠️ Files Exceeding Token Limits

File Tokens Limit Over By
.github/skills/analyze-skill-issues/SKILL.md 2109 500 +1609
.github/skills/analyze-test-run/SKILL.md 2471 500 +1971
.github/skills/file-test-bug/SKILL.md 628 500 +128
.github/skills/sensei/README.md 3531 2000 +1531
.github/skills/sensei/SKILL.md 3026 500 +2526
.github/skills/sensei/references/EXAMPLES.md 3701 2000 +1701
.github/skills/sensei/references/LOOP.md 4181 2000 +2181
.github/skills/sensei/references/SCORING.md 4299 2000 +2299
.github/skills/skill-authoring/SKILL.md 839 500 +339
plugin/skills/airunway-aks-setup/SKILL.md 1025 500 +525
plugin/skills/appinsights-instrumentation/SKILL.md 937 500 +437
plugin/skills/azure-ai/SKILL.md 820 500 +320
plugin/skills/azure-aigateway/SKILL.md 1261 500 +761
plugin/skills/azure-aigateway/references/policies.md 2342 2000 +342
plugin/skills/azure-cloud-migrate/SKILL.md 1085 500 +585
plugin/skills/azure-cloud-migrate/references/services/container-apps/cloudrun-deployment-guide.md 2029 2000 +29
plugin/skills/azure-cloud-migrate/references/services/container-apps/deployment-guide.md 2458 2000 +458
plugin/skills/azure-cloud-migrate/references/services/container-apps/fargate-deployment-guide.md 2587 2000 +587
plugin/skills/azure-cloud-migrate/references/services/container-apps/spring-deployment-guide.md 3871 2000 +1871
plugin/skills/azure-cloud-migrate/references/services/functions/lambda-to-functions.md 2600 2000 +600
plugin/skills/azure-cloud-migrate/references/services/functions/runtimes/javascript.md 2181 2000 +181
plugin/skills/azure-compliance/SKILL.md 1188 500 +688
plugin/skills/azure-compute/SKILL.md 1370 500 +870
plugin/skills/azure-compute/workflows/essential-machine-management/references/emm-enable-flow.md 2344 2000 +344
plugin/skills/azure-compute/workflows/vm-recommender/vm-recommender.md 2631 2000 +631
plugin/skills/azure-compute/workflows/vm-troubleshooter/vm-troubleshooter.md 2509 2000 +509
plugin/skills/azure-cost/SKILL.md 1980 500 +1480
plugin/skills/azure-deploy/SKILL.md 1645 500 +1145
plugin/skills/azure-deploy/references/pre-deploy-checklist.md 4692 2000 +2692
plugin/skills/azure-deploy/references/recipes/azd/errors.md 4004 2000 +2004
plugin/skills/azure-deploy/references/troubleshooting.md 2038 2000 +38
plugin/skills/azure-diagnostics/SKILL.md 1423 500 +923
plugin/skills/azure-enterprise-infra-planner/SKILL.md 1002 500 +502
plugin/skills/azure-enterprise-infra-planner/references/constraints/compute-apps.md 2022 2000 +22
plugin/skills/azure-hosted-copilot-sdk/SKILL.md 1332 500 +832
plugin/skills/azure-kubernetes/SKILL.md 2606 500 +2106
plugin/skills/azure-kubernetes/azure-kubernetes-automatic-readiness/SKILL.md 3609 500 +3109
plugin/skills/azure-kusto/SKILL.md 2152 500 +1652
plugin/skills/azure-messaging/SKILL.md 821 500 +321
plugin/skills/azure-prepare/SKILL.md 3375 500 +2875
plugin/skills/azure-prepare/references/aspire.md 4617 2000 +2617
plugin/skills/azure-prepare/references/plan-template.md 2617 2000 +617
plugin/skills/azure-prepare/references/recipes/azd/aspire.md 2275 2000 +275
plugin/skills/azure-prepare/references/recipes/azd/terraform.md 3555 2000 +1555
plugin/skills/azure-prepare/references/research.md 2274 2000 +274
plugin/skills/azure-prepare/references/resources-limits-quotas.md 3322 2000 +1322
plugin/skills/azure-prepare/references/security.md 2147 2000 +147
plugin/skills/azure-prepare/references/services/functions/bicep.md 3127 2000 +1127
plugin/skills/azure-prepare/references/services/functions/templates/recipes/composition.md 2813 2000 +813
plugin/skills/azure-prepare/references/services/functions/terraform.md 3404 2000 +1404
plugin/skills/azure-prepare/references/services/sql-database/bicep.md 2037 2000 +37
plugin/skills/azure-quotas/SKILL.md 2821 500 +2321
plugin/skills/azure-quotas/references/commands.md 2644 2000 +644
plugin/skills/azure-reliability/SKILL.md 5659 500 +5159
plugin/skills/azure-reliability/references/configure-multi-region.md 4729 2000 +2729
plugin/skills/azure-resource-lookup/SKILL.md 1394 500 +894
plugin/skills/azure-resource-visualizer/SKILL.md 2122 500 +1622
plugin/skills/azure-storage/SKILL.md 1228 500 +728
plugin/skills/azure-upgrade/SKILL.md 1542 500 +1042
plugin/skills/azure-upgrade/references/languages/java/INSTRUCTION.md 2724 2000 +724
plugin/skills/azure-upgrade/references/languages/java/package-specific/com.microsoft.azure.management.md 2215 2000 +215
plugin/skills/azure-upgrade/references/languages/java/templates/PLAN_TEMPLATE.md 2411 2000 +411
plugin/skills/azure-upgrade/references/languages/java/templates/PROGRESS_TEMPLATE.md 2315 2000 +315
plugin/skills/azure-upgrade/references/languages/java/templates/SUMMARY_TEMPLATE.md 2190 2000 +190
plugin/skills/azure-upgrade/references/services/functions/automation.md 3463 2000 +1463
plugin/skills/azure-upgrade/references/services/functions/consumption-to-flex.md 2773 2000 +773
plugin/skills/azure-validate/SKILL.md 950 500 +450
plugin/skills/entra-agent-id/SKILL.md 4001 500 +3501
plugin/skills/entra-app-registration/SKILL.md 2070 500 +1570
plugin/skills/entra-app-registration/references/api-permissions.md 2545 2000 +545
plugin/skills/entra-app-registration/references/cli-commands.md 2211 2000 +211
plugin/skills/entra-app-registration/references/console-app-example.md 2752 2000 +752
plugin/skills/entra-app-registration/references/oauth-flows.md 2375 2000 +375
plugin/skills/microsoft-foundry/SKILL.md 3955 500 +3455
plugin/skills/microsoft-foundry/foundry-agent/create/create-hosted.md 4824 2000 +2824
plugin/skills/microsoft-foundry/foundry-agent/deploy/deploy.md 6203 2000 +4203
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/eval-datasets.md 2494 2000 +494
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/generate-seed-dataset.md 2088 2000 +88
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/trace-to-dataset.md 4325 2000 +2325
plugin/skills/microsoft-foundry/foundry-agent/faos-optimize/faos-optimize.md 3436 2000 +1436
plugin/skills/microsoft-foundry/foundry-agent/observe/observe.md 3190 2000 +1190
plugin/skills/microsoft-foundry/foundry-agent/observe/references/continuous-eval.md 3860 2000 +1860
plugin/skills/microsoft-foundry/foundry-agent/observe/references/deploy-and-setup.md 2072 2000 +72
plugin/skills/microsoft-foundry/foundry-agent/trace/references/kql-templates.md 2701 2000 +701
plugin/skills/microsoft-foundry/models/deploy-model/SKILL.md 1640 500 +1140
plugin/skills/microsoft-foundry/models/deploy-model/capacity/SKILL.md 1739 500 +1239
plugin/skills/microsoft-foundry/models/deploy-model/customize/SKILL.md 2235 500 +1735
plugin/skills/microsoft-foundry/models/deploy-model/customize/references/customize-workflow.md 3335 2000 +1335
plugin/skills/microsoft-foundry/models/deploy-model/preset/SKILL.md 1226 500 +726
plugin/skills/microsoft-foundry/models/deploy-model/preset/references/preset-workflow.md 5534 2000 +3534
plugin/skills/microsoft-foundry/quota/quota.md 2288 2000 +288
plugin/skills/microsoft-foundry/quota/references/capacity-planning.md 2080 2000 +80
plugin/skills/microsoft-foundry/references/agent-metadata-contract.md 2373 2000 +373
plugin/skills/microsoft-foundry/references/sdk/foundry-sdk-py.md 2162 2000 +162

Consider moving content to references/ subdirectories.


Automated token analysis. See skill authoring guidelines for best practices.

Copy link
Copy Markdown
Collaborator

@jongio jongio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified all 22 eval configs structurally - skill paths, grader configs (required/disallowed), and positive/negative trigger balance all check out. Consistent config across all files. CI green. Clean migration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants