feat: migrate trigger tests to Vally eval configs (22 skills, 722 stimuli)#2282
feat: migrate trigger tests to Vally eval configs (22 skills, 722 stimuli)#2282wbreza wants to merge 10 commits into
Conversation
The eval workflow was invoking px @microsoft/vally-cli without any npm auth setup, so npm fell back to the public registry and the package (published to GitHub Packages) could not be resolved. - Add .npmrc mapping @microsoft scope to npm.pkg.github.com - Add scope: '@microsoft' to setup-node so NODE_AUTH_TOKEN is applied - Add an pm install --no-save step (with NODE_AUTH_TOKEN) so the @microsoft/vally-cli devDependency is resolved via authenticated fetch - Declare @microsoft/vally-cli in devDependencies (latest) so local dev and CI both resolve it through a single config path This mirrors the working setup in wbreza/skills. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Workflow hardening: - Drop pull_request trigger (keep workflow_dispatch only) to eliminate token exfiltration vector from untrusted PR code - Add top-level permissions block (contents/packages: read) for defense-in-depth Package hygiene: - Remove @microsoft/vally-cli from devDependencies (CI installs it explicitly via GitHub Packages); lockfile regenerated in sync - Remove unused root yaml dependency Eval spec cleanup: - Remove 13 broad output-not-contains "error"/"failed" graders from azure-hosted-copilot-sdk/eval.yaml (kept specific fatal-error regex) - Add azure-prepare, azure-validate, azure-deploy to environment.skills - Remove cost:free tag from all LLM-backed stimuli across 4 eval files (reserved now for non-LLM static evals) - Align .vally.yaml suite descriptions with accurate tag semantics Cleanup: - Delete stale Waza task files in azure-hosted-copilot-sdk/tasks/ - Add evals/README.md with local vally-cli run instructions - Gitignore local results/ output directory Follow-up issue #1920 tracks wiring CI to a curated medium suite. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Update ai-bench references in evals/README.md to microsoft/evaluate (the actual upstream Vally repo name) - Add https://aka.ms/vally as the canonical docs link - Clarify that contributors don't need source-repo access to run evals locally — the @microsoft/vally-cli package from GitHub Packages is sufficient Addresses JasonYeMSFT's review question on evals/README.md. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…/vally - Remove .npmrc private registry config (GitHub Packages no longer needed) - Update eval workflow to use @microsoft/vally from public npm - Remove VALLY_NPM_TOKEN secret requirement from CI - Update evals/README.md with public npm installation instructions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Delete tests/scripts/run-waza.js (replaced by direct vally CLI usage) - Delete .waza.yaml (replaced by .vally.yaml) - Remove waza and waza:live npm scripts from tests/package.json - Update tests/README.md: replace Waza Eval Mode section with Vally - Update tests/azure-prepare/eval/README.md: replace waza references with vally - Update eval.yaml comment headers in azure-enterprise-infra-planner and azure-prepare Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Install @microsoft/vally-cli (not @microsoft/vally) for CLI executable - Run npm install to sync package-lock.json with package.json Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Convert trigger test prompt lists from Jest triggers.test.ts files and Waza trigger_tests.yaml files into Vally eval configs using skill-invocation grader with required/disallowed assertions. 22 skills covered, 722 total stimuli: - 458 positive triggers (skill-invocation required) - 264 negative triggers (skill-invocation disallowed) Merged Waza YAML prompts for azure-deploy, azure-prepare, and azure-enterprise-infra-planner (deduplicated). Tags: type=trigger, polarity=positive/negative, tier=full, cost=free, area=routing Config: runs=3, timeout=120, model=claude-sonnet-4, threshold=0.8 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Correct model identifiers, grader types, tag format, and scoring weights per verified patterns from PR #1912 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Change cost: free to cost: llm for copilot-sdk trigger stimuli - Add scope note for Foundry sub-skills deferral - Fix README: @microsoft/vally -> @microsoft/vally-cli, distinct mock/copilot-sdk commands - Add scoping comment for nodejs_entry_point grader Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR migrates trigger routing test coverage for 22 skills from the Jest-based approach to Vally eval specs, by adding per-skill triggers.eval.yaml files containing positive/negative routing stimuli and a consistent grader pattern.
Changes:
- Added 22 new
evals/*/triggers.eval.yamlspecs with positive (must invoke skill) and negative (must not invoke skill) prompts. - Standardized routing graders across the new specs (
skill-invocation,completed, and a fatal-error guard viaoutput-not-matches). - Removed Vally-related npm scripts from
tests/package.json(leaving only Jest/lint/typecheck scripts).
Show a summary per file
| File | Description |
|---|---|
| tests/package.json | Removes Vally eval / eval:lint scripts from the tests package scripts. |
| evals/appinsights-instrumentation/triggers.eval.yaml | Adds Vally trigger routing stimuli for appinsights-instrumentation. |
| evals/azure-ai/triggers.eval.yaml | Adds Vally trigger routing stimuli for azure-ai. |
| evals/azure-aigateway/triggers.eval.yaml | Adds Vally trigger routing stimuli for azure-aigateway. |
| evals/azure-cloud-migrate/triggers.eval.yaml | Adds Vally trigger routing stimuli for azure-cloud-migrate. |
| evals/azure-compliance/triggers.eval.yaml | Adds Vally trigger routing stimuli for azure-compliance. |
| evals/azure-compute/triggers.eval.yaml | Adds Vally trigger routing stimuli for azure-compute. |
| evals/azure-cost/triggers.eval.yaml | Adds Vally trigger routing stimuli for azure-cost. |
| evals/azure-deploy/triggers.eval.yaml | Adds Vally trigger routing stimuli for azure-deploy. |
| evals/azure-diagnostics/triggers.eval.yaml | Adds Vally trigger routing stimuli for azure-diagnostics. |
| evals/azure-enterprise-infra-planner/triggers.eval.yaml | Adds Vally trigger routing stimuli for azure-enterprise-infra-planner. |
| evals/azure-hosted-copilot-sdk/triggers.eval.yaml | Adds/extends Vally trigger routing stimuli for azure-hosted-copilot-sdk. |
| evals/azure-kubernetes/triggers.eval.yaml | Adds Vally trigger routing stimuli for azure-kubernetes. |
| evals/azure-messaging/triggers.eval.yaml | Adds Vally trigger routing stimuli for azure-messaging. |
| evals/azure-prepare/triggers.eval.yaml | Adds Vally trigger routing stimuli for azure-prepare. |
| evals/azure-quotas/triggers.eval.yaml | Adds Vally trigger routing stimuli for azure-quotas. |
| evals/azure-rbac/triggers.eval.yaml | Adds Vally trigger routing stimuli for azure-rbac. |
| evals/azure-resource-lookup/triggers.eval.yaml | Adds Vally trigger routing stimuli for azure-resource-lookup. |
| evals/azure-resource-visualizer/triggers.eval.yaml | Adds Vally trigger routing stimuli for azure-resource-visualizer. |
| evals/azure-upgrade/triggers.eval.yaml | Adds Vally trigger routing stimuli for azure-upgrade. |
| evals/azure-validate/triggers.eval.yaml | Adds Vally trigger routing stimuli for azure-validate. |
| evals/entra-app-registration/triggers.eval.yaml | Adds Vally trigger routing stimuli for entra-app-registration. |
| evals/microsoft-foundry/triggers.eval.yaml | Adds Vally trigger routing stimuli for microsoft-foundry. |
Copilot's findings
- Files reviewed: 23/23 changed files
- Comments generated: 1
Address PR #2282 review feedback. Update note to recommend npx @microsoft/vally-cli exclusively (matching .github/workflows/eval.yml) since the eval / eval:lint npm scripts were removed during Vally migration. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Details# 🔍 Token Analysis Report
📊 Token Change ReportComparing Summary
Changed Files
📊 Token Limit Check ReportChecked: 644 files
|
| File | Tokens | Limit | Over By |
|---|---|---|---|
.github/skills/analyze-skill-issues/SKILL.md |
2109 | 500 | +1609 |
.github/skills/analyze-test-run/SKILL.md |
2471 | 500 | +1971 |
.github/skills/file-test-bug/SKILL.md |
628 | 500 | +128 |
.github/skills/sensei/README.md |
3531 | 2000 | +1531 |
.github/skills/sensei/SKILL.md |
3026 | 500 | +2526 |
.github/skills/sensei/references/EXAMPLES.md |
3701 | 2000 | +1701 |
.github/skills/sensei/references/LOOP.md |
4181 | 2000 | +2181 |
.github/skills/sensei/references/SCORING.md |
4299 | 2000 | +2299 |
.github/skills/skill-authoring/SKILL.md |
839 | 500 | +339 |
plugin/skills/airunway-aks-setup/SKILL.md |
1025 | 500 | +525 |
plugin/skills/appinsights-instrumentation/SKILL.md |
937 | 500 | +437 |
plugin/skills/azure-ai/SKILL.md |
820 | 500 | +320 |
plugin/skills/azure-aigateway/SKILL.md |
1261 | 500 | +761 |
plugin/skills/azure-aigateway/references/policies.md |
2342 | 2000 | +342 |
plugin/skills/azure-cloud-migrate/SKILL.md |
1085 | 500 | +585 |
plugin/skills/azure-cloud-migrate/references/services/container-apps/cloudrun-deployment-guide.md |
2029 | 2000 | +29 |
plugin/skills/azure-cloud-migrate/references/services/container-apps/deployment-guide.md |
2458 | 2000 | +458 |
plugin/skills/azure-cloud-migrate/references/services/container-apps/fargate-deployment-guide.md |
2587 | 2000 | +587 |
plugin/skills/azure-cloud-migrate/references/services/container-apps/spring-deployment-guide.md |
3871 | 2000 | +1871 |
plugin/skills/azure-cloud-migrate/references/services/functions/lambda-to-functions.md |
2600 | 2000 | +600 |
plugin/skills/azure-cloud-migrate/references/services/functions/runtimes/javascript.md |
2181 | 2000 | +181 |
plugin/skills/azure-compliance/SKILL.md |
1188 | 500 | +688 |
plugin/skills/azure-compute/SKILL.md |
1370 | 500 | +870 |
plugin/skills/azure-compute/workflows/essential-machine-management/references/emm-enable-flow.md |
2344 | 2000 | +344 |
plugin/skills/azure-compute/workflows/vm-recommender/vm-recommender.md |
2631 | 2000 | +631 |
plugin/skills/azure-compute/workflows/vm-troubleshooter/vm-troubleshooter.md |
2509 | 2000 | +509 |
plugin/skills/azure-cost/SKILL.md |
1980 | 500 | +1480 |
plugin/skills/azure-deploy/SKILL.md |
1645 | 500 | +1145 |
plugin/skills/azure-deploy/references/pre-deploy-checklist.md |
4692 | 2000 | +2692 |
plugin/skills/azure-deploy/references/recipes/azd/errors.md |
4004 | 2000 | +2004 |
plugin/skills/azure-deploy/references/troubleshooting.md |
2038 | 2000 | +38 |
plugin/skills/azure-diagnostics/SKILL.md |
1423 | 500 | +923 |
plugin/skills/azure-enterprise-infra-planner/SKILL.md |
1002 | 500 | +502 |
plugin/skills/azure-enterprise-infra-planner/references/constraints/compute-apps.md |
2022 | 2000 | +22 |
plugin/skills/azure-hosted-copilot-sdk/SKILL.md |
1332 | 500 | +832 |
plugin/skills/azure-kubernetes/SKILL.md |
2606 | 500 | +2106 |
plugin/skills/azure-kubernetes/azure-kubernetes-automatic-readiness/SKILL.md |
3609 | 500 | +3109 |
plugin/skills/azure-kusto/SKILL.md |
2152 | 500 | +1652 |
plugin/skills/azure-messaging/SKILL.md |
821 | 500 | +321 |
plugin/skills/azure-prepare/SKILL.md |
3375 | 500 | +2875 |
plugin/skills/azure-prepare/references/aspire.md |
4617 | 2000 | +2617 |
plugin/skills/azure-prepare/references/plan-template.md |
2617 | 2000 | +617 |
plugin/skills/azure-prepare/references/recipes/azd/aspire.md |
2275 | 2000 | +275 |
plugin/skills/azure-prepare/references/recipes/azd/terraform.md |
3555 | 2000 | +1555 |
plugin/skills/azure-prepare/references/research.md |
2274 | 2000 | +274 |
plugin/skills/azure-prepare/references/resources-limits-quotas.md |
3322 | 2000 | +1322 |
plugin/skills/azure-prepare/references/security.md |
2147 | 2000 | +147 |
plugin/skills/azure-prepare/references/services/functions/bicep.md |
3127 | 2000 | +1127 |
plugin/skills/azure-prepare/references/services/functions/templates/recipes/composition.md |
2813 | 2000 | +813 |
plugin/skills/azure-prepare/references/services/functions/terraform.md |
3404 | 2000 | +1404 |
plugin/skills/azure-prepare/references/services/sql-database/bicep.md |
2037 | 2000 | +37 |
plugin/skills/azure-quotas/SKILL.md |
2821 | 500 | +2321 |
plugin/skills/azure-quotas/references/commands.md |
2644 | 2000 | +644 |
plugin/skills/azure-reliability/SKILL.md |
5659 | 500 | +5159 |
plugin/skills/azure-reliability/references/configure-multi-region.md |
4729 | 2000 | +2729 |
plugin/skills/azure-resource-lookup/SKILL.md |
1394 | 500 | +894 |
plugin/skills/azure-resource-visualizer/SKILL.md |
2122 | 500 | +1622 |
plugin/skills/azure-storage/SKILL.md |
1228 | 500 | +728 |
plugin/skills/azure-upgrade/SKILL.md |
1542 | 500 | +1042 |
plugin/skills/azure-upgrade/references/languages/java/INSTRUCTION.md |
2724 | 2000 | +724 |
plugin/skills/azure-upgrade/references/languages/java/package-specific/com.microsoft.azure.management.md |
2215 | 2000 | +215 |
plugin/skills/azure-upgrade/references/languages/java/templates/PLAN_TEMPLATE.md |
2411 | 2000 | +411 |
plugin/skills/azure-upgrade/references/languages/java/templates/PROGRESS_TEMPLATE.md |
2315 | 2000 | +315 |
plugin/skills/azure-upgrade/references/languages/java/templates/SUMMARY_TEMPLATE.md |
2190 | 2000 | +190 |
plugin/skills/azure-upgrade/references/services/functions/automation.md |
3463 | 2000 | +1463 |
plugin/skills/azure-upgrade/references/services/functions/consumption-to-flex.md |
2773 | 2000 | +773 |
plugin/skills/azure-validate/SKILL.md |
950 | 500 | +450 |
plugin/skills/entra-agent-id/SKILL.md |
4001 | 500 | +3501 |
plugin/skills/entra-app-registration/SKILL.md |
2070 | 500 | +1570 |
plugin/skills/entra-app-registration/references/api-permissions.md |
2545 | 2000 | +545 |
plugin/skills/entra-app-registration/references/cli-commands.md |
2211 | 2000 | +211 |
plugin/skills/entra-app-registration/references/console-app-example.md |
2752 | 2000 | +752 |
plugin/skills/entra-app-registration/references/oauth-flows.md |
2375 | 2000 | +375 |
plugin/skills/microsoft-foundry/SKILL.md |
3955 | 500 | +3455 |
plugin/skills/microsoft-foundry/foundry-agent/create/create-hosted.md |
4824 | 2000 | +2824 |
plugin/skills/microsoft-foundry/foundry-agent/deploy/deploy.md |
6203 | 2000 | +4203 |
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/eval-datasets.md |
2494 | 2000 | +494 |
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/generate-seed-dataset.md |
2088 | 2000 | +88 |
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/trace-to-dataset.md |
4325 | 2000 | +2325 |
plugin/skills/microsoft-foundry/foundry-agent/faos-optimize/faos-optimize.md |
3436 | 2000 | +1436 |
plugin/skills/microsoft-foundry/foundry-agent/observe/observe.md |
3190 | 2000 | +1190 |
plugin/skills/microsoft-foundry/foundry-agent/observe/references/continuous-eval.md |
3860 | 2000 | +1860 |
plugin/skills/microsoft-foundry/foundry-agent/observe/references/deploy-and-setup.md |
2072 | 2000 | +72 |
plugin/skills/microsoft-foundry/foundry-agent/trace/references/kql-templates.md |
2701 | 2000 | +701 |
plugin/skills/microsoft-foundry/models/deploy-model/SKILL.md |
1640 | 500 | +1140 |
plugin/skills/microsoft-foundry/models/deploy-model/capacity/SKILL.md |
1739 | 500 | +1239 |
plugin/skills/microsoft-foundry/models/deploy-model/customize/SKILL.md |
2235 | 500 | +1735 |
plugin/skills/microsoft-foundry/models/deploy-model/customize/references/customize-workflow.md |
3335 | 2000 | +1335 |
plugin/skills/microsoft-foundry/models/deploy-model/preset/SKILL.md |
1226 | 500 | +726 |
plugin/skills/microsoft-foundry/models/deploy-model/preset/references/preset-workflow.md |
5534 | 2000 | +3534 |
plugin/skills/microsoft-foundry/quota/quota.md |
2288 | 2000 | +288 |
plugin/skills/microsoft-foundry/quota/references/capacity-planning.md |
2080 | 2000 | +80 |
plugin/skills/microsoft-foundry/references/agent-metadata-contract.md |
2373 | 2000 | +373 |
plugin/skills/microsoft-foundry/references/sdk/foundry-sdk-py.md |
2162 | 2000 | +162 |
Consider moving content to
references/subdirectories.
Automated token analysis. See skill authoring guidelines for best practices.
jongio
left a comment
There was a problem hiding this comment.
Verified all 22 eval configs structurally - skill paths, grader configs (required/disallowed), and positive/negative trigger balance all check out. Consistent config across all files. CI green. Clean migration.
Replaces #1899 (migrated from fork to org branch for CI secret access).
Summary
Migrates trigger routing tests to Vally eval configs for all 22 skills, building on the foundation established in #1912. Each skill gets a dedicated
triggers.eval.yamlwith positive and negative routing stimuli.Skills Migrated
appinsights-instrumentationazure-aiazure-aigatewayazure-cloud-migrateazure-complianceazure-computeazure-costazure-deployazure-diagnosticsazure-enterprise-infra-plannerazure-hosted-copilot-sdkazure-kubernetesazure-messagingazure-prepareazure-quotasazure-rbacazure-resource-lookupazure-resource-visualizerazure-upgradeazure-validateentra-app-registrationmicrosoft-foundryTotal: 722 stimuli across 22 skills
Changes
evals/*/triggers.eval.yamlconfigsskill-invocation+completed+output-not-matches(fatal error guard)eval.yml) for new eval pathsEval Patterns
All specs follow the validated patterns from #1912:
claude-sonnet-4.5copilot-sdk"2m")@microsoft/vally-cli@0.4.0)Dependencies