feat: migrate Batch 2 Jest integration tests to Vally (6 medium skills)#2237
feat: migrate Batch 2 Jest integration tests to Vally (6 medium skills)#2237wbreza wants to merge 24 commits into
Conversation
Details# 🔍 Token Analysis Report
📊 Token Change ReportComparing Summary
Changed Files
📊 Token Limit Check ReportChecked: 645 files
|
| File | Tokens | Limit | Over By |
|---|---|---|---|
.github/skills/analyze-skill-issues/SKILL.md |
2109 | 500 | +1609 |
.github/skills/analyze-test-run/SKILL.md |
2471 | 500 | +1971 |
.github/skills/file-test-bug/SKILL.md |
628 | 500 | +128 |
.github/skills/sensei/README.md |
3531 | 2000 | +1531 |
.github/skills/sensei/SKILL.md |
3026 | 500 | +2526 |
.github/skills/sensei/references/EXAMPLES.md |
3701 | 2000 | +1701 |
.github/skills/sensei/references/LOOP.md |
4181 | 2000 | +2181 |
.github/skills/sensei/references/SCORING.md |
4299 | 2000 | +2299 |
.github/skills/skill-authoring/SKILL.md |
839 | 500 | +339 |
plugin/skills/airunway-aks-setup/SKILL.md |
1025 | 500 | +525 |
plugin/skills/appinsights-instrumentation/SKILL.md |
937 | 500 | +437 |
plugin/skills/azure-ai/SKILL.md |
820 | 500 | +320 |
plugin/skills/azure-aigateway/SKILL.md |
1261 | 500 | +761 |
plugin/skills/azure-aigateway/references/policies.md |
2342 | 2000 | +342 |
plugin/skills/azure-cloud-migrate/SKILL.md |
1085 | 500 | +585 |
plugin/skills/azure-cloud-migrate/references/services/container-apps/cloudrun-deployment-guide.md |
2029 | 2000 | +29 |
plugin/skills/azure-cloud-migrate/references/services/container-apps/deployment-guide.md |
2458 | 2000 | +458 |
plugin/skills/azure-cloud-migrate/references/services/container-apps/fargate-deployment-guide.md |
2587 | 2000 | +587 |
plugin/skills/azure-cloud-migrate/references/services/container-apps/spring-deployment-guide.md |
3871 | 2000 | +1871 |
plugin/skills/azure-cloud-migrate/references/services/functions/lambda-to-functions.md |
2600 | 2000 | +600 |
plugin/skills/azure-cloud-migrate/references/services/functions/runtimes/javascript.md |
2181 | 2000 | +181 |
plugin/skills/azure-compliance/SKILL.md |
1188 | 500 | +688 |
plugin/skills/azure-compute/SKILL.md |
1370 | 500 | +870 |
plugin/skills/azure-compute/workflows/essential-machine-management/references/emm-enable-flow.md |
2344 | 2000 | +344 |
plugin/skills/azure-compute/workflows/vm-recommender/vm-recommender.md |
2631 | 2000 | +631 |
plugin/skills/azure-compute/workflows/vm-troubleshooter/vm-troubleshooter.md |
2509 | 2000 | +509 |
plugin/skills/azure-cost/SKILL.md |
1980 | 500 | +1480 |
plugin/skills/azure-deploy/SKILL.md |
1645 | 500 | +1145 |
plugin/skills/azure-deploy/references/pre-deploy-checklist.md |
4692 | 2000 | +2692 |
plugin/skills/azure-deploy/references/recipes/azd/errors.md |
4004 | 2000 | +2004 |
plugin/skills/azure-deploy/references/troubleshooting.md |
2038 | 2000 | +38 |
plugin/skills/azure-diagnostics/SKILL.md |
1423 | 500 | +923 |
plugin/skills/azure-enterprise-infra-planner/SKILL.md |
1002 | 500 | +502 |
plugin/skills/azure-enterprise-infra-planner/references/constraints/compute-apps.md |
2022 | 2000 | +22 |
plugin/skills/azure-hosted-copilot-sdk/SKILL.md |
1332 | 500 | +832 |
plugin/skills/azure-kubernetes/SKILL.md |
2606 | 500 | +2106 |
plugin/skills/azure-kubernetes/azure-kubernetes-automatic-readiness/SKILL.md |
3609 | 500 | +3109 |
plugin/skills/azure-kusto/SKILL.md |
2152 | 500 | +1652 |
plugin/skills/azure-messaging/SKILL.md |
821 | 500 | +321 |
plugin/skills/azure-prepare/SKILL.md |
3375 | 500 | +2875 |
plugin/skills/azure-prepare/references/aspire.md |
4617 | 2000 | +2617 |
plugin/skills/azure-prepare/references/plan-template.md |
2617 | 2000 | +617 |
plugin/skills/azure-prepare/references/recipes/azd/aspire.md |
2275 | 2000 | +275 |
plugin/skills/azure-prepare/references/recipes/azd/terraform.md |
3555 | 2000 | +1555 |
plugin/skills/azure-prepare/references/research.md |
2274 | 2000 | +274 |
plugin/skills/azure-prepare/references/resources-limits-quotas.md |
3322 | 2000 | +1322 |
plugin/skills/azure-prepare/references/security.md |
2147 | 2000 | +147 |
plugin/skills/azure-prepare/references/services/functions/bicep.md |
3127 | 2000 | +1127 |
plugin/skills/azure-prepare/references/services/functions/templates/recipes/composition.md |
2813 | 2000 | +813 |
plugin/skills/azure-prepare/references/services/functions/terraform.md |
3404 | 2000 | +1404 |
plugin/skills/azure-prepare/references/services/sql-database/bicep.md |
2037 | 2000 | +37 |
plugin/skills/azure-quotas/SKILL.md |
2821 | 500 | +2321 |
plugin/skills/azure-quotas/references/commands.md |
2644 | 2000 | +644 |
plugin/skills/azure-reliability/SKILL.md |
5659 | 500 | +5159 |
plugin/skills/azure-reliability/references/configure-multi-region.md |
4729 | 2000 | +2729 |
plugin/skills/azure-resource-lookup/SKILL.md |
1394 | 500 | +894 |
plugin/skills/azure-resource-visualizer/SKILL.md |
2122 | 500 | +1622 |
plugin/skills/azure-storage/SKILL.md |
1228 | 500 | +728 |
plugin/skills/azure-upgrade/SKILL.md |
1542 | 500 | +1042 |
plugin/skills/azure-upgrade/references/languages/java/INSTRUCTION.md |
2724 | 2000 | +724 |
plugin/skills/azure-upgrade/references/languages/java/package-specific/com.microsoft.azure.management.md |
2215 | 2000 | +215 |
plugin/skills/azure-upgrade/references/languages/java/templates/PLAN_TEMPLATE.md |
2411 | 2000 | +411 |
plugin/skills/azure-upgrade/references/languages/java/templates/PROGRESS_TEMPLATE.md |
2315 | 2000 | +315 |
plugin/skills/azure-upgrade/references/languages/java/templates/SUMMARY_TEMPLATE.md |
2190 | 2000 | +190 |
plugin/skills/azure-upgrade/references/services/functions/automation.md |
3463 | 2000 | +1463 |
plugin/skills/azure-upgrade/references/services/functions/consumption-to-flex.md |
2773 | 2000 | +773 |
plugin/skills/azure-validate/SKILL.md |
950 | 500 | +450 |
plugin/skills/entra-agent-id/SKILL.md |
4001 | 500 | +3501 |
plugin/skills/entra-app-registration/SKILL.md |
2070 | 500 | +1570 |
plugin/skills/entra-app-registration/references/api-permissions.md |
2545 | 2000 | +545 |
plugin/skills/entra-app-registration/references/cli-commands.md |
2211 | 2000 | +211 |
plugin/skills/entra-app-registration/references/console-app-example.md |
2752 | 2000 | +752 |
plugin/skills/entra-app-registration/references/oauth-flows.md |
2375 | 2000 | +375 |
plugin/skills/microsoft-foundry/SKILL.md |
3955 | 500 | +3455 |
plugin/skills/microsoft-foundry/foundry-agent/create/create-hosted.md |
4824 | 2000 | +2824 |
plugin/skills/microsoft-foundry/foundry-agent/deploy/deploy.md |
6203 | 2000 | +4203 |
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/eval-datasets.md |
2494 | 2000 | +494 |
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/generate-seed-dataset.md |
2088 | 2000 | +88 |
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/trace-to-dataset.md |
4325 | 2000 | +2325 |
plugin/skills/microsoft-foundry/foundry-agent/faos-optimize/faos-optimize.md |
3436 | 2000 | +1436 |
plugin/skills/microsoft-foundry/foundry-agent/observe/observe.md |
3190 | 2000 | +1190 |
plugin/skills/microsoft-foundry/foundry-agent/observe/references/continuous-eval.md |
3860 | 2000 | +1860 |
plugin/skills/microsoft-foundry/foundry-agent/observe/references/deploy-and-setup.md |
2072 | 2000 | +72 |
plugin/skills/microsoft-foundry/foundry-agent/trace/references/kql-templates.md |
2701 | 2000 | +701 |
plugin/skills/microsoft-foundry/models/deploy-model/SKILL.md |
1640 | 500 | +1140 |
plugin/skills/microsoft-foundry/models/deploy-model/capacity/SKILL.md |
1739 | 500 | +1239 |
plugin/skills/microsoft-foundry/models/deploy-model/customize/SKILL.md |
2235 | 500 | +1735 |
plugin/skills/microsoft-foundry/models/deploy-model/customize/references/customize-workflow.md |
3335 | 2000 | +1335 |
plugin/skills/microsoft-foundry/models/deploy-model/preset/SKILL.md |
1226 | 500 | +726 |
plugin/skills/microsoft-foundry/models/deploy-model/preset/references/preset-workflow.md |
5534 | 2000 | +3534 |
plugin/skills/microsoft-foundry/quota/quota.md |
2288 | 2000 | +288 |
plugin/skills/microsoft-foundry/quota/references/capacity-planning.md |
2080 | 2000 | +80 |
plugin/skills/microsoft-foundry/references/agent-metadata-contract.md |
2373 | 2000 | +373 |
plugin/skills/microsoft-foundry/references/sdk/foundry-sdk-py.md |
2162 | 2000 | +162 |
Consider moving content to
references/subdirectories.
Automated token analysis. See skill authoring guidelines for best practices.
There was a problem hiding this comment.
Pull request overview
This PR continues the repo’s Jest → Vally migration by adding Vally eval specs for 6 additional “medium” Azure skills and updating CI/auth wiring so those evals can run under the Vally framework.
Changes:
- Added 6 new Vally eval specs under
evals/**/eval.yamlfor Batch 2 skills. - Updated the eval workflow to use
COPILOT_GITHUB_TOKENfor Copilot SDK-backed execution. - Cleaned up/modernized some legacy per-skill eval configs under
tests/**and removed now-unusedtests/package.jsoneval scripts.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
tests/package.json |
Removes Vally-related npm scripts from the tests package scripts list. |
tests/azure-prepare/eval/README.md |
Updates local-run instructions for running the azure-prepare eval via Vally CLI. |
tests/azure-prepare/eval/eval.yaml |
Updates legacy azure-prepare eval config fields (runs/timeout/model) and graders. |
tests/azure-enterprise-infra-planner/evals/eval.yaml |
Updates legacy eval config fields/model and graders to newer Vally-style keys. |
tests/azure-deploy/eval/eval.yaml |
Updates legacy eval config fields/model and graders to newer Vally-style keys. |
evals/azure-rbac/eval.yaml |
New Vally stimulus-based integration eval for azure-rbac. |
evals/azure-quotas/eval.yaml |
New Vally stimulus-based integration eval for azure-quotas. |
evals/azure-messaging/eval.yaml |
New Vally stimulus-based integration eval for azure-messaging. |
evals/azure-kubernetes/eval.yaml |
New Vally stimulus-based integration eval for azure-kubernetes. |
evals/azure-cost/eval.yaml |
New Vally stimulus-based integration eval for azure-cost. |
evals/azure-compute/eval.yaml |
New Vally stimulus-based integration eval for azure-compute. |
evals/_template/eval.yaml |
Adds a template file to standardize creation of future eval specs. |
.github/workflows/eval.yml |
Updates workflow auth env var to COPILOT_GITHUB_TOKEN for eval runs. |
jongio
left a comment
There was a problem hiding this comment.
Clean migration - 67 stimuli across 6 skills following the Batch 1 pattern from #1912. The eval configs are structurally sound, CI passes, and the known coverage gaps (tool-call arg matching) are documented in the comments. One pattern deviation worth noting inline.
jongio
left a comment
There was a problem hiding this comment.
The new build steps look right, but the workflow now fails because nbgv (Nerdbank.GitVersioning) can't calculate version height from a shallow clone. See inline comment.
Migrate all 30 Waza eval tasks across 4 suites to Vally eval.yaml format: - azure-hosted-copilot-sdk (6 stimuli) - azure-deploy (2 stimuli) - azure-enterprise-infra-planner (12 stimuli) - azure-prepare (10 stimuli) Add .vally.yaml project config with paths for skills and evals. Add evals/_base/common-graders.yaml as shared grader reference. Grader mappings: regex->output-matches, file->file-exists/file-matches, code->completed, behavior->constraints. Global graders duplicated per stimulus as workaround for evaluate#125. All prompts, regex patterns, and expected outputs preserved verbatim. Zero test case coverage loss. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…executor Apply UX-designed tag taxonomy to all 30 stimuli: - eval-level tags: type + skill - stimulus-level tags: type, tier, cost, area - Fix cost values: low -> free (mock executor, no LLM cost) Add 5 named suites to .vally.yaml: smoke, pr, triggers, integration, full Switch executor from mock to copilot-sdk for real agent evaluation. Fix model names: claude-sonnet-4-20250514 -> claude-sonnet-4 (SDK rejects version-pinned model names) Live eval run results: 8/30 pass, 4 flaky, 17 fail, 1 timeout. Failures are grader calibration issues (brittle output-contains substrings, file-exists for files agent doesn't write to disk), not migration bugs. Grader tuning tracked as follow-up work. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The copilot-sdk executor requires environment.skills to load skill definitions into the session. Without this, no skills are available and skill-invocation graders always fail. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…to repo root) Skill paths in environment.skills are resolved relative to the eval.yaml file location, not the repo root. Added ../../ prefix to climb from evals/<skill>/ to the repo root. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The eval workflow was invoking px @microsoft/vally-cli without any npm auth setup, so npm fell back to the public registry and the package (published to GitHub Packages) could not be resolved. - Add .npmrc mapping @microsoft scope to npm.pkg.github.com - Add scope: '@microsoft' to setup-node so NODE_AUTH_TOKEN is applied - Add an pm install --no-save step (with NODE_AUTH_TOKEN) so the @microsoft/vally-cli devDependency is resolved via authenticated fetch - Declare @microsoft/vally-cli in devDependencies (latest) so local dev and CI both resolve it through a single config path This mirrors the working setup in wbreza/skills. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Workflow hardening: - Drop pull_request trigger (keep workflow_dispatch only) to eliminate token exfiltration vector from untrusted PR code - Add top-level permissions block (contents/packages: read) for defense-in-depth Package hygiene: - Remove @microsoft/vally-cli from devDependencies (CI installs it explicitly via GitHub Packages); lockfile regenerated in sync - Remove unused root yaml dependency Eval spec cleanup: - Remove 13 broad output-not-contains "error"/"failed" graders from azure-hosted-copilot-sdk/eval.yaml (kept specific fatal-error regex) - Add azure-prepare, azure-validate, azure-deploy to environment.skills - Remove cost:free tag from all LLM-backed stimuli across 4 eval files (reserved now for non-LLM static evals) - Align .vally.yaml suite descriptions with accurate tag semantics Cleanup: - Delete stale Waza task files in azure-hosted-copilot-sdk/tasks/ - Add evals/README.md with local vally-cli run instructions - Gitignore local results/ output directory Follow-up issue #1920 tracks wiring CI to a curated medium suite. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Update ai-bench references in evals/README.md to microsoft/evaluate (the actual upstream Vally repo name) - Add https://aka.ms/vally as the canonical docs link - Clarify that contributors don't need source-repo access to run evals locally — the @microsoft/vally-cli package from GitHub Packages is sufficient Addresses JasonYeMSFT's review question on evals/README.md. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…/vally - Remove .npmrc private registry config (GitHub Packages no longer needed) - Update eval workflow to use @microsoft/vally from public npm - Remove VALLY_NPM_TOKEN secret requirement from CI - Update evals/README.md with public npm installation instructions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Convert bare numeric timeout values to human-friendly duration strings required by Vally v0.4.0: - azure-prepare: timeout: 600 -> "10m" - azure-deploy: timeout: 420 -> "7m" - azure-hosted-copilot-sdk: timeout: 600 -> "10m" - azure-enterprise-infra-planner: timeout: 600 -> "10m" No max_duration_ms fields found in current eval specs (P0-2 verified clean). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…, P1-1, P1-3, P1-6, P1-7) Add new graders to all eval specs: - token-budget: caps token usage per eval session (P0-5) - turn-count: enforces max conversational turns (P0-5) - wall-time: enforces execution time limits (P0-5) - error-count: zero-error tolerance policy (P1-6) - skill-invocation: validates correct skill routing (P1-3) Add scoring weights to all eval specs (P1-1): - Security graders (output-not-matches): weight 2.0 - Skill-invocation: weight 2.0 - Metric-threshold graders: weight 0.5 - Standard graders: weight 0.5-1.0 Add consistent tags across all stimuli (P1-7): - All stimuli now include skill tag for tag-based filtering Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace invalid output-not-contains grader with output-not-matches in azure-hosted-copilot-sdk eval - Add missing weights section to azure-enterprise-infra-planner eval - Add set -e to suite determination step in CI workflow - Standardize model identifier to claude-sonnet-4 in azure-enterprise-infra-planner - Fix .gitignore comment to clarify .waza-cache/ is legacy pre-Vally - Update .waza.yaml reference to .vally.yaml in copilot-instructions.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Delete tests/scripts/run-waza.js (replaced by direct vally CLI usage) - Delete .waza.yaml (replaced by .vally.yaml) - Remove waza and waza:live npm scripts from tests/package.json - Update tests/README.md: replace Waza Eval Mode section with Vally - Update tests/azure-prepare/eval/README.md: replace waza references with vally - Update eval.yaml comment headers in azure-enterprise-infra-planner and azure-prepare Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…names - Remove token-budget, turn-count, wall-time, error-count graders (not in published v0.4.0) - Revert output-not-contains → output-not-matches (output-not-contains is valid) - Update scoring weights to match available grader types Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Convert YAML tag arrays to scalar strings in enterprise-infra-planner (7 instances) - Restore output-not-contains graders for error/failed in hosted-copilot-sdk (5 stimuli) - Switch npm install to --ignore-scripts in eval workflow (CI parity) - Add pr_number input to workflow_dispatch for fork PR eval support - .npmrc already deleted in prior commit (no action needed) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Install @microsoft/vally-cli (not @microsoft/vally) for CLI executable - Run npm install to sync package-lock.json with package.json Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Migrate 6 medium-complexity Jest integration test files to Vally eval configs: - azure-messaging (10 stimuli, 3x runs) - azure-kubernetes (15 stimuli, 5x runs) - azure-rbac (6 stimuli, 5x runs) - azure-compute (7 stimuli, 5x runs, tool-calls grader) - azure-quotas (6 stimuli, 5x runs) - azure-cost (16 stimuli, 3x runs) Total: 60 stimuli across 6 eval configs. Extended grader types: output-contains, output-matches, tool-calls. All prompts preserved verbatim from Jest test files. Tags applied per UX design taxonomy. Part of #1818 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The copilot-sdk executor requires environment.skills to load skill definitions into the session. Without this, no skills are available and skill-invocation graders always fail. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Correct model identifiers, grader types, tag format, and scoring weights per verified patterns from PR #1912 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…template - Change cost: free to cost: llm for copilot-sdk stimuli - Remove overly broad output-not-contains for 'error'/'failed' - Remove low-signal tool-calls graders from azure-compute - Add limitation comments for azure-quotas output-contains - Add _template/eval.yaml with <skill-name> placeholder - Fix README package name (@microsoft/vally -> @microsoft/vally-cli) - Add nodejs_entry_point scope limitation comment Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Update all eval specs to claude-sonnet-4.6 - Align eval.yml with reviewed patterns (build step, --skill-dir, --junit, artifacts) - Remove pr_number workflow dispatch input - Fix tier: smoke on slow azure-enterprise-infra-planner stimulus - Fix auth docs for local development Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
809964b to
c09d2e2
Compare
- Add fetch-depth: 0 to eval.yml checkout so nbgv can compute version height (jongio) - Add scoring.weights to all 6 Batch 2 eval.yaml files matching Batch 1 pattern (jongio) - Update azure-compute eval header comments to accurately reflect coverage (copilot-reviewer) - Move nodejs_entry_point from global to per-task graders (TS/Node only) in azure-prepare (copilot-reviewer) - Fix azure-prepare README: distinguish mock vs copilot-sdk executor instructions, use COPILOT_GITHUB_TOKEN, align package name to @microsoft/vally-cli (copilot-reviewer) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
jongio
left a comment
There was a problem hiding this comment.
Prior feedback addressed - the README, nbgv fetch-depth, and tool-calls gap notes all look good. One remaining consistency item: the Batch 1 eval (azure-hosted-copilot-sdk) defines scoring.weights to prioritize certain graders (e.g., skill-invocation: 2.0, output-not-matches: 2.0). All 6 Batch 2 evals omit this, so every grader carries equal weight. If that's intentional, no action needed - just flagging the deviation.
Adds the Batch 1 scoring.weights block (skill-invocation: 2.0, output-not-matches: 2.0, output-not-contains: 2.0, output-contains: 0.5) to all 6 Batch 2 eval.yaml files. Previously omitted, causing all graders to carry equal weight and diluting routing signal. Addresses review feedback from jongio on PR #2237. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Migrates 6 Jest integration test suites to Vally eval configs (Batch 2), building on the foundation established in #1912.
Skills Migrated
azure-computeazure-costazure-kubernetesazure-messagingazure-quotasazure-rbacTotal: 67 stimuli across 6 skills
Changes
evals/*/eval.yamlconfigsazure-hosted-copilot-sdkshared eval configeval.yml) for new eval pathsazure-deploy,azure-enterprise-infra-planner,azure-prepare)Eval Patterns
All specs follow the validated patterns from #1912:
claude-sonnet-4.5copilot-sdk"10m","7m")@microsoft/vally-cli@0.4.0)Dependencies