feat: migrate Batch 3 Jest integration tests to Vally (10 complex skills)#2236
feat: migrate Batch 3 Jest integration tests to Vally (10 complex skills)#2236wbreza wants to merge 37 commits into
Conversation
Migrate all 30 Waza eval tasks across 4 suites to Vally eval.yaml format: - azure-hosted-copilot-sdk (6 stimuli) - azure-deploy (2 stimuli) - azure-enterprise-infra-planner (12 stimuli) - azure-prepare (10 stimuli) Add .vally.yaml project config with paths for skills and evals. Add evals/_base/common-graders.yaml as shared grader reference. Grader mappings: regex->output-matches, file->file-exists/file-matches, code->completed, behavior->constraints. Global graders duplicated per stimulus as workaround for evaluate#125. All prompts, regex patterns, and expected outputs preserved verbatim. Zero test case coverage loss. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace azd waza run with npx @microsoft/vally-cli eval. Add setup-node with GitHub Packages registry for @microsoft/vally-cli. Add packages:read permission for GitHub Packages auth. Preserve trigger paths, artifact upload, and retention settings. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…executor Apply UX-designed tag taxonomy to all 30 stimuli: - eval-level tags: type + skill - stimulus-level tags: type, tier, cost, area - Fix cost values: low -> free (mock executor, no LLM cost) Add 5 named suites to .vally.yaml: smoke, pr, triggers, integration, full Switch executor from mock to copilot-sdk for real agent evaluation. Fix model names: claude-sonnet-4-20250514 -> claude-sonnet-4 (SDK rejects version-pinned model names) Live eval run results: 8/30 pass, 4 flaky, 17 fail, 1 timeout. Failures are grader calibration issues (brittle output-contains substrings, file-exists for files agent doesn't write to disk), not migration bugs. Grader tuning tracked as follow-up work. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The copilot-sdk executor requires environment.skills to load skill definitions into the session. Without this, no skills are available and skill-invocation graders always fail. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…to repo root) Skill paths in environment.skills are resolved relative to the eval.yaml file location, not the repo root. Added ../../ prefix to climb from evals/<skill>/ to the repo root. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The eval workflow was invoking px @microsoft/vally-cli without any npm auth setup, so npm fell back to the public registry and the package (published to GitHub Packages) could not be resolved. - Add .npmrc mapping @microsoft scope to npm.pkg.github.com - Add scope: '@microsoft' to setup-node so NODE_AUTH_TOKEN is applied - Add an pm install --no-save step (with NODE_AUTH_TOKEN) so the @microsoft/vally-cli devDependency is resolved via authenticated fetch - Declare @microsoft/vally-cli in devDependencies (latest) so local dev and CI both resolve it through a single config path This mirrors the working setup in wbreza/skills. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The default GITHUB_TOKEN lacks read:packages access to the microsoft org's private @microsoft/vally-cli package, yielding 403s. Switch to a dedicated VALLY_NPM_TOKEN repo secret (PAT with read:packages, SSO-authorized for the microsoft org). Note: fork-originated pull_request runs do not receive secrets, so fork PRs will still fail auth until the package is made public or the trigger is reworked. Internal branches / workflow_dispatch / merges will resolve correctly once the secret is provisioned. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Allow maintainers to manually invoke the eval workflow from the Actions UI. This is needed to bypass the fork-PR secrets restriction: pull_request workflows triggered from a fork cannot access repository secrets, so the @microsoft/vally-cli install fails. Manual workflow_dispatch runs execute in the base repo context where secrets are available. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The copilot-sdk executor in @microsoft/vally-cli reads GITHUB_TOKEN to create a Copilot session. The default Actions GITHUB_TOKEN doesn't have Copilot API scope, causing "Session was not created with authentication info or custom provider" at eval execution time. Reuse the existing repo secret COPILOT_CLI_TOKEN (a Copilot-enabled PAT) and expose it as GITHUB_TOKEN to the eval run step. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Workflow hardening: - Drop pull_request trigger (keep workflow_dispatch only) to eliminate token exfiltration vector from untrusted PR code - Add top-level permissions block (contents/packages: read) for defense-in-depth Package hygiene: - Remove @microsoft/vally-cli from devDependencies (CI installs it explicitly via GitHub Packages); lockfile regenerated in sync - Remove unused root yaml dependency Eval spec cleanup: - Remove 13 broad output-not-contains "error"/"failed" graders from azure-hosted-copilot-sdk/eval.yaml (kept specific fatal-error regex) - Add azure-prepare, azure-validate, azure-deploy to environment.skills - Remove cost:free tag from all LLM-backed stimuli across 4 eval files (reserved now for non-LLM static evals) - Align .vally.yaml suite descriptions with accurate tag semantics Cleanup: - Delete stale Waza task files in azure-hosted-copilot-sdk/tasks/ - Add evals/README.md with local vally-cli run instructions - Gitignore local results/ output directory Follow-up issue #1920 tracks wiring CI to a curated medium suite. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Update ai-bench references in evals/README.md to microsoft/evaluate (the actual upstream Vally repo name) - Add https://aka.ms/vally as the canonical docs link - Clarify that contributors don't need source-repo access to run evals locally — the @microsoft/vally-cli package from GitHub Packages is sufficient Addresses JasonYeMSFT's review question on evals/README.md. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…/vally - Remove .npmrc private registry config (GitHub Packages no longer needed) - Update eval workflow to use @microsoft/vally from public npm - Remove VALLY_NPM_TOKEN secret requirement from CI - Update evals/README.md with public npm installation instructions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Convert bare numeric timeout values to human-friendly duration strings required by Vally v0.4.0: - azure-prepare: timeout: 600 -> "10m" - azure-deploy: timeout: 420 -> "7m" - azure-hosted-copilot-sdk: timeout: 600 -> "10m" - azure-enterprise-infra-planner: timeout: 600 -> "10m" No max_duration_ms fields found in current eval specs (P0-2 verified clean). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…4, P1-5) - Add vally lint job that runs before eval (P0-4) - Add --junit flag for JUnit XML output (P0-3) - Use --suite flag with dynamic suite selection (P1-2): - PR events run 'triggers' suite - Scheduled runs use 'full' suite - Manual dispatch allows suite selection - Leverage .vally.yaml project config via --suite (P1-4) - Add --threshold 0.8 to fail CI below score threshold (P1-5) - Add PR trigger for skills/evals changes - Add nightly schedule for full regression Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…, P1-1, P1-3, P1-6, P1-7) Add new graders to all eval specs: - token-budget: caps token usage per eval session (P0-5) - turn-count: enforces max conversational turns (P0-5) - wall-time: enforces execution time limits (P0-5) - error-count: zero-error tolerance policy (P1-6) - skill-invocation: validates correct skill routing (P1-3) Add scoring weights to all eval specs (P1-1): - Security graders (output-not-matches): weight 2.0 - Skill-invocation: weight 2.0 - Metric-threshold graders: weight 0.5 - Standard graders: weight 0.5-1.0 Add consistent tags across all stimuli (P1-7): - All stimuli now include skill tag for tag-based filtering Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace invalid output-not-contains grader with output-not-matches in azure-hosted-copilot-sdk eval - Add missing weights section to azure-enterprise-infra-planner eval - Add set -e to suite determination step in CI workflow - Standardize model identifier to claude-sonnet-4 in azure-enterprise-infra-planner - Fix .gitignore comment to clarify .waza-cache/ is legacy pre-Vally - Update .waza.yaml reference to .vally.yaml in copilot-instructions.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Delete tests/scripts/run-waza.js (replaced by direct vally CLI usage) - Delete .waza.yaml (replaced by .vally.yaml) - Remove waza and waza:live npm scripts from tests/package.json - Update tests/README.md: replace Waza Eval Mode section with Vally - Update tests/azure-prepare/eval/README.md: replace waza references with vally - Update eval.yaml comment headers in azure-enterprise-infra-planner and azure-prepare Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…names - Remove token-budget, turn-count, wall-time, error-count graders (not in published v0.4.0) - Revert output-not-contains → output-not-matches (output-not-contains is valid) - Update scoring weights to match available grader types Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Convert YAML tag arrays to scalar strings in enterprise-infra-planner (7 instances) - Restore output-not-contains graders for error/failed in hosted-copilot-sdk (5 stimuli) - Switch npm install to --ignore-scripts in eval workflow (CI parity) - Add pr_number input to workflow_dispatch for fork PR eval support - .npmrc already deleted in prior commit (no action needed) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Install @microsoft/vally-cli (not @microsoft/vally) for CLI executable - Run npm install to sync package-lock.json with package.json Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The CLI executable is in the vally-cli package, not the vally library package. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…lls) Migrate 10 complex-complexity Jest integration test files to Vally eval configs: Batch 3A (7 smaller complex skills): - _template (4 stimuli, workspace setup) - azure-hosted-copilot-sdk (8 stimuli, security regression checks, 20min timeout) - azure-resource-visualizer (4 stimuli, file pattern assertions, 30min timeout) - azure-enterprise-infra-planner (8 stimuli, file validation) - appinsights-instrumentation (4 stimuli, resource copying) - azure-cloud-migrate (2 stimuli, cloneRepo, 45min timeout) - azure-resource-lookup (7 stimuli, keyword validation) Batch 3B (3 large complex skills): - azure-deploy (39 stimuli, custom deploy validators, 60min timeout) - azure-validate (11 stimuli, sparse checkout, 45min timeout) - azure-prepare (38 stimuli, multi-language scaffolding, 60min timeout) Total: 125 stimuli across 10 eval configs. Extended grader types: file-exists, file-matches, output-not-matches (security). Custom helpers mapped: hasDeployLinks, expectFiles, matchesCommand, countSecretsInCode, matchesFileEdit, doesWorkspaceFileIncludePattern. All prompts preserved verbatim from Jest test files. Part of #1818 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…re-validate evals Fix lint errors: prompt was nested under input.prompt instead of being a top-level stimulus field. Corrected in all 88 stimuli across 3 files. No prompt text changed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The copilot-sdk executor requires environment.skills to load skill definitions into the session. Without this, no skills are available and skill-invocation graders always fail. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Correct model identifiers, grader types, tag format, and scoring weights per verified patterns from PR #1912 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix model identifier: claude-sonnet-4.5-20250514 -> claude-sonnet-4.5 - Fix legacy Waza grader type: regex -> output-not-matches - Fix duration format: timeout_seconds: 2700 -> timeout: 45m - Convert Waza config fields to Vally format (trials_per_task -> runs, etc.) - Remove Waza-only top-level fields (skill, version, metrics) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Lint Eval Specs CI check failed because azure-deploy and azure-prepare eval specs were missing the required 'scoring' section, triggering scoring-defaults-applied warnings that are treated as errors in strict mode. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Convert azure-deploy, azure-prepare from Waza to Vally schema - Change cost: free to cost: llm for copilot-sdk stimuli - Add pending notes for workspace-dependent stimuli - Fix template placeholder and README docs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The @github/copilot-sdk checks COPILOT_GITHUB_TOKEN first (priority 1), before GH_TOKEN and GITHUB_TOKEN. Using GITHUB_TOKEN can conflict with the built-in Actions token. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Details# 🔍 Token Analysis Report
📊 Token Change ReportComparing Summary
Changed Files
📊 Token Limit Check ReportChecked: 645 files
|
| File | Tokens | Limit | Over By |
|---|---|---|---|
.github/skills/analyze-skill-issues/SKILL.md |
2109 | 500 | +1609 |
.github/skills/analyze-test-run/SKILL.md |
2471 | 500 | +1971 |
.github/skills/file-test-bug/SKILL.md |
628 | 500 | +128 |
.github/skills/sensei/README.md |
3531 | 2000 | +1531 |
.github/skills/sensei/SKILL.md |
3026 | 500 | +2526 |
.github/skills/sensei/references/EXAMPLES.md |
3701 | 2000 | +1701 |
.github/skills/sensei/references/LOOP.md |
4181 | 2000 | +2181 |
.github/skills/sensei/references/SCORING.md |
4299 | 2000 | +2299 |
.github/skills/skill-authoring/SKILL.md |
839 | 500 | +339 |
plugin/skills/airunway-aks-setup/SKILL.md |
1025 | 500 | +525 |
plugin/skills/appinsights-instrumentation/SKILL.md |
937 | 500 | +437 |
plugin/skills/azure-ai/SKILL.md |
820 | 500 | +320 |
plugin/skills/azure-aigateway/SKILL.md |
1261 | 500 | +761 |
plugin/skills/azure-aigateway/references/policies.md |
2342 | 2000 | +342 |
plugin/skills/azure-cloud-migrate/SKILL.md |
1085 | 500 | +585 |
plugin/skills/azure-cloud-migrate/references/services/container-apps/cloudrun-deployment-guide.md |
2029 | 2000 | +29 |
plugin/skills/azure-cloud-migrate/references/services/container-apps/deployment-guide.md |
2458 | 2000 | +458 |
plugin/skills/azure-cloud-migrate/references/services/container-apps/fargate-deployment-guide.md |
2587 | 2000 | +587 |
plugin/skills/azure-cloud-migrate/references/services/container-apps/spring-deployment-guide.md |
3871 | 2000 | +1871 |
plugin/skills/azure-cloud-migrate/references/services/functions/lambda-to-functions.md |
2600 | 2000 | +600 |
plugin/skills/azure-cloud-migrate/references/services/functions/runtimes/javascript.md |
2181 | 2000 | +181 |
plugin/skills/azure-compliance/SKILL.md |
1188 | 500 | +688 |
plugin/skills/azure-compute/SKILL.md |
1370 | 500 | +870 |
plugin/skills/azure-compute/workflows/essential-machine-management/references/emm-enable-flow.md |
2344 | 2000 | +344 |
plugin/skills/azure-compute/workflows/vm-recommender/vm-recommender.md |
2631 | 2000 | +631 |
plugin/skills/azure-compute/workflows/vm-troubleshooter/vm-troubleshooter.md |
2509 | 2000 | +509 |
plugin/skills/azure-cost/SKILL.md |
1980 | 500 | +1480 |
plugin/skills/azure-deploy/SKILL.md |
1645 | 500 | +1145 |
plugin/skills/azure-deploy/references/pre-deploy-checklist.md |
4692 | 2000 | +2692 |
plugin/skills/azure-deploy/references/recipes/azd/errors.md |
4004 | 2000 | +2004 |
plugin/skills/azure-deploy/references/troubleshooting.md |
2038 | 2000 | +38 |
plugin/skills/azure-diagnostics/SKILL.md |
1423 | 500 | +923 |
plugin/skills/azure-enterprise-infra-planner/SKILL.md |
1002 | 500 | +502 |
plugin/skills/azure-enterprise-infra-planner/references/constraints/compute-apps.md |
2022 | 2000 | +22 |
plugin/skills/azure-hosted-copilot-sdk/SKILL.md |
1332 | 500 | +832 |
plugin/skills/azure-kubernetes/SKILL.md |
2606 | 500 | +2106 |
plugin/skills/azure-kubernetes/azure-kubernetes-automatic-readiness/SKILL.md |
3609 | 500 | +3109 |
plugin/skills/azure-kusto/SKILL.md |
2152 | 500 | +1652 |
plugin/skills/azure-messaging/SKILL.md |
821 | 500 | +321 |
plugin/skills/azure-prepare/SKILL.md |
3375 | 500 | +2875 |
plugin/skills/azure-prepare/references/aspire.md |
4617 | 2000 | +2617 |
plugin/skills/azure-prepare/references/plan-template.md |
2617 | 2000 | +617 |
plugin/skills/azure-prepare/references/recipes/azd/aspire.md |
2275 | 2000 | +275 |
plugin/skills/azure-prepare/references/recipes/azd/terraform.md |
3555 | 2000 | +1555 |
plugin/skills/azure-prepare/references/research.md |
2274 | 2000 | +274 |
plugin/skills/azure-prepare/references/resources-limits-quotas.md |
3322 | 2000 | +1322 |
plugin/skills/azure-prepare/references/security.md |
2147 | 2000 | +147 |
plugin/skills/azure-prepare/references/services/functions/bicep.md |
3127 | 2000 | +1127 |
plugin/skills/azure-prepare/references/services/functions/templates/recipes/composition.md |
2813 | 2000 | +813 |
plugin/skills/azure-prepare/references/services/functions/terraform.md |
3404 | 2000 | +1404 |
plugin/skills/azure-prepare/references/services/sql-database/bicep.md |
2037 | 2000 | +37 |
plugin/skills/azure-quotas/SKILL.md |
2821 | 500 | +2321 |
plugin/skills/azure-quotas/references/commands.md |
2644 | 2000 | +644 |
plugin/skills/azure-reliability/SKILL.md |
5659 | 500 | +5159 |
plugin/skills/azure-reliability/references/configure-multi-region.md |
4729 | 2000 | +2729 |
plugin/skills/azure-resource-lookup/SKILL.md |
1394 | 500 | +894 |
plugin/skills/azure-resource-visualizer/SKILL.md |
2122 | 500 | +1622 |
plugin/skills/azure-storage/SKILL.md |
1228 | 500 | +728 |
plugin/skills/azure-upgrade/SKILL.md |
1542 | 500 | +1042 |
plugin/skills/azure-upgrade/references/languages/java/INSTRUCTION.md |
2724 | 2000 | +724 |
plugin/skills/azure-upgrade/references/languages/java/package-specific/com.microsoft.azure.management.md |
2215 | 2000 | +215 |
plugin/skills/azure-upgrade/references/languages/java/templates/PLAN_TEMPLATE.md |
2411 | 2000 | +411 |
plugin/skills/azure-upgrade/references/languages/java/templates/PROGRESS_TEMPLATE.md |
2315 | 2000 | +315 |
plugin/skills/azure-upgrade/references/languages/java/templates/SUMMARY_TEMPLATE.md |
2190 | 2000 | +190 |
plugin/skills/azure-upgrade/references/services/functions/automation.md |
3463 | 2000 | +1463 |
plugin/skills/azure-upgrade/references/services/functions/consumption-to-flex.md |
2773 | 2000 | +773 |
plugin/skills/azure-validate/SKILL.md |
950 | 500 | +450 |
plugin/skills/entra-agent-id/SKILL.md |
4001 | 500 | +3501 |
plugin/skills/entra-app-registration/SKILL.md |
2070 | 500 | +1570 |
plugin/skills/entra-app-registration/references/api-permissions.md |
2545 | 2000 | +545 |
plugin/skills/entra-app-registration/references/cli-commands.md |
2211 | 2000 | +211 |
plugin/skills/entra-app-registration/references/console-app-example.md |
2752 | 2000 | +752 |
plugin/skills/entra-app-registration/references/oauth-flows.md |
2375 | 2000 | +375 |
plugin/skills/microsoft-foundry/SKILL.md |
3955 | 500 | +3455 |
plugin/skills/microsoft-foundry/foundry-agent/create/create-hosted.md |
4824 | 2000 | +2824 |
plugin/skills/microsoft-foundry/foundry-agent/deploy/deploy.md |
6203 | 2000 | +4203 |
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/eval-datasets.md |
2494 | 2000 | +494 |
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/generate-seed-dataset.md |
2088 | 2000 | +88 |
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/trace-to-dataset.md |
4325 | 2000 | +2325 |
plugin/skills/microsoft-foundry/foundry-agent/faos-optimize/faos-optimize.md |
3436 | 2000 | +1436 |
plugin/skills/microsoft-foundry/foundry-agent/observe/observe.md |
3190 | 2000 | +1190 |
plugin/skills/microsoft-foundry/foundry-agent/observe/references/continuous-eval.md |
3860 | 2000 | +1860 |
plugin/skills/microsoft-foundry/foundry-agent/observe/references/deploy-and-setup.md |
2072 | 2000 | +72 |
plugin/skills/microsoft-foundry/foundry-agent/trace/references/kql-templates.md |
2701 | 2000 | +701 |
plugin/skills/microsoft-foundry/models/deploy-model/SKILL.md |
1640 | 500 | +1140 |
plugin/skills/microsoft-foundry/models/deploy-model/capacity/SKILL.md |
1739 | 500 | +1239 |
plugin/skills/microsoft-foundry/models/deploy-model/customize/SKILL.md |
2235 | 500 | +1735 |
plugin/skills/microsoft-foundry/models/deploy-model/customize/references/customize-workflow.md |
3335 | 2000 | +1335 |
plugin/skills/microsoft-foundry/models/deploy-model/preset/SKILL.md |
1226 | 500 | +726 |
plugin/skills/microsoft-foundry/models/deploy-model/preset/references/preset-workflow.md |
5534 | 2000 | +3534 |
plugin/skills/microsoft-foundry/quota/quota.md |
2288 | 2000 | +288 |
plugin/skills/microsoft-foundry/quota/references/capacity-planning.md |
2080 | 2000 | +80 |
plugin/skills/microsoft-foundry/references/agent-metadata-contract.md |
2373 | 2000 | +373 |
plugin/skills/microsoft-foundry/references/sdk/foundry-sdk-py.md |
2162 | 2000 | +162 |
Consider moving content to
references/subdirectories.
Automated token analysis. See skill authoring guidelines for best practices.
There was a problem hiding this comment.
Pull request overview
Migrates a batch of complex skill integration tests from Jest into Vally eval specs under evals/**, with accompanying updates to legacy per-skill eval configs and the eval CI workflow.
Changes:
- Adds new Vally
evals/<skill>/eval.yamlspecs for multiple skills (routing/behavior/file graders, including template scaffolding). - Updates legacy test-side eval configs under
tests/**/eval(s)/to newer config/scoring/grader shapes. - Modifies
.github/workflows/eval.ymlto lint evals/skills and run Vally in CI.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/package.json | Removes Vally-related npm scripts from the tests package. |
| tests/azure-prepare/eval/README.md | Updates local-running instructions for the azure-prepare eval suite. |
| tests/azure-prepare/eval/eval.yaml | Updates legacy azure-prepare eval config (runs/timeout/scoring/graders). |
| tests/azure-enterprise-infra-planner/evals/eval.yaml | Updates legacy eval config (runs/timeout/scoring/graders). |
| tests/azure-deploy/eval/eval.yaml | Updates legacy eval config (runs/timeout/scoring/graders). |
| evals/_template/eval.yaml | Adds a reference/template Vally eval spec for future migrations. |
| evals/appinsights-instrumentation/eval.yaml | Adds a new Vally eval spec migrated from Jest for appinsights instrumentation. |
| evals/azure-cloud-migrate/eval.yaml | Adds a new Vally eval spec migrated from Jest for cloud-migrate scenarios. |
| evals/azure-deploy/eval.yaml | Replaces/expands azure-deploy eval spec with a large Jest-migrated stimulus set. |
| evals/azure-enterprise-infra-planner/eval.yaml | Replaces/rewrites infra-planner eval spec migrated from Jest. |
| evals/azure-hosted-copilot-sdk/eval.yaml | Replaces/rewrites hosted-copilot-sdk eval spec migrated from Jest. |
| evals/azure-prepare/eval.yaml | Replaces/rewrites azure-prepare eval spec migrated from Jest with many stimuli and file graders. |
| evals/azure-resource-lookup/eval.yaml | Adds a new Vally eval spec migrated from Jest for resource-lookup prompts. |
| evals/azure-resource-visualizer/eval.yaml | Adds a new Vally eval spec migrated from Jest for resource-visualizer routing and file output. |
| evals/azure-validate/eval.yaml | Adds a new Vally eval spec migrated from Jest for validate scenarios (routing + file/output checks). |
| .github/workflows/eval.yml | Updates eval workflow triggers/inputs and switches CI execution behavior for Vally. |
jongio
left a comment
There was a problem hiding this comment.
One workflow item not covered by the existing review.
- Update all eval specs to claude-sonnet-4.6 - Align eval.yml with reviewed patterns (build step, --skill-dir, --junit, artifacts) - Fix auth docs for local development Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
jongio
left a comment
There was a problem hiding this comment.
My setup-node downgrade comment is addressed - restored to v6.4.0. The new commit also adds a proper build step, smoke eval PR gate, and switches all configs to the evals/ directory instead of a single hard-coded spec. Looks good.
ecc141d to
36fcc44
Compare
- eval.yml: wire suite input through to eval invocation (was unused) - README: align Copilot SDK auth env var to COPILOT_GITHUB_TOKEN - azure-resource-lookup: switch web|app|site grader to output-matches regex - appinsights-instrumentation: replace file-matches with output-matches against the original Jest regex patterns (workspace seeding no longer available) - azure-validate: replace AppHost.cs file-matches with output-contains for AzureWebJobsSecretStorageType guidance (no Aspire sample seeding in eval) - azure-hosted-copilot-sdk: tighten 2 prompts to mention "Copilot SDK" explicitly so skill-invocation routing doesn't depend on missing workspace seeding Refs #2236 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
36fcc44 to
dffe4b4
Compare
jongio
left a comment
There was a problem hiding this comment.
The migration looks faithful and the workflow improvements are solid - build step, smoke gate, suite wiring all make sense. One gap worth checking: the PR gate's --tier smoke filter vs what's actually tagged in the eval configs.
Resolves conflicts in .vally.yaml and 4 eval YAMLs by preserving both sides' intent: keeping batch3 grader/prompt fixes from dffe4b4 while incorporating new stimuli and config from main. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Re-tags one cheap routing/invocation stimulus in each of the 10 eval files migrated in this PR with ier: smoke so the --suite smoke PR gate exercises real routing coverage for these skills (matching the convention already used by azure-ai, azure-aigateway, azure-diagnostics, azure-kusto, azure-storage, azure-upgrade, entra-app-registration, and microsoft-foundry on main). Addresses jongio's review feedback that the smoke gate had no matching stimuli for newly-migrated skills. Refs #2236 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Migrates 10 Jest integration test suites to Vally eval configs (Batch 3), building on the foundation established in #1912. Includes template, complex multi-section eval specs, and skills with file-based graders.
Skills Migrated
_templateappinsights-instrumentationazure-cloud-migrateazure-deployazure-enterprise-infra-plannerazure-hosted-copilot-sdkazure-prepareazure-resource-lookupazure-resource-visualizerazure-validateTotal: 56 stimuli across 10 skills
Changes
evals/*/eval.yamlconfigs (including_templatereference spec)file-exists,file-matches,regexeval.yml) for new eval pathsazure-deploy,azure-enterprise-infra-planner,azure-prepare)Eval Patterns
All specs follow the validated patterns from #1912:
claude-sonnet-4.5copilot-sdk"10m","7m")@microsoft/vally-cli@0.4.0)Dependencies