feat: migrate Batch 3 Jest integration tests to Vally (10 complex skills) by wbreza · Pull Request #2236 · microsoft/GitHub-Copilot-for-Azure

wbreza · 2026-05-11T20:42:22Z

Summary

Migrates 10 Jest integration test suites to Vally eval configs (Batch 3), building on the foundation established in #1912. Includes template, complex multi-section eval specs, and skills with file-based graders.

Skills Migrated

Skill	Stimuli	Key Graders
`_template`	5	skill-invocation, completed, output-contains, output-not-matches, tool-calls
`appinsights-instrumentation`	4	skill-invocation, completed, file-matches, output-contains, output-not-matches
`azure-cloud-migrate`	2	skill-invocation, completed, file-exists, output-not-matches
`azure-deploy`	3	skill-invocation, completed, file-exists, file-matches, output-matches, regex
`azure-enterprise-infra-planner`	8	skill-invocation, completed, file-exists, output-not-matches
`azure-hosted-copilot-sdk`	8	skill-invocation, completed, output-contains, output-not-matches
`azure-prepare`	3	skill-invocation, completed, file-exists, file-matches, output-matches, regex
`azure-resource-lookup`	7	skill-invocation, completed, output-contains, output-not-matches
`azure-resource-visualizer`	4	skill-invocation, completed, file-matches, output-not-matches
`azure-validate`	12	skill-invocation, completed, file-matches, output-matches, output-not-matches

Total: 56 stimuli across 10 skills

Changes

Adds/updates 10 evals/*/eval.yaml configs (including _template reference spec)
Uses advanced grader types: file-exists, file-matches, regex
Updates CI workflow (eval.yml) for new eval paths
Cleans up legacy test configs (azure-deploy, azure-enterprise-infra-planner, azure-prepare)

Eval Patterns

All specs follow the validated patterns from #1912:

Model: claude-sonnet-4.5
Executor: copilot-sdk
Duration format: human-friendly ("10m", "7m")
Valid grader types only (per published @microsoft/vally-cli@0.4.0)

Dependencies

Requires ci(eval): migrate to Vally eval framework with v0.4.0 features #1912 (merged ✅)

Migrate all 30 Waza eval tasks across 4 suites to Vally eval.yaml format: - azure-hosted-copilot-sdk (6 stimuli) - azure-deploy (2 stimuli) - azure-enterprise-infra-planner (12 stimuli) - azure-prepare (10 stimuli) Add .vally.yaml project config with paths for skills and evals. Add evals/_base/common-graders.yaml as shared grader reference. Grader mappings: regex->output-matches, file->file-exists/file-matches, code->completed, behavior->constraints. Global graders duplicated per stimulus as workaround for evaluate#125. All prompts, regex patterns, and expected outputs preserved verbatim. Zero test case coverage loss. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace azd waza run with npx @microsoft/vally-cli eval. Add setup-node with GitHub Packages registry for @microsoft/vally-cli. Add packages:read permission for GitHub Packages auth. Preserve trigger paths, artifact upload, and retention settings. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…executor Apply UX-designed tag taxonomy to all 30 stimuli: - eval-level tags: type + skill - stimulus-level tags: type, tier, cost, area - Fix cost values: low -> free (mock executor, no LLM cost) Add 5 named suites to .vally.yaml: smoke, pr, triggers, integration, full Switch executor from mock to copilot-sdk for real agent evaluation. Fix model names: claude-sonnet-4-20250514 -> claude-sonnet-4 (SDK rejects version-pinned model names) Live eval run results: 8/30 pass, 4 flaky, 17 fail, 1 timeout. Failures are grader calibration issues (brittle output-contains substrings, file-exists for files agent doesn't write to disk), not migration bugs. Grader tuning tracked as follow-up work. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The copilot-sdk executor requires environment.skills to load skill definitions into the session. Without this, no skills are available and skill-invocation graders always fail. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…to repo root) Skill paths in environment.skills are resolved relative to the eval.yaml file location, not the repo root. Added ../../ prefix to climb from evals/<skill>/ to the repo root. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The eval workflow was invoking px @microsoft/vally-cli without any npm auth setup, so npm fell back to the public registry and the package (published to GitHub Packages) could not be resolved. - Add .npmrc mapping @microsoft scope to npm.pkg.github.com - Add scope: '@microsoft' to setup-node so NODE_AUTH_TOKEN is applied - Add an pm install --no-save step (with NODE_AUTH_TOKEN) so the @microsoft/vally-cli devDependency is resolved via authenticated fetch - Declare @microsoft/vally-cli in devDependencies (latest) so local dev and CI both resolve it through a single config path This mirrors the working setup in wbreza/skills. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The default GITHUB_TOKEN lacks read:packages access to the microsoft org's private @microsoft/vally-cli package, yielding 403s. Switch to a dedicated VALLY_NPM_TOKEN repo secret (PAT with read:packages, SSO-authorized for the microsoft org). Note: fork-originated pull_request runs do not receive secrets, so fork PRs will still fail auth until the package is made public or the trigger is reworked. Internal branches / workflow_dispatch / merges will resolve correctly once the secret is provisioned. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Allow maintainers to manually invoke the eval workflow from the Actions UI. This is needed to bypass the fork-PR secrets restriction: pull_request workflows triggered from a fork cannot access repository secrets, so the @microsoft/vally-cli install fails. Manual workflow_dispatch runs execute in the base repo context where secrets are available. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The copilot-sdk executor in @microsoft/vally-cli reads GITHUB_TOKEN to create a Copilot session. The default Actions GITHUB_TOKEN doesn't have Copilot API scope, causing "Session was not created with authentication info or custom provider" at eval execution time. Reuse the existing repo secret COPILOT_CLI_TOKEN (a Copilot-enabled PAT) and expose it as GITHUB_TOKEN to the eval run step. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Workflow hardening: - Drop pull_request trigger (keep workflow_dispatch only) to eliminate token exfiltration vector from untrusted PR code - Add top-level permissions block (contents/packages: read) for defense-in-depth Package hygiene: - Remove @microsoft/vally-cli from devDependencies (CI installs it explicitly via GitHub Packages); lockfile regenerated in sync - Remove unused root yaml dependency Eval spec cleanup: - Remove 13 broad output-not-contains "error"/"failed" graders from azure-hosted-copilot-sdk/eval.yaml (kept specific fatal-error regex) - Add azure-prepare, azure-validate, azure-deploy to environment.skills - Remove cost:free tag from all LLM-backed stimuli across 4 eval files (reserved now for non-LLM static evals) - Align .vally.yaml suite descriptions with accurate tag semantics Cleanup: - Delete stale Waza task files in azure-hosted-copilot-sdk/tasks/ - Add evals/README.md with local vally-cli run instructions - Gitignore local results/ output directory Follow-up issue #1920 tracks wiring CI to a curated medium suite. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Update ai-bench references in evals/README.md to microsoft/evaluate (the actual upstream Vally repo name) - Add https://aka.ms/vally as the canonical docs link - Clarify that contributors don't need source-repo access to run evals locally — the @microsoft/vally-cli package from GitHub Packages is sufficient Addresses JasonYeMSFT's review question on evals/README.md. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…/vally - Remove .npmrc private registry config (GitHub Packages no longer needed) - Update eval workflow to use @microsoft/vally from public npm - Remove VALLY_NPM_TOKEN secret requirement from CI - Update evals/README.md with public npm installation instructions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Convert bare numeric timeout values to human-friendly duration strings required by Vally v0.4.0: - azure-prepare: timeout: 600 -> "10m" - azure-deploy: timeout: 420 -> "7m" - azure-hosted-copilot-sdk: timeout: 600 -> "10m" - azure-enterprise-infra-planner: timeout: 600 -> "10m" No max_duration_ms fields found in current eval specs (P0-2 verified clean). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…4, P1-5) - Add vally lint job that runs before eval (P0-4) - Add --junit flag for JUnit XML output (P0-3) - Use --suite flag with dynamic suite selection (P1-2): - PR events run 'triggers' suite - Scheduled runs use 'full' suite - Manual dispatch allows suite selection - Leverage .vally.yaml project config via --suite (P1-4) - Add --threshold 0.8 to fail CI below score threshold (P1-5) - Add PR trigger for skills/evals changes - Add nightly schedule for full regression Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…, P1-1, P1-3, P1-6, P1-7) Add new graders to all eval specs: - token-budget: caps token usage per eval session (P0-5) - turn-count: enforces max conversational turns (P0-5) - wall-time: enforces execution time limits (P0-5) - error-count: zero-error tolerance policy (P1-6) - skill-invocation: validates correct skill routing (P1-3) Add scoring weights to all eval specs (P1-1): - Security graders (output-not-matches): weight 2.0 - Skill-invocation: weight 2.0 - Metric-threshold graders: weight 0.5 - Standard graders: weight 0.5-1.0 Add consistent tags across all stimuli (P1-7): - All stimuli now include skill tag for tag-based filtering Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Replace invalid output-not-contains grader with output-not-matches in azure-hosted-copilot-sdk eval - Add missing weights section to azure-enterprise-infra-planner eval - Add set -e to suite determination step in CI workflow - Standardize model identifier to claude-sonnet-4 in azure-enterprise-infra-planner - Fix .gitignore comment to clarify .waza-cache/ is legacy pre-Vally - Update .waza.yaml reference to .vally.yaml in copilot-instructions.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Delete tests/scripts/run-waza.js (replaced by direct vally CLI usage) - Delete .waza.yaml (replaced by .vally.yaml) - Remove waza and waza:live npm scripts from tests/package.json - Update tests/README.md: replace Waza Eval Mode section with Vally - Update tests/azure-prepare/eval/README.md: replace waza references with vally - Update eval.yaml comment headers in azure-enterprise-infra-planner and azure-prepare Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…names - Remove token-budget, turn-count, wall-time, error-count graders (not in published v0.4.0) - Revert output-not-contains → output-not-matches (output-not-contains is valid) - Update scoring weights to match available grader types Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Convert YAML tag arrays to scalar strings in enterprise-infra-planner (7 instances) - Restore output-not-contains graders for error/failed in hosted-copilot-sdk (5 stimuli) - Switch npm install to --ignore-scripts in eval workflow (CI parity) - Add pr_number input to workflow_dispatch for fork PR eval support - .npmrc already deleted in prior commit (no action needed) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Install @microsoft/vally-cli (not @microsoft/vally) for CLI executable - Run npm install to sync package-lock.json with package.json Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The CLI executable is in the vally-cli package, not the vally library package. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…lls) Migrate 10 complex-complexity Jest integration test files to Vally eval configs: Batch 3A (7 smaller complex skills): - _template (4 stimuli, workspace setup) - azure-hosted-copilot-sdk (8 stimuli, security regression checks, 20min timeout) - azure-resource-visualizer (4 stimuli, file pattern assertions, 30min timeout) - azure-enterprise-infra-planner (8 stimuli, file validation) - appinsights-instrumentation (4 stimuli, resource copying) - azure-cloud-migrate (2 stimuli, cloneRepo, 45min timeout) - azure-resource-lookup (7 stimuli, keyword validation) Batch 3B (3 large complex skills): - azure-deploy (39 stimuli, custom deploy validators, 60min timeout) - azure-validate (11 stimuli, sparse checkout, 45min timeout) - azure-prepare (38 stimuli, multi-language scaffolding, 60min timeout) Total: 125 stimuli across 10 eval configs. Extended grader types: file-exists, file-matches, output-not-matches (security). Custom helpers mapped: hasDeployLinks, expectFiles, matchesCommand, countSecretsInCode, matchesFileEdit, doesWorkspaceFileIncludePattern. All prompts preserved verbatim from Jest test files. Part of #1818 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…re-validate evals Fix lint errors: prompt was nested under input.prompt instead of being a top-level stimulus field. Corrected in all 88 stimuli across 3 files. No prompt text changed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The copilot-sdk executor requires environment.skills to load skill definitions into the session. Without this, no skills are available and skill-invocation graders always fail. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Correct model identifiers, grader types, tag format, and scoring weights per verified patterns from PR #1912 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Fix model identifier: claude-sonnet-4.5-20250514 -> claude-sonnet-4.5 - Fix legacy Waza grader type: regex -> output-not-matches - Fix duration format: timeout_seconds: 2700 -> timeout: 45m - Convert Waza config fields to Vally format (trials_per_task -> runs, etc.) - Remove Waza-only top-level fields (skill, version, metrics) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The Lint Eval Specs CI check failed because azure-deploy and azure-prepare eval specs were missing the required 'scoring' section, triggering scoring-defaults-applied warnings that are treated as errors in strict mode. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Convert azure-deploy, azure-prepare from Waza to Vally schema - Change cost: free to cost: llm for copilot-sdk stimuli - Add pending notes for workspace-dependent stimuli - Fix template placeholder and README docs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The @github/copilot-sdk checks COPILOT_GITHUB_TOKEN first (priority 1), before GH_TOKEN and GITHUB_TOKEN. Using GITHUB_TOKEN can conflict with the built-in Actions token. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-05-11T20:43:40Z

Details

# 🔍 Token Analysis Report

@github-copilot-for-azure/scripts@1.0.0 tokens
node --import tsx src/tokens/cli.ts compare --base origin/main --head HEAD --markdown

📊 Token Change Report

Comparing origin/main → HEAD

Summary

Metric	Value
📈 Total Change	+19 tokens (+1%)
Before	1,880 tokens
After	1,899 tokens
Files Changed	2

Changed Files

File	Before	After	Change
`evals/README.md`	415	432	+17 (+4%)
`tests/azure-prepare/eval/README.md`	1,465	1,467	+2 (0%)

@github-copilot-for-azure/scripts@1.0.0 tokens
node --import tsx src/tokens/cli.ts check --markdown

📊 Token Limit Check Report

Checked: 645 files
Exceeded: 94 files

⚠️ Files Exceeding Token Limits

File	Tokens	Limit	Over By
`.github/skills/analyze-skill-issues/SKILL.md`	2109	500	+1609
`.github/skills/analyze-test-run/SKILL.md`	2471	500	+1971
`.github/skills/file-test-bug/SKILL.md`	628	500	+128
`.github/skills/sensei/README.md`	3531	2000	+1531
`.github/skills/sensei/SKILL.md`	3026	500	+2526
`.github/skills/sensei/references/EXAMPLES.md`	3701	2000	+1701
`.github/skills/sensei/references/LOOP.md`	4181	2000	+2181
`.github/skills/sensei/references/SCORING.md`	4299	2000	+2299
`.github/skills/skill-authoring/SKILL.md`	839	500	+339
`plugin/skills/airunway-aks-setup/SKILL.md`	1025	500	+525
`plugin/skills/appinsights-instrumentation/SKILL.md`	937	500	+437
`plugin/skills/azure-ai/SKILL.md`	820	500	+320
`plugin/skills/azure-aigateway/SKILL.md`	1261	500	+761
`plugin/skills/azure-aigateway/references/policies.md`	2342	2000	+342
`plugin/skills/azure-cloud-migrate/SKILL.md`	1085	500	+585
`plugin/skills/azure-cloud-migrate/references/services/container-apps/cloudrun-deployment-guide.md`	2029	2000	+29
`plugin/skills/azure-cloud-migrate/references/services/container-apps/deployment-guide.md`	2458	2000	+458
`plugin/skills/azure-cloud-migrate/references/services/container-apps/fargate-deployment-guide.md`	2587	2000	+587
`plugin/skills/azure-cloud-migrate/references/services/container-apps/spring-deployment-guide.md`	3871	2000	+1871
`plugin/skills/azure-cloud-migrate/references/services/functions/lambda-to-functions.md`	2600	2000	+600
`plugin/skills/azure-cloud-migrate/references/services/functions/runtimes/javascript.md`	2181	2000	+181
`plugin/skills/azure-compliance/SKILL.md`	1188	500	+688
`plugin/skills/azure-compute/SKILL.md`	1370	500	+870
`plugin/skills/azure-compute/workflows/essential-machine-management/references/emm-enable-flow.md`	2344	2000	+344
`plugin/skills/azure-compute/workflows/vm-recommender/vm-recommender.md`	2631	2000	+631
`plugin/skills/azure-compute/workflows/vm-troubleshooter/vm-troubleshooter.md`	2509	2000	+509
`plugin/skills/azure-cost/SKILL.md`	1980	500	+1480
`plugin/skills/azure-deploy/SKILL.md`	1645	500	+1145
`plugin/skills/azure-deploy/references/pre-deploy-checklist.md`	4692	2000	+2692
`plugin/skills/azure-deploy/references/recipes/azd/errors.md`	4004	2000	+2004
`plugin/skills/azure-deploy/references/troubleshooting.md`	2038	2000	+38
`plugin/skills/azure-diagnostics/SKILL.md`	1423	500	+923
`plugin/skills/azure-enterprise-infra-planner/SKILL.md`	1002	500	+502
`plugin/skills/azure-enterprise-infra-planner/references/constraints/compute-apps.md`	2022	2000	+22
`plugin/skills/azure-hosted-copilot-sdk/SKILL.md`	1332	500	+832
`plugin/skills/azure-kubernetes/SKILL.md`	2606	500	+2106
`plugin/skills/azure-kubernetes/azure-kubernetes-automatic-readiness/SKILL.md`	3609	500	+3109
`plugin/skills/azure-kusto/SKILL.md`	2152	500	+1652
`plugin/skills/azure-messaging/SKILL.md`	821	500	+321
`plugin/skills/azure-prepare/SKILL.md`	3375	500	+2875
`plugin/skills/azure-prepare/references/aspire.md`	4617	2000	+2617
`plugin/skills/azure-prepare/references/plan-template.md`	2617	2000	+617
`plugin/skills/azure-prepare/references/recipes/azd/aspire.md`	2275	2000	+275
`plugin/skills/azure-prepare/references/recipes/azd/terraform.md`	3555	2000	+1555
`plugin/skills/azure-prepare/references/research.md`	2274	2000	+274
`plugin/skills/azure-prepare/references/resources-limits-quotas.md`	3322	2000	+1322
`plugin/skills/azure-prepare/references/security.md`	2147	2000	+147
`plugin/skills/azure-prepare/references/services/functions/bicep.md`	3127	2000	+1127
`plugin/skills/azure-prepare/references/services/functions/templates/recipes/composition.md`	2813	2000	+813
`plugin/skills/azure-prepare/references/services/functions/terraform.md`	3404	2000	+1404
`plugin/skills/azure-prepare/references/services/sql-database/bicep.md`	2037	2000	+37
`plugin/skills/azure-quotas/SKILL.md`	2821	500	+2321
`plugin/skills/azure-quotas/references/commands.md`	2644	2000	+644
`plugin/skills/azure-reliability/SKILL.md`	5659	500	+5159
`plugin/skills/azure-reliability/references/configure-multi-region.md`	4729	2000	+2729
`plugin/skills/azure-resource-lookup/SKILL.md`	1394	500	+894
`plugin/skills/azure-resource-visualizer/SKILL.md`	2122	500	+1622
`plugin/skills/azure-storage/SKILL.md`	1228	500	+728
`plugin/skills/azure-upgrade/SKILL.md`	1542	500	+1042
`plugin/skills/azure-upgrade/references/languages/java/INSTRUCTION.md`	2724	2000	+724
`plugin/skills/azure-upgrade/references/languages/java/package-specific/com.microsoft.azure.management.md`	2215	2000	+215
`plugin/skills/azure-upgrade/references/languages/java/templates/PLAN_TEMPLATE.md`	2411	2000	+411
`plugin/skills/azure-upgrade/references/languages/java/templates/PROGRESS_TEMPLATE.md`	2315	2000	+315
`plugin/skills/azure-upgrade/references/languages/java/templates/SUMMARY_TEMPLATE.md`	2190	2000	+190
`plugin/skills/azure-upgrade/references/services/functions/automation.md`	3463	2000	+1463
`plugin/skills/azure-upgrade/references/services/functions/consumption-to-flex.md`	2773	2000	+773
`plugin/skills/azure-validate/SKILL.md`	950	500	+450
`plugin/skills/entra-agent-id/SKILL.md`	4001	500	+3501
`plugin/skills/entra-app-registration/SKILL.md`	2070	500	+1570
`plugin/skills/entra-app-registration/references/api-permissions.md`	2545	2000	+545
`plugin/skills/entra-app-registration/references/cli-commands.md`	2211	2000	+211
`plugin/skills/entra-app-registration/references/console-app-example.md`	2752	2000	+752
`plugin/skills/entra-app-registration/references/oauth-flows.md`	2375	2000	+375
`plugin/skills/microsoft-foundry/SKILL.md`	3955	500	+3455
`plugin/skills/microsoft-foundry/foundry-agent/create/create-hosted.md`	4824	2000	+2824
`plugin/skills/microsoft-foundry/foundry-agent/deploy/deploy.md`	6203	2000	+4203
`plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/eval-datasets.md`	2494	2000	+494
`plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/generate-seed-dataset.md`	2088	2000	+88
`plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/trace-to-dataset.md`	4325	2000	+2325
`plugin/skills/microsoft-foundry/foundry-agent/faos-optimize/faos-optimize.md`	3436	2000	+1436
`plugin/skills/microsoft-foundry/foundry-agent/observe/observe.md`	3190	2000	+1190
`plugin/skills/microsoft-foundry/foundry-agent/observe/references/continuous-eval.md`	3860	2000	+1860
`plugin/skills/microsoft-foundry/foundry-agent/observe/references/deploy-and-setup.md`	2072	2000	+72
`plugin/skills/microsoft-foundry/foundry-agent/trace/references/kql-templates.md`	2701	2000	+701
`plugin/skills/microsoft-foundry/models/deploy-model/SKILL.md`	1640	500	+1140
`plugin/skills/microsoft-foundry/models/deploy-model/capacity/SKILL.md`	1739	500	+1239
`plugin/skills/microsoft-foundry/models/deploy-model/customize/SKILL.md`	2235	500	+1735
`plugin/skills/microsoft-foundry/models/deploy-model/customize/references/customize-workflow.md`	3335	2000	+1335
`plugin/skills/microsoft-foundry/models/deploy-model/preset/SKILL.md`	1226	500	+726
`plugin/skills/microsoft-foundry/models/deploy-model/preset/references/preset-workflow.md`	5534	2000	+3534
`plugin/skills/microsoft-foundry/quota/quota.md`	2288	2000	+288
`plugin/skills/microsoft-foundry/quota/references/capacity-planning.md`	2080	2000	+80
`plugin/skills/microsoft-foundry/references/agent-metadata-contract.md`	2373	2000	+373
`plugin/skills/microsoft-foundry/references/sdk/foundry-sdk-py.md`	2162	2000	+162

Consider moving content to references/ subdirectories.

Automated token analysis. See skill authoring guidelines for best practices.

Copilot

Pull request overview

Migrates a batch of complex skill integration tests from Jest into Vally eval specs under evals/**, with accompanying updates to legacy per-skill eval configs and the eval CI workflow.

Changes:

Adds new Vally evals/<skill>/eval.yaml specs for multiple skills (routing/behavior/file graders, including template scaffolding).
Updates legacy test-side eval configs under tests/**/eval(s)/ to newer config/scoring/grader shapes.
Modifies .github/workflows/eval.yml to lint evals/skills and run Vally in CI.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
tests/package.json	Removes Vally-related npm scripts from the tests package.
tests/azure-prepare/eval/README.md	Updates local-running instructions for the azure-prepare eval suite.
tests/azure-prepare/eval/eval.yaml	Updates legacy azure-prepare eval config (runs/timeout/scoring/graders).
tests/azure-enterprise-infra-planner/evals/eval.yaml	Updates legacy eval config (runs/timeout/scoring/graders).
tests/azure-deploy/eval/eval.yaml	Updates legacy eval config (runs/timeout/scoring/graders).
evals/_template/eval.yaml	Adds a reference/template Vally eval spec for future migrations.
evals/appinsights-instrumentation/eval.yaml	Adds a new Vally eval spec migrated from Jest for appinsights instrumentation.
evals/azure-cloud-migrate/eval.yaml	Adds a new Vally eval spec migrated from Jest for cloud-migrate scenarios.
evals/azure-deploy/eval.yaml	Replaces/expands azure-deploy eval spec with a large Jest-migrated stimulus set.
evals/azure-enterprise-infra-planner/eval.yaml	Replaces/rewrites infra-planner eval spec migrated from Jest.
evals/azure-hosted-copilot-sdk/eval.yaml	Replaces/rewrites hosted-copilot-sdk eval spec migrated from Jest.
evals/azure-prepare/eval.yaml	Replaces/rewrites azure-prepare eval spec migrated from Jest with many stimuli and file graders.
evals/azure-resource-lookup/eval.yaml	Adds a new Vally eval spec migrated from Jest for resource-lookup prompts.
evals/azure-resource-visualizer/eval.yaml	Adds a new Vally eval spec migrated from Jest for resource-visualizer routing and file output.
evals/azure-validate/eval.yaml	Adds a new Vally eval spec migrated from Jest for validate scenarios (routing + file/output checks).
.github/workflows/eval.yml	Updates eval workflow triggers/inputs and switches CI execution behavior for Vally.

jongio

One workflow item not covered by the existing review.

- Update all eval specs to claude-sonnet-4.6 - Align eval.yml with reviewed patterns (build step, --skill-dir, --junit, artifacts) - Fix auth docs for local development Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jongio

My setup-node downgrade comment is addressed - restored to v6.4.0. The new commit also adds a proper build step, smoke eval PR gate, and switches all configs to the evals/ directory instead of a single hard-coded spec. Looks good.

- eval.yml: wire suite input through to eval invocation (was unused) - README: align Copilot SDK auth env var to COPILOT_GITHUB_TOKEN - azure-resource-lookup: switch web|app|site grader to output-matches regex - appinsights-instrumentation: replace file-matches with output-matches against the original Jest regex patterns (workspace seeding no longer available) - azure-validate: replace AppHost.cs file-matches with output-contains for AzureWebJobsSecretStorageType guidance (no Aspire sample seeding in eval) - azure-hosted-copilot-sdk: tighten 2 prompts to mention "Copilot SDK" explicitly so skill-invocation routing doesn't depend on missing workspace seeding Refs #2236 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jongio

The migration looks faithful and the workflow improvements are solid - build step, smoke gate, suite wiring all make sense. One gap worth checking: the PR gate's --tier smoke filter vs what's actually tagged in the eval configs.

Resolves conflicts in .vally.yaml and 4 eval YAMLs by preserving both sides' intent: keeping batch3 grader/prompt fixes from dffe4b4 while incorporating new stimuli and config from main. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Re-tags one cheap routing/invocation stimulus in each of the 10 eval files migrated in this PR with ier: smoke so the --suite smoke PR gate exercises real routing coverage for these skills (matching the convention already used by azure-ai, azure-aigateway, azure-diagnostics, azure-kusto, azure-storage, azure-upgrade, entra-app-registration, and microsoft-foundry on main). Addresses jongio's review feedback that the smoke gate had no matching stimuli for newly-migrated skills. Refs #2236 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

wbreza and others added 30 commits May 11, 2026 13:40

fix: correct model identifiers and eval spec issues for clean local runs

351f0ac

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: correct vally-cli package name in CI and sync package-lock.json

db0544b

- Install @microsoft/vally-cli (not @microsoft/vally) for CLI executable - Run npm install to sync package-lock.json with package.json Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: use @microsoft/vally-cli for npx commands in CI workflow

ffa3a49

The CLI executable is in the vally-cli package, not the vally library package. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: use relative paths (../../) for environment.skills

80ff30d

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: apply validated Vally eval patterns

f1052c8

- Correct model identifiers, grader types, tag format, and scoring weights per verified patterns from PR #1912 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: align eval specs with merged Vally patterns from #1912

5393867

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

wbreza and others added 3 commits May 11, 2026 13:41

fix: use COPILOT_GITHUB_TOKEN secret (not COPILOT_CLI_TOKEN)

895030f

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

wbreza requested review from Copilot and jongio May 11, 2026 20:42

Copilot started reviewing on behalf of wbreza May 11, 2026 20:43 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

jongio reviewed May 11, 2026

View reviewed changes

Comment thread .github/workflows/eval.yml Outdated

fix: apply review feedback patterns from #2235

ecc141d

- Update all eval specs to claude-sonnet-4.6 - Align eval.yml with reviewed patterns (build step, --skill-dir, --junit, artifacts) - Fix auth docs for local development Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

wbreza requested review from JasonYeMSFT and jongio May 13, 2026 18:51

jongio previously approved these changes May 13, 2026

View reviewed changes

This was referenced May 14, 2026

[repo-status] Weekly Repo Status — May 8–14, 2026 #2269

Closed

[repo-status] Weekly Repo Status — May 9–15, 2026 #2280

Closed

wbreza dismissed jongio’s stale review via 36fcc44 May 15, 2026 22:30

wbreza force-pushed the feature/jest-batch3-migration branch from ecc141d to 36fcc44 Compare May 15, 2026 22:30

wbreza force-pushed the feature/jest-batch3-migration branch from 36fcc44 to dffe4b4 Compare May 15, 2026 22:35

wbreza requested a review from jongio May 15, 2026 22:40

jongio reviewed May 16, 2026

View reviewed changes

Comment thread .github/workflows/eval.yml Outdated

wbreza and others added 2 commits May 18, 2026 10:09

wbreza requested a review from jongio May 18, 2026 17:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: migrate Batch 3 Jest integration tests to Vally (10 complex skills)#2236

feat: migrate Batch 3 Jest integration tests to Vally (10 complex skills)#2236
wbreza wants to merge 37 commits into
mainfrom
feature/jest-batch3-migration

wbreza commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026 •

edited

Loading

📊 Token Change Report

Summary

Changed Files

📊 Token Limit Check Report

⚠️ Files Exceeding Token Limits

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jongio left a comment

Uh oh!

Uh oh!

jongio left a comment

Uh oh!

jongio left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wbreza commented May 11, 2026

Summary

Skills Migrated

Changes

Eval Patterns

Dependencies

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Token Change Report

Summary

Changed Files

📊 Token Limit Check Report

⚠️ Files Exceeding Token Limits

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jongio left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jongio left a comment

Choose a reason for hiding this comment

Uh oh!

jongio left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 11, 2026 •

edited

Loading