feat: migrate Batch 2 Jest integration tests to Vally (6 medium skills) by wbreza · Pull Request #2237 · microsoft/GitHub-Copilot-for-Azure

wbreza · 2026-05-11T20:44:00Z

Summary

Migrates 6 Jest integration test suites to Vally eval configs (Batch 2), building on the foundation established in #1912.

Skills Migrated

Skill	Stimuli	Key Graders
`azure-compute`	14	skill-invocation, completed, output-not-matches, tool-calls
`azure-cost`	16	skill-invocation, completed, output-matches, output-not-matches
`azure-kubernetes`	15	skill-invocation, completed, output-matches, output-not-matches
`azure-messaging`	10	skill-invocation, completed, output-contains, output-matches, output-not-matches
`azure-quotas`	6	skill-invocation, completed, output-contains, output-not-matches
`azure-rbac`	6	skill-invocation, completed, output-contains, output-not-matches

Total: 67 stimuli across 6 skills

Changes

Adds 6 new evals/*/eval.yaml configs
Adds azure-hosted-copilot-sdk shared eval config
Updates CI workflow (eval.yml) for new eval paths
Cleans up legacy test configs (azure-deploy, azure-enterprise-infra-planner, azure-prepare)

Eval Patterns

All specs follow the validated patterns from #1912:

Model: claude-sonnet-4.5
Executor: copilot-sdk
Duration format: human-friendly ("10m", "7m")
Valid grader types only (per published @microsoft/vally-cli@0.4.0)

Dependencies

Requires ci(eval): migrate to Vally eval framework with v0.4.0 features #1912 (merged ✅)

github-actions · 2026-05-11T20:45:17Z

Details

# 🔍 Token Analysis Report

@github-copilot-for-azure/scripts@1.0.0 tokens
node --import tsx src/tokens/cli.ts compare --base origin/main --head HEAD --markdown

📊 Token Change Report

Comparing origin/main → HEAD

Summary

Metric	Value
📉 Total Change	-59 tokens (-3%)
Before	1,880 tokens
After	1,821 tokens
Files Changed	2

Changed Files

File	Before	After	Change
`tests/azure-prepare/eval/README.md`	1,465	1,416	-49 (-3%)
`evals/README.md`	415	405	-10 (-2%)

@github-copilot-for-azure/scripts@1.0.0 tokens
node --import tsx src/tokens/cli.ts check --markdown

📊 Token Limit Check Report

Checked: 645 files
Exceeded: 94 files

⚠️ Files Exceeding Token Limits

File	Tokens	Limit	Over By
`.github/skills/analyze-skill-issues/SKILL.md`	2109	500	+1609
`.github/skills/analyze-test-run/SKILL.md`	2471	500	+1971
`.github/skills/file-test-bug/SKILL.md`	628	500	+128
`.github/skills/sensei/README.md`	3531	2000	+1531
`.github/skills/sensei/SKILL.md`	3026	500	+2526
`.github/skills/sensei/references/EXAMPLES.md`	3701	2000	+1701
`.github/skills/sensei/references/LOOP.md`	4181	2000	+2181
`.github/skills/sensei/references/SCORING.md`	4299	2000	+2299
`.github/skills/skill-authoring/SKILL.md`	839	500	+339
`plugin/skills/airunway-aks-setup/SKILL.md`	1025	500	+525
`plugin/skills/appinsights-instrumentation/SKILL.md`	937	500	+437
`plugin/skills/azure-ai/SKILL.md`	820	500	+320
`plugin/skills/azure-aigateway/SKILL.md`	1261	500	+761
`plugin/skills/azure-aigateway/references/policies.md`	2342	2000	+342
`plugin/skills/azure-cloud-migrate/SKILL.md`	1085	500	+585
`plugin/skills/azure-cloud-migrate/references/services/container-apps/cloudrun-deployment-guide.md`	2029	2000	+29
`plugin/skills/azure-cloud-migrate/references/services/container-apps/deployment-guide.md`	2458	2000	+458
`plugin/skills/azure-cloud-migrate/references/services/container-apps/fargate-deployment-guide.md`	2587	2000	+587
`plugin/skills/azure-cloud-migrate/references/services/container-apps/spring-deployment-guide.md`	3871	2000	+1871
`plugin/skills/azure-cloud-migrate/references/services/functions/lambda-to-functions.md`	2600	2000	+600
`plugin/skills/azure-cloud-migrate/references/services/functions/runtimes/javascript.md`	2181	2000	+181
`plugin/skills/azure-compliance/SKILL.md`	1188	500	+688
`plugin/skills/azure-compute/SKILL.md`	1370	500	+870
`plugin/skills/azure-compute/workflows/essential-machine-management/references/emm-enable-flow.md`	2344	2000	+344
`plugin/skills/azure-compute/workflows/vm-recommender/vm-recommender.md`	2631	2000	+631
`plugin/skills/azure-compute/workflows/vm-troubleshooter/vm-troubleshooter.md`	2509	2000	+509
`plugin/skills/azure-cost/SKILL.md`	1980	500	+1480
`plugin/skills/azure-deploy/SKILL.md`	1645	500	+1145
`plugin/skills/azure-deploy/references/pre-deploy-checklist.md`	4692	2000	+2692
`plugin/skills/azure-deploy/references/recipes/azd/errors.md`	4004	2000	+2004
`plugin/skills/azure-deploy/references/troubleshooting.md`	2038	2000	+38
`plugin/skills/azure-diagnostics/SKILL.md`	1423	500	+923
`plugin/skills/azure-enterprise-infra-planner/SKILL.md`	1002	500	+502
`plugin/skills/azure-enterprise-infra-planner/references/constraints/compute-apps.md`	2022	2000	+22
`plugin/skills/azure-hosted-copilot-sdk/SKILL.md`	1332	500	+832
`plugin/skills/azure-kubernetes/SKILL.md`	2606	500	+2106
`plugin/skills/azure-kubernetes/azure-kubernetes-automatic-readiness/SKILL.md`	3609	500	+3109
`plugin/skills/azure-kusto/SKILL.md`	2152	500	+1652
`plugin/skills/azure-messaging/SKILL.md`	821	500	+321
`plugin/skills/azure-prepare/SKILL.md`	3375	500	+2875
`plugin/skills/azure-prepare/references/aspire.md`	4617	2000	+2617
`plugin/skills/azure-prepare/references/plan-template.md`	2617	2000	+617
`plugin/skills/azure-prepare/references/recipes/azd/aspire.md`	2275	2000	+275
`plugin/skills/azure-prepare/references/recipes/azd/terraform.md`	3555	2000	+1555
`plugin/skills/azure-prepare/references/research.md`	2274	2000	+274
`plugin/skills/azure-prepare/references/resources-limits-quotas.md`	3322	2000	+1322
`plugin/skills/azure-prepare/references/security.md`	2147	2000	+147
`plugin/skills/azure-prepare/references/services/functions/bicep.md`	3127	2000	+1127
`plugin/skills/azure-prepare/references/services/functions/templates/recipes/composition.md`	2813	2000	+813
`plugin/skills/azure-prepare/references/services/functions/terraform.md`	3404	2000	+1404
`plugin/skills/azure-prepare/references/services/sql-database/bicep.md`	2037	2000	+37
`plugin/skills/azure-quotas/SKILL.md`	2821	500	+2321
`plugin/skills/azure-quotas/references/commands.md`	2644	2000	+644
`plugin/skills/azure-reliability/SKILL.md`	5659	500	+5159
`plugin/skills/azure-reliability/references/configure-multi-region.md`	4729	2000	+2729
`plugin/skills/azure-resource-lookup/SKILL.md`	1394	500	+894
`plugin/skills/azure-resource-visualizer/SKILL.md`	2122	500	+1622
`plugin/skills/azure-storage/SKILL.md`	1228	500	+728
`plugin/skills/azure-upgrade/SKILL.md`	1542	500	+1042
`plugin/skills/azure-upgrade/references/languages/java/INSTRUCTION.md`	2724	2000	+724
`plugin/skills/azure-upgrade/references/languages/java/package-specific/com.microsoft.azure.management.md`	2215	2000	+215
`plugin/skills/azure-upgrade/references/languages/java/templates/PLAN_TEMPLATE.md`	2411	2000	+411
`plugin/skills/azure-upgrade/references/languages/java/templates/PROGRESS_TEMPLATE.md`	2315	2000	+315
`plugin/skills/azure-upgrade/references/languages/java/templates/SUMMARY_TEMPLATE.md`	2190	2000	+190
`plugin/skills/azure-upgrade/references/services/functions/automation.md`	3463	2000	+1463
`plugin/skills/azure-upgrade/references/services/functions/consumption-to-flex.md`	2773	2000	+773
`plugin/skills/azure-validate/SKILL.md`	950	500	+450
`plugin/skills/entra-agent-id/SKILL.md`	4001	500	+3501
`plugin/skills/entra-app-registration/SKILL.md`	2070	500	+1570
`plugin/skills/entra-app-registration/references/api-permissions.md`	2545	2000	+545
`plugin/skills/entra-app-registration/references/cli-commands.md`	2211	2000	+211
`plugin/skills/entra-app-registration/references/console-app-example.md`	2752	2000	+752
`plugin/skills/entra-app-registration/references/oauth-flows.md`	2375	2000	+375
`plugin/skills/microsoft-foundry/SKILL.md`	3955	500	+3455
`plugin/skills/microsoft-foundry/foundry-agent/create/create-hosted.md`	4824	2000	+2824
`plugin/skills/microsoft-foundry/foundry-agent/deploy/deploy.md`	6203	2000	+4203
`plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/eval-datasets.md`	2494	2000	+494
`plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/generate-seed-dataset.md`	2088	2000	+88
`plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/trace-to-dataset.md`	4325	2000	+2325
`plugin/skills/microsoft-foundry/foundry-agent/faos-optimize/faos-optimize.md`	3436	2000	+1436
`plugin/skills/microsoft-foundry/foundry-agent/observe/observe.md`	3190	2000	+1190
`plugin/skills/microsoft-foundry/foundry-agent/observe/references/continuous-eval.md`	3860	2000	+1860
`plugin/skills/microsoft-foundry/foundry-agent/observe/references/deploy-and-setup.md`	2072	2000	+72
`plugin/skills/microsoft-foundry/foundry-agent/trace/references/kql-templates.md`	2701	2000	+701
`plugin/skills/microsoft-foundry/models/deploy-model/SKILL.md`	1640	500	+1140
`plugin/skills/microsoft-foundry/models/deploy-model/capacity/SKILL.md`	1739	500	+1239
`plugin/skills/microsoft-foundry/models/deploy-model/customize/SKILL.md`	2235	500	+1735
`plugin/skills/microsoft-foundry/models/deploy-model/customize/references/customize-workflow.md`	3335	2000	+1335
`plugin/skills/microsoft-foundry/models/deploy-model/preset/SKILL.md`	1226	500	+726
`plugin/skills/microsoft-foundry/models/deploy-model/preset/references/preset-workflow.md`	5534	2000	+3534
`plugin/skills/microsoft-foundry/quota/quota.md`	2288	2000	+288
`plugin/skills/microsoft-foundry/quota/references/capacity-planning.md`	2080	2000	+80
`plugin/skills/microsoft-foundry/references/agent-metadata-contract.md`	2373	2000	+373
`plugin/skills/microsoft-foundry/references/sdk/foundry-sdk-py.md`	2162	2000	+162

Consider moving content to references/ subdirectories.

Automated token analysis. See skill authoring guidelines for best practices.

Copilot

Pull request overview

This PR continues the repo’s Jest → Vally migration by adding Vally eval specs for 6 additional “medium” Azure skills and updating CI/auth wiring so those evals can run under the Vally framework.

Changes:

Added 6 new Vally eval specs under evals/**/eval.yaml for Batch 2 skills.
Updated the eval workflow to use COPILOT_GITHUB_TOKEN for Copilot SDK-backed execution.
Cleaned up/modernized some legacy per-skill eval configs under tests/** and removed now-unused tests/package.json eval scripts.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`tests/package.json`	Removes Vally-related npm scripts from the tests package scripts list.
`tests/azure-prepare/eval/README.md`	Updates local-run instructions for running the azure-prepare eval via Vally CLI.
`tests/azure-prepare/eval/eval.yaml`	Updates legacy azure-prepare eval config fields (runs/timeout/model) and graders.
`tests/azure-enterprise-infra-planner/evals/eval.yaml`	Updates legacy eval config fields/model and graders to newer Vally-style keys.
`tests/azure-deploy/eval/eval.yaml`	Updates legacy eval config fields/model and graders to newer Vally-style keys.
`evals/azure-rbac/eval.yaml`	New Vally stimulus-based integration eval for azure-rbac.
`evals/azure-quotas/eval.yaml`	New Vally stimulus-based integration eval for azure-quotas.
`evals/azure-messaging/eval.yaml`	New Vally stimulus-based integration eval for azure-messaging.
`evals/azure-kubernetes/eval.yaml`	New Vally stimulus-based integration eval for azure-kubernetes.
`evals/azure-cost/eval.yaml`	New Vally stimulus-based integration eval for azure-cost.
`evals/azure-compute/eval.yaml`	New Vally stimulus-based integration eval for azure-compute.
`evals/_template/eval.yaml`	Adds a template file to standardize creation of future eval specs.
`.github/workflows/eval.yml`	Updates workflow auth env var to `COPILOT_GITHUB_TOKEN` for eval runs.

jongio

Clean migration - 67 stimuli across 6 skills following the Batch 1 pattern from #1912. The eval configs are structurally sound, CI passes, and the known coverage gaps (tool-call arg matching) are documented in the comments. One pattern deviation worth noting inline.

jongio

The new build steps look right, but the workflow now fails because nbgv (Nerdbank.GitVersioning) can't calculate version height from a shallow clone. See inline comment.

Migrate all 30 Waza eval tasks across 4 suites to Vally eval.yaml format: - azure-hosted-copilot-sdk (6 stimuli) - azure-deploy (2 stimuli) - azure-enterprise-infra-planner (12 stimuli) - azure-prepare (10 stimuli) Add .vally.yaml project config with paths for skills and evals. Add evals/_base/common-graders.yaml as shared grader reference. Grader mappings: regex->output-matches, file->file-exists/file-matches, code->completed, behavior->constraints. Global graders duplicated per stimulus as workaround for evaluate#125. All prompts, regex patterns, and expected outputs preserved verbatim. Zero test case coverage loss. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…executor Apply UX-designed tag taxonomy to all 30 stimuli: - eval-level tags: type + skill - stimulus-level tags: type, tier, cost, area - Fix cost values: low -> free (mock executor, no LLM cost) Add 5 named suites to .vally.yaml: smoke, pr, triggers, integration, full Switch executor from mock to copilot-sdk for real agent evaluation. Fix model names: claude-sonnet-4-20250514 -> claude-sonnet-4 (SDK rejects version-pinned model names) Live eval run results: 8/30 pass, 4 flaky, 17 fail, 1 timeout. Failures are grader calibration issues (brittle output-contains substrings, file-exists for files agent doesn't write to disk), not migration bugs. Grader tuning tracked as follow-up work. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The copilot-sdk executor requires environment.skills to load skill definitions into the session. Without this, no skills are available and skill-invocation graders always fail. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…to repo root) Skill paths in environment.skills are resolved relative to the eval.yaml file location, not the repo root. Added ../../ prefix to climb from evals/<skill>/ to the repo root. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The eval workflow was invoking px @microsoft/vally-cli without any npm auth setup, so npm fell back to the public registry and the package (published to GitHub Packages) could not be resolved. - Add .npmrc mapping @microsoft scope to npm.pkg.github.com - Add scope: '@microsoft' to setup-node so NODE_AUTH_TOKEN is applied - Add an pm install --no-save step (with NODE_AUTH_TOKEN) so the @microsoft/vally-cli devDependency is resolved via authenticated fetch - Declare @microsoft/vally-cli in devDependencies (latest) so local dev and CI both resolve it through a single config path This mirrors the working setup in wbreza/skills. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Workflow hardening: - Drop pull_request trigger (keep workflow_dispatch only) to eliminate token exfiltration vector from untrusted PR code - Add top-level permissions block (contents/packages: read) for defense-in-depth Package hygiene: - Remove @microsoft/vally-cli from devDependencies (CI installs it explicitly via GitHub Packages); lockfile regenerated in sync - Remove unused root yaml dependency Eval spec cleanup: - Remove 13 broad output-not-contains "error"/"failed" graders from azure-hosted-copilot-sdk/eval.yaml (kept specific fatal-error regex) - Add azure-prepare, azure-validate, azure-deploy to environment.skills - Remove cost:free tag from all LLM-backed stimuli across 4 eval files (reserved now for non-LLM static evals) - Align .vally.yaml suite descriptions with accurate tag semantics Cleanup: - Delete stale Waza task files in azure-hosted-copilot-sdk/tasks/ - Add evals/README.md with local vally-cli run instructions - Gitignore local results/ output directory Follow-up issue #1920 tracks wiring CI to a curated medium suite. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Update ai-bench references in evals/README.md to microsoft/evaluate (the actual upstream Vally repo name) - Add https://aka.ms/vally as the canonical docs link - Clarify that contributors don't need source-repo access to run evals locally — the @microsoft/vally-cli package from GitHub Packages is sufficient Addresses JasonYeMSFT's review question on evals/README.md. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…/vally - Remove .npmrc private registry config (GitHub Packages no longer needed) - Update eval workflow to use @microsoft/vally from public npm - Remove VALLY_NPM_TOKEN secret requirement from CI - Update evals/README.md with public npm installation instructions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Convert bare numeric timeout values to human-friendly duration strings required by Vally v0.4.0: - azure-prepare: timeout: 600 -> "10m" - azure-deploy: timeout: 420 -> "7m" - azure-hosted-copilot-sdk: timeout: 600 -> "10m" - azure-enterprise-infra-planner: timeout: 600 -> "10m" No max_duration_ms fields found in current eval specs (P0-2 verified clean). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…, P1-1, P1-3, P1-6, P1-7) Add new graders to all eval specs: - token-budget: caps token usage per eval session (P0-5) - turn-count: enforces max conversational turns (P0-5) - wall-time: enforces execution time limits (P0-5) - error-count: zero-error tolerance policy (P1-6) - skill-invocation: validates correct skill routing (P1-3) Add scoring weights to all eval specs (P1-1): - Security graders (output-not-matches): weight 2.0 - Skill-invocation: weight 2.0 - Metric-threshold graders: weight 0.5 - Standard graders: weight 0.5-1.0 Add consistent tags across all stimuli (P1-7): - All stimuli now include skill tag for tag-based filtering Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Replace invalid output-not-contains grader with output-not-matches in azure-hosted-copilot-sdk eval - Add missing weights section to azure-enterprise-infra-planner eval - Add set -e to suite determination step in CI workflow - Standardize model identifier to claude-sonnet-4 in azure-enterprise-infra-planner - Fix .gitignore comment to clarify .waza-cache/ is legacy pre-Vally - Update .waza.yaml reference to .vally.yaml in copilot-instructions.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Delete tests/scripts/run-waza.js (replaced by direct vally CLI usage) - Delete .waza.yaml (replaced by .vally.yaml) - Remove waza and waza:live npm scripts from tests/package.json - Update tests/README.md: replace Waza Eval Mode section with Vally - Update tests/azure-prepare/eval/README.md: replace waza references with vally - Update eval.yaml comment headers in azure-enterprise-infra-planner and azure-prepare Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…names - Remove token-budget, turn-count, wall-time, error-count graders (not in published v0.4.0) - Revert output-not-contains → output-not-matches (output-not-contains is valid) - Update scoring weights to match available grader types Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Convert YAML tag arrays to scalar strings in enterprise-infra-planner (7 instances) - Restore output-not-contains graders for error/failed in hosted-copilot-sdk (5 stimuli) - Switch npm install to --ignore-scripts in eval workflow (CI parity) - Add pr_number input to workflow_dispatch for fork PR eval support - .npmrc already deleted in prior commit (no action needed) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Install @microsoft/vally-cli (not @microsoft/vally) for CLI executable - Run npm install to sync package-lock.json with package.json Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Migrate 6 medium-complexity Jest integration test files to Vally eval configs: - azure-messaging (10 stimuli, 3x runs) - azure-kubernetes (15 stimuli, 5x runs) - azure-rbac (6 stimuli, 5x runs) - azure-compute (7 stimuli, 5x runs, tool-calls grader) - azure-quotas (6 stimuli, 5x runs) - azure-cost (16 stimuli, 3x runs) Total: 60 stimuli across 6 eval configs. Extended grader types: output-contains, output-matches, tool-calls. All prompts preserved verbatim from Jest test files. Tags applied per UX design taxonomy. Part of #1818 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The copilot-sdk executor requires environment.skills to load skill definitions into the session. Without this, no skills are available and skill-invocation graders always fail. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Correct model identifiers, grader types, tag format, and scoring weights per verified patterns from PR #1912 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…template - Change cost: free to cost: llm for copilot-sdk stimuli - Remove overly broad output-not-contains for 'error'/'failed' - Remove low-signal tool-calls graders from azure-compute - Add limitation comments for azure-quotas output-contains - Add _template/eval.yaml with <skill-name> placeholder - Fix README package name (@microsoft/vally -> @microsoft/vally-cli) - Add nodejs_entry_point scope limitation comment Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Update all eval specs to claude-sonnet-4.6 - Align eval.yml with reviewed patterns (build step, --skill-dir, --junit, artifacts) - Remove pr_number workflow dispatch input - Fix tier: smoke on slow azure-enterprise-infra-planner stimulus - Fix auth docs for local development Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add fetch-depth: 0 to eval.yml checkout so nbgv can compute version height (jongio) - Add scoring.weights to all 6 Batch 2 eval.yaml files matching Batch 1 pattern (jongio) - Update azure-compute eval header comments to accurately reflect coverage (copilot-reviewer) - Move nodejs_entry_point from global to per-task graders (TS/Node only) in azure-prepare (copilot-reviewer) - Fix azure-prepare README: distinguish mock vs copilot-sdk executor instructions, use COPILOT_GITHUB_TOKEN, align package name to @microsoft/vally-cli (copilot-reviewer) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jongio

Prior feedback addressed - the README, nbgv fetch-depth, and tool-calls gap notes all look good. One remaining consistency item: the Batch 1 eval (azure-hosted-copilot-sdk) defines scoring.weights to prioritize certain graders (e.g., skill-invocation: 2.0, output-not-matches: 2.0). All 6 Batch 2 evals omit this, so every grader carries equal weight. If that's intentional, no action needed - just flagging the deviation.

Adds the Batch 1 scoring.weights block (skill-invocation: 2.0, output-not-matches: 2.0, output-not-contains: 2.0, output-contains: 0.5) to all 6 Batch 2 eval.yaml files. Previously omitted, causing all graders to carry equal weight and diluting routing signal. Addresses review feedback from jongio on PR #2237. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

wbreza requested review from Copilot and jongio May 11, 2026 20:44

Copilot started reviewing on behalf of wbreza May 11, 2026 20:45 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

Comment thread tests/azure-prepare/eval/README.md

Comment thread tests/azure-prepare/eval/eval.yaml

Comment thread evals/azure-compute/eval.yaml

jongio reviewed May 11, 2026

View reviewed changes

Comment thread evals/azure-compute/eval.yaml

wbreza requested review from JasonYeMSFT and jongio May 13, 2026 18:51

jongio reviewed May 13, 2026

View reviewed changes

Comment thread .github/workflows/eval.yml

This was referenced May 14, 2026

[repo-status] Weekly Repo Status — May 8–14, 2026 #2269

Closed

[repo-status] Weekly Repo Status — May 9–15, 2026 #2280

Closed

wbreza and others added 19 commits May 15, 2026 15:26

fix: correct model identifiers and eval spec issues for clean local runs

7b5ee43

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: correct vally-cli package name in CI and sync package-lock.json

a08eb29

- Install @microsoft/vally-cli (not @microsoft/vally) for CLI executable - Run npm install to sync package-lock.json with package.json Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: use relative paths (../../) for environment.skills

a2a9835

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

wbreza and others added 4 commits May 15, 2026 15:29

fix: apply validated Vally eval patterns

83a4c45

- Correct model identifiers, grader types, tag format, and scoring weights per verified patterns from PR #1912 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: align eval specs with merged Vally patterns from #1912

8017fb7

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

wbreza force-pushed the feature/jest-batch2-migration branch from 809964b to c09d2e2 Compare May 15, 2026 22:30

wbreza requested a review from jongio May 15, 2026 22:36

jongio reviewed May 16, 2026

View reviewed changes

Comment thread evals/azure-compute/eval.yaml

wbreza requested a review from jongio May 18, 2026 17:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: migrate Batch 2 Jest integration tests to Vally (6 medium skills)#2237

feat: migrate Batch 2 Jest integration tests to Vally (6 medium skills)#2237
wbreza wants to merge 24 commits into
mainfrom
feature/jest-batch2-migration

wbreza commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026 •

edited

Loading

📊 Token Change Report

Summary

Changed Files

📊 Token Limit Check Report

⚠️ Files Exceeding Token Limits

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jongio left a comment

Uh oh!

Uh oh!

jongio left a comment

Uh oh!

Uh oh!

jongio left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wbreza commented May 11, 2026

Summary

Skills Migrated

Changes

Eval Patterns

Dependencies

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Token Change Report

Summary

Changed Files

📊 Token Limit Check Report

⚠️ Files Exceeding Token Limits

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jongio left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jongio left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jongio left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 11, 2026 •

edited

Loading