Skip to content

feat: migrate Batch 2 Jest integration tests to Vally (6 medium skills)#2237

Open
wbreza wants to merge 24 commits into
mainfrom
feature/jest-batch2-migration
Open

feat: migrate Batch 2 Jest integration tests to Vally (6 medium skills)#2237
wbreza wants to merge 24 commits into
mainfrom
feature/jest-batch2-migration

Conversation

@wbreza
Copy link
Copy Markdown
Collaborator

@wbreza wbreza commented May 11, 2026

Summary

Migrates 6 Jest integration test suites to Vally eval configs (Batch 2), building on the foundation established in #1912.

Skills Migrated

Skill Stimuli Key Graders
azure-compute 14 skill-invocation, completed, output-not-matches, tool-calls
azure-cost 16 skill-invocation, completed, output-matches, output-not-matches
azure-kubernetes 15 skill-invocation, completed, output-matches, output-not-matches
azure-messaging 10 skill-invocation, completed, output-contains, output-matches, output-not-matches
azure-quotas 6 skill-invocation, completed, output-contains, output-not-matches
azure-rbac 6 skill-invocation, completed, output-contains, output-not-matches

Total: 67 stimuli across 6 skills

Changes

  • Adds 6 new evals/*/eval.yaml configs
  • Adds azure-hosted-copilot-sdk shared eval config
  • Updates CI workflow (eval.yml) for new eval paths
  • Cleans up legacy test configs (azure-deploy, azure-enterprise-infra-planner, azure-prepare)

Eval Patterns

All specs follow the validated patterns from #1912:

  • Model: claude-sonnet-4.5
  • Executor: copilot-sdk
  • Duration format: human-friendly ("10m", "7m")
  • Valid grader types only (per published @microsoft/vally-cli@0.4.0)

Dependencies

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 11, 2026

Details# 🔍 Token Analysis Report

@github-copilot-for-azure/scripts@1.0.0 tokens
node --import tsx src/tokens/cli.ts compare --base origin/main --head HEAD --markdown

📊 Token Change Report

Comparing origin/mainHEAD

Summary

Metric Value
📉 Total Change -59 tokens (-3%)
Before 1,880 tokens
After 1,821 tokens
Files Changed 2

Changed Files

File Before After Change
tests/azure-prepare/eval/README.md 1,465 1,416 -49 (-3%)
evals/README.md 415 405 -10 (-2%)

@github-copilot-for-azure/scripts@1.0.0 tokens
node --import tsx src/tokens/cli.ts check --markdown

📊 Token Limit Check Report

Checked: 645 files
Exceeded: 94 files

⚠️ Files Exceeding Token Limits

File Tokens Limit Over By
.github/skills/analyze-skill-issues/SKILL.md 2109 500 +1609
.github/skills/analyze-test-run/SKILL.md 2471 500 +1971
.github/skills/file-test-bug/SKILL.md 628 500 +128
.github/skills/sensei/README.md 3531 2000 +1531
.github/skills/sensei/SKILL.md 3026 500 +2526
.github/skills/sensei/references/EXAMPLES.md 3701 2000 +1701
.github/skills/sensei/references/LOOP.md 4181 2000 +2181
.github/skills/sensei/references/SCORING.md 4299 2000 +2299
.github/skills/skill-authoring/SKILL.md 839 500 +339
plugin/skills/airunway-aks-setup/SKILL.md 1025 500 +525
plugin/skills/appinsights-instrumentation/SKILL.md 937 500 +437
plugin/skills/azure-ai/SKILL.md 820 500 +320
plugin/skills/azure-aigateway/SKILL.md 1261 500 +761
plugin/skills/azure-aigateway/references/policies.md 2342 2000 +342
plugin/skills/azure-cloud-migrate/SKILL.md 1085 500 +585
plugin/skills/azure-cloud-migrate/references/services/container-apps/cloudrun-deployment-guide.md 2029 2000 +29
plugin/skills/azure-cloud-migrate/references/services/container-apps/deployment-guide.md 2458 2000 +458
plugin/skills/azure-cloud-migrate/references/services/container-apps/fargate-deployment-guide.md 2587 2000 +587
plugin/skills/azure-cloud-migrate/references/services/container-apps/spring-deployment-guide.md 3871 2000 +1871
plugin/skills/azure-cloud-migrate/references/services/functions/lambda-to-functions.md 2600 2000 +600
plugin/skills/azure-cloud-migrate/references/services/functions/runtimes/javascript.md 2181 2000 +181
plugin/skills/azure-compliance/SKILL.md 1188 500 +688
plugin/skills/azure-compute/SKILL.md 1370 500 +870
plugin/skills/azure-compute/workflows/essential-machine-management/references/emm-enable-flow.md 2344 2000 +344
plugin/skills/azure-compute/workflows/vm-recommender/vm-recommender.md 2631 2000 +631
plugin/skills/azure-compute/workflows/vm-troubleshooter/vm-troubleshooter.md 2509 2000 +509
plugin/skills/azure-cost/SKILL.md 1980 500 +1480
plugin/skills/azure-deploy/SKILL.md 1645 500 +1145
plugin/skills/azure-deploy/references/pre-deploy-checklist.md 4692 2000 +2692
plugin/skills/azure-deploy/references/recipes/azd/errors.md 4004 2000 +2004
plugin/skills/azure-deploy/references/troubleshooting.md 2038 2000 +38
plugin/skills/azure-diagnostics/SKILL.md 1423 500 +923
plugin/skills/azure-enterprise-infra-planner/SKILL.md 1002 500 +502
plugin/skills/azure-enterprise-infra-planner/references/constraints/compute-apps.md 2022 2000 +22
plugin/skills/azure-hosted-copilot-sdk/SKILL.md 1332 500 +832
plugin/skills/azure-kubernetes/SKILL.md 2606 500 +2106
plugin/skills/azure-kubernetes/azure-kubernetes-automatic-readiness/SKILL.md 3609 500 +3109
plugin/skills/azure-kusto/SKILL.md 2152 500 +1652
plugin/skills/azure-messaging/SKILL.md 821 500 +321
plugin/skills/azure-prepare/SKILL.md 3375 500 +2875
plugin/skills/azure-prepare/references/aspire.md 4617 2000 +2617
plugin/skills/azure-prepare/references/plan-template.md 2617 2000 +617
plugin/skills/azure-prepare/references/recipes/azd/aspire.md 2275 2000 +275
plugin/skills/azure-prepare/references/recipes/azd/terraform.md 3555 2000 +1555
plugin/skills/azure-prepare/references/research.md 2274 2000 +274
plugin/skills/azure-prepare/references/resources-limits-quotas.md 3322 2000 +1322
plugin/skills/azure-prepare/references/security.md 2147 2000 +147
plugin/skills/azure-prepare/references/services/functions/bicep.md 3127 2000 +1127
plugin/skills/azure-prepare/references/services/functions/templates/recipes/composition.md 2813 2000 +813
plugin/skills/azure-prepare/references/services/functions/terraform.md 3404 2000 +1404
plugin/skills/azure-prepare/references/services/sql-database/bicep.md 2037 2000 +37
plugin/skills/azure-quotas/SKILL.md 2821 500 +2321
plugin/skills/azure-quotas/references/commands.md 2644 2000 +644
plugin/skills/azure-reliability/SKILL.md 5659 500 +5159
plugin/skills/azure-reliability/references/configure-multi-region.md 4729 2000 +2729
plugin/skills/azure-resource-lookup/SKILL.md 1394 500 +894
plugin/skills/azure-resource-visualizer/SKILL.md 2122 500 +1622
plugin/skills/azure-storage/SKILL.md 1228 500 +728
plugin/skills/azure-upgrade/SKILL.md 1542 500 +1042
plugin/skills/azure-upgrade/references/languages/java/INSTRUCTION.md 2724 2000 +724
plugin/skills/azure-upgrade/references/languages/java/package-specific/com.microsoft.azure.management.md 2215 2000 +215
plugin/skills/azure-upgrade/references/languages/java/templates/PLAN_TEMPLATE.md 2411 2000 +411
plugin/skills/azure-upgrade/references/languages/java/templates/PROGRESS_TEMPLATE.md 2315 2000 +315
plugin/skills/azure-upgrade/references/languages/java/templates/SUMMARY_TEMPLATE.md 2190 2000 +190
plugin/skills/azure-upgrade/references/services/functions/automation.md 3463 2000 +1463
plugin/skills/azure-upgrade/references/services/functions/consumption-to-flex.md 2773 2000 +773
plugin/skills/azure-validate/SKILL.md 950 500 +450
plugin/skills/entra-agent-id/SKILL.md 4001 500 +3501
plugin/skills/entra-app-registration/SKILL.md 2070 500 +1570
plugin/skills/entra-app-registration/references/api-permissions.md 2545 2000 +545
plugin/skills/entra-app-registration/references/cli-commands.md 2211 2000 +211
plugin/skills/entra-app-registration/references/console-app-example.md 2752 2000 +752
plugin/skills/entra-app-registration/references/oauth-flows.md 2375 2000 +375
plugin/skills/microsoft-foundry/SKILL.md 3955 500 +3455
plugin/skills/microsoft-foundry/foundry-agent/create/create-hosted.md 4824 2000 +2824
plugin/skills/microsoft-foundry/foundry-agent/deploy/deploy.md 6203 2000 +4203
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/eval-datasets.md 2494 2000 +494
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/generate-seed-dataset.md 2088 2000 +88
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/trace-to-dataset.md 4325 2000 +2325
plugin/skills/microsoft-foundry/foundry-agent/faos-optimize/faos-optimize.md 3436 2000 +1436
plugin/skills/microsoft-foundry/foundry-agent/observe/observe.md 3190 2000 +1190
plugin/skills/microsoft-foundry/foundry-agent/observe/references/continuous-eval.md 3860 2000 +1860
plugin/skills/microsoft-foundry/foundry-agent/observe/references/deploy-and-setup.md 2072 2000 +72
plugin/skills/microsoft-foundry/foundry-agent/trace/references/kql-templates.md 2701 2000 +701
plugin/skills/microsoft-foundry/models/deploy-model/SKILL.md 1640 500 +1140
plugin/skills/microsoft-foundry/models/deploy-model/capacity/SKILL.md 1739 500 +1239
plugin/skills/microsoft-foundry/models/deploy-model/customize/SKILL.md 2235 500 +1735
plugin/skills/microsoft-foundry/models/deploy-model/customize/references/customize-workflow.md 3335 2000 +1335
plugin/skills/microsoft-foundry/models/deploy-model/preset/SKILL.md 1226 500 +726
plugin/skills/microsoft-foundry/models/deploy-model/preset/references/preset-workflow.md 5534 2000 +3534
plugin/skills/microsoft-foundry/quota/quota.md 2288 2000 +288
plugin/skills/microsoft-foundry/quota/references/capacity-planning.md 2080 2000 +80
plugin/skills/microsoft-foundry/references/agent-metadata-contract.md 2373 2000 +373
plugin/skills/microsoft-foundry/references/sdk/foundry-sdk-py.md 2162 2000 +162

Consider moving content to references/ subdirectories.


Automated token analysis. See skill authoring guidelines for best practices.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR continues the repo’s Jest → Vally migration by adding Vally eval specs for 6 additional “medium” Azure skills and updating CI/auth wiring so those evals can run under the Vally framework.

Changes:

  • Added 6 new Vally eval specs under evals/**/eval.yaml for Batch 2 skills.
  • Updated the eval workflow to use COPILOT_GITHUB_TOKEN for Copilot SDK-backed execution.
  • Cleaned up/modernized some legacy per-skill eval configs under tests/** and removed now-unused tests/package.json eval scripts.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/package.json Removes Vally-related npm scripts from the tests package scripts list.
tests/azure-prepare/eval/README.md Updates local-run instructions for running the azure-prepare eval via Vally CLI.
tests/azure-prepare/eval/eval.yaml Updates legacy azure-prepare eval config fields (runs/timeout/model) and graders.
tests/azure-enterprise-infra-planner/evals/eval.yaml Updates legacy eval config fields/model and graders to newer Vally-style keys.
tests/azure-deploy/eval/eval.yaml Updates legacy eval config fields/model and graders to newer Vally-style keys.
evals/azure-rbac/eval.yaml New Vally stimulus-based integration eval for azure-rbac.
evals/azure-quotas/eval.yaml New Vally stimulus-based integration eval for azure-quotas.
evals/azure-messaging/eval.yaml New Vally stimulus-based integration eval for azure-messaging.
evals/azure-kubernetes/eval.yaml New Vally stimulus-based integration eval for azure-kubernetes.
evals/azure-cost/eval.yaml New Vally stimulus-based integration eval for azure-cost.
evals/azure-compute/eval.yaml New Vally stimulus-based integration eval for azure-compute.
evals/_template/eval.yaml Adds a template file to standardize creation of future eval specs.
.github/workflows/eval.yml Updates workflow auth env var to COPILOT_GITHUB_TOKEN for eval runs.

Comment thread tests/azure-prepare/eval/README.md
Comment thread tests/azure-prepare/eval/eval.yaml
Comment thread evals/azure-compute/eval.yaml
Copy link
Copy Markdown
Collaborator

@jongio jongio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean migration - 67 stimuli across 6 skills following the Batch 1 pattern from #1912. The eval configs are structurally sound, CI passes, and the known coverage gaps (tool-call arg matching) are documented in the comments. One pattern deviation worth noting inline.

Comment thread evals/azure-compute/eval.yaml
@wbreza wbreza requested review from JasonYeMSFT and jongio May 13, 2026 18:51
Copy link
Copy Markdown
Collaborator

@jongio jongio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new build steps look right, but the workflow now fails because nbgv (Nerdbank.GitVersioning) can't calculate version height from a shallow clone. See inline comment.

Comment thread .github/workflows/eval.yml
wbreza and others added 19 commits May 15, 2026 15:26
Migrate all 30 Waza eval tasks across 4 suites to Vally eval.yaml format:
- azure-hosted-copilot-sdk (6 stimuli)
- azure-deploy (2 stimuli)
- azure-enterprise-infra-planner (12 stimuli)
- azure-prepare (10 stimuli)

Add .vally.yaml project config with paths for skills and evals.
Add evals/_base/common-graders.yaml as shared grader reference.

Grader mappings: regex->output-matches, file->file-exists/file-matches,
code->completed, behavior->constraints. Global graders duplicated per
stimulus as workaround for evaluate#125.

All prompts, regex patterns, and expected outputs preserved verbatim.
Zero test case coverage loss.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…executor

Apply UX-designed tag taxonomy to all 30 stimuli:
- eval-level tags: type + skill
- stimulus-level tags: type, tier, cost, area
- Fix cost values: low -> free (mock executor, no LLM cost)

Add 5 named suites to .vally.yaml: smoke, pr, triggers, integration, full

Switch executor from mock to copilot-sdk for real agent evaluation.
Fix model names: claude-sonnet-4-20250514 -> claude-sonnet-4
(SDK rejects version-pinned model names)

Live eval run results: 8/30 pass, 4 flaky, 17 fail, 1 timeout.
Failures are grader calibration issues (brittle output-contains
substrings, file-exists for files agent doesn't write to disk),
not migration bugs. Grader tuning tracked as follow-up work.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The copilot-sdk executor requires environment.skills to load skill
definitions into the session. Without this, no skills are available
and skill-invocation graders always fail.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…to repo root)

Skill paths in environment.skills are resolved relative to the
eval.yaml file location, not the repo root. Added ../../ prefix
to climb from evals/<skill>/ to the repo root.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The eval workflow was invoking
px @microsoft/vally-cli without any
npm auth setup, so npm fell back to the public registry and the package
(published to GitHub Packages) could not be resolved.

- Add .npmrc mapping @microsoft scope to npm.pkg.github.com
- Add scope: '@microsoft' to setup-node so NODE_AUTH_TOKEN is applied
- Add an
pm install --no-save step (with NODE_AUTH_TOKEN) so the
  @microsoft/vally-cli devDependency is resolved via authenticated fetch
- Declare @microsoft/vally-cli in devDependencies (latest) so local dev
  and CI both resolve it through a single config path

This mirrors the working setup in wbreza/skills.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Workflow hardening:
- Drop pull_request trigger (keep workflow_dispatch only) to eliminate
  token exfiltration vector from untrusted PR code
- Add top-level permissions block (contents/packages: read) for
  defense-in-depth

Package hygiene:
- Remove @microsoft/vally-cli from devDependencies (CI installs it
  explicitly via GitHub Packages); lockfile regenerated in sync
- Remove unused root yaml dependency

Eval spec cleanup:
- Remove 13 broad output-not-contains "error"/"failed" graders from
  azure-hosted-copilot-sdk/eval.yaml (kept specific fatal-error regex)
- Add azure-prepare, azure-validate, azure-deploy to environment.skills
- Remove cost:free tag from all LLM-backed stimuli across 4 eval files
  (reserved now for non-LLM static evals)
- Align .vally.yaml suite descriptions with accurate tag semantics

Cleanup:
- Delete stale Waza task files in azure-hosted-copilot-sdk/tasks/
- Add evals/README.md with local vally-cli run instructions
- Gitignore local results/ output directory

Follow-up issue #1920 tracks wiring CI to a curated medium suite.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Update ai-bench references in evals/README.md to microsoft/evaluate
  (the actual upstream Vally repo name)
- Add https://aka.ms/vally as the canonical docs link
- Clarify that contributors don't need source-repo access to run evals
  locally — the @microsoft/vally-cli package from GitHub Packages is
  sufficient

Addresses JasonYeMSFT's review question on evals/README.md.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…/vally

- Remove .npmrc private registry config (GitHub Packages no longer needed)
- Update eval workflow to use @microsoft/vally from public npm
- Remove VALLY_NPM_TOKEN secret requirement from CI
- Update evals/README.md with public npm installation instructions

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Convert bare numeric timeout values to human-friendly duration strings
required by Vally v0.4.0:
- azure-prepare: timeout: 600 -> "10m"
- azure-deploy: timeout: 420 -> "7m"
- azure-hosted-copilot-sdk: timeout: 600 -> "10m"
- azure-enterprise-infra-planner: timeout: 600 -> "10m"

No max_duration_ms fields found in current eval specs (P0-2 verified clean).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…, P1-1, P1-3, P1-6, P1-7)

Add new graders to all eval specs:
- token-budget: caps token usage per eval session (P0-5)
- turn-count: enforces max conversational turns (P0-5)
- wall-time: enforces execution time limits (P0-5)
- error-count: zero-error tolerance policy (P1-6)
- skill-invocation: validates correct skill routing (P1-3)

Add scoring weights to all eval specs (P1-1):
- Security graders (output-not-matches): weight 2.0
- Skill-invocation: weight 2.0
- Metric-threshold graders: weight 0.5
- Standard graders: weight 0.5-1.0

Add consistent tags across all stimuli (P1-7):
- All stimuli now include skill tag for tag-based filtering

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace invalid output-not-contains grader with output-not-matches in azure-hosted-copilot-sdk eval
- Add missing weights section to azure-enterprise-infra-planner eval
- Add set -e to suite determination step in CI workflow
- Standardize model identifier to claude-sonnet-4 in azure-enterprise-infra-planner
- Fix .gitignore comment to clarify .waza-cache/ is legacy pre-Vally
- Update .waza.yaml reference to .vally.yaml in copilot-instructions.md

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Delete tests/scripts/run-waza.js (replaced by direct vally CLI usage)
- Delete .waza.yaml (replaced by .vally.yaml)
- Remove waza and waza:live npm scripts from tests/package.json
- Update tests/README.md: replace Waza Eval Mode section with Vally
- Update tests/azure-prepare/eval/README.md: replace waza references with vally
- Update eval.yaml comment headers in azure-enterprise-infra-planner and azure-prepare

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…names

- Remove token-budget, turn-count, wall-time, error-count graders (not in published v0.4.0)
- Revert output-not-contains → output-not-matches (output-not-contains is valid)
- Update scoring weights to match available grader types

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Convert YAML tag arrays to scalar strings in enterprise-infra-planner (7 instances)
- Restore output-not-contains graders for error/failed in hosted-copilot-sdk (5 stimuli)
- Switch npm install to --ignore-scripts in eval workflow (CI parity)
- Add pr_number input to workflow_dispatch for fork PR eval support
- .npmrc already deleted in prior commit (no action needed)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Install @microsoft/vally-cli (not @microsoft/vally) for CLI executable
- Run npm install to sync package-lock.json with package.json

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Migrate 6 medium-complexity Jest integration test files to Vally eval configs:
- azure-messaging (10 stimuli, 3x runs)
- azure-kubernetes (15 stimuli, 5x runs)
- azure-rbac (6 stimuli, 5x runs)
- azure-compute (7 stimuli, 5x runs, tool-calls grader)
- azure-quotas (6 stimuli, 5x runs)
- azure-cost (16 stimuli, 3x runs)

Total: 60 stimuli across 6 eval configs.
Extended grader types: output-contains, output-matches, tool-calls.
All prompts preserved verbatim from Jest test files.
Tags applied per UX design taxonomy.

Part of #1818

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The copilot-sdk executor requires environment.skills to load skill
definitions into the session. Without this, no skills are available
and skill-invocation graders always fail.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
wbreza and others added 4 commits May 15, 2026 15:29
- Correct model identifiers, grader types, tag format, and scoring weights
  per verified patterns from PR #1912

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…template

- Change cost: free to cost: llm for copilot-sdk stimuli
- Remove overly broad output-not-contains for 'error'/'failed'
- Remove low-signal tool-calls graders from azure-compute
- Add limitation comments for azure-quotas output-contains
- Add _template/eval.yaml with <skill-name> placeholder
- Fix README package name (@microsoft/vally -> @microsoft/vally-cli)
- Add nodejs_entry_point scope limitation comment

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Update all eval specs to claude-sonnet-4.6
- Align eval.yml with reviewed patterns (build step, --skill-dir, --junit, artifacts)
- Remove pr_number workflow dispatch input
- Fix tier: smoke on slow azure-enterprise-infra-planner stimulus
- Fix auth docs for local development

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@wbreza wbreza force-pushed the feature/jest-batch2-migration branch from 809964b to c09d2e2 Compare May 15, 2026 22:30
wbreza added a commit to wbreza/GitHub-Copilot-for-Azure that referenced this pull request May 15, 2026
- Add fetch-depth: 0 to eval.yml checkout so nbgv can compute version height (jongio)
- Add scoring.weights to all 6 Batch 2 eval.yaml files matching Batch 1 pattern (jongio)
- Update azure-compute eval header comments to accurately reflect coverage (copilot-reviewer)
- Move nodejs_entry_point from global to per-task graders (TS/Node only) in azure-prepare (copilot-reviewer)
- Fix azure-prepare README: distinguish mock vs copilot-sdk executor instructions, use COPILOT_GITHUB_TOKEN, align package name to @microsoft/vally-cli (copilot-reviewer)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@wbreza wbreza requested a review from jongio May 15, 2026 22:36
Copy link
Copy Markdown
Collaborator

@jongio jongio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior feedback addressed - the README, nbgv fetch-depth, and tool-calls gap notes all look good. One remaining consistency item: the Batch 1 eval (azure-hosted-copilot-sdk) defines scoring.weights to prioritize certain graders (e.g., skill-invocation: 2.0, output-not-matches: 2.0). All 6 Batch 2 evals omit this, so every grader carries equal weight. If that's intentional, no action needed - just flagging the deviation.

Comment thread evals/azure-compute/eval.yaml
Adds the Batch 1 scoring.weights block (skill-invocation: 2.0,
output-not-matches: 2.0, output-not-contains: 2.0, output-contains: 0.5)
to all 6 Batch 2 eval.yaml files. Previously omitted, causing all graders
to carry equal weight and diluting routing signal.

Addresses review feedback from jongio on PR #2237.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@wbreza wbreza requested a review from jongio May 18, 2026 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants