Skip to content

feat: migrate Batch 3 Jest integration tests to Vally (10 complex skills)#2236

Open
wbreza wants to merge 37 commits into
mainfrom
feature/jest-batch3-migration
Open

feat: migrate Batch 3 Jest integration tests to Vally (10 complex skills)#2236
wbreza wants to merge 37 commits into
mainfrom
feature/jest-batch3-migration

Conversation

@wbreza
Copy link
Copy Markdown
Collaborator

@wbreza wbreza commented May 11, 2026

Summary

Migrates 10 Jest integration test suites to Vally eval configs (Batch 3), building on the foundation established in #1912. Includes template, complex multi-section eval specs, and skills with file-based graders.

Skills Migrated

Skill Stimuli Key Graders
_template 5 skill-invocation, completed, output-contains, output-not-matches, tool-calls
appinsights-instrumentation 4 skill-invocation, completed, file-matches, output-contains, output-not-matches
azure-cloud-migrate 2 skill-invocation, completed, file-exists, output-not-matches
azure-deploy 3 skill-invocation, completed, file-exists, file-matches, output-matches, regex
azure-enterprise-infra-planner 8 skill-invocation, completed, file-exists, output-not-matches
azure-hosted-copilot-sdk 8 skill-invocation, completed, output-contains, output-not-matches
azure-prepare 3 skill-invocation, completed, file-exists, file-matches, output-matches, regex
azure-resource-lookup 7 skill-invocation, completed, output-contains, output-not-matches
azure-resource-visualizer 4 skill-invocation, completed, file-matches, output-not-matches
azure-validate 12 skill-invocation, completed, file-matches, output-matches, output-not-matches

Total: 56 stimuli across 10 skills

Changes

  • Adds/updates 10 evals/*/eval.yaml configs (including _template reference spec)
  • Uses advanced grader types: file-exists, file-matches, regex
  • Updates CI workflow (eval.yml) for new eval paths
  • Cleans up legacy test configs (azure-deploy, azure-enterprise-infra-planner, azure-prepare)

Eval Patterns

All specs follow the validated patterns from #1912:

  • Model: claude-sonnet-4.5
  • Executor: copilot-sdk
  • Duration format: human-friendly ("10m", "7m")
  • Valid grader types only (per published @microsoft/vally-cli@0.4.0)

Dependencies

wbreza and others added 30 commits May 11, 2026 13:40
Migrate all 30 Waza eval tasks across 4 suites to Vally eval.yaml format:
- azure-hosted-copilot-sdk (6 stimuli)
- azure-deploy (2 stimuli)
- azure-enterprise-infra-planner (12 stimuli)
- azure-prepare (10 stimuli)

Add .vally.yaml project config with paths for skills and evals.
Add evals/_base/common-graders.yaml as shared grader reference.

Grader mappings: regex->output-matches, file->file-exists/file-matches,
code->completed, behavior->constraints. Global graders duplicated per
stimulus as workaround for evaluate#125.

All prompts, regex patterns, and expected outputs preserved verbatim.
Zero test case coverage loss.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace azd waza run with npx @microsoft/vally-cli eval.
Add setup-node with GitHub Packages registry for @microsoft/vally-cli.
Add packages:read permission for GitHub Packages auth.
Preserve trigger paths, artifact upload, and retention settings.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…executor

Apply UX-designed tag taxonomy to all 30 stimuli:
- eval-level tags: type + skill
- stimulus-level tags: type, tier, cost, area
- Fix cost values: low -> free (mock executor, no LLM cost)

Add 5 named suites to .vally.yaml: smoke, pr, triggers, integration, full

Switch executor from mock to copilot-sdk for real agent evaluation.
Fix model names: claude-sonnet-4-20250514 -> claude-sonnet-4
(SDK rejects version-pinned model names)

Live eval run results: 8/30 pass, 4 flaky, 17 fail, 1 timeout.
Failures are grader calibration issues (brittle output-contains
substrings, file-exists for files agent doesn't write to disk),
not migration bugs. Grader tuning tracked as follow-up work.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The copilot-sdk executor requires environment.skills to load skill
definitions into the session. Without this, no skills are available
and skill-invocation graders always fail.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…to repo root)

Skill paths in environment.skills are resolved relative to the
eval.yaml file location, not the repo root. Added ../../ prefix
to climb from evals/<skill>/ to the repo root.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The eval workflow was invoking
px @microsoft/vally-cli without any
npm auth setup, so npm fell back to the public registry and the package
(published to GitHub Packages) could not be resolved.

- Add .npmrc mapping @microsoft scope to npm.pkg.github.com
- Add scope: '@microsoft' to setup-node so NODE_AUTH_TOKEN is applied
- Add an
pm install --no-save step (with NODE_AUTH_TOKEN) so the
  @microsoft/vally-cli devDependency is resolved via authenticated fetch
- Declare @microsoft/vally-cli in devDependencies (latest) so local dev
  and CI both resolve it through a single config path

This mirrors the working setup in wbreza/skills.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The default GITHUB_TOKEN lacks read:packages access to the microsoft
org's private @microsoft/vally-cli package, yielding 403s. Switch to
a dedicated VALLY_NPM_TOKEN repo secret (PAT with read:packages,
SSO-authorized for the microsoft org).

Note: fork-originated pull_request runs do not receive secrets, so
fork PRs will still fail auth until the package is made public or the
trigger is reworked. Internal branches / workflow_dispatch / merges
will resolve correctly once the secret is provisioned.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Allow maintainers to manually invoke the eval workflow from the Actions
UI. This is needed to bypass the fork-PR secrets restriction: pull_request
workflows triggered from a fork cannot access repository secrets, so the
@microsoft/vally-cli install fails. Manual workflow_dispatch runs execute
in the base repo context where secrets are available.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The copilot-sdk executor in @microsoft/vally-cli reads GITHUB_TOKEN
to create a Copilot session. The default Actions GITHUB_TOKEN doesn't
have Copilot API scope, causing "Session was not created with
authentication info or custom provider" at eval execution time.

Reuse the existing repo secret COPILOT_CLI_TOKEN (a Copilot-enabled
PAT) and expose it as GITHUB_TOKEN to the eval run step.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Workflow hardening:
- Drop pull_request trigger (keep workflow_dispatch only) to eliminate
  token exfiltration vector from untrusted PR code
- Add top-level permissions block (contents/packages: read) for
  defense-in-depth

Package hygiene:
- Remove @microsoft/vally-cli from devDependencies (CI installs it
  explicitly via GitHub Packages); lockfile regenerated in sync
- Remove unused root yaml dependency

Eval spec cleanup:
- Remove 13 broad output-not-contains "error"/"failed" graders from
  azure-hosted-copilot-sdk/eval.yaml (kept specific fatal-error regex)
- Add azure-prepare, azure-validate, azure-deploy to environment.skills
- Remove cost:free tag from all LLM-backed stimuli across 4 eval files
  (reserved now for non-LLM static evals)
- Align .vally.yaml suite descriptions with accurate tag semantics

Cleanup:
- Delete stale Waza task files in azure-hosted-copilot-sdk/tasks/
- Add evals/README.md with local vally-cli run instructions
- Gitignore local results/ output directory

Follow-up issue #1920 tracks wiring CI to a curated medium suite.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Update ai-bench references in evals/README.md to microsoft/evaluate
  (the actual upstream Vally repo name)
- Add https://aka.ms/vally as the canonical docs link
- Clarify that contributors don't need source-repo access to run evals
  locally — the @microsoft/vally-cli package from GitHub Packages is
  sufficient

Addresses JasonYeMSFT's review question on evals/README.md.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…/vally

- Remove .npmrc private registry config (GitHub Packages no longer needed)
- Update eval workflow to use @microsoft/vally from public npm
- Remove VALLY_NPM_TOKEN secret requirement from CI
- Update evals/README.md with public npm installation instructions

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Convert bare numeric timeout values to human-friendly duration strings
required by Vally v0.4.0:
- azure-prepare: timeout: 600 -> "10m"
- azure-deploy: timeout: 420 -> "7m"
- azure-hosted-copilot-sdk: timeout: 600 -> "10m"
- azure-enterprise-infra-planner: timeout: 600 -> "10m"

No max_duration_ms fields found in current eval specs (P0-2 verified clean).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…4, P1-5)

- Add vally lint job that runs before eval (P0-4)
- Add --junit flag for JUnit XML output (P0-3)
- Use --suite flag with dynamic suite selection (P1-2):
  - PR events run 'triggers' suite
  - Scheduled runs use 'full' suite
  - Manual dispatch allows suite selection
- Leverage .vally.yaml project config via --suite (P1-4)
- Add --threshold 0.8 to fail CI below score threshold (P1-5)
- Add PR trigger for skills/evals changes
- Add nightly schedule for full regression

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…, P1-1, P1-3, P1-6, P1-7)

Add new graders to all eval specs:
- token-budget: caps token usage per eval session (P0-5)
- turn-count: enforces max conversational turns (P0-5)
- wall-time: enforces execution time limits (P0-5)
- error-count: zero-error tolerance policy (P1-6)
- skill-invocation: validates correct skill routing (P1-3)

Add scoring weights to all eval specs (P1-1):
- Security graders (output-not-matches): weight 2.0
- Skill-invocation: weight 2.0
- Metric-threshold graders: weight 0.5
- Standard graders: weight 0.5-1.0

Add consistent tags across all stimuli (P1-7):
- All stimuli now include skill tag for tag-based filtering

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace invalid output-not-contains grader with output-not-matches in azure-hosted-copilot-sdk eval
- Add missing weights section to azure-enterprise-infra-planner eval
- Add set -e to suite determination step in CI workflow
- Standardize model identifier to claude-sonnet-4 in azure-enterprise-infra-planner
- Fix .gitignore comment to clarify .waza-cache/ is legacy pre-Vally
- Update .waza.yaml reference to .vally.yaml in copilot-instructions.md

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Delete tests/scripts/run-waza.js (replaced by direct vally CLI usage)
- Delete .waza.yaml (replaced by .vally.yaml)
- Remove waza and waza:live npm scripts from tests/package.json
- Update tests/README.md: replace Waza Eval Mode section with Vally
- Update tests/azure-prepare/eval/README.md: replace waza references with vally
- Update eval.yaml comment headers in azure-enterprise-infra-planner and azure-prepare

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…names

- Remove token-budget, turn-count, wall-time, error-count graders (not in published v0.4.0)
- Revert output-not-contains → output-not-matches (output-not-contains is valid)
- Update scoring weights to match available grader types

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Convert YAML tag arrays to scalar strings in enterprise-infra-planner (7 instances)
- Restore output-not-contains graders for error/failed in hosted-copilot-sdk (5 stimuli)
- Switch npm install to --ignore-scripts in eval workflow (CI parity)
- Add pr_number input to workflow_dispatch for fork PR eval support
- .npmrc already deleted in prior commit (no action needed)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Install @microsoft/vally-cli (not @microsoft/vally) for CLI executable
- Run npm install to sync package-lock.json with package.json

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The CLI executable is in the vally-cli package, not the vally library package.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…lls)

Migrate 10 complex-complexity Jest integration test files to Vally eval configs:

Batch 3A (7 smaller complex skills):
- _template (4 stimuli, workspace setup)
- azure-hosted-copilot-sdk (8 stimuli, security regression checks, 20min timeout)
- azure-resource-visualizer (4 stimuli, file pattern assertions, 30min timeout)
- azure-enterprise-infra-planner (8 stimuli, file validation)
- appinsights-instrumentation (4 stimuli, resource copying)
- azure-cloud-migrate (2 stimuli, cloneRepo, 45min timeout)
- azure-resource-lookup (7 stimuli, keyword validation)

Batch 3B (3 large complex skills):
- azure-deploy (39 stimuli, custom deploy validators, 60min timeout)
- azure-validate (11 stimuli, sparse checkout, 45min timeout)
- azure-prepare (38 stimuli, multi-language scaffolding, 60min timeout)

Total: 125 stimuli across 10 eval configs.
Extended grader types: file-exists, file-matches, output-not-matches (security).
Custom helpers mapped: hasDeployLinks, expectFiles, matchesCommand,
countSecretsInCode, matchesFileEdit, doesWorkspaceFileIncludePattern.
All prompts preserved verbatim from Jest test files.

Part of #1818

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…re-validate evals

Fix lint errors: prompt was nested under input.prompt instead of
being a top-level stimulus field. Corrected in all 88 stimuli
across 3 files. No prompt text changed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The copilot-sdk executor requires environment.skills to load skill
definitions into the session. Without this, no skills are available
and skill-invocation graders always fail.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Correct model identifiers, grader types, tag format, and scoring weights
  per verified patterns from PR #1912

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix model identifier: claude-sonnet-4.5-20250514 -> claude-sonnet-4.5
- Fix legacy Waza grader type: regex -> output-not-matches
- Fix duration format: timeout_seconds: 2700 -> timeout: 45m
- Convert Waza config fields to Vally format (trials_per_task -> runs, etc.)
- Remove Waza-only top-level fields (skill, version, metrics)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Lint Eval Specs CI check failed because azure-deploy and azure-prepare
eval specs were missing the required 'scoring' section, triggering
scoring-defaults-applied warnings that are treated as errors in strict mode.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
wbreza and others added 3 commits May 11, 2026 13:41
- Convert azure-deploy, azure-prepare from Waza to Vally schema
- Change cost: free to cost: llm for copilot-sdk stimuli
- Add pending notes for workspace-dependent stimuli
- Fix template placeholder and README docs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The @github/copilot-sdk checks COPILOT_GITHUB_TOKEN first (priority 1),
before GH_TOKEN and GITHUB_TOKEN. Using GITHUB_TOKEN can conflict with
the built-in Actions token.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 11, 2026

Details# 🔍 Token Analysis Report

@github-copilot-for-azure/scripts@1.0.0 tokens
node --import tsx src/tokens/cli.ts compare --base origin/main --head HEAD --markdown

📊 Token Change Report

Comparing origin/mainHEAD

Summary

Metric Value
📈 Total Change +19 tokens (+1%)
Before 1,880 tokens
After 1,899 tokens
Files Changed 2

Changed Files

File Before After Change
evals/README.md 415 432 +17 (+4%)
tests/azure-prepare/eval/README.md 1,465 1,467 +2 (0%)

@github-copilot-for-azure/scripts@1.0.0 tokens
node --import tsx src/tokens/cli.ts check --markdown

📊 Token Limit Check Report

Checked: 645 files
Exceeded: 94 files

⚠️ Files Exceeding Token Limits

File Tokens Limit Over By
.github/skills/analyze-skill-issues/SKILL.md 2109 500 +1609
.github/skills/analyze-test-run/SKILL.md 2471 500 +1971
.github/skills/file-test-bug/SKILL.md 628 500 +128
.github/skills/sensei/README.md 3531 2000 +1531
.github/skills/sensei/SKILL.md 3026 500 +2526
.github/skills/sensei/references/EXAMPLES.md 3701 2000 +1701
.github/skills/sensei/references/LOOP.md 4181 2000 +2181
.github/skills/sensei/references/SCORING.md 4299 2000 +2299
.github/skills/skill-authoring/SKILL.md 839 500 +339
plugin/skills/airunway-aks-setup/SKILL.md 1025 500 +525
plugin/skills/appinsights-instrumentation/SKILL.md 937 500 +437
plugin/skills/azure-ai/SKILL.md 820 500 +320
plugin/skills/azure-aigateway/SKILL.md 1261 500 +761
plugin/skills/azure-aigateway/references/policies.md 2342 2000 +342
plugin/skills/azure-cloud-migrate/SKILL.md 1085 500 +585
plugin/skills/azure-cloud-migrate/references/services/container-apps/cloudrun-deployment-guide.md 2029 2000 +29
plugin/skills/azure-cloud-migrate/references/services/container-apps/deployment-guide.md 2458 2000 +458
plugin/skills/azure-cloud-migrate/references/services/container-apps/fargate-deployment-guide.md 2587 2000 +587
plugin/skills/azure-cloud-migrate/references/services/container-apps/spring-deployment-guide.md 3871 2000 +1871
plugin/skills/azure-cloud-migrate/references/services/functions/lambda-to-functions.md 2600 2000 +600
plugin/skills/azure-cloud-migrate/references/services/functions/runtimes/javascript.md 2181 2000 +181
plugin/skills/azure-compliance/SKILL.md 1188 500 +688
plugin/skills/azure-compute/SKILL.md 1370 500 +870
plugin/skills/azure-compute/workflows/essential-machine-management/references/emm-enable-flow.md 2344 2000 +344
plugin/skills/azure-compute/workflows/vm-recommender/vm-recommender.md 2631 2000 +631
plugin/skills/azure-compute/workflows/vm-troubleshooter/vm-troubleshooter.md 2509 2000 +509
plugin/skills/azure-cost/SKILL.md 1980 500 +1480
plugin/skills/azure-deploy/SKILL.md 1645 500 +1145
plugin/skills/azure-deploy/references/pre-deploy-checklist.md 4692 2000 +2692
plugin/skills/azure-deploy/references/recipes/azd/errors.md 4004 2000 +2004
plugin/skills/azure-deploy/references/troubleshooting.md 2038 2000 +38
plugin/skills/azure-diagnostics/SKILL.md 1423 500 +923
plugin/skills/azure-enterprise-infra-planner/SKILL.md 1002 500 +502
plugin/skills/azure-enterprise-infra-planner/references/constraints/compute-apps.md 2022 2000 +22
plugin/skills/azure-hosted-copilot-sdk/SKILL.md 1332 500 +832
plugin/skills/azure-kubernetes/SKILL.md 2606 500 +2106
plugin/skills/azure-kubernetes/azure-kubernetes-automatic-readiness/SKILL.md 3609 500 +3109
plugin/skills/azure-kusto/SKILL.md 2152 500 +1652
plugin/skills/azure-messaging/SKILL.md 821 500 +321
plugin/skills/azure-prepare/SKILL.md 3375 500 +2875
plugin/skills/azure-prepare/references/aspire.md 4617 2000 +2617
plugin/skills/azure-prepare/references/plan-template.md 2617 2000 +617
plugin/skills/azure-prepare/references/recipes/azd/aspire.md 2275 2000 +275
plugin/skills/azure-prepare/references/recipes/azd/terraform.md 3555 2000 +1555
plugin/skills/azure-prepare/references/research.md 2274 2000 +274
plugin/skills/azure-prepare/references/resources-limits-quotas.md 3322 2000 +1322
plugin/skills/azure-prepare/references/security.md 2147 2000 +147
plugin/skills/azure-prepare/references/services/functions/bicep.md 3127 2000 +1127
plugin/skills/azure-prepare/references/services/functions/templates/recipes/composition.md 2813 2000 +813
plugin/skills/azure-prepare/references/services/functions/terraform.md 3404 2000 +1404
plugin/skills/azure-prepare/references/services/sql-database/bicep.md 2037 2000 +37
plugin/skills/azure-quotas/SKILL.md 2821 500 +2321
plugin/skills/azure-quotas/references/commands.md 2644 2000 +644
plugin/skills/azure-reliability/SKILL.md 5659 500 +5159
plugin/skills/azure-reliability/references/configure-multi-region.md 4729 2000 +2729
plugin/skills/azure-resource-lookup/SKILL.md 1394 500 +894
plugin/skills/azure-resource-visualizer/SKILL.md 2122 500 +1622
plugin/skills/azure-storage/SKILL.md 1228 500 +728
plugin/skills/azure-upgrade/SKILL.md 1542 500 +1042
plugin/skills/azure-upgrade/references/languages/java/INSTRUCTION.md 2724 2000 +724
plugin/skills/azure-upgrade/references/languages/java/package-specific/com.microsoft.azure.management.md 2215 2000 +215
plugin/skills/azure-upgrade/references/languages/java/templates/PLAN_TEMPLATE.md 2411 2000 +411
plugin/skills/azure-upgrade/references/languages/java/templates/PROGRESS_TEMPLATE.md 2315 2000 +315
plugin/skills/azure-upgrade/references/languages/java/templates/SUMMARY_TEMPLATE.md 2190 2000 +190
plugin/skills/azure-upgrade/references/services/functions/automation.md 3463 2000 +1463
plugin/skills/azure-upgrade/references/services/functions/consumption-to-flex.md 2773 2000 +773
plugin/skills/azure-validate/SKILL.md 950 500 +450
plugin/skills/entra-agent-id/SKILL.md 4001 500 +3501
plugin/skills/entra-app-registration/SKILL.md 2070 500 +1570
plugin/skills/entra-app-registration/references/api-permissions.md 2545 2000 +545
plugin/skills/entra-app-registration/references/cli-commands.md 2211 2000 +211
plugin/skills/entra-app-registration/references/console-app-example.md 2752 2000 +752
plugin/skills/entra-app-registration/references/oauth-flows.md 2375 2000 +375
plugin/skills/microsoft-foundry/SKILL.md 3955 500 +3455
plugin/skills/microsoft-foundry/foundry-agent/create/create-hosted.md 4824 2000 +2824
plugin/skills/microsoft-foundry/foundry-agent/deploy/deploy.md 6203 2000 +4203
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/eval-datasets.md 2494 2000 +494
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/generate-seed-dataset.md 2088 2000 +88
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/trace-to-dataset.md 4325 2000 +2325
plugin/skills/microsoft-foundry/foundry-agent/faos-optimize/faos-optimize.md 3436 2000 +1436
plugin/skills/microsoft-foundry/foundry-agent/observe/observe.md 3190 2000 +1190
plugin/skills/microsoft-foundry/foundry-agent/observe/references/continuous-eval.md 3860 2000 +1860
plugin/skills/microsoft-foundry/foundry-agent/observe/references/deploy-and-setup.md 2072 2000 +72
plugin/skills/microsoft-foundry/foundry-agent/trace/references/kql-templates.md 2701 2000 +701
plugin/skills/microsoft-foundry/models/deploy-model/SKILL.md 1640 500 +1140
plugin/skills/microsoft-foundry/models/deploy-model/capacity/SKILL.md 1739 500 +1239
plugin/skills/microsoft-foundry/models/deploy-model/customize/SKILL.md 2235 500 +1735
plugin/skills/microsoft-foundry/models/deploy-model/customize/references/customize-workflow.md 3335 2000 +1335
plugin/skills/microsoft-foundry/models/deploy-model/preset/SKILL.md 1226 500 +726
plugin/skills/microsoft-foundry/models/deploy-model/preset/references/preset-workflow.md 5534 2000 +3534
plugin/skills/microsoft-foundry/quota/quota.md 2288 2000 +288
plugin/skills/microsoft-foundry/quota/references/capacity-planning.md 2080 2000 +80
plugin/skills/microsoft-foundry/references/agent-metadata-contract.md 2373 2000 +373
plugin/skills/microsoft-foundry/references/sdk/foundry-sdk-py.md 2162 2000 +162

Consider moving content to references/ subdirectories.


Automated token analysis. See skill authoring guidelines for best practices.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Migrates a batch of complex skill integration tests from Jest into Vally eval specs under evals/**, with accompanying updates to legacy per-skill eval configs and the eval CI workflow.

Changes:

  • Adds new Vally evals/<skill>/eval.yaml specs for multiple skills (routing/behavior/file graders, including template scaffolding).
  • Updates legacy test-side eval configs under tests/**/eval(s)/ to newer config/scoring/grader shapes.
  • Modifies .github/workflows/eval.yml to lint evals/skills and run Vally in CI.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/package.json Removes Vally-related npm scripts from the tests package.
tests/azure-prepare/eval/README.md Updates local-running instructions for the azure-prepare eval suite.
tests/azure-prepare/eval/eval.yaml Updates legacy azure-prepare eval config (runs/timeout/scoring/graders).
tests/azure-enterprise-infra-planner/evals/eval.yaml Updates legacy eval config (runs/timeout/scoring/graders).
tests/azure-deploy/eval/eval.yaml Updates legacy eval config (runs/timeout/scoring/graders).
evals/_template/eval.yaml Adds a reference/template Vally eval spec for future migrations.
evals/appinsights-instrumentation/eval.yaml Adds a new Vally eval spec migrated from Jest for appinsights instrumentation.
evals/azure-cloud-migrate/eval.yaml Adds a new Vally eval spec migrated from Jest for cloud-migrate scenarios.
evals/azure-deploy/eval.yaml Replaces/expands azure-deploy eval spec with a large Jest-migrated stimulus set.
evals/azure-enterprise-infra-planner/eval.yaml Replaces/rewrites infra-planner eval spec migrated from Jest.
evals/azure-hosted-copilot-sdk/eval.yaml Replaces/rewrites hosted-copilot-sdk eval spec migrated from Jest.
evals/azure-prepare/eval.yaml Replaces/rewrites azure-prepare eval spec migrated from Jest with many stimuli and file graders.
evals/azure-resource-lookup/eval.yaml Adds a new Vally eval spec migrated from Jest for resource-lookup prompts.
evals/azure-resource-visualizer/eval.yaml Adds a new Vally eval spec migrated from Jest for resource-visualizer routing and file output.
evals/azure-validate/eval.yaml Adds a new Vally eval spec migrated from Jest for validate scenarios (routing + file/output checks).
.github/workflows/eval.yml Updates eval workflow triggers/inputs and switches CI execution behavior for Vally.

Comment thread .github/workflows/eval.yml Outdated
Comment thread .github/workflows/eval.yml Outdated
Comment thread tests/azure-prepare/eval/README.md Outdated
Comment thread evals/azure-resource-lookup/eval.yaml Outdated
Comment thread evals/appinsights-instrumentation/eval.yaml
Comment thread evals/azure-validate/eval.yaml Outdated
Comment thread evals/azure-hosted-copilot-sdk/eval.yaml
Copy link
Copy Markdown
Collaborator

@jongio jongio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One workflow item not covered by the existing review.

Comment thread .github/workflows/eval.yml Outdated
- Update all eval specs to claude-sonnet-4.6
- Align eval.yml with reviewed patterns (build step, --skill-dir, --junit, artifacts)
- Fix auth docs for local development

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@wbreza wbreza requested review from JasonYeMSFT and jongio May 13, 2026 18:51
jongio
jongio previously approved these changes May 13, 2026
Copy link
Copy Markdown
Collaborator

@jongio jongio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My setup-node downgrade comment is addressed - restored to v6.4.0. The new commit also adds a proper build step, smoke eval PR gate, and switches all configs to the evals/ directory instead of a single hard-coded spec. Looks good.

- eval.yml: wire suite input through to eval invocation (was unused)
- README: align Copilot SDK auth env var to COPILOT_GITHUB_TOKEN
- azure-resource-lookup: switch web|app|site grader to output-matches regex
- appinsights-instrumentation: replace file-matches with output-matches against
  the original Jest regex patterns (workspace seeding no longer available)
- azure-validate: replace AppHost.cs file-matches with output-contains for
  AzureWebJobsSecretStorageType guidance (no Aspire sample seeding in eval)
- azure-hosted-copilot-sdk: tighten 2 prompts to mention "Copilot SDK"
  explicitly so skill-invocation routing doesn't depend on missing workspace
  seeding

Refs #2236

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@wbreza wbreza force-pushed the feature/jest-batch3-migration branch from 36fcc44 to dffe4b4 Compare May 15, 2026 22:35
@wbreza wbreza requested a review from jongio May 15, 2026 22:40
Copy link
Copy Markdown
Collaborator

@jongio jongio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The migration looks faithful and the workflow improvements are solid - build step, smoke gate, suite wiring all make sense. One gap worth checking: the PR gate's --tier smoke filter vs what's actually tagged in the eval configs.

Comment thread .github/workflows/eval.yml Outdated
wbreza and others added 2 commits May 18, 2026 10:09
Resolves conflicts in .vally.yaml and 4 eval YAMLs by preserving both sides' intent: keeping batch3 grader/prompt fixes from dffe4b4 while incorporating new stimuli and config from main.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Re-tags one cheap routing/invocation stimulus in each of the 10 eval
files migrated in this PR with 	ier: smoke so the --suite smoke
PR gate exercises real routing coverage for these skills (matching the
convention already used by azure-ai, azure-aigateway, azure-diagnostics,
azure-kusto, azure-storage, azure-upgrade, entra-app-registration, and
microsoft-foundry on main).

Addresses jongio's review feedback that the smoke gate had no
matching stimuli for newly-migrated skills.

Refs #2236

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@wbreza wbreza requested a review from jongio May 18, 2026 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants