Clarify FAOS optimization guidance and improve evaluator contracts by XOEEst · Pull Request #2286 · microsoft/GitHub-Copilot-for-Azure

XOEEst · 2026-05-18T08:05:25Z

This pull request significantly updates the Microsoft Foundry agent deployment and evaluation documentation to enforce stricter deployment completion criteria, clarify the evaluation suite auto-generation process, and improve guidance for dataset management. The changes ensure that deployments are not considered complete until both an invocation test and an auto-generated evaluation suite are finished, and they provide detailed instructions for handling evaluation suite creation, manual fallbacks, and artifact persistence.

Deployment Workflow Enforcement and Evaluation Suite Generation:

Deployment workflows for both hosted and prompt agents now require an invocation smoke test and auto-generation of an evaluation suite before reporting success. The "Definition of Done" checklists have been added and expanded to reflect these requirements. [1] [2]
The documentation now mandates polling for evaluation suite generation completion before summarizing deployment, and provides explicit fallback instructions if automated suite generation fails. Users are prompted to select the suite generation source, and all steps for artifact persistence and metadata updates are detailed. [1] [2] [3]

Evaluation Dataset and Tooling Updates:

The eval-datasets documentation now includes support for data generation jobs and evaluation suite creation tools, and covers new use cases such as regenerating datasets and refreshing synthetic data. [1] [2] [3]

Terminology and Clarity Improvements:

Terminology has been standardized throughout the documentation, such as "Auto-Generate Evaluation Suite" replacing "Auto-Create Evaluators & Dataset," and steps have been clarified for both success and fallback scenarios. [1] [2] [3]

Summary of Most Important Changes:

Deployment Workflow and Completion Criteria

Enforced that both invocation testing and auto-generation of an evaluation suite are mandatory for a deployment to be considered complete, with new "Definition of Done" checklists for both hosted and prompt agent workflows. [1] [2]
Updated workflow tables and instructions to include the new "observe (auto-setup eval suite)" step after deployment and invocation for both new and updated agents.

Evaluation Suite Generation and Fallbacks

Added detailed, step-by-step instructions for auto-generating evaluation suites, including polling for job completion, handling user prompts for generation source, and specifying required parameters and service constraints.
Provided explicit fallback procedures for manual evaluation suite creation if automated generation fails, ensuring users are informed and artifacts are properly recorded.

Artifact and Metadata Management

Clarified and expanded instructions for persisting deployment context and evaluation artifacts, including updates to .foundry/agent-metadata.yaml and local cache directories, and how to handle legacy metadata formats.

Evaluation Dataset Documentation

Expanded eval-datasets documentation to support new data generation and evaluation suite tools, and added guidance for dataset regeneration and refreshing synthetic data. [1] [2] [3]

Terminology and Clarity

Standardized terminology throughout the documentation (e.g., "Auto-Generate Evaluation Suite"), and improved clarity in workflow steps and user prompts. [1] [2] [3]

These changes collectively improve the reliability, transparency, and reproducibility of agent deployments and evaluations in Microsoft Foundry.## Description

Checklist

Tests pass locally (cd tests && npm test)
If modifying skill descriptions: verified routing correctness with integration tests (npm run test:skills:integration -- <skill>)
If modifying skill USE FOR / DO NOT USE FOR / PREFER OVER clauses: confirmed no routing regressions for competing skills

Related Issues

Address PR microsoft#2174 review comments by expanding FAOS as Foundry Agent Optimization Service on first use and making the Step 8 Python config snippet copy-paste safe. The minimum contract example now imports os and preserves the app's existing model-selection fallback instead of hard-coding MODEL_DEPLOYMENT_NAME unless that is already what the app uses. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

# Conflicts: # plugin/skills/microsoft-foundry/SKILL.md

When creating custom evaluator prompts, treat the runtime-enforced JSON schema as authoritative: result plus reason. Preserve user-provided rubric text, but remove or normalize conflicting output schemas such as score/reasoning or duplicate OUTPUT FORMAT blocks before calling evaluator_catalog_create. Add observe skill test coverage for the promptText guardrail so future edits keep the result/reason contract visible. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove the 0-to-1 scoringType/minScore/maxScore details from the sample custom evaluator prompt. The important guardrail is the output contract: preserve the rubric, but avoid conflicting output schemas because the runtime enforces result/reason. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Restore plugin/skills/microsoft-foundry/SKILL.md from upstream/main so this PR no longer carries a line-ending-only change for the top-level skill file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Updates the microsoft-foundry skill documentation to enforce stricter post-deploy completion criteria (invocation + evaluation suite generation), expand guidance for auto-generating and persisting evaluation-suite artifacts, and refine FAOS optimization/runtime configuration guidance.

Changes:

Makes deployment “done” contingent on both a successful invocation smoke test and a completed (or explicitly documented fallback) evaluation suite generation step.
Adds/expands documentation for evaluation suite generation, artifact caching, metadata schema updates, and evaluation run completion criteria.
Refines FAOS optimization guidance (skills loading, tool description patching, and OPTIMIZATION_* env var conventions) and broadens dataset lifecycle docs to include generation jobs.

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
plugin/skills/microsoft-foundry/SKILL.md	Updates lifecycle routing table to include eval-suite setup after deploy/invoke.
plugin/skills/microsoft-foundry/references/agent-metadata-contract.md	Extends metadata contract/examples for suite references and generation provenance fields.
plugin/skills/microsoft-foundry/foundry-agent/observe/references/evaluation-suite-generation.md	Adds a new end-to-end reference for suite generation, polling, and caching artifacts.
plugin/skills/microsoft-foundry/foundry-agent/observe/references/evaluate-step.md	Tightens “evaluation run” completion criteria and clarifies agent-target batch eval inputs.
plugin/skills/microsoft-foundry/foundry-agent/observe/references/deploy-and-setup.md	Reworks Step 1 to focus on auto-generating and persisting evaluation suites.
plugin/skills/microsoft-foundry/foundry-agent/observe/references/compare-iterate.md	Aligns re-evaluation guidance with agent-target batch eval and suite metadata usage.
plugin/skills/microsoft-foundry/foundry-agent/observe/references/analyze-results.md	Updates observability rule reference for knowledge-cutoff mitigation guidance.
plugin/skills/microsoft-foundry/foundry-agent/observe/observe.md	Updates observe workflow framing to prioritize suite generation and agent-target runs.
plugin/skills/microsoft-foundry/foundry-agent/faos-optimize/references/python-patterns.md	Adds concrete Python patterns for skills loading and safe tool description patching; updates env var guidance.
plugin/skills/microsoft-foundry/foundry-agent/faos-optimize/faos-optimize.md	Updates FAOS contract guidance for `OPTIMIZATION_*` variables and local candidate directory support.
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/generate-seed-dataset.md	Positions manual seed datasets as fallback to suite/data generation APIs; marks generationSource in metadata example.
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/eval-datasets.md	Expands dataset lifecycle docs to include generation jobs and suite tooling integration.
plugin/skills/microsoft-foundry/foundry-agent/deploy/deploy.md	Enforces an explicit post-deploy Step 8/5 gate for suite generation and artifact persistence before reporting success.
package.json	Reformats JSON (no functional change).

Comments suppressed due to low confidence (1)

plugin/skills/microsoft-foundry/references/agent-metadata-contract.md:67

In the example generated suite entry, datasetFile still points to a local .jsonl. The suite-generation caching instructions added in this PR require writing a dataset reference stub (*.ref.json) plus downloading the actual dataset blobs into a versioned folder. Updating the example to reflect the new caching contract would avoid agents persisting the wrong artifact path.

          purpose: baseline
          stage: seed
        dataset: support-agent-dev-eval-seed
        datasetVersion: v1
        datasetFile: .foundry/datasets/support-agent-dev-eval-seed-v1.jsonl
        datasetUri: <foundry-dataset-uri>

XOEEst and others added 14 commits May 5, 2026 22:38

FAOS optimize preparation skill

d031442

Merge remote-tracking branch 'upstream/main'

aa35edc

# Conflicts: # plugin/skills/microsoft-foundry/SKILL.md

fix

e00c8f4

Revert microsoft-foundry SKILL.md to upstream

1b0033f

Restore plugin/skills/microsoft-foundry/SKILL.md from upstream/main so this PR no longer carries a line-ending-only change for the top-level skill file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fix faos skill

1e33e11

Merge upstream/main

19810e4

Eval suite support

06a3d9b

Fix skills

cf476b3

Merge upstream/main

ba07533

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add faos update

beeed1c

Merge upstream/main

84773f5

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

XOEEst marked this pull request as ready for review May 20, 2026 09:40

Copilot AI review requested due to automatic review settings May 20, 2026 09:40

XOEEst requested review from XiaofuHuang, anchenyi, ankitbko, jongio and tendau as code owners May 20, 2026 09:40

Copilot started reviewing on behalf of XOEEst May 20, 2026 09:42 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Fix comments

399af6e

This was referenced May 21, 2026

[aw] Issue Triage failed #2356

Closed

[repo-status] Weekly Repo Status — May 15–21, 2026 #2358

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify FAOS optimization guidance and improve evaluator contracts#2286

Clarify FAOS optimization guidance and improve evaluator contracts#2286
XOEEst wants to merge 15 commits into
microsoft:mainfrom
XOEEst:main

XOEEst commented May 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

XOEEst commented May 18, 2026

Checklist

Related Issues

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants