Skip to content

Clarify FAOS optimization guidance and improve evaluator contracts#2286

Open
XOEEst wants to merge 15 commits into
microsoft:mainfrom
XOEEst:main
Open

Clarify FAOS optimization guidance and improve evaluator contracts#2286
XOEEst wants to merge 15 commits into
microsoft:mainfrom
XOEEst:main

Conversation

@XOEEst
Copy link
Copy Markdown
Collaborator

@XOEEst XOEEst commented May 18, 2026

This pull request significantly updates the Microsoft Foundry agent deployment and evaluation documentation to enforce stricter deployment completion criteria, clarify the evaluation suite auto-generation process, and improve guidance for dataset management. The changes ensure that deployments are not considered complete until both an invocation test and an auto-generated evaluation suite are finished, and they provide detailed instructions for handling evaluation suite creation, manual fallbacks, and artifact persistence.

Deployment Workflow Enforcement and Evaluation Suite Generation:

  • Deployment workflows for both hosted and prompt agents now require an invocation smoke test and auto-generation of an evaluation suite before reporting success. The "Definition of Done" checklists have been added and expanded to reflect these requirements. [1] [2]
  • The documentation now mandates polling for evaluation suite generation completion before summarizing deployment, and provides explicit fallback instructions if automated suite generation fails. Users are prompted to select the suite generation source, and all steps for artifact persistence and metadata updates are detailed. [1] [2] [3]

Evaluation Dataset and Tooling Updates:

  • The eval-datasets documentation now includes support for data generation jobs and evaluation suite creation tools, and covers new use cases such as regenerating datasets and refreshing synthetic data. [1] [2] [3]

Terminology and Clarity Improvements:

  • Terminology has been standardized throughout the documentation, such as "Auto-Generate Evaluation Suite" replacing "Auto-Create Evaluators & Dataset," and steps have been clarified for both success and fallback scenarios. [1] [2] [3]

Summary of Most Important Changes:

Deployment Workflow and Completion Criteria

  • Enforced that both invocation testing and auto-generation of an evaluation suite are mandatory for a deployment to be considered complete, with new "Definition of Done" checklists for both hosted and prompt agent workflows. [1] [2]
  • Updated workflow tables and instructions to include the new "observe (auto-setup eval suite)" step after deployment and invocation for both new and updated agents.

Evaluation Suite Generation and Fallbacks

  • Added detailed, step-by-step instructions for auto-generating evaluation suites, including polling for job completion, handling user prompts for generation source, and specifying required parameters and service constraints.
  • Provided explicit fallback procedures for manual evaluation suite creation if automated generation fails, ensuring users are informed and artifacts are properly recorded.

Artifact and Metadata Management

  • Clarified and expanded instructions for persisting deployment context and evaluation artifacts, including updates to .foundry/agent-metadata.yaml and local cache directories, and how to handle legacy metadata formats.

Evaluation Dataset Documentation

  • Expanded eval-datasets documentation to support new data generation and evaluation suite tools, and added guidance for dataset regeneration and refreshing synthetic data. [1] [2] [3]

Terminology and Clarity

  • Standardized terminology throughout the documentation (e.g., "Auto-Generate Evaluation Suite"), and improved clarity in workflow steps and user prompts. [1] [2] [3]

These changes collectively improve the reliability, transparency, and reproducibility of agent deployments and evaluations in Microsoft Foundry.## Description

Checklist

  • Tests pass locally (cd tests && npm test)
  • If modifying skill descriptions: verified routing correctness with integration tests (npm run test:skills:integration -- <skill>)
  • If modifying skill USE FOR / DO NOT USE FOR / PREFER OVER clauses: confirmed no routing regressions for competing skills

Related Issues

XOEEst and others added 14 commits May 5, 2026 22:38
Address PR microsoft#2174 review comments by expanding FAOS as Foundry Agent Optimization Service on first use and making the Step 8 Python config snippet copy-paste safe.

The minimum contract example now imports os and preserves the app's existing model-selection fallback instead of hard-coding MODEL_DEPLOYMENT_NAME unless that is already what the app uses.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# Conflicts:
#	plugin/skills/microsoft-foundry/SKILL.md
When creating custom evaluator prompts, treat the runtime-enforced JSON schema as authoritative: result plus reason. Preserve user-provided rubric text, but remove or normalize conflicting output schemas such as score/reasoning or duplicate OUTPUT FORMAT blocks before calling evaluator_catalog_create.

Add observe skill test coverage for the promptText guardrail so future edits keep the result/reason contract visible.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the 0-to-1 scoringType/minScore/maxScore details from the sample custom evaluator prompt. The important guardrail is the output contract: preserve the rubric, but avoid conflicting output schemas because the runtime enforces result/reason.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Restore plugin/skills/microsoft-foundry/SKILL.md from upstream/main so this PR no longer carries a line-ending-only change for the top-level skill file.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@XOEEst XOEEst marked this pull request as ready for review May 20, 2026 09:40
Copilot AI review requested due to automatic review settings May 20, 2026 09:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the microsoft-foundry skill documentation to enforce stricter post-deploy completion criteria (invocation + evaluation suite generation), expand guidance for auto-generating and persisting evaluation-suite artifacts, and refine FAOS optimization/runtime configuration guidance.

Changes:

  • Makes deployment “done” contingent on both a successful invocation smoke test and a completed (or explicitly documented fallback) evaluation suite generation step.
  • Adds/expands documentation for evaluation suite generation, artifact caching, metadata schema updates, and evaluation run completion criteria.
  • Refines FAOS optimization guidance (skills loading, tool description patching, and OPTIMIZATION_* env var conventions) and broadens dataset lifecycle docs to include generation jobs.

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
plugin/skills/microsoft-foundry/SKILL.md Updates lifecycle routing table to include eval-suite setup after deploy/invoke.
plugin/skills/microsoft-foundry/references/agent-metadata-contract.md Extends metadata contract/examples for suite references and generation provenance fields.
plugin/skills/microsoft-foundry/foundry-agent/observe/references/evaluation-suite-generation.md Adds a new end-to-end reference for suite generation, polling, and caching artifacts.
plugin/skills/microsoft-foundry/foundry-agent/observe/references/evaluate-step.md Tightens “evaluation run” completion criteria and clarifies agent-target batch eval inputs.
plugin/skills/microsoft-foundry/foundry-agent/observe/references/deploy-and-setup.md Reworks Step 1 to focus on auto-generating and persisting evaluation suites.
plugin/skills/microsoft-foundry/foundry-agent/observe/references/compare-iterate.md Aligns re-evaluation guidance with agent-target batch eval and suite metadata usage.
plugin/skills/microsoft-foundry/foundry-agent/observe/references/analyze-results.md Updates observability rule reference for knowledge-cutoff mitigation guidance.
plugin/skills/microsoft-foundry/foundry-agent/observe/observe.md Updates observe workflow framing to prioritize suite generation and agent-target runs.
plugin/skills/microsoft-foundry/foundry-agent/faos-optimize/references/python-patterns.md Adds concrete Python patterns for skills loading and safe tool description patching; updates env var guidance.
plugin/skills/microsoft-foundry/foundry-agent/faos-optimize/faos-optimize.md Updates FAOS contract guidance for OPTIMIZATION_* variables and local candidate directory support.
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/references/generate-seed-dataset.md Positions manual seed datasets as fallback to suite/data generation APIs; marks generationSource in metadata example.
plugin/skills/microsoft-foundry/foundry-agent/eval-datasets/eval-datasets.md Expands dataset lifecycle docs to include generation jobs and suite tooling integration.
plugin/skills/microsoft-foundry/foundry-agent/deploy/deploy.md Enforces an explicit post-deploy Step 8/5 gate for suite generation and artifact persistence before reporting success.
package.json Reformats JSON (no functional change).
Comments suppressed due to low confidence (1)

plugin/skills/microsoft-foundry/references/agent-metadata-contract.md:67

  • In the example generated suite entry, datasetFile still points to a local .jsonl. The suite-generation caching instructions added in this PR require writing a dataset reference stub (*.ref.json) plus downloading the actual dataset blobs into a versioned folder. Updating the example to reflect the new caching contract would avoid agents persisting the wrong artifact path.
          purpose: baseline
          stage: seed
        dataset: support-agent-dev-eval-seed
        datasetVersion: v1
        datasetFile: .foundry/datasets/support-agent-dev-eval-seed-v1.jsonl
        datasetUri: <foundry-dataset-uri>

Comment thread plugin/skills/microsoft-foundry/SKILL.md Outdated
Comment thread plugin/skills/microsoft-foundry/references/agent-metadata-contract.md Outdated
Comment thread plugin/skills/microsoft-foundry/foundry-agent/observe/references/evaluate-step.md Outdated
Comment thread plugin/skills/microsoft-foundry/foundry-agent/deploy/deploy.md Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants