feat(designer): Agent evaluations tab by andrew-eldridge · Pull Request #8932 · Azure/LogicAppsUX

andrew-eldridge · 2026-03-17T17:28:35Z

Commit Type

feature - New functionality
fix - Bug fix
refactor - Code restructuring without behavior change
perf - Performance improvement
docs - Documentation update
test - Test-related changes
chore - Maintenance/tooling

Risk Level

Low - Minor changes, limited scope
Medium - Moderate changes, some user impact
High - Major changes, significant user/system impact

What & Why

Add agent evaluations functionality in a new designer tab. Allows users to evaluate A2A/agentic workflow runs using a predefined set of evaluators (tool call trajectory, semantic similarity, custom prompt). All evaluators either use reference runs as ground truth or a separate evaluator model as a judge.

Impact of Change

Users: Introduces new Agent Evaluations tab, new UI panels for runs/evaluators, and the ability to run evaluations from the designer. Only applies to A2A/agentic workflows
Developers: Adds new queries/hooks (react-query), a new Redux slice (evaluation), new UI components (EvaluateView and multiple panel components), new models and a StandardEvaluationService
System: API contract additions (new evaluator endpoints and run endpoints) and additional network calls for eval fetching/management and running evals

Test Plan

Unit tests added/updated
E2E tests added/updated
Manual testing completed
Tested in:

Contributors

@andrew-eldridge

Screenshots/Videos

github-actions · 2026-03-17T17:31:38Z

🤖 AI PR Validation Report

PR Review Results

Thank you for your submission! Here's detailed feedback on your PR title and body compliance:

✅ PR Title

Current: feat(designer): Agent evaluations tab
Issue: None — title follows conventional commit format and concisely describes the change.
Recommendation: Keep as-is. Optionally add a short scope or note if this requires a backend API rollout (e.g., feat(designer): Agent evaluations tab (requires evaluation API)).

✅ Commit Type

Properly selected (feature).
Note: Only one commit type is selected which is correct.

❌ Risk Level

Assessment: The PR body marks Risk Level as Medium and the PR has a risk:medium label. However, based on the code diff, I advise a higher risk level: High.
Rationale: This PR adds a large surface area of functionality across UI, Redux state, query layers, new client services, new models, and multiple new API call sites. Files changed include new designer views, new network services (EvaluationService + StandardEvaluationService), new models, many UI components and panels, state slices, and changes to run history/chat behavior. The changes introduce new API contract dependencies and runtime behaviors across platforms (Standalone, VS Code extension, shared libs). Because of the scope and cross-cutting impact, treat this as high risk and coordinate rollout with backend/contract owners and QA.
Recommendation: Update the PR Risk Level to High and change the repo label to risk:high. Add an explicit rollout/migration note if backend endpoints are gated or require versioned APIs.

✅ What & Why

Current: "Add agent evaluations functionality in a new designer tab..."
Issue: The section is present and concise.
Recommendation: Good. Consider adding a one-line note about required backend endpoints or feature flags (if any) and whether the change is behind a feature flag or needs a backend deployment first.

⚠️ Impact of Change

Issue: The PR lists Users/Developers/System impacts which is good. However, please explicitly call out any required backend/version support and compatibility expectations.
Recommendation:
- Users: New Agent Evaluations tab + ability to run evaluations from designer. Add: "Requires evaluation endpoints to exist on runtime — users will see errors if backend is not deployed." if applicable.
- Developers: New services, models, and queries added. Add: "Ensure other consumers of shared libs re-build and bump package versions (designer-v2, logic-apps-shared)." if relevant.
- System: API contract additions and new network calls. Add: "Document new endpoints and API version used; coordinate with backend release.")

❌ Test Plan

Assessment: The Test Plan section in the PR body left Unit tests / E2E tests / Manual testing all unchecked and contains no justification.
Issue: This is a substantial feature touching many layers and UI components — the PR must include tests or a clear justification and a mitigation plan.
Recommendation (action required):
- If you have added tests, update the PR body and include the test files in the diff. Specifically add:
  - Unit tests for the new queries (libs/designer-v2/src/lib/core/queries/evaluations.ts) and evaluation slice reducers/selectors.
  - Unit tests for any critical UI logic in the EvaluateView, EvaluatorFormPanel (form validation), EvaluatorManagementPanel (list behavior), and EvaluationResultPanel.
  - Integration/E2E tests to cover: opening the Evaluate tab, selecting a run/agent, creating/editing/deleting an evaluator, running an evaluation and showing results.
- If you cannot add automated tests in this PR, provide a clear explanation in the Test Plan explaining why (e.g., blocked by infra, backend not available) and include a follow-up plan with timelines and ticket/PR references. Explain any manual test steps and acceptance criteria.
- Do not leave all tests unchecked for a feature of this scope — this PR should not be merged until tests are added or a documented mitigation plan is accepted by maintainers.

⚠️ Contributors

Assessment: A contributor (@andrew-eldridge) is listed.
Recommendation: If others (PMs, designers, reviewers) contributed, add them to the Contributors section. If not, consider a short acknowledgement of design/PM if they were involved.

⚠️ Screenshots/Videos

Assessment: No screenshots/videos provided.
Issue: This is a large UI addition (new tab + multiple panels). Reviewers will benefit from screenshots or a short video showing the main flows.
Recommendation: Add screenshots for the main flows (Evaluate tab overview, evaluator list, create-editor modal/form, evaluation result) or a short screen recording. This speeds UI review and helps catch visual/regression issues.

Summary Table

Section	Status	Recommendation
Title	✅	Title is good; optionally call out backend dependency in title if required.
Commit Type	✅	Correct (`feature`).
Risk Level	❌	Update to `High` and the `risk:high` label; include rollout coordination notes.
What & Why	✅	Good; add note about backend/feature flag if needed.
Impact of Change	⚠️	Add explicit backend/API contract and package/version impact notes.
Test Plan	❌	Add unit + E2E tests or provide a justified mitigation plan with timeline.
Contributors	⚠️	Consider adding other contributors (PM/Design) or acknowledge them.
Screenshots/Videos	⚠️	Add screenshots or a short recording of the new UI flows.

Action items (required before merging):

Update PR Risk Level to High and add/replace repo label with risk:high. Document why the risk was raised (broad UI + service + API changes).
Add automated tests (preferred):
- Unit tests for new queries and evaluation slice (reducers/selectors).
- Unit tests for key UI behavior (form validation, list interactions, enabling/disabling actions).
- E2E test(s) for the evaluate flow: open Evaluate tab, select run/agent, create evaluator, run evaluation, assert result displayed.
  If tests are blocked, add a clear Test Plan that explains blockers and a follow-up ticket/PR with ETA.
Add screenshots/videos of the new Evaluate tab and panels (management, form, details, results) or a short walkthrough video.
Call out backend/API prerequisites (new endpoints) in the PR body and confirm whether the backend is already deployed or will be released simultaneously. If the backend is not ready, mark the PR as draft or gate behind a feature flag.
Consider adding a short migration/compatibility note for other consumers of shared packages (designer-v2, logic-apps-shared).

Please update the PR title/body and attach tests/screenshots as recommended, then re-submit. Because of the scope and cross-cutting changes, I recommend coordinating a release plan with the backend and QA teams and bumping the risk label to `risk:high` prior to merging. Thank you for the thorough implementation — once the test coverage and rollout details are provided, this will be much easier to approve.

Last updated: Sat, 21 Mar 2026 06:00:28 GMT

github-actions · 2026-03-17T17:31:38Z

🤖 AI PR Validation Report

PR Review Results

Thank you for your submission! Here's detailed feedback on your PR title and body compliance:

✅ PR Title

Current: feat(designer): Agent evaluations tab
Issue: None major — title follows conventional commit style and concisely describes the change.
Recommendation: Keep as-is or, if you want more precision, feat(designer): add Agent Evaluations tab and evaluation services to highlight both UI and service changes.

✅ Commit Type

Properly selected (feature).
Note: Only one option selected which is correct.

⚠️ Risk Level

The PR body and label indicate: Medium risk (risk:medium).
Assessment: Based on the code diff (large feature additions across UI, core state/store, shared services, new API client, and models — ~2357 additions, 25 files changed), I advise a higher risk level: High.
- Comment: This PR touches core libs (libs/designer-v2, libs/logic-apps-shared, store initialization), registers a new service, and introduces new runtime API calls. These changes can affect app initialization, API contracts, and global state. Please consider using risk:high so reviewers and release managers treat this accordingly.

⚠️ What & Why

Current: Add agent evaluations functionality in a new designer tab. Allows users to evaluate A2A/agentic workflow runs using a predefined set of evaluators (tool call trajectory, semantic similarity, custom prompt). All evaluators either use reference runs as ground truth or a separate evaluator model as a judge.
Issue: Clear and concise; good.
Recommendation: Optionally add a one-line summary of the major implementation changes (UI components, new evaluation service, store/slice additions) to help reviewers map the description to files changed.

❌ Impact of Change

Issue: The Impact of Change section is present but minimal (System marked as N/A). Given the scope of changes, the impact is broader than indicated.
Recommendation:
- Users: Introduces new Agent Evaluations tab, new UI panels for runs/evaluators, and the ability to run evaluations from the designer. This is user-facing; consider noting that feature may show for certain workflow kinds (agentic/stateful) only.
- Developers: Adds new queries/hooks (react-query), a new Redux slice (evaluation), new UI components (EvaluateView and multiple panel components), new models and a StandardEvaluationService that calls backend endpoints. Call out any public surface/API changes in libs/logic-apps-shared since other repos may depend on them.
- System: API contract additions (new evaluator endpoints and run endpoints) and additional network calls (evaluation run), potential increased telemetry and cost (model runs). Add: "System: adds new backend API usage and model/runtime evaluation calls; monitor API errors and performance."

❌ Test Plan

Test Plan Assessment: Missing — the PR has no Unit tests, no E2E tests, and no manual testing notes.
Issue: The diff adds new services, complex UI flows, and new redux state but no tests. Per the repository guidance, if no unit or E2E tests are added then the PR must include a clear manual testing plan and justification. This PR currently lists none of the test checkboxes.
Recommendation (required before merging):
- Add unit tests for the new redux slice (evaluationSlice) and selectors.
- Add unit tests for queries/behaviour in libs/designer-v2/src/lib/core/queries/evaluations.ts (mock EvaluationService and verify query/mutation behavior and cache invalidation).
- Add unit/component tests for main UI components (EvaluateView, EvaluatorsPanel, EvaluatorFormPanel) including form submission and run flow (mock services). Prefer snapshot or DOM tests for rendering critical flows.
- If adding automated tests is not possible now, provide a detailed manual test plan explaining how reviewers can exercise the feature (steps to create evaluator, run evaluation, verify result, behavior for stateful vs stateless workflows) and why no automated tests were added. Manual plan should include expected results and error scenarios.

✅ Contributors

Contributors Assessment: @andrew-eldridge is listed. Good to credit; if others helped (PM/Design/QA) consider adding them.

⚠️ Screenshots/Videos

Screenshots Assessment: Not provided. This is a UI-heavy change — I recommend adding screenshots or a short demo GIF showing the new Evaluate tab, run list, evaluator creation form, and a sample evaluation result. This helps reviewers and designers validate UX quickly.

Summary Table

Section	Status	Recommendation
Title	✅	Keep as-is or slightly expand for clarity
Commit Type	✅	OK
Risk Level	⚠️	Recommend bump to `risk:high` and update label
What & Why	✅	Good; optionally mention high-level files changed
Impact of Change	❌	Expand to list system-level impacts and API changes
Test Plan	❌	Add unit/E2E tests or a detailed manual test plan
Contributors	✅	OK; add others if applicable
Screenshots/Videos	⚠️	Add visual proof for UI changes

Summary:
This PR introduces a large feature set (new evaluation UI, new redux slice, queries, models, and a new StandardEvaluationService). Because this touches core libraries, the store, service initialization, and adds network/API interactions, I recommend raising the risk to High (please update label) and adding tests or a detailed manual test plan. At present, the PR does NOT pass the PR body checklist because the Test Plan is empty — please add automated tests or a robust manual testing section and address the risk label.

Please update the PR title/body with the following specific items and then re-submit:

Risk label: change to risk:high (comment in PR explaining why: touches core libs/store/services/API).
Test Plan: either add test files (unit tests for evaluationSlice, queries, EvaluateView components; integration/E2E flow that covers create/run evaluation) OR add a detailed manual testing section with step-by-step instructions and expected results.
Impact of Change: expand to describe system/backend/API impacts (new endpoints, potential runtime/cost), and any migration steps (none seen — if none, explicitly state so).
Screenshots/Videos: include a screenshot of the Evaluate tab, the create evaluator form, and an evaluation result (or a short demo GIF).

Thank you for the thorough implementation. Once tests/manual test plan and the risk label are addressed, this will be in much better shape for merging.

Helpful file-specific test suggestions:

libs/designer-v2/src/lib/core/state/evaluation/evaluationSlice.ts -> unit tests for reducer actions and reset behavior.
libs/designer-v2/src/lib/core/queries/evaluations.ts -> mock EvaluationService and test query keys, enabled/disabled logic, and onSuccess invalidations for mutations.
libs/logic-apps-shared/src/designer-client-services/lib/standard/evaluation.ts -> unit tests for URL/HTTP calls using a mocked IHttpClient.
EvaluateView & panels -> component tests for rendering states (empty, loading, error, result) and form submission flows (EvaluatorFormPanel).

Please update and ping reviewers when ready. Thank you!

Last updated: Tue, 17 Mar 2026 17:31:38 GMT

github-actions · 2026-03-17T17:33:23Z

📊 Coverage check completed. See workflow run for details.

…vice, update views

andrew-eldridge added 2 commits March 12, 2026 15:14

initial evaluations changes for designer

17bbfe9

add StandardEvaluationService implementation

2eff19d

andrew-eldridge requested review from ccastrotrejo and lambrianmsft March 17, 2026 17:28

andrew-eldridge added the risk:medium Medium risk change with potential impact label Mar 17, 2026

github-actions bot added the needs-pr-update label Mar 17, 2026

add evals to vscode designer, add agent chat panel, refactor eval ser…

c488ba1

…vice, update views

github-actions bot mentioned this pull request Mar 18, 2026

[repo-status] Daily Repo Status — March 18, 2026 #8934

Open

andrew-eldridge added 2 commits March 18, 2026 16:38

remove redundant redux selectors

0e569cc

fix loading, run dataset panel issues

dcb7718

This was referenced Mar 19, 2026

[repo-status] Daily Repo Status — March 19, 2026 #8940

Open

[repo-status] Daily Repo Status — March 20, 2026 #8947

Open

andrew-eldridge added 3 commits March 21, 2026 00:39

simplify evaluation state, reuse existing views, fix eval view bugs

3b3136a

add EvaluationViewMode

c760da6

fix evaluation result not loading bug

e75900d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(designer): Agent evaluations tab#8932

feat(designer): Agent evaluations tab#8932
andrew-eldridge wants to merge 8 commits intomainfrom
aeldridge/agentEval

andrew-eldridge commented Mar 17, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

github-actions bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andrew-eldridge commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Commit Type

Risk Level

What & Why

Impact of Change

Test Plan

Contributors

Screenshots/Videos

Uh oh!

github-actions bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🤖 AI PR Validation Report

PR Review Results

Thank you for your submission! Here's detailed feedback on your PR title and body compliance:

✅ PR Title

✅ Commit Type

❌ Risk Level

✅ What & Why

⚠️ Impact of Change

❌ Test Plan

⚠️ Contributors

⚠️ Screenshots/Videos

Summary Table

Uh oh!

github-actions bot commented Mar 17, 2026

🤖 AI PR Validation Report

PR Review Results

Thank you for your submission! Here's detailed feedback on your PR title and body compliance:

✅ PR Title

✅ Commit Type

⚠️ Risk Level

⚠️ What & Why

❌ Impact of Change

❌ Test Plan

✅ Contributors

⚠️ Screenshots/Videos

Summary Table

Thank you for the thorough implementation. Once tests/manual test plan and the risk label are addressed, this will be in much better shape for merging.

Please update and ping reviewers when ready. Thank you!

Uh oh!

github-actions bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andrew-eldridge commented Mar 17, 2026 •

edited

Loading

github-actions bot commented Mar 17, 2026 •

edited

Loading

github-actions bot commented Mar 17, 2026 •

edited

Loading