Skip to content

feat(designer): Agent evaluations tab#8932

Draft
andrew-eldridge wants to merge 8 commits intomainfrom
aeldridge/agentEval
Draft

feat(designer): Agent evaluations tab#8932
andrew-eldridge wants to merge 8 commits intomainfrom
aeldridge/agentEval

Conversation

@andrew-eldridge
Copy link
Contributor

@andrew-eldridge andrew-eldridge commented Mar 17, 2026

Commit Type

  • feature - New functionality
  • fix - Bug fix
  • refactor - Code restructuring without behavior change
  • perf - Performance improvement
  • docs - Documentation update
  • test - Test-related changes
  • chore - Maintenance/tooling

Risk Level

  • Low - Minor changes, limited scope
  • Medium - Moderate changes, some user impact
  • High - Major changes, significant user/system impact

What & Why

Add agent evaluations functionality in a new designer tab. Allows users to evaluate A2A/agentic workflow runs using a predefined set of evaluators (tool call trajectory, semantic similarity, custom prompt). All evaluators either use reference runs as ground truth or a separate evaluator model as a judge.

Impact of Change

  • Users: Introduces new Agent Evaluations tab, new UI panels for runs/evaluators, and the ability to run evaluations from the designer. Only applies to A2A/agentic workflows
  • Developers: Adds new queries/hooks (react-query), a new Redux slice (evaluation), new UI components (EvaluateView and multiple panel components), new models and a StandardEvaluationService
  • System: API contract additions (new evaluator endpoints and run endpoints) and additional network calls for eval fetching/management and running evals

Test Plan

  • Unit tests added/updated
  • E2E tests added/updated
  • Manual testing completed
  • Tested in:

Contributors

@andrew-eldridge

Screenshots/Videos

@andrew-eldridge andrew-eldridge added the risk:medium Medium risk change with potential impact label Mar 17, 2026
@github-actions
Copy link

github-actions bot commented Mar 17, 2026

🤖 AI PR Validation Report

PR Review Results

Thank you for your submission! Here's detailed feedback on your PR title and body compliance:

PR Title

  • Current: feat(designer): Agent evaluations tab
  • Issue: None — title follows conventional commit format and concisely describes the change.
  • Recommendation: Keep as-is. Optionally add a short scope or note if this requires a backend API rollout (e.g., feat(designer): Agent evaluations tab (requires evaluation API)).

Commit Type

  • Properly selected (feature).
  • Note: Only one commit type is selected which is correct.

Risk Level

  • Assessment: The PR body marks Risk Level as Medium and the PR has a risk:medium label. However, based on the code diff, I advise a higher risk level: High.
  • Rationale: This PR adds a large surface area of functionality across UI, Redux state, query layers, new client services, new models, and multiple new API call sites. Files changed include new designer views, new network services (EvaluationService + StandardEvaluationService), new models, many UI components and panels, state slices, and changes to run history/chat behavior. The changes introduce new API contract dependencies and runtime behaviors across platforms (Standalone, VS Code extension, shared libs). Because of the scope and cross-cutting impact, treat this as high risk and coordinate rollout with backend/contract owners and QA.
  • Recommendation: Update the PR Risk Level to High and change the repo label to risk:high. Add an explicit rollout/migration note if backend endpoints are gated or require versioned APIs.

What & Why

  • Current: "Add agent evaluations functionality in a new designer tab..."
  • Issue: The section is present and concise.
  • Recommendation: Good. Consider adding a one-line note about required backend endpoints or feature flags (if any) and whether the change is behind a feature flag or needs a backend deployment first.

⚠️ Impact of Change

  • Issue: The PR lists Users/Developers/System impacts which is good. However, please explicitly call out any required backend/version support and compatibility expectations.
  • Recommendation:
    • Users: New Agent Evaluations tab + ability to run evaluations from designer. Add: "Requires evaluation endpoints to exist on runtime — users will see errors if backend is not deployed." if applicable.
    • Developers: New services, models, and queries added. Add: "Ensure other consumers of shared libs re-build and bump package versions (designer-v2, logic-apps-shared)." if relevant.
    • System: API contract additions and new network calls. Add: "Document new endpoints and API version used; coordinate with backend release.")

Test Plan

  • Assessment: The Test Plan section in the PR body left Unit tests / E2E tests / Manual testing all unchecked and contains no justification.
  • Issue: This is a substantial feature touching many layers and UI components — the PR must include tests or a clear justification and a mitigation plan.
  • Recommendation (action required):
    • If you have added tests, update the PR body and include the test files in the diff. Specifically add:
      • Unit tests for the new queries (libs/designer-v2/src/lib/core/queries/evaluations.ts) and evaluation slice reducers/selectors.
      • Unit tests for any critical UI logic in the EvaluateView, EvaluatorFormPanel (form validation), EvaluatorManagementPanel (list behavior), and EvaluationResultPanel.
      • Integration/E2E tests to cover: opening the Evaluate tab, selecting a run/agent, creating/editing/deleting an evaluator, running an evaluation and showing results.
    • If you cannot add automated tests in this PR, provide a clear explanation in the Test Plan explaining why (e.g., blocked by infra, backend not available) and include a follow-up plan with timelines and ticket/PR references. Explain any manual test steps and acceptance criteria.
    • Do not leave all tests unchecked for a feature of this scope — this PR should not be merged until tests are added or a documented mitigation plan is accepted by maintainers.

⚠️ Contributors

  • Assessment: A contributor (@andrew-eldridge) is listed.
  • Recommendation: If others (PMs, designers, reviewers) contributed, add them to the Contributors section. If not, consider a short acknowledgement of design/PM if they were involved.

⚠️ Screenshots/Videos

  • Assessment: No screenshots/videos provided.
  • Issue: This is a large UI addition (new tab + multiple panels). Reviewers will benefit from screenshots or a short video showing the main flows.
  • Recommendation: Add screenshots for the main flows (Evaluate tab overview, evaluator list, create-editor modal/form, evaluation result) or a short screen recording. This speeds UI review and helps catch visual/regression issues.

Summary Table

Section Status Recommendation
Title Title is good; optionally call out backend dependency in title if required.
Commit Type Correct (feature).
Risk Level Update to High and the risk:high label; include rollout coordination notes.
What & Why Good; add note about backend/feature flag if needed.
Impact of Change ⚠️ Add explicit backend/API contract and package/version impact notes.
Test Plan Add unit + E2E tests or provide a justified mitigation plan with timeline.
Contributors ⚠️ Consider adding other contributors (PM/Design) or acknowledge them.
Screenshots/Videos ⚠️ Add screenshots or a short recording of the new UI flows.

Action items (required before merging):

  1. Update PR Risk Level to High and add/replace repo label with risk:high. Document why the risk was raised (broad UI + service + API changes).
  2. Add automated tests (preferred):
    • Unit tests for new queries and evaluation slice (reducers/selectors).
    • Unit tests for key UI behavior (form validation, list interactions, enabling/disabling actions).
    • E2E test(s) for the evaluate flow: open Evaluate tab, select run/agent, create evaluator, run evaluation, assert result displayed.
      If tests are blocked, add a clear Test Plan that explains blockers and a follow-up ticket/PR with ETA.
  3. Add screenshots/videos of the new Evaluate tab and panels (management, form, details, results) or a short walkthrough video.
  4. Call out backend/API prerequisites (new endpoints) in the PR body and confirm whether the backend is already deployed or will be released simultaneously. If the backend is not ready, mark the PR as draft or gate behind a feature flag.
  5. Consider adding a short migration/compatibility note for other consumers of shared packages (designer-v2, logic-apps-shared).

Please update the PR title/body and attach tests/screenshots as recommended, then re-submit. Because of the scope and cross-cutting changes, I recommend coordinating a release plan with the backend and QA teams and bumping the risk label to risk:high prior to merging. Thank you for the thorough implementation — once the test coverage and rollout details are provided, this will be much easier to approve.


Last updated: Sat, 21 Mar 2026 06:00:28 GMT

@github-actions
Copy link

🤖 AI PR Validation Report

PR Review Results

Thank you for your submission! Here's detailed feedback on your PR title and body compliance:

PR Title

  • Current: feat(designer): Agent evaluations tab
  • Issue: None major — title follows conventional commit style and concisely describes the change.
  • Recommendation: Keep as-is or, if you want more precision, feat(designer): add Agent Evaluations tab and evaluation services to highlight both UI and service changes.

Commit Type

  • Properly selected (feature).
  • Note: Only one option selected which is correct.

⚠️ Risk Level

  • The PR body and label indicate: Medium risk (risk:medium).
  • Assessment: Based on the code diff (large feature additions across UI, core state/store, shared services, new API client, and models — ~2357 additions, 25 files changed), I advise a higher risk level: High.
    • Comment: This PR touches core libs (libs/designer-v2, libs/logic-apps-shared, store initialization), registers a new service, and introduces new runtime API calls. These changes can affect app initialization, API contracts, and global state. Please consider using risk:high so reviewers and release managers treat this accordingly.

⚠️ What & Why

  • Current: Add agent evaluations functionality in a new designer tab. Allows users to evaluate A2A/agentic workflow runs using a predefined set of evaluators (tool call trajectory, semantic similarity, custom prompt). All evaluators either use reference runs as ground truth or a separate evaluator model as a judge.
  • Issue: Clear and concise; good.
  • Recommendation: Optionally add a one-line summary of the major implementation changes (UI components, new evaluation service, store/slice additions) to help reviewers map the description to files changed.

Impact of Change

  • Issue: The Impact of Change section is present but minimal (System marked as N/A). Given the scope of changes, the impact is broader than indicated.
  • Recommendation:
    • Users: Introduces new Agent Evaluations tab, new UI panels for runs/evaluators, and the ability to run evaluations from the designer. This is user-facing; consider noting that feature may show for certain workflow kinds (agentic/stateful) only.
    • Developers: Adds new queries/hooks (react-query), a new Redux slice (evaluation), new UI components (EvaluateView and multiple panel components), new models and a StandardEvaluationService that calls backend endpoints. Call out any public surface/API changes in libs/logic-apps-shared since other repos may depend on them.
    • System: API contract additions (new evaluator endpoints and run endpoints) and additional network calls (evaluation run), potential increased telemetry and cost (model runs). Add: "System: adds new backend API usage and model/runtime evaluation calls; monitor API errors and performance."

Test Plan

  • Test Plan Assessment: Missing — the PR has no Unit tests, no E2E tests, and no manual testing notes.
  • Issue: The diff adds new services, complex UI flows, and new redux state but no tests. Per the repository guidance, if no unit or E2E tests are added then the PR must include a clear manual testing plan and justification. This PR currently lists none of the test checkboxes.
  • Recommendation (required before merging):
    • Add unit tests for the new redux slice (evaluationSlice) and selectors.
    • Add unit tests for queries/behaviour in libs/designer-v2/src/lib/core/queries/evaluations.ts (mock EvaluationService and verify query/mutation behavior and cache invalidation).
    • Add unit/component tests for main UI components (EvaluateView, EvaluatorsPanel, EvaluatorFormPanel) including form submission and run flow (mock services). Prefer snapshot or DOM tests for rendering critical flows.
    • If adding automated tests is not possible now, provide a detailed manual test plan explaining how reviewers can exercise the feature (steps to create evaluator, run evaluation, verify result, behavior for stateful vs stateless workflows) and why no automated tests were added. Manual plan should include expected results and error scenarios.

Contributors

  • Contributors Assessment: @andrew-eldridge is listed. Good to credit; if others helped (PM/Design/QA) consider adding them.

⚠️ Screenshots/Videos

  • Screenshots Assessment: Not provided. This is a UI-heavy change — I recommend adding screenshots or a short demo GIF showing the new Evaluate tab, run list, evaluator creation form, and a sample evaluation result. This helps reviewers and designers validate UX quickly.

Summary Table

Section Status Recommendation
Title Keep as-is or slightly expand for clarity
Commit Type OK
Risk Level ⚠️ Recommend bump to risk:high and update label
What & Why Good; optionally mention high-level files changed
Impact of Change Expand to list system-level impacts and API changes
Test Plan Add unit/E2E tests or a detailed manual test plan
Contributors OK; add others if applicable
Screenshots/Videos ⚠️ Add visual proof for UI changes

Summary:
This PR introduces a large feature set (new evaluation UI, new redux slice, queries, models, and a new StandardEvaluationService). Because this touches core libraries, the store, service initialization, and adds network/API interactions, I recommend raising the risk to High (please update label) and adding tests or a detailed manual test plan. At present, the PR does NOT pass the PR body checklist because the Test Plan is empty — please add automated tests or a robust manual testing section and address the risk label.

Please update the PR title/body with the following specific items and then re-submit:

  • Risk label: change to risk:high (comment in PR explaining why: touches core libs/store/services/API).
  • Test Plan: either add test files (unit tests for evaluationSlice, queries, EvaluateView components; integration/E2E flow that covers create/run evaluation) OR add a detailed manual testing section with step-by-step instructions and expected results.
  • Impact of Change: expand to describe system/backend/API impacts (new endpoints, potential runtime/cost), and any migration steps (none seen — if none, explicitly state so).
  • Screenshots/Videos: include a screenshot of the Evaluate tab, the create evaluator form, and an evaluation result (or a short demo GIF).

Thank you for the thorough implementation. Once tests/manual test plan and the risk label are addressed, this will be in much better shape for merging.

Helpful file-specific test suggestions:

  • libs/designer-v2/src/lib/core/state/evaluation/evaluationSlice.ts -> unit tests for reducer actions and reset behavior.
  • libs/designer-v2/src/lib/core/queries/evaluations.ts -> mock EvaluationService and test query keys, enabled/disabled logic, and onSuccess invalidations for mutations.
  • libs/logic-apps-shared/src/designer-client-services/lib/standard/evaluation.ts -> unit tests for URL/HTTP calls using a mocked IHttpClient.
  • EvaluateView & panels -> component tests for rendering states (empty, loading, error, result) and form submission flows (EvaluatorFormPanel).

Please update and ping reviewers when ready. Thank you!


Last updated: Tue, 17 Mar 2026 17:31:38 GMT

@github-actions
Copy link

github-actions bot commented Mar 17, 2026

📊 Coverage check completed. See workflow run for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-pr-update risk:medium Medium risk change with potential impact

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant