Skip to content

Fix flaky SDK E2E tests#1418

Merged
stephentoub merged 3 commits into
mainfrom
stephentoub/fix-flaky-tests
May 25, 2026
Merged

Fix flaky SDK E2E tests#1418
stephentoub merged 3 commits into
mainfrom
stephentoub/fix-flaky-tests

Conversation

@stephentoub
Copy link
Copy Markdown
Collaborator

The Windows .NET E2E job was intermittently timing out in permission-handler coverage and reporting replay cache misses from a background-agent notification race. This makes the affected assertions deterministic and hardens the shared replay matching for semantically equivalent task-completion notification wording.

Summary

  • Assert .NET permission-handler error handling by observing the replayed denied tool result instead of waiting for final assistant prose.
  • Wait for background-agent completion notifications before cleanup in .NET and the matching Python E2E test.
  • Normalize read_agent task-completion notification wording in the shared replay proxy and add proxy coverage.

Validation

  • Repeated the affected .NET E2E tests 5 times across net8.0 and net472.
  • Ran npm test -- replayingCapiProxy in test/harness.
  • Ran python -m pytest e2e\test_rpc_tasks_and_handlers_e2e.py -k should_start_background_agent_and_report_task_details.
  • Ran git diff --check.

Make permission handler error coverage assert deterministic replayed tool results instead of waiting for final assistant text, and ensure background-agent tests wait for the completion notification before cleanup. Normalize equivalent replay proxy notification wording across SDK suites.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 25, 2026 18:39
@stephentoub stephentoub requested a review from a team as a code owner May 25, 2026 18:39
@github-actions

This comment has been minimized.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the replaying CAPI proxy and E2E assertions to eliminate flakiness caused by background-agent completion notification timing and slightly varying notification wording, with a focus on stabilizing Windows .NET E2E runs.

Changes:

  • Normalize background-agent completion notification wording (read_agent “unread results” → “full results”) in the replaying proxy to avoid replay cache misses.
  • Make .NET and Python E2E tests deterministic by waiting for the background-agent completion notification event before teardown.
  • Fix .NET permission-handler error coverage to assert against the replayed denied tool result rather than final assistant prose.
Show a summary per file
File Description
test/harness/replayingCapiProxy.ts Normalizes semantically equivalent read_agent completion-notification wording for stable replay matching.
test/harness/replayingCapiProxy.test.ts Adds regression coverage to ensure the new notification normalization is applied.
python/e2e/test_rpc_tasks_and_handlers_e2e.py Waits for the background-agent completion notification event and unsubscribes the handler during cleanup.
dotnet/test/E2E/RpcTasksAndHandlersE2ETests.cs Waits for the background-agent completion notification event to avoid teardown races.
dotnet/test/E2E/PermissionE2ETests.cs Asserts permission-handler failure behavior via replayed tool result content for determinism.

Copilot's findings

  • Files reviewed: 5/5 changed files
  • Comments generated: 0

Normalize task-completion notification wording in stored replay snapshots as well as incoming requests so older snapshots using the unread-results wording continue to match.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions

This comment has been minimized.

Assert Python permission handler errors via the replayed denied tool result instead of final assistant prose, matching the deterministic .NET coverage.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

Cross-SDK Consistency Review ✅

This PR fixes flaky E2E tests and hardens the shared replay proxy — no cross-SDK consistency concerns.

Summary of changes:

  • test/harness/replayingCapiProxy.ts — shared infrastructure fix that benefits all SDK implementations equally
  • dotnet/test/E2E/ and python/e2e/ — parallel, symmetric fixes applied to both SDKs: the same pattern (wait for a concrete observable event before cleanup, rather than relying on final assistant prose) was applied consistently in both .NET and Python

Parity check:
The equivalent background-agent task completion and permission-handler error tests in Go and Node.js were not modified, but since the flakiness was specific to Windows .NET CI and Python, and the shared proxy normalization fix is already universal, no further changes appear needed for consistency.

No cross-SDK inconsistencies introduced.

Generated by SDK Consistency Review Agent for issue #1418 · ● 980.6K ·

@stephentoub stephentoub merged commit 98ff3c2 into main May 25, 2026
39 of 40 checks passed
@stephentoub stephentoub deleted the stephentoub/fix-flaky-tests branch May 25, 2026 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants