Skip to content

fix(socket): recover from stale disconnected socket to prevent permanent "Connecting..."#2487

Draft
YellowSnnowmann wants to merge 2 commits into
tinyhumansai:mainfrom
YellowSnnowmann:fix/socket-stuck-connecting-reconnect
Draft

fix(socket): recover from stale disconnected socket to prevent permanent "Connecting..."#2487
YellowSnnowmann wants to merge 2 commits into
tinyhumansai:mainfrom
YellowSnnowmann:fix/socket-stuck-connecting-reconnect

Conversation

@YellowSnnowmann
Copy link
Copy Markdown
Contributor

Summary

  • Fixed a reconnect edge case where socketService.connect() could get stuck when a stale disconnected socket instance existed for the same auth token.
  • Prevented false-positive connecting UI state by clearing stale disconnected socket references before async reconnect guards run.
  • Preserved existing safety behavior for active sockets (connected) and in-flight sockets (!disconnected) to avoid duplicate connections.
  • Added a regression unit test to verify same-token reconnect creates a fresh socket instead of silently no-oping.

Problem

  • Users could be stuck on Connecting... and unable to chat after a disconnect/reconnect cycle.
  • Root cause: reconnect logic set state to connecting, but returned early because this.socket was still non-null (stale/disconnected), so no new socket was created and no connect/connect_error transition fired.
  • This left connection state stranded and blocked chat flows.

Solution

  • In socketService.connectAsync, when token is unchanged and this.socket.disconnected === true, explicitly clear stale runtime references (this.socket, this.mcpTransport) before continuing.
  • Keep existing early returns for:
    • same-token + already connected
    • same-token + currently connecting (!disconnected)
  • Added test coverage for the stale-socket scenario: second same-token connect() now creates a new socket instance (verifies io(...) called twice).

Submission Checklist

  • If a section does not apply to this change, mark the item as N/A with a one-line reason. Do not delete items.
  • Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy
  • Diff coverage ≥ 80% — changed lines (Vitest + cargo-llvm-cov merged via diff-cover) meet the gate enforced by .github/workflows/coverage.yml. Run pnpm test:coverage and pnpm test:rust locally; PRs below 80% on changed lines will not merge.
  • Coverage matrix updated — added/removed/renamed feature rows in docs/TEST-COVERAGE-MATRIX.md reflect this change (or N/A: behaviour-only change)
  • All affected feature IDs from the matrix are listed in the PR description under ## Related
  • No new external network dependencies introduced (mock backend used per Testing Strategy)
  • Manual smoke checklist updated if this touches release-cut surfaces (docs/RELEASE-MANUAL-SMOKE.md)
  • Linked issue closed via Closes #NNN in the ## Related section

Impact

  • Runtime/platform impact: frontend connection management path (app/src/services/socketService.ts) affecting desktop/web UI behavior where this service is used.
  • Compatibility: no API/schema changes.
  • Performance: negligible; only additional stale-reference cleanup in reconnect edge case.
  • Security: no new credential surface or transport changes.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 22, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 54deb846-a3fb-47d8-a7d1-f8b56f564d3f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Comment @coderabbitai help to get the list of available commands and usage tips.

@M3gA-Mind M3gA-Mind self-assigned this May 22, 2026
Copy link
Copy Markdown
Contributor

@M3gA-Mind M3gA-Mind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Walkthrough

This PR fixes a real, reproducible bug: calling connect(sameToken) after a socket disconnects was silently no-oping because this.socket was still non-null (stale/disconnected), causing the async stale-invocation guard || this.socket to bail early and leave the UI permanently stuck at "Connecting...". The root cause analysis is accurate, the fix is in exactly the right place, and the regression test correctly demonstrates the broken behavior before the fix.

The core logic change is correct. There's one issue worth addressing before merge.

Change summary

File Type Notes
app/src/services/socketService.ts Bug fix Adds else branch in connectAsync to clear stale disconnected socket before async path continues
app/src/services/__tests__/socketService.test.ts Test Regression test verifying io() called twice on same-token reconnect after disconnect

Findings

One issue worth addressing before this lands.

// Drop it so this connect attempt can create a fresh socket;
// otherwise the async stale-invocation guard below (`|| this.socket`)
// returns early and leaves connectivity stuck at "connecting".
this.socket = null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] The stale socket is abandoned via null-assignment rather than shut down cleanly. Two problems:

  1. Stale reconnect timers: socket.io-client has internal reconnect logic (reconnectionAttempts: 5 is configured). Setting this.socket = null drops your reference but doesn't cancel socket.io's internal reconnect timers. If a queued attempt fires after the new socket is created, the abandoned socket's connect handler (still wired up from connectAsync) will dispatch setStatusForUser({ status: 'connected' }) and setSocketIdForUser({ socketId: oldId }) into Redux — overwriting state that belongs to the new socket.

  2. Orphaned listenerMap entries: listenerMap is not cleared here. Any socketService.on(event, cb) listeners registered during the previous connection are still in the map but were never attached to the new socket. Redux-connected components that registered handlers before the disconnect will silently stop receiving events. Meanwhile off(event, cb) will look up those entries, try to remove them from the new socket (where they don't exist), and remove them from listenerMap — so the caller gets no error but the handler is gone with no way to recover.

Suggested fix:

} else {
  // Stale disconnected socket, same token. Shut it down before clearing
  // to cancel any queued socket.io reconnect timers and avoid stale Redux
  // dispatches from the abandoned socket's event handlers.
  this.socket.disconnect();
  this.socket = null;
  this.mcpTransport = null;
  this.listenerMap.clear();
  this.pendingListeners = [];
}

This matches what disconnect() does for the token-changed path and puts listenerMap in a consistent empty state so callers re-register cleanly after reconnection.

// Same token, previous socket is disconnected=true in our mock.
// We should still create a fresh socket instead of returning early.
socketService.connect('mock-jwt-stale-socket');
await pollUntil(() => expect(ioMock).toHaveBeenCalledTimes(2));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] The test validates that io() is called twice (new socket created) which is the core regression being guarded. A small addition would increase confidence that the reconnect path actually executes past the stale-socket guard rather than taking some other code path:

// After the second connect(), verify the reconnect path dispatched 'connecting'
await pollUntil(() =>
  expect(storeMock.dispatch).toHaveBeenCalledWith(
    expect.objectContaining({ payload: expect.objectContaining({ status: 'connecting' }) })
  )
);

Not a blocker, but would catch a regression where io() gets called for an unrelated reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants