Skip to content

fix: close and delete under re-checked lock to fix race conditions#81

Merged
cnlangzi merged 4 commits intomainfrom
fix/sweeper-mutex-close
Mar 30, 2026
Merged

fix: close and delete under re-checked lock to fix race conditions#81
cnlangzi merged 4 commits intomainfrom
fix/sweeper-mutex-close

Conversation

@cnlangzi
Copy link
Copy Markdown
Owner

@cnlangzi cnlangzi commented Mar 30, 2026

Background

Follow-up to #79/#80 — two race conditions found in review:

  1. Delete race: after releasing lock between collection and deletion, a new server can be created for the same URL — sweeper would delete the new entry.
  2. Revive race: getServer() can revive a draining entry (reset DrainedAt to zero) while sweeper is outside the lock — sweeper would close a server that is actually active.

Changes

Before closing: re-verify servers[e.url] == e.srv AND DrainedAt still non-zero.
Before delete: re-verify servers[e.url] == e.srv — skip if a new server was created for that URL.

Both checks are under lock, making the operations effectively race-free.

Acceptance Criteria

  • No goroutine accumulation
  • Memory stays stable
  • Close() and CloseAll() work correctly
  • sweeper does not hold mu while calling slow Instance.Close()
  • No race between sweeper close/delete and concurrent getServer/setServer

Summary by Sourcery

Prevent race conditions in the xray server sweeper by decoupling collection from closing and re-validating entries under lock before closing and deletion.

Bug Fixes:

  • Avoid deleting newly created servers for a URL by re-checking the map entry under lock before removal.
  • Prevent closing revived or active servers by verifying drained status and identity under lock before invoking Instance.Close().

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Mar 30, 2026

Reviewer's Guide

Refactors the sweeper to first collect expired servers under lock and then perform close and delete operations with additional under-lock rechecks to eliminate race conditions with concurrent server creation and revival.

Sequence diagram for sweeper close/delete with concurrent getServer race conditions

sequenceDiagram
    actor Client
    participant getServer
    participant sweeper
    participant mu
    participant servers
    participant Server

    rect rgb(235, 245, 255)
        sweeper->>mu: Lock()
        sweeper->>servers: Scan for expired entries
        servers-->>sweeper: Return expired srv for url
        sweeper->>mu: Unlock()
    end

    par Concurrent_getServer
        Client->>getServer: Request server for url
        getServer->>mu: Lock()
        alt Existing_draining_server
            getServer->>servers: Lookup url
            servers-->>getServer: Return srv
            getServer->>Server: Reset DrainedAt to zero
        else New_server_for_url
            getServer->>servers: Create and store new *Server for url
        end
        getServer->>mu: Unlock()
    and Sweeper_processing_expired
        sweeper->>mu: Lock()
        alt Entry_changed_or_revived
            sweeper->>servers: Check servers[url]
            sweeper-->>sweeper: servers[url] != srv OR DrainedAt is_zero
            sweeper->>mu: Unlock()
            sweeper-->>sweeper: Skip close and delete
        else Entry_still_same_and_draining
            sweeper->>servers: Confirm servers[url] == srv AND DrainedAt non_zero
            sweeper->>mu: Unlock()
            sweeper->>Server: Instance.Close()

            sweeper->>mu: Lock()
            alt No_new_server_created
                sweeper->>servers: Check servers[url] == srv
                sweeper->>servers: delete(servers, url)
            else New_server_replaced_entry
                sweeper-->>sweeper: servers[url] != srv
                sweeper-->>sweeper: Skip delete to avoid removing new server
            end
            sweeper->>mu: Unlock()
        end
    end
Loading

File-Level Changes

Change Details Files
Refactor sweeper to batch expired servers under lock, then close and delete them with under-lock rechecks to avoid races with getServer/setServer.
  • Introduce an expired slice of {url, *Server} collected while holding mu based on DrainedAt and drainTimeout.
  • After releasing mu, iterate over expired and, before closing, briefly reacquire mu to verify the map entry is still the same server and that its drained state check still matches the expiration condition, skipping if it has changed.
  • Call Instance.Close() outside any lock to avoid blocking other map operations while close runs.
  • After closing, reacquire mu and delete the entry only if the current servers[url] is still the same server, so that newly created servers for the same URL are not removed inadvertently.
xray/xray.go

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • The re-check condition before closing appears inverted relative to the comment/intent: to skip closing when an entry has been revived you probably want to continue when servers[e.url] != e.srv || e.srv.DrainedAt.IsZero() rather than || !e.srv.DrainedAt.IsZero().
  • Within the per-entry loop you can reduce lock churn and make the control flow clearer by holding mu for the entire re-check block per phase (e.g., lock, validate conditions and capture a local flag, unlock, then close) instead of multiple short lock/unlock pairs around individual conditionals.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The re-check condition before closing appears inverted relative to the comment/intent: to skip closing when an entry has been revived you probably want to continue when `servers[e.url] != e.srv || e.srv.DrainedAt.IsZero()` rather than `|| !e.srv.DrainedAt.IsZero()`.
- Within the per-entry loop you can reduce lock churn and make the control flow clearer by holding `mu` for the entire re-check block per phase (e.g., lock, validate conditions and capture a local flag, unlock, then close) instead of multiple short lock/unlock pairs around individual conditionals.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@cnlangzi cnlangzi force-pushed the fix/sweeper-mutex-close branch 3 times, most recently from 25870d9 to ae7bed1 Compare March 30, 2026 03:42
Extract tryCloseAndDelete(url, srv) to hold lock per phase with defer,
making the two-phase pattern easy to read and reuse.

Phase 1 (check): servers[url] == srv && !DrainedAt.IsZero()
  → unlock, then close
Phase 2 (delete): servers[url] == srv
  → delete under lock with defer

This closes the revive race (getServer resetting DrainedAt between
collection and close) and the delete race (setServer creating a new
entry for the same URL between close and delete).
@cnlangzi cnlangzi force-pushed the fix/sweeper-mutex-close branch from ae7bed1 to 1c940e0 Compare March 30, 2026 03:45
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 30, 2026

Codecov Report

❌ Patch coverage is 89.36170% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 12.71%. Comparing base (294a699) to head (88a0413).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
xray/xray.go 89.36% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##            main      #81      +/-   ##
=========================================
+ Coverage   8.93%   12.71%   +3.77%     
=========================================
  Files         27       27              
  Lines       1533     1565      +32     
=========================================
+ Hits         137      199      +62     
+ Misses      1381     1351      -30     
  Partials      15       15              
Flag Coverage Δ
Tests 12.71% <89.36%> (+3.77%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Xiage added 2 commits March 30, 2026 12:05
Add xray_test.go covering:
- setServer/getServer basic operation
- Close marks server as draining (DrainedAt non-zero)
- getServer revives a draining server (DrainedAt reset to zero)
- Close is idempotent
- CloseAll drains all servers
- Sweeper removes expired entries after DrainTimeout
- Sweeper skips revived entries (DrainedAt reset to zero)
- Sweeper skips active entries (DrainedAt is zero)
- tryCloseAndDelete: nil srv, wrong pointer, revived entry, replaced entry
- Concurrent getServer/setServer/Close with race detector

Production code changes:
- Make DrainTimeout/SweepInterval package vars (configurable per-test)
- startSweeper() lazily starts sweeper via sync.Once (not in init)
- StopSweeper() stops running sweeper and waits for goroutine exit
- ResetForTest() clears map and stops sweeper between tests
- Add nil guard on srv.Instance in tryCloseAndDelete for testability
- Add quit channel to sweeper so it can be stopped cleanly
@cnlangzi cnlangzi merged commit 9a3c545 into main Mar 30, 2026
7 checks passed
@cnlangzi cnlangzi deleted the fix/sweeper-mutex-close branch March 30, 2026 05:59
@cnlangzi cnlangzi linked an issue Mar 30, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: xray goroutine leak in servers map (sweeper solution)

1 participant