Skip to content

fix: prevent xray goroutine leak via background sweeper#79

Merged
cnlangzi merged 1 commit intomainfrom
fix/xray-goroutine-leak
Mar 30, 2026
Merged

fix: prevent xray goroutine leak via background sweeper#79
cnlangzi merged 1 commit intomainfrom
fix/xray-goroutine-leak

Conversation

@cnlangzi
Copy link
Copy Markdown
Owner

@cnlangzi cnlangzi commented Mar 30, 2026

Background

The xray.go servers map caches Xray instances keyed by proxy URL. When Close() is called, it immediately calls Instance.Close() and removes the entry. However, xray-core's goroutines may not fully terminate on short-lived proxy tests, causing goroutine accumulation over time.

After 18 days of sustained proxy testing the process holds 1.5GB RSS with 4GB swap full.

Changes

  • Add DrainedAt field to Server struct (zero = active, non-zero = draining since)
  • Add background sweeper() goroutine that runs every 10s and closes instances that have been draining for more than 30s
  • Close() now marks the instance as draining instead of closing immediately
  • CloseAll() marks all instances as draining for the sweeper to clean up
  • getServer() revives a draining instance if accessed again within the grace period

This gives in-flight operations a graceful period to finish while ensuring no goroutine leaks accumulate over time.

Acceptance Criteria

  • No goroutine accumulation on repeated proxy test cycles
  • Memory stays stable under sustained proxy testing
  • Close() and CloseAll() still work correctly

Summary by Sourcery

Introduce a delayed-drain mechanism for Xray server instances to prevent goroutine leaks and allow graceful shutdown of in-flight operations.

Bug Fixes:

  • Prevent goroutine and memory leaks by deferring Xray instance closure until after a drain timeout via a background sweeper.

Enhancements:

  • Track server drain state with a timestamp on each Xray server instance and periodically sweep and close expired draining instances.
  • Allow previously draining Xray server instances to be revived if reused within the configured grace period.
  • Ensure Close and CloseAll mark servers as draining instead of closing immediately, centralizing actual shutdown in the background sweeper.

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Mar 30, 2026

Reviewer's Guide

Introduces a draining lifecycle for cached Xray servers with a background sweeper goroutine, so Close/CloseAll mark instances for deferred shutdown instead of closing them immediately, preventing goroutine leaks while allowing in-flight work to finish.

Sequence diagram for deferred Close with background sweeper

sequenceDiagram
    participant Caller
    participant Functions
    participant GlobalState
    participant Server
    participant Sweeper
    participant CoreInstance as core.Instance

    Caller->>Functions: Close(proxyURL)
    Functions->>GlobalState: Lock mu
    Functions->>GlobalState: lookup servers[proxyURL]
    alt server exists and DrainedAt is zero
        Functions->>Server: set DrainedAt = time.Now()
    else server missing or already draining
        Note over Functions,Server: No change
    end
    Functions->>GlobalState: Unlock mu

    loop every sweepInterval
        Sweeper->>GlobalState: Sleep(sweepInterval)
        Sweeper->>GlobalState: Lock mu
        Sweeper->>GlobalState: iterate servers
        alt server DrainedAt != zero and now - DrainedAt > drainTimeout
            Sweeper->>CoreInstance: Instance.Close()
            Sweeper->>GlobalState: delete servers[proxyURL]
        else not ready to close
            Note over Sweeper,Server: Keep server in draining or active state
        end
        Sweeper->>GlobalState: Unlock mu
    end
Loading

Sequence diagram for getServer reviving a draining instance

sequenceDiagram
    participant Caller
    participant Functions
    participant GlobalState
    participant Server

    Caller->>Functions: getServer(proxyURL)
    Functions->>GlobalState: Lock mu
    Functions->>GlobalState: lookup servers[proxyURL]
    alt server exists
        alt Server.DrainedAt is non_zero
            Functions->>Server: set DrainedAt = zero_time
        else Server.DrainedAt is zero
            Note over Functions,Server: Already active
        end
        Functions-->>Caller: *Server
    else server missing
        Functions-->>Caller: nil
    end
    Functions->>GlobalState: Unlock mu
Loading

Class diagram for updated Xray server lifecycle management

classDiagram
    class Server {
        core.Instance Instance
        int SocksPort
        time.Time DrainedAt
    }

    class GlobalState {
        sync.Mutex mu
        map~string,*Server~ servers
        bool sweeping
        const time.Duration drainTimeout
        const time.Duration sweepInterval
    }

    class Functions {
        +getServer(proxyURL string) *Server
        +setServer(proxyURL string, instance *core.Instance, port int)
        +Close(proxyURL string)
        +CloseAll()
        +sweeper()
        +init()
    }

    GlobalState --> Server : manages
    Functions --> Server : creates_and_updates
    Functions --> GlobalState : reads_writes
Loading

State diagram for Server draining lifecycle

stateDiagram-v2
    [*] --> Active

    Active: DrainedAt = zero
    Draining: DrainedAt = non_zero
    Closed: removed from servers map

    Active --> Draining: Close or CloseAll
    Draining --> Active: getServer called
    Draining --> Closed: sweeper after drainTimeout
    Closed --> [*]
Loading

File-Level Changes

Change Details Files
Add a draining lifecycle with background sweeper to defer Xray instance shutdown and cleanup safely.
  • Introduce drainTimeout and sweepInterval constants controlling how long servers stay in draining state and how often cleanup runs
  • Extend Server struct with a DrainedAt timestamp indicating when draining began
  • Add a global sweeper goroutine invoked from init that periodically scans servers and closes/removes those that have exceeded the drain timeout
  • Guard sweeper access to the servers map with the existing mutex to keep concurrent access safe
xray/xray.go
Change server cache operations to cooperate with the draining lifecycle instead of closing immediately.
  • Update getServer to clear DrainedAt and revive servers that are accessed while draining
  • Update setServer to initialize new servers with a zero DrainedAt (active) state
  • Modify Close to only mark a server as draining by setting DrainedAt instead of closing and deleting it immediately
  • Modify CloseAll to mark all servers as draining at the current time and let the sweeper perform actual close and removal
xray/xray.go

Possibly linked issues

  • #fix: xray goroutine leak in servers map (sweeper solution): PR implements the issue’s proposed sweeper-based drain mechanism in xray.go to stop goroutine leaks.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • Holding mu while calling srv.Instance.Close() in sweeper risks blocking all map operations if close is slow; consider first collecting the instances to close under the lock, then releasing the lock and closing them outside the critical section.
  • The sweeping package-level variable is declared but never used; it can be removed or repurposed (e.g., to guard multiple sweeper startups) to avoid confusion.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Holding `mu` while calling `srv.Instance.Close()` in `sweeper` risks blocking all map operations if close is slow; consider first collecting the instances to close under the lock, then releasing the lock and closing them outside the critical section.
- The `sweeping` package-level variable is declared but never used; it can be removed or repurposed (e.g., to guard multiple sweeper startups) to avoid confusion.

## Individual Comments

### Comment 1
<location path="xray/xray.go" line_range="120-123" />
<code_context>
 	defer mu.Unlock()

 	servers[proxyURL] = &Server{
-		Instance:  instance,
+		Instance:   instance,
 		SocksPort: port,
+		DrainedAt:  time.Time{},
 	}
 }
</code_context>
<issue_to_address>
**issue (bug_risk):** Overwriting an existing draining server entry can leak the old instance because the sweeper no longer sees it.

With the new draining design, `setServer` now unconditionally overwrites `servers[proxyURL]`. If a server has been marked draining by `Close`/`CloseAll` but the sweeper hasn’t run yet, a new `setServer` for the same URL will remove the draining `*Server` from the map, so the sweeper never sees or closes it. To avoid this leak, ensure any existing entry is explicitly `Close()`d before overwriting, or always close the previous instance when replacing it, regardless of draining state.
</issue_to_address>

### Comment 2
<location path="xray/xray.go" line_range="142-148" />
<code_context>

+// CloseAll marks all servers as draining immediately. The sweeper will close
+// each one after drainTimeout.
 func CloseAll() {
 	mu.Lock()
 	defer mu.Unlock()

-	for url, server := range servers {
-		server.Instance.Close() //nolint: errcheck
-		delete(servers, url)
+	now := time.Now()
+	for _, srv := range servers {
+		if srv.DrainedAt.IsZero() {
+			srv.DrainedAt = now
+		}
</code_context>
<issue_to_address>
**question (bug_risk):** CloseAll no longer guarantees synchronous shutdown of instances, which may surprise callers depending on previous semantics.

Previously this called `Instance.Close()` and removed each server entry before returning, so callers could rely on all instances being fully shut down. Now it only sets `DrainedAt` and returns, leaving `Close()` and deletion to the sweeper after `drainTimeout`, which can introduce races or resource conflicts for callers expecting immediate shutdown (e.g., port reuse, goroutine termination). If that synchronous guarantee is required, consider either keeping `CloseAll` blocking for shutdown (with an internal helper for async draining) or introducing a separate async API and preserving the original semantics here.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread xray/xray.go
Comment on lines 120 to +123
servers[proxyURL] = &Server{
Instance: instance,
Instance: instance,
SocksPort: port,
DrainedAt: time.Time{},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Overwriting an existing draining server entry can leak the old instance because the sweeper no longer sees it.

With the new draining design, setServer now unconditionally overwrites servers[proxyURL]. If a server has been marked draining by Close/CloseAll but the sweeper hasn’t run yet, a new setServer for the same URL will remove the draining *Server from the map, so the sweeper never sees or closes it. To avoid this leak, ensure any existing entry is explicitly Close()d before overwriting, or always close the previous instance when replacing it, regardless of draining state.

Comment thread xray/xray.go
Comment on lines 142 to +148
func CloseAll() {
mu.Lock()
defer mu.Unlock()

for url, server := range servers {
server.Instance.Close() //nolint: errcheck
delete(servers, url)
now := time.Now()
for _, srv := range servers {
if srv.DrainedAt.IsZero() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (bug_risk): CloseAll no longer guarantees synchronous shutdown of instances, which may surprise callers depending on previous semantics.

Previously this called Instance.Close() and removed each server entry before returning, so callers could rely on all instances being fully shut down. Now it only sets DrainedAt and returns, leaving Close() and deletion to the sweeper after drainTimeout, which can introduce races or resource conflicts for callers expecting immediate shutdown (e.g., port reuse, goroutine termination). If that synchronous guarantee is required, consider either keeping CloseAll blocking for shutdown (with an internal helper for async draining) or introducing a separate async API and preserving the original semantics here.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 30, 2026

Codecov Report

❌ Patch coverage is 31.81818% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 8.79%. Comparing base (a27df08) to head (6fbbbe0).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
xray/xray.go 31.81% 14 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##            main     #79      +/-   ##
========================================
+ Coverage   8.42%   8.79%   +0.37%     
========================================
  Files         27      27              
  Lines       1508    1523      +15     
========================================
+ Hits         127     134       +7     
- Misses      1367    1374       +7     
- Partials      14      15       +1     
Flag Coverage Δ
Tests 8.79% <31.81%> (+0.37%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Add DrainedAt field to Server struct for graceful draining
- Add background sweeper goroutine that closes drained xray instances
  after 30s grace period (gives in-flight ops time to finish)
- sweeper runs every 10s to close expired draining instances
- Close() marks instance as draining instead of immediate close
- CloseAll() marks all instances as draining for sweeper to clean up
- This prevents goroutine accumulation when xray-core's Close() does not
  fully terminate all internal goroutines on short-lived proxy tests
@cnlangzi cnlangzi force-pushed the fix/xray-goroutine-leak branch from fead215 to 6fbbbe0 Compare March 30, 2026 02:57
@cnlangzi cnlangzi merged commit 0b5ab19 into main Mar 30, 2026
7 checks passed
@cnlangzi cnlangzi deleted the fix/xray-goroutine-leak branch March 30, 2026 03:04
@cnlangzi cnlangzi linked an issue Mar 30, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: xray goroutine leak in servers map (sweeper solution)

1 participant