Skip to content

fix(race): close remaining #3247 shared-state races#3269

Open
xxs588 wants to merge 7 commits intoapache:developfrom
xxs588:fix/data-races-registry-directory-3247
Open

fix(race): close remaining #3247 shared-state races#3269
xxs588 wants to merge 7 commits intoapache:developfrom
xxs588:fix/data-races-registry-directory-3247

Conversation

@xxs588
Copy link
Copy Markdown
Contributor

@xxs588 xxs588 commented Mar 19, 2026

Description

Fixes #3247

This PR follows the concurrency-safety direction from #3248 and #3249, and focuses on remaining race-prone shared states without introducing directory-level reorganization.

Changes included:

  • registry/directory: guard cacheInvokers, configurators, and SubscribedUrl access with locks/snapshots
  • registry/protocol: synchronize overrideSubscribeListener.configurator read/write
  • filter/hystrix: snapshot regex cache under lock before callback execution
  • metrics: protect application info globals via sync.RWMutex

Atomic commits:

  1. fix(directory): guard registry directory concurrent state access
  2. fix(registry/protocol): synchronize override listener configurator
  3. fix(filter/hystrix): avoid race when reading regex cache
  4. fix(metrics): guard app info global state with RWMutex

Key local verification:

  • make lint
  • go test -race ./registry/directory -run 'TestRegistryDirectory(IsAvailableConcurrentWithCacheUpdates|ConfiguratorConcurrentAccess|SubscribedURLConcurrentAccess)$'
  • go test -race ./registry/protocol -run 'TestOverrideSubscribeListenerConfiguratorConcurrentAccess$'
  • go test -race ./filter/hystrix -run 'TestHystrixFilterConcurrentRegexAccess$'
  • go test -race ./metrics -run 'TestAppInfoConcurrentAccess$'

Checklist

  • I confirm the target branch is develop
  • Code has passed local testing
  • I have added tests that prove my fix is effective or that my feature works

xxs588 added a commit to xxs588/dubbo-go that referenced this pull request Mar 19, 2026
- guard per-key listener collection with RWMutex and snapshot callbacks
- remove direct concurrent map iteration/delete in watcher paths
- harden callback timing test against fsnotify duplicate-event jitter
- addresses flaky Unit Test failure observed on PR apache#3269 CI

Fixes: apache#3247
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 19, 2026

Codecov Report

❌ Patch coverage is 83.83838% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.60%. Comparing base (c356735) to head (12f8e8e).
⚠️ Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
config_center/file/listener.go 66.66% 10 Missing and 1 partial ⚠️
registry/directory/directory.go 88.63% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3269      +/-   ##
===========================================
+ Coverage    49.48%   49.60%   +0.11%     
===========================================
  Files          469      469              
  Lines        34568    34627      +59     
===========================================
+ Hits         17107    17175      +68     
+ Misses       16080    16069      -11     
- Partials      1381     1383       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@AlexStocks AlexStocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

继续修复 #3247 的 race,方向没问题,涉及多个包的全局状态保护。有几个问题需要关注,详见行内评论。

configLoadMutex.Unlock()
}
configLoadMutex.RLock()
methodRegexps := append([]*regexp.Regexp(nil), f.res[invocation.MethodName()]...)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里在 RUnlock 之前做了 map copy,然后在 Do 的回调里用 copy 没问题(避免了迭代时的 race)。但 hystrix.Do 的回调是异步执行的,hystrix 内部维护的 circuit 状态在 Do 返回后仍可能变化,和 filter 层的 goroutine 安全是两回事。

这个是原有设计问题,不是这次引入的。hystrix filter 本身整体是 non-thread-safe 的,注释里应该加一句说明这个 filter 不支持并发调用,或者考虑在 Invoke 入口加锁。

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这条问题目前还没处理。当前分支已经不再修改 filter/hystrix/filter.go,所以这次 race 修复没有覆盖到这里;如果这个 PR 不准备把 hystrix 一并收进来,建议至少后续单独补一个修复,把并发前提写清楚,或者在入口做串行化保护。

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filter/hystrix/filter.go 已在 #3253 迁移并从当前仓库删除,为保持原子性,我会在 hystrix 迁移后的目标位置单独跟进并发约束

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

补充闭环:已在迁移后仓库创建跟进 issue:apache/dubbo-go-extensions#14。该问题后续在扩展仓推进,这个 PR 里不扩 scope。

@xxs588 xxs588 force-pushed the fix/data-races-registry-directory-3247 branch from ae0830a to 465767f Compare March 23, 2026 12:09
xxs588 added a commit to xxs588/dubbo-go that referenced this pull request Mar 23, 2026
- guard per-key listener collection with RWMutex and snapshot callbacks
- remove direct concurrent map iteration/delete in watcher paths
- harden callback timing test against fsnotify duplicate-event jitter
- addresses flaky Unit Test failure observed on PR apache#3269 CI

Fixes: apache#3247
@xxs588 xxs588 force-pushed the fix/data-races-registry-directory-3247 branch 2 times, most recently from ae0830a to d2a51db Compare March 23, 2026 13:23
xxs588 added a commit to xxs588/dubbo-go that referenced this pull request Mar 23, 2026
- guard per-key listener collection with RWMutex and snapshot callbacks
- remove direct concurrent map iteration/delete in watcher paths
- harden callback timing test against fsnotify duplicate-event jitter
- addresses flaky Unit Test failure observed on PR apache#3269 CI

Fixes: apache#3247
@xxs588
Copy link
Copy Markdown
Contributor Author

xxs588 commented Mar 23, 2026

修复了一下冲突,先 fetch了develop,然后对当前分支做了 rebase。
过程中遇到的冲突是预期内的 modify/delete(filter/hystrix/),因为 develop分支已经把这部分迁移并删除了。
然后我保持 develop分支 的删除结果,移除本分支对应的 hystrix 改动,然后继续变基,只保留本 PR 本身要交的逻辑

另外我排查到一个并发测试点 directory.go 里已经有 snapshotCacheInvokers() 这个并发安全读路径,但是测试里有直接读取 cacheInvokers 的写法 这样子会绕过同步语义并触发 race。
但是这个属于测试侧读取模型问题,不应该在 pr 3269 里面修复所以为了保持 PR 原子性,我切到 split/3247-registry-lifecycle 单独提交一个最小测试补丁,只修复了一下测试读取路径

})
}

func TestRegistryDirectoryIsAvailableConcurrentWithCacheUpdates(t *testing.T) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里补了并发场景用例,但当前包在 go test -race ./registry/directory 下还是会报 race。触发点不是这几个新 helper,而是这个测试文件里原有用例还在直接读 registryDirectory.cacheInvokers(例如 TestSubscribeTest_DestroyTest_RefreshUrl)。既然这个 PR 的目标是把 directory 的共享状态 race 收干净,建议把这些断言也改成通过 snapshotCacheInvokers() / List() 读取,或者在测试里统一走 invokersLock,否则目录包的 race 检测结果仍然是红的。

Copy link
Copy Markdown
Contributor Author

@xxs588 xxs588 Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感谢提醒,我现在已经把把 TestSubscribe、Test_Destroy、Test_MergeProviderUrl、Test_MergeOverrideUrl、Test_RefreshUrl 里直接读 cacheInvokers 的断言,全部改成了snapshotCacheInvokers() 读取,修复在 #3270 分支里面,go test -race ./registry/directory ./registry/nacos 也是通过的

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

具体详情是
图片
图片
图片
图片
图片

@xxs588
Copy link
Copy Markdown
Contributor Author

xxs588 commented Mar 28, 2026

跟进同步:针对 #3269 (comment) 提到的 hystrix 并发约束问题,已在迁移后目标仓建立后续 issue:apache/dubbo-go-extensions#14
由于 hystrix filter 已在 #3253 从 dubbo-go 主仓迁移并删除,#3269/#3270/#3271 不再包含该文件改动。为保持 PR 原子性,这里不扩 scope 修改 hystrix 代码,后续将在扩展仓按 issue #14 单独推进(先明确并发约束,再评估是否需要入口串行化保护)。

Copy link
Copy Markdown
Contributor

@AlexStocks AlexStocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: fix(race): close remaining #3247 shared-state races

[P0] 阻断级问题

1. registry/directory/directory.go:586-620 - 并发修复不完整

新增的 snapshotCacheInvokers() 方法使用 RLock+copy 模式正确,但在 toGroupInvokers() 方法中仍直接访问 dir.cacheInvokers

func (dir *RegistryDirectory) toGroupInvokers() []protocolbase.Invoker {
    groupInvokersMap := make(map[string][]protocolbase.Invoker)
    // 问题:这里直接访问 dir.cacheInvokers,未使用 snapshot 方法
    for _, ivk := range dir.cacheInvokers {
        // ...
    }
}

需要统一修改为:

for _, ivk := range dir.snapshotCacheInvokers() {
    // ...
}

[P1] 规范级问题

1. config_center/file/listener.go:45-80 - listenerSet 设计

新增的 listenerSet 结构设计良好,使用 Snapshot() 返回切片避免迭代时修改。但文档说明不足:

// Snapshot returns a read-only listener snapshot for safe iteration.
// Callers must not mutate listeners through this snapshot.

建议添加注释说明调用方不应修改返回的切片。

2. 测试辅助函数验证

listener_test.go 新增的 drainEventsassertNoEvent 函数未单独测试,建议添加单元测试验证其正确性。


总体评价:竞态修复方向正确,但需要检查所有直接访问共享状态的地方是否已统一使用 snapshot 方法。

@xxs588
Copy link
Copy Markdown
Contributor Author

xxs588 commented Mar 29, 2026

Code Review: fix(race): close remaining #3247 shared-state races

[P0] 阻断级问题

1. registry/directory/directory.go:586-620 - 并发修复不完整

新增的 snapshotCacheInvokers() 方法使用 RLock+copy 模式正确,但在 toGroupInvokers() 方法中仍直接访问 dir.cacheInvokers

func (dir *RegistryDirectory) toGroupInvokers() []protocolbase.Invoker {
    groupInvokersMap := make(map[string][]protocolbase.Invoker)
    // 问题:这里直接访问 dir.cacheInvokers,未使用 snapshot 方法
    for _, ivk := range dir.cacheInvokers {
        // ...
    }
}

需要统一修改为:

for _, ivk := range dir.snapshotCacheInvokers() {
    // ...
}

[P1] 规范级问题

1. config_center/file/listener.go:45-80 - listenerSet 设计

新增的 listenerSet 结构设计良好,使用 Snapshot() 返回切片避免迭代时修改。但文档说明不足:

// Snapshot returns a read-only listener snapshot for safe iteration.
// Callers must not mutate listeners through this snapshot.

建议添加注释说明调用方不应修改返回的切片。

2. 测试辅助函数验证

listener_test.go 新增的 drainEventsassertNoEvent 函数未单独测试,建议添加单元测试验证其正确性。

总体评价:竞态修复方向正确,但需要检查所有直接访问共享状态的地方是否已统一使用 snapshot 方法。

关于 toGroupInvokers

我这边重新对了当前分支代码,这里现在是走 cacheInvokersMap.Range(...) 做分组,不是直接遍历 dir.cacheInvokers。cacheInvokers 目前只在加锁的 helper(snapshotCacheInvokers / swapCacheInvokers)里访问 然后剩下两个部分已补充注释和添加了相关单测

xxs588 added 7 commits March 31, 2026 13:25
- guard per-key listener collection with RWMutex and snapshot callbacks
- remove direct concurrent map iteration/delete in watcher paths
- harden callback timing test against fsnotify duplicate-event jitter
- addresses flaky Unit Test failure observed on PR apache#3269 CI

Fixes: apache#3247
add direct test coverage for drainEvents

add direct test coverage for assertNoEvent

keep review scope minimal without unrelated logic changes
clarify that both returned snapshot slice and listeners are immutable for callers
@xxs588 xxs588 force-pushed the fix/data-races-registry-directory-3247 branch from 057f02b to 12f8e8e Compare March 31, 2026 05:27
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.3.2 version 3.3.2 ☢️ Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Systematic data races: 20+ extension global maps unprotected, wrong lock in SetProviderService, RouterChain.Route lockless

4 participants