fix(race): close remaining #3247 shared-state races#3269
fix(race): close remaining #3247 shared-state races#3269xxs588 wants to merge 7 commits intoapache:developfrom
Conversation
- guard per-key listener collection with RWMutex and snapshot callbacks - remove direct concurrent map iteration/delete in watcher paths - harden callback timing test against fsnotify duplicate-event jitter - addresses flaky Unit Test failure observed on PR apache#3269 CI Fixes: apache#3247
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #3269 +/- ##
===========================================
+ Coverage 49.48% 49.60% +0.11%
===========================================
Files 469 469
Lines 34568 34627 +59
===========================================
+ Hits 17107 17175 +68
+ Misses 16080 16069 -11
- Partials 1381 1383 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
AlexStocks
left a comment
There was a problem hiding this comment.
继续修复 #3247 的 race,方向没问题,涉及多个包的全局状态保护。有几个问题需要关注,详见行内评论。
filter/hystrix/filter.go
Outdated
| configLoadMutex.Unlock() | ||
| } | ||
| configLoadMutex.RLock() | ||
| methodRegexps := append([]*regexp.Regexp(nil), f.res[invocation.MethodName()]...) |
There was a problem hiding this comment.
这里在 RUnlock 之前做了 map copy,然后在 Do 的回调里用 copy 没问题(避免了迭代时的 race)。但 hystrix.Do 的回调是异步执行的,hystrix 内部维护的 circuit 状态在 Do 返回后仍可能变化,和 filter 层的 goroutine 安全是两回事。
这个是原有设计问题,不是这次引入的。hystrix filter 本身整体是 non-thread-safe 的,注释里应该加一句说明这个 filter 不支持并发调用,或者考虑在 Invoke 入口加锁。
There was a problem hiding this comment.
这条问题目前还没处理。当前分支已经不再修改 filter/hystrix/filter.go,所以这次 race 修复没有覆盖到这里;如果这个 PR 不准备把 hystrix 一并收进来,建议至少后续单独补一个修复,把并发前提写清楚,或者在入口做串行化保护。
There was a problem hiding this comment.
filter/hystrix/filter.go 已在 #3253 迁移并从当前仓库删除,为保持原子性,我会在 hystrix 迁移后的目标位置单独跟进并发约束
There was a problem hiding this comment.
补充闭环:已在迁移后仓库创建跟进 issue:apache/dubbo-go-extensions#14。该问题后续在扩展仓推进,这个 PR 里不扩 scope。
ae0830a to
465767f
Compare
- guard per-key listener collection with RWMutex and snapshot callbacks - remove direct concurrent map iteration/delete in watcher paths - harden callback timing test against fsnotify duplicate-event jitter - addresses flaky Unit Test failure observed on PR apache#3269 CI Fixes: apache#3247
ae0830a to
d2a51db
Compare
- guard per-key listener collection with RWMutex and snapshot callbacks - remove direct concurrent map iteration/delete in watcher paths - harden callback timing test against fsnotify duplicate-event jitter - addresses flaky Unit Test failure observed on PR apache#3269 CI Fixes: apache#3247
|
修复了一下冲突,先 fetch了develop,然后对当前分支做了 rebase。 另外我排查到一个并发测试点 directory.go 里已经有 snapshotCacheInvokers() 这个并发安全读路径,但是测试里有直接读取 cacheInvokers 的写法 这样子会绕过同步语义并触发 race。 |
| }) | ||
| } | ||
|
|
||
| func TestRegistryDirectoryIsAvailableConcurrentWithCacheUpdates(t *testing.T) { |
There was a problem hiding this comment.
这里补了并发场景用例,但当前包在 go test -race ./registry/directory 下还是会报 race。触发点不是这几个新 helper,而是这个测试文件里原有用例还在直接读 registryDirectory.cacheInvokers(例如 TestSubscribe、Test_Destroy、Test_RefreshUrl)。既然这个 PR 的目标是把 directory 的共享状态 race 收干净,建议把这些断言也改成通过 snapshotCacheInvokers() / List() 读取,或者在测试里统一走 invokersLock,否则目录包的 race 检测结果仍然是红的。
There was a problem hiding this comment.
感谢提醒,我现在已经把把 TestSubscribe、Test_Destroy、Test_MergeProviderUrl、Test_MergeOverrideUrl、Test_RefreshUrl 里直接读 cacheInvokers 的断言,全部改成了snapshotCacheInvokers() 读取,修复在 #3270 分支里面,go test -race ./registry/directory ./registry/nacos 也是通过的
|
跟进同步:针对 #3269 (comment) 提到的 hystrix 并发约束问题,已在迁移后目标仓建立后续 issue:apache/dubbo-go-extensions#14 。 |
AlexStocks
left a comment
There was a problem hiding this comment.
Code Review: fix(race): close remaining #3247 shared-state races
[P0] 阻断级问题
1. registry/directory/directory.go:586-620 - 并发修复不完整
新增的 snapshotCacheInvokers() 方法使用 RLock+copy 模式正确,但在 toGroupInvokers() 方法中仍直接访问 dir.cacheInvokers:
func (dir *RegistryDirectory) toGroupInvokers() []protocolbase.Invoker {
groupInvokersMap := make(map[string][]protocolbase.Invoker)
// 问题:这里直接访问 dir.cacheInvokers,未使用 snapshot 方法
for _, ivk := range dir.cacheInvokers {
// ...
}
}需要统一修改为:
for _, ivk := range dir.snapshotCacheInvokers() {
// ...
}[P1] 规范级问题
1. config_center/file/listener.go:45-80 - listenerSet 设计
新增的 listenerSet 结构设计良好,使用 Snapshot() 返回切片避免迭代时修改。但文档说明不足:
// Snapshot returns a read-only listener snapshot for safe iteration.
// Callers must not mutate listeners through this snapshot.建议添加注释说明调用方不应修改返回的切片。
2. 测试辅助函数验证
listener_test.go 新增的 drainEvents 和 assertNoEvent 函数未单独测试,建议添加单元测试验证其正确性。
总体评价:竞态修复方向正确,但需要检查所有直接访问共享状态的地方是否已统一使用 snapshot 方法。
关于 toGroupInvokers 我这边重新对了当前分支代码,这里现在是走 cacheInvokersMap.Range(...) 做分组,不是直接遍历 dir.cacheInvokers。cacheInvokers 目前只在加锁的 helper(snapshotCacheInvokers / swapCacheInvokers)里访问 然后剩下两个部分已补充注释和添加了相关单测 |
- guard per-key listener collection with RWMutex and snapshot callbacks - remove direct concurrent map iteration/delete in watcher paths - harden callback timing test against fsnotify duplicate-event jitter - addresses flaky Unit Test failure observed on PR apache#3269 CI Fixes: apache#3247
add direct test coverage for drainEvents add direct test coverage for assertNoEvent keep review scope minimal without unrelated logic changes
clarify that both returned snapshot slice and listeners are immutable for callers
057f02b to
12f8e8e
Compare
|








Description
Fixes #3247
This PR follows the concurrency-safety direction from #3248 and #3249, and focuses on remaining race-prone shared states without introducing directory-level reorganization.
Changes included:
registry/directory: guardcacheInvokers,configurators, andSubscribedUrlaccess with locks/snapshotsregistry/protocol: synchronizeoverrideSubscribeListener.configuratorread/writefilter/hystrix: snapshot regex cache under lock before callback executionmetrics: protect application info globals viasync.RWMutexAtomic commits:
fix(directory): guard registry directory concurrent state accessfix(registry/protocol): synchronize override listener configuratorfix(filter/hystrix): avoid race when reading regex cachefix(metrics): guard app info global state with RWMutexKey local verification:
make lintgo test -race ./registry/directory -run 'TestRegistryDirectory(IsAvailableConcurrentWithCacheUpdates|ConfiguratorConcurrentAccess|SubscribedURLConcurrentAccess)$'go test -race ./registry/protocol -run 'TestOverrideSubscribeListenerConfiguratorConcurrentAccess$'go test -race ./filter/hystrix -run 'TestHystrixFilterConcurrentRegexAccess$'go test -race ./metrics -run 'TestAppInfoConcurrentAccess$'Checklist
develop