Add client evictor & investigastion docs#4138
Conversation
Signed-off-by: Mangirdas Judeikis <mangirdas@judeikis.lt> On-behalf-of: SAP <mangirdas.judeikis@sap.com>
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/test all |
There was a problem hiding this comment.
Pull request overview
This PR introduces mechanisms and tooling to better attribute and reduce memory/goroutine retention across logical cluster lifecycles, and adds a new quickstart scenario for workspace-churn/performance testing.
Changes:
- Add a global per-cluster client-cache eviction fan-out (
EvictCluster) and wire it toLogicalClusterdeletions to release cached per-workspace client state. - Introduce
pkg/pproflabelsand apply labels to kubequota controller/admission per-cluster initialization paths for improved pprof attribution. - Add a new
quickstart“workspaces” scenario (plus flags/tests) and add investigation documentation for leak attribution.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| staging/src/github.com/kcp-dev/cli/pkg/quickstart/scenarios/workspaces.go | New quickstart scenario that creates a random tree of workspaces under an org workspace. |
| staging/src/github.com/kcp-dev/cli/pkg/quickstart/scenarios/workspaces_test.go | Unit tests for workspaces scenario validation and deterministic tree generation. |
| staging/src/github.com/kcp-dev/cli/pkg/quickstart/scenarios/scenarios_test.go | Adds coverage that “workspaces” is a valid scenario in the registry. |
| staging/src/github.com/kcp-dev/cli/pkg/quickstart/scenarios/scenario.go | Extends Scenario with Validate and registers the new “workspaces” scenario. |
| staging/src/github.com/kcp-dev/cli/pkg/quickstart/scenarios/api_provider.go | Adds scenario-specific Validate for workspace-name DNS-1123 constraints. |
| staging/src/github.com/kcp-dev/cli/pkg/quickstart/plugin/quickstart_test.go | Updates test mocks to satisfy the new Scenario.Validate interface method. |
| staging/src/github.com/kcp-dev/cli/pkg/quickstart/plugin/options.go | Adds workspaces scenario tuning flags and defers prefix validation to Scenario.Validate. |
| staging/src/github.com/kcp-dev/cli/pkg/quickstart/cmd/cmd.go | Updates CLI help text to describe the new “workspaces” scenario. |
| staging/src/github.com/kcp-dev/apimachinery/pkg/client/constructor.go | Adds Evict support, global evictor registry, and EvictCluster fan-out tied to caches created via NewCache. |
| pkg/server/server.go | Wires LogicalCluster informer delete events to client-cache eviction. |
| pkg/server/clientcache_evictor.go | Implements the LogicalCluster delete handler that calls apiclient.EvictCluster. |
| pkg/reconciler/kubequota/kubequota_controller.go | Wraps per-cluster startup in pprof labels to keep leaked goroutines attributable. |
| pkg/pproflabels/labels.go | New helper package for attaching stable pprof labels (controller, logicalcluster). |
| pkg/admission/kubequota/kubequota_admission.go | Wraps per-cluster quota admission delegate creation in pprof labels for attribution. |
| docs/content/developers/investigations/memory-leak-attribution.md | Adds a leak attribution guide using pprof labels and workspace churn. |
Comments suppressed due to low confidence (1)
staging/src/github.com/kcp-dev/apimachinery/pkg/client/constructor.go:63
- The package-level
evictorsslice strongly references every cache ever created viaNewCache(becauseNewCacheauto-callsRegisterEvictor). If consumers create many short-lived clientsets/caches (common in tests or CLIs), this global registry will grow monotonically and prevent those caches from being garbage-collected. Consider adding anUnregisterEvictor/ResetEvictorsForTesthook, or avoiding auto-registration inNewCachein favor of explicit registration by long-lived processes (like the kcp server).
var (
evictorsMu sync.RWMutex
evictors []Evictor
)
// RegisterEvictor adds e to the set of caches notified by EvictCluster.
// NewCache calls this automatically.
func RegisterEvictor(e Evictor) {
evictorsMu.Lock()
defer evictorsMu.Unlock()
evictors = append(evictors, e)
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Cache is a client factory that caches previous results. | ||
| type Cache[R any] interface { | ||
| ClusterOrDie(clusterPath logicalcluster.Path) R | ||
| Cluster(clusterPath logicalcluster.Path) (R, error) | ||
| // Evict drops the cached client for clusterPath, if any. Used to release | ||
| // per-cluster client state (REST clients, codec factories, parsed | ||
| // schemas) when a logical cluster is deleted. Safe to call concurrently | ||
| // with Cluster / ClusterOrDie. No-op if the path is not cached. | ||
| Evict(clusterPath logicalcluster.Path) | ||
| } |
There was a problem hiding this comment.
I would like to make this breaking so downstream clients MUST provide eviction, else it just yet-another-leak
There was a problem hiding this comment.
But is this something of interest for downstream? kcp-internal only has this problem because we have the shard-cluster relation and the internal view.
External clients should only see the clusters.
There was a problem hiding this comment.
And since they only see the clusters they won't have this problem.
There was a problem hiding this comment.
I mean, the idea was that one could make this external, but I opted out of this. So we get compile errors :)
|
/retest |
|
/test pull-kcp-test-e2e-multiple-runs |
| c.RUnlock() | ||
| if exists { | ||
| return cachedClient, nil | ||
| } | ||
|
|
||
| cfg := SetCluster(rest.CopyConfig(c.cfg), clusterPath) | ||
| instance, err := c.constructor.NewForConfigAndClient(cfg, c.client) | ||
| if err != nil { | ||
| var result R | ||
| return result, err | ||
| } | ||
|
|
||
| c.Lock() | ||
| defer c.Unlock() | ||
| cachedClient, exists = c.clientsByClusterPath[clusterPath] | ||
| if exists { | ||
| return cachedClient, nil | ||
| } |
There was a problem hiding this comment.
Can't this then lead to leaked clients?
If a client is being built for a cluster and the client does not exist, so it drops out of .RUnlock, build the client, then acquires the lock and writes the new client. If inbetween an Evict on the same cluster happened the client will be left over.
Summary
Whats in this PR:
pkg/pproflabels/labels.go— pprof label helper, wired into kubequota controllersstaging/.../apimachinery/pkg/client/constructor.go— Evict + global EvictCluster registrypkg/server/clientcache_evictor.go + pkg/server/server.go— LogicalCluster delete → EvictClusterMakefile— fix prometheus target so UGET_PRINT_PATH=absolute works when binary cached (needed for./hack/run-with-prometheus.sh ./bin/kcp start --debug-socket-path=/tmp/kcp-debug.sockto work locally on mac)docs/content/developers/investigations/memory-leak-attribution.md— full investigation guideImpact:
Targeted bug (
clientCacheretention): 213 MB → 3.7 MB cum (98% recovery)Total leak per workspace: 745 KB → 562 KB (-183 KB)
Goroutine leak class from #3350: already mostly fixed, wiring confirms attribution works
Next investigations identified (separate PRs):
structToUnstructured— likely OpenAPI v3 controller's per-cluster spec cacheson.objectInterface— JSON-decoded objects still pinned by somethingpkg/index/index.goDeleteLogicalClusterincomplete (5 of 8 per-cluster maps not cleaned)What Type of PR Is This?
/kind bug
Related Issue(s)
Fixes #4071 (one of few fixes, more will come)
Release Notes