Skip to content

Add client evictor & investigastion docs#4138

Open
mjudeikis wants to merge 7 commits into
kcp-dev:mainfrom
mjudeikis:leaking.pipes
Open

Add client evictor & investigastion docs#4138
mjudeikis wants to merge 7 commits into
kcp-dev:mainfrom
mjudeikis:leaking.pipes

Conversation

@mjudeikis
Copy link
Copy Markdown
Contributor

Summary

Whats in this PR:

  • pkg/pproflabels/labels.go — pprof label helper, wired into kubequota controllers
  • staging/.../apimachinery/pkg/client/constructor.go — Evict + global EvictCluster registry
  • pkg/server/clientcache_evictor.go + pkg/server/server.go — LogicalCluster delete → EvictCluster
  • Makefile — fix prometheus target so UGET_PRINT_PATH=absolute works when binary cached (needed for ./hack/run-with-prometheus.sh ./bin/kcp start --debug-socket-path=/tmp/kcp-debug.sock to work locally on mac)
  • docs/content/developers/investigations/memory-leak-attribution.md — full investigation guide

Impact:

Targeted bug (clientCache retention): 213 MB → 3.7 MB cum (98% recovery)
Total leak per workspace: 745 KB → 562 KB (-183 KB)
Goroutine leak class from #3350: already mostly fixed, wiring confirms attribution works

Next investigations identified (separate PRs):

  • 200 MB cum on structToUnstructured — likely OpenAPI v3 controller's per-cluster spec cache
  • 99 MB on json.objectInterface — JSON-decoded objects still pinned by something
  • pkg/index/index.go DeleteLogicalCluster incomplete (5 of 8 per-cluster maps not cleaned)

What Type of PR Is This?

/kind bug

Related Issue(s)

Fixes #4071 (one of few fixes, more will come)

Release Notes

Add client evictor to release memory on LogicalCluster deletes

mjudeikis added 3 commits May 21, 2026 08:33
Signed-off-by: Mangirdas Judeikis <mangirdas@judeikis.lt>
On-behalf-of: SAP <mangirdas.judeikis@sap.com>
@kcp-ci-bot kcp-ci-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. labels May 21, 2026
@kcp-ci-bot
Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@kcp-ci-bot kcp-ci-bot added dco-signoff: yes Indicates the PR's author has signed the DCO. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels May 21, 2026
@kcp-ci-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mjudeikis for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kcp-ci-bot kcp-ci-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 21, 2026
@mjudeikis
Copy link
Copy Markdown
Contributor Author

/test all

@mjudeikis mjudeikis marked this pull request as ready for review May 21, 2026 08:40
@kcp-ci-bot kcp-ci-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 21, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces mechanisms and tooling to better attribute and reduce memory/goroutine retention across logical cluster lifecycles, and adds a new quickstart scenario for workspace-churn/performance testing.

Changes:

  • Add a global per-cluster client-cache eviction fan-out (EvictCluster) and wire it to LogicalCluster deletions to release cached per-workspace client state.
  • Introduce pkg/pproflabels and apply labels to kubequota controller/admission per-cluster initialization paths for improved pprof attribution.
  • Add a new quickstart “workspaces” scenario (plus flags/tests) and add investigation documentation for leak attribution.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
staging/src/github.com/kcp-dev/cli/pkg/quickstart/scenarios/workspaces.go New quickstart scenario that creates a random tree of workspaces under an org workspace.
staging/src/github.com/kcp-dev/cli/pkg/quickstart/scenarios/workspaces_test.go Unit tests for workspaces scenario validation and deterministic tree generation.
staging/src/github.com/kcp-dev/cli/pkg/quickstart/scenarios/scenarios_test.go Adds coverage that “workspaces” is a valid scenario in the registry.
staging/src/github.com/kcp-dev/cli/pkg/quickstart/scenarios/scenario.go Extends Scenario with Validate and registers the new “workspaces” scenario.
staging/src/github.com/kcp-dev/cli/pkg/quickstart/scenarios/api_provider.go Adds scenario-specific Validate for workspace-name DNS-1123 constraints.
staging/src/github.com/kcp-dev/cli/pkg/quickstart/plugin/quickstart_test.go Updates test mocks to satisfy the new Scenario.Validate interface method.
staging/src/github.com/kcp-dev/cli/pkg/quickstart/plugin/options.go Adds workspaces scenario tuning flags and defers prefix validation to Scenario.Validate.
staging/src/github.com/kcp-dev/cli/pkg/quickstart/cmd/cmd.go Updates CLI help text to describe the new “workspaces” scenario.
staging/src/github.com/kcp-dev/apimachinery/pkg/client/constructor.go Adds Evict support, global evictor registry, and EvictCluster fan-out tied to caches created via NewCache.
pkg/server/server.go Wires LogicalCluster informer delete events to client-cache eviction.
pkg/server/clientcache_evictor.go Implements the LogicalCluster delete handler that calls apiclient.EvictCluster.
pkg/reconciler/kubequota/kubequota_controller.go Wraps per-cluster startup in pprof labels to keep leaked goroutines attributable.
pkg/pproflabels/labels.go New helper package for attaching stable pprof labels (controller, logicalcluster).
pkg/admission/kubequota/kubequota_admission.go Wraps per-cluster quota admission delegate creation in pprof labels for attribution.
docs/content/developers/investigations/memory-leak-attribution.md Adds a leak attribution guide using pprof labels and workspace churn.
Comments suppressed due to low confidence (1)

staging/src/github.com/kcp-dev/apimachinery/pkg/client/constructor.go:63

  • The package-level evictors slice strongly references every cache ever created via NewCache (because NewCache auto-calls RegisterEvictor). If consumers create many short-lived clientsets/caches (common in tests or CLIs), this global registry will grow monotonically and prevent those caches from being garbage-collected. Consider adding an UnregisterEvictor/ResetEvictorsForTest hook, or avoiding auto-registration in NewCache in favor of explicit registration by long-lived processes (like the kcp server).
var (
	evictorsMu sync.RWMutex
	evictors   []Evictor
)

// RegisterEvictor adds e to the set of caches notified by EvictCluster.
// NewCache calls this automatically.
func RegisterEvictor(e Evictor) {
	evictorsMu.Lock()
	defer evictorsMu.Unlock()
	evictors = append(evictors, e)
}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread staging/src/github.com/kcp-dev/cli/pkg/quickstart/scenarios/workspaces_test.go Outdated
Comment on lines 33 to +42
// Cache is a client factory that caches previous results.
type Cache[R any] interface {
ClusterOrDie(clusterPath logicalcluster.Path) R
Cluster(clusterPath logicalcluster.Path) (R, error)
// Evict drops the cached client for clusterPath, if any. Used to release
// per-cluster client state (REST clients, codec factories, parsed
// schemas) when a logical cluster is deleted. Safe to call concurrently
// with Cluster / ClusterOrDie. No-op if the path is not cached.
Evict(clusterPath logicalcluster.Path)
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to make this breaking so downstream clients MUST provide eviction, else it just yet-another-leak

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is this something of interest for downstream? kcp-internal only has this problem because we have the shard-cluster relation and the internal view.
External clients should only see the clusters.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And since they only see the clusters they won't have this problem.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, the idea was that one could make this external, but I opted out of this. So we get compile errors :)

Comment thread docs/content/developers/investigations/memory-leak-attribution.md Outdated
Comment thread pkg/server/server.go Outdated
@mjudeikis
Copy link
Copy Markdown
Contributor Author

/retest

@ntnn
Copy link
Copy Markdown
Member

ntnn commented May 21, 2026

/test pull-kcp-test-e2e-multiple-runs

Comment on lines 125 to 142
c.RUnlock()
if exists {
return cachedClient, nil
}

cfg := SetCluster(rest.CopyConfig(c.cfg), clusterPath)
instance, err := c.constructor.NewForConfigAndClient(cfg, c.client)
if err != nil {
var result R
return result, err
}

c.Lock()
defer c.Unlock()
cachedClient, exists = c.clientsByClusterPath[clusterPath]
if exists {
return cachedClient, nil
}
Copy link
Copy Markdown
Member

@ntnn ntnn May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this then lead to leaked clients?
If a client is being built for a cluster and the client does not exist, so it drops out of .RUnlock, build the client, then acquires the lock and writes the new client. If inbetween an Evict on the same cluster happened the client will be left over.

@ntnn ntnn added this to tbd May 21, 2026
@ntnn ntnn moved this to Reviewing in tbd May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the DCO. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

deleting logical clusters does not free up memory

4 participants