Skip to content

[CONTP-1639] Refactor CRD metadata collection to use informer#2979

Open
zhuminyi wants to merge 3 commits intomainfrom
minyi/contp-1639-crd-metadata-informer
Open

[CONTP-1639] Refactor CRD metadata collection to use informer#2979
zhuminyi wants to merge 3 commits intomainfrom
minyi/contp-1639-crd-metadata-informer

Conversation

@zhuminyi
Copy link
Copy Markdown
Contributor

@zhuminyi zhuminyi commented May 6, 2026

What does this PR do?

Replaces the 1-minute polling loop in CRDMetadataForwarder with an event-driven informer + workqueue, and extracts the shared informer/RBAC/queue/heartbeat plumbing into a new InformerWorkQueue engine used by both the CRD and Helm metadata forwarders.

Three commits, intended to be reviewed in order:

1. feat(metadata): add shared InformerWorkQueue engine

  • New pkg/controller/utils/metadata/informer_runner.go defines InformerWorkQueue — owns the rate-limited workqueue, worker pool, RBAC pre-check (SelfSubjectAccessReview), informer registration via AddWatch(ctx, WatchTarget), and an optional heartbeat ticker.
  • WatchTarget carries an explicit API Group so the SAR pre-check matches the operator's granted RBAC (e.g. datadoghq.com for CRDs, "" for core resources).
  • Pure addition; no existing forwarder is touched yet.

2. refactor(metadata): migrate HelmMetadataForwarder to InformerWorkQueue

  • Pure refactor. Helm metadata send behavior is unchanged.
  • Removes inline workqueue, RBAC pre-check, worker pool, and ticker loop in favor of the shared engine (~200 lines deleted).
  • processKey now dispatches by kind (ConfigMap / Secret) encoded in the queue key, replacing the prior "try as ConfigMap then Secret" fallback.
  • All existing Helm tests continue to pass without changes beyond the queue-initialization assertion (now reads through runner.QueueLen).

3. feat(metadata): make CRDMetadataForwarder event-driven via InformerWorkQueue

  • Constructor takes manager.Manager; implements manager.Runnable + manager.LeaderElectionRunnable.
  • Start registers AddWatch for DatadogAgent / DatadogAgentInternal / DatadogAgentProfile against the manager cache, each with Group="datadoghq.com".
  • processKey Gets the typed object, hashes against a sync.Map of last-sent snapshots, sends only if new or changed; stores the new snapshot before sending so a concurrent heartbeat tick cannot replay a stale snapshot.
  • handleDelete clears the snapshot for a deleted CRD; 10-minute heartbeat re-sends every snapshot.
  • Polling loop and helpers (sendMetadata, getAllActiveCRDs, getCRDsToSend, cleanupDeletedCRDs, buildCacheKey) deleted.
  • cmd/main.go: setupAndStartCRDMetadataForwarder now takes manager.Manager and registers via mgr.Add(cmf), sitting next to the Helm wiring instead of inside the leader-elected goroutine.

Motivation

CRD metadata sends were taking up to 60 seconds to reflect a kubectl edit on a DatadogAgent, because the forwarder relied on a wall-clock poll. The Helm forwarder has used informers since it was written.

After this PR, a CRD spec change reaches the Datadog metadata pipeline within ~1 second.

Additional Notes

  • Behavioral change for Helm: the new shared engine adds an UpdateFunc handler (the old Helm code only handled Add/Delete). This means metadata-only changes to a Helm-managed ConfigMap/Secret can now enqueue events that previously were silent. handleHelmResource already deduplicates by Helm revision, so no extra metadata sends leak through — verified by the existing Helm test suite.
  • Leader election semantics: the CRD forwarder previously ran inside a <-mgr.Elected() goroutine, so was effectively leader-only. The new NeedLeaderElection() bool { return true } preserves that. Operators with --enable-leader-election=false still get sends because controller-runtime treats LeaderElectionRunnable returning true as "run unconditionally when leader election is disabled".
  • Heartbeat cadence: old CRD code polled every minute and re-sent unchanged objects after 10 minutes, so the heartbeat wave was staggered. New code is event-driven plus a single 10-minute ticker that re-sends every snapshot. Net frequency is similar; cadence shape is slightly different.

Minimum Agent Versions

  • Agent: N/A (no agent-side changes)
  • Cluster Agent: N/A (no cluster-agent-side changes)

Describe your test plan

Unit tests (added with this PR, all passing on workspace):

  • Test_EncodeDecodeKey, Test_InformerWorkQueue_DispatchesAdd, Test_InformerWorkQueue_DispatchesDelete, Test_InformerWorkQueue_HeartbeatFires, Test_InformerWorkQueue_CanListWatch_Allowed — engine mechanics
  • Test_CRDProcessKey_NewCRDSends, Test_CRDProcessKey_UnchangedSkipsSend, Test_CRDProcessKey_NotFoundIsNoOp, Test_CRDHandleDelete, Test_CRDHeartbeatResendsAllSnapshots — CRD informer flow (httptest server + seeded kube-system Namespace + DD_API_KEY/DD_URL env vars for credentials)
  • All 7 existing Helm tests (Test_workqueueInitialization, Test_parseHelmResource, Test_releaseSnapshots, Test_revisionLogic, Test_buildSnapshot, Test_snapshotToReleaseData, Test_mergeValues, Test_buildPayload) — unchanged, still pass

e2e test

  • Deploy operator with this branch to a cluster with a DatadogAgent CR
  • kubectl edit dda on a label/annotation/spec field, watch operator logs for Sending metadata within seconds (V(1) info)
  • Confirm an unchanged DDA still produces a heartbeat send after 10 minutes idle
  • kubectl delete dda and confirm the snapshot is cleared (V(1) Removed deleted CRD from snapshot store)

Checklist

  • PR has at least one valid label: refactoring
  • PR has a milestone or the qa/skip-qa label
  • All commits are signed

@zhuminyi zhuminyi requested a review from a team May 6, 2026 03:01
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f966929809

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread pkg/controller/utils/metadata/crd_metadata.go
@zhuminyi zhuminyi added this to the v1.28.0 milestone May 6, 2026
@zhuminyi zhuminyi added the enhancement New feature or request label May 6, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 6, 2026

Codecov Report

❌ Patch coverage is 40.57971% with 205 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.76%. Comparing base (aad76e3) to head (45b81c9).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
pkg/controller/utils/metadata/crd_metadata.go 37.50% 85 Missing and 5 partials ⚠️
pkg/controller/utils/metadata/informer_runner.go 50.00% 69 Missing and 6 partials ⚠️
pkg/controller/utils/metadata/helm_metadata.go 24.44% 34 Missing ⚠️
cmd/main.go 0.00% 6 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2979      +/-   ##
==========================================
+ Coverage   41.40%   41.76%   +0.36%     
==========================================
  Files         331      332       +1     
  Lines       28908    29008     +100     
==========================================
+ Hits        11969    12116     +147     
+ Misses      16084    16025      -59     
- Partials      855      867      +12     
Flag Coverage Δ
unittests 41.76% <40.57%> (+0.36%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
cmd/main.go 6.66% <0.00%> (-0.02%) ⬇️
pkg/controller/utils/metadata/helm_metadata.go 40.67% <24.44%> (+14.78%) ⬆️
pkg/controller/utils/metadata/informer_runner.go 50.00% <50.00%> (ø)
pkg/controller/utils/metadata/crd_metadata.go 47.31% <37.50%> (+4.79%) ⬆️

... and 3 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aad76e3...45b81c9. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

zhuminyi added 2 commits May 5, 2026 23:14
Adds pkg/controller/utils/metadata/informer_runner.go: a shared engine
for event-driven metadata forwarders. Owns a rate-limited workqueue,
worker pool, optional heartbeat ticker, RBAC pre-check via
SelfSubjectAccessReview, and an AddWatch helper that registers an
informer with filter and Add/Update/Delete handlers.

Forwarders register watches via AddWatch and supply ProcessFunc /
DeleteFunc / HeartbeatFunc callbacks. Queue keys encode kind +
namespace + name as "<Kind>/<namespace>/<name>", with a "delete:"
prefix for deletion events.

WatchTarget carries an explicit API Group for the RBAC pre-check;
callers pass "datadoghq.com" for Datadog CRDs and "" for core
resources like ConfigMap/Secret. Without this, SelfSubjectAccessReview
defaults to the core group and would deny list/watch on any CRD even
when the operator's RBAC permits it.

This commit only adds the engine and its unit tests; existing
forwarders are migrated in follow-up commits.
Pure refactor. Helm metadata sends behavior is unchanged; only the
implementation moves onto the shared engine introduced in the previous
commit.

- Removes inline workqueue, RBAC pre-check, worker pool, and ticker
  loop in favor of InformerWorkQueue.
- processKey now dispatches by kind ("ConfigMap" / "Secret") encoded in
  the queue key, replacing the prior "try as ConfigMap then Secret"
  fallback.
- handleDelete and the periodic heartbeat are now plain callbacks the
  engine invokes.
- Watches register with Group="" since ConfigMap and Secret live in the
  core API group.
- Note: shared handlers include UpdateFunc; the previous Helm code
  registered Add/Delete only. Helm release dedup in handleHelmResource
  (revision check) absorbs any extra events, so no behavior leak.

All existing Helm tests continue to pass without modification beyond
the queue-initialization assertion (now reads through runner.QueueLen).
@datadog-datadog-prod-us1
Copy link
Copy Markdown

datadog-datadog-prod-us1 Bot commented May 6, 2026

Code Coverage

Fix all issues with BitsAI

🛑 Gate Violations

🎯 1 Code Coverage issue detected

A Patch coverage percentage gate may be blocking this PR.

Patch coverage: 38.49% (threshold: 80.00%)

ℹ️ Info

🎯 Code Coverage (details)
Patch Coverage: 38.49%
Overall Coverage: 41.92% (+0.37%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 45b81c9 | Docs | Datadog PR Page | Give us feedback!

@zhuminyi zhuminyi force-pushed the minyi/contp-1639-crd-metadata-informer branch from f966929 to 7e86a2a Compare May 6, 2026 03:14
…rkQueue

Replaces the 1-minute polling loop with informer-driven sends. CRD spec
changes now propagate to the metadata pipeline within ~1 second instead
of up to 60 seconds.

CRDMetadataForwarder changes:
- Constructor takes manager.Manager; implements manager.Runnable and
  manager.LeaderElectionRunnable so the manager starts it after cache
  sync and dedupes sends across operator replicas.
- Start registers AddWatch for DatadogAgent, DatadogAgentInternal, and
  DatadogAgentProfile against the manager cache, each with
  Group="datadoghq.com" so the RBAC pre-check matches the operator's
  granted permissions on that group.
- processKey Gets the typed object from the cache, builds a CRDInstance,
  hashes against a sync.Map of last-sent snapshots keyed by
  "<Kind>/<namespace>/<name>", and sends only if new or changed.
- handleDelete clears the snapshot for a deleted CRD.
- A 10-minute heartbeat re-sends every snapshot for safety.
- Removes the polling loop and its helpers (sendMetadata,
  getAllActiveCRDs, getCRDsToSend, cleanupDeletedCRDs, buildCacheKey).

cmd/main.go: setupAndStartCRDMetadataForwarder now takes manager.Manager
and registers the forwarder via mgr.Add, alongside the Helm forwarder.
The leader-elected goroutine still runs the operator metadata forwarder
which remains a poll loop.

Tests cover the new processKey paths (new send, unchanged skip,
NotFound no-op), handleDelete, and heartbeat re-send.
@zhuminyi zhuminyi force-pushed the minyi/contp-1639-crd-metadata-informer branch from 7e86a2a to 45b81c9 Compare May 6, 2026 05:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants