Skip to content

[CASCL-1304] kubectl-datadog: enrich dd-cluster-info ConfigMap#2980

Draft
L3n41c wants to merge 1 commit intomainfrom
lenaic/CASCL-1304-enrich-dd-cluster-info
Draft

[CASCL-1304] kubectl-datadog: enrich dd-cluster-info ConfigMap#2980
L3n41c wants to merge 1 commit intomainfrom
lenaic/CASCL-1304-enrich-dd-cluster-info

Conversation

@L3n41c
Copy link
Copy Markdown
Member

@L3n41c L3n41c commented May 6, 2026

What does this PR do?

Enriches the dd-cluster-info ConfigMap (introduced by #2945) so a future migration tool can:

  • Distinguish Datadog-managed node-management entities (Fargate profiles, Karpenter NodePools) from legacy ones to drain — via a per-entity managedByDatadog flag.
  • See the running Karpenter installation (version, namespace, ownership) under a new autoscaling parent that also groups the existing clusterAutoscaler entry and a new eksAutoMode entry.

As a side-effect, two detection helpers (FindKarpenterInstallation, IsEKSAutoModeEnabled) move from install/guess/ to new common/karpenter/ and common/eksautomode/ packages so the clusterinfo classifier can reuse them. A generic commonk8s.FindFirstDeployment factors out the shared pager+predicate scan, and commonk8s.ExtractDeploymentVersion factors out the controller-image-tag → label fallback used by both the Karpenter and Cluster Autoscaler detectors.

Motivation

Follow-up to #2945 (CASCL-1304). The original snapshot only captured nodes grouped by their owning manager; the future migration tool also needs to tell Datadog-managed managers (to keep) from legacy ones (to drain), to know whether Karpenter is already running, and to know whether EKS auto-mode is active so it can short-circuit when there is no migration to drive.

Additional Notes

  • Schema change is breaking, on purpose: no consumer of dd-cluster-info exists yet (grep -r "dd-cluster-info\|ConfigMapDataKey" --include='*.go' returns only the writer). APIVersion stays at v1.
  • NodePool ownership detection uses the broader autoscaling.datadoghq.com/created label alone — not uninstall.go's AND-pair with app.kubernetes.io/managed-by: kubectl-datadog. The cluster agent creates NodePools with only the created label, and the migration tool must preserve them too. This divergence is deliberate.
  • Fargate profile ownership detection reads tags via EKS.DescribeFargateProfile. The expected managed-by: kubectl-datadog tag is propagated automatically from the CloudFormation stack tags written by common/aws/cloudformation.go, so no infrastructure change is needed.
  • Datadog-managed NodePools with no nodes yet (typical right after install) are seeded into the snapshot with an empty nodes list so the migration tool sees the destination NodePools exist.
  • Best-effort detection: every new external API call (EKS DescribeFargateProfile, NodePool list, Discovery for auto-mode) is tolerated — transient errors / missing CRDs log a warning and leave entries unflagged rather than failing the snapshot. The call site (recordClusterInfo) was already best-effort before this PR.

Minimum Agent Versions

  • Agent: N/A (kubectl-datadog plugin only)
  • Cluster Agent: N/A (kubectl-datadog plugin only)

Describe your test plan

Automated coverage added in this PR:

  • TestClassify_KarpenterNodePoolOwnership: kubectl-datadog (both labels) + cluster agent (created label only) + Datadog NodePool with no nodes yet + foreign NodePool.
  • TestClassify_KarpenterNodePoolOwnership_NoCRD: tolerant of meta.IsNoMatchError when the Karpenter CRD is not installed.
  • TestClassify_FargateProfileOwnership / TestClassify_FargateProfileOwnership_DescribeError: tag-based detection + AWS API error fallback.
  • TestClassify_KarpenterDetection: version extraction from controller image tag, ManagedByDatadog/InstallerVersion from sentinel labels.
  • TestClassify_EKSAutoMode: discovery API exposes nodeclassesEnabled: true.
  • TestPersist_YAMLShape: pins lowerCamelCase wire keys against the gopkg.in/yaml.v3 lower-case-by-default footgun.
  • TestFindFirstDeployment_*: covers the new generic helper.

Manual test plan on a sandbox EKS cluster:

  • kubectl datadog autoscaling cluster install succeeds.
  • kubectl get cm -n dd-karpenter dd-cluster-info -o yaml contains:
    • autoscaling.karpenter.{present, version, managedByDatadog, installerVersion}
    • autoscaling.eksAutoMode.enabled
    • nodeManagement.fargate."dd-karpenter-<cluster>".managedByDatadog: true
  • At least one NodePool created by kubectl-datadog appears under nodeManagement.karpenter with managedByDatadog: true, even with no node landed on it yet.
  • A hand-created foreign NodePool (no Datadog labels) does NOT carry managedByDatadog: true.

Checklist

  • PR has at least one valid label: enhancement, refactoring
  • PR has a milestone or the qa/skip-qa label
  • All commits are signed

The dd-cluster-info ConfigMap (introduced by #2945) now records:

- the running Karpenter installation (version, namespace, ownership)
  under a new `autoscaling` parent that also groups the existing
  clusterAutoscaler entry and a new eksAutoMode entry,
- a `managedByDatadog` flag per node-management entity (Fargate
  profile, Karpenter NodePool), so a future migration tool can
  distinguish Datadog-managed entities to keep from legacy ones to
  drain.

Detection helpers `FindKarpenterInstallation` and `IsEKSAutoModeEnabled`
move from `install/guess/` to new `common/karpenter/` and
`common/eksautomode/` packages so the clusterinfo classifier can reuse
them. A generic `commonk8s.FindFirstDeployment` factors out the shared
pager+predicate scan, and `commonk8s.ExtractDeploymentVersion` factors
out the controller-image-tag → label fallback used by both detectors.

Karpenter NodePool ownership uses the broader
`autoscaling.datadoghq.com/created` label only (vs. uninstall's AND-pair
with `app.kubernetes.io/managed-by: kubectl-datadog`) so NodePools
managed by the Datadog cluster agent are also preserved by the
migration tool. Datadog-managed NodePools with no nodes yet (typical
right after install) are seeded into the snapshot with an empty Nodes
list so the migration tool sees the destination NodePools exist.

Fargate profile ownership reads tags via EKS DescribeFargateProfile;
the `managed-by: kubectl-datadog` tag is propagated automatically from
the CloudFormation stack tags, so no infrastructure change is needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 6, 2026

Codecov Report

❌ Patch coverage is 72.15190% with 44 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.42%. Comparing base (d1d2b65) to head (992bd28).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...tadog/autoscaling/cluster/common/k8s/deployment.go 45.45% 18 Missing ⚠️
...autoscaling/cluster/common/clusterinfo/classify.go 84.94% 10 Missing and 4 partials ⚠️
...bectl-datadog/autoscaling/cluster/install/steps.go 20.00% 12 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2980      +/-   ##
==========================================
+ Coverage   41.39%   41.42%   +0.02%     
==========================================
  Files         331      332       +1     
  Lines       28911    28984      +73     
==========================================
+ Hits        11969    12007      +38     
- Misses      16086    16118      +32     
- Partials      856      859       +3     
Flag Coverage Δ
unittests 41.42% <72.15%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...oscaling/cluster/common/eksautomode/eksautomode.go 88.88% <100.00%> (ø)
.../autoscaling/cluster/common/karpenter/karpenter.go 92.85% <100.00%> (ø)
...bectl-datadog/autoscaling/cluster/install/steps.go 19.51% <20.00%> (-0.49%) ⬇️
...autoscaling/cluster/common/clusterinfo/classify.go 86.95% <84.94%> (-3.96%) ⬇️
...tadog/autoscaling/cluster/common/k8s/deployment.go 45.45% <45.45%> (ø)

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d1d2b65...992bd28. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@datadog-official

This comment has been minimized.

@L3n41c L3n41c changed the title [CASCL-1304] kubectl-datadog: enrich dd-cluster-info ConfigMap [CASCL-1304] kubectl-datadog: enrich dd-cluster-info ConfigMap May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants