Skip to content

[CASCL-1323] Add update subcommand to kubectl-datadog autoscaling cluster#2976

Merged
L3n41c merged 8 commits intomainfrom
lenaic/CASCL-1323-kubectl-datadog-cluster-update
May 7, 2026
Merged

[CASCL-1323] Add update subcommand to kubectl-datadog autoscaling cluster#2976
L3n41c merged 8 commits intomainfrom
lenaic/CASCL-1323-kubectl-datadog-cluster-update

Conversation

@L3n41c
Copy link
Copy Markdown
Member

@L3n41c L3n41c commented May 5, 2026

What does this PR do?

Adds a new kubectl datadog autoscaling cluster update subcommand that refreshes a Karpenter installation previously deployed by kubectl datadog autoscaling cluster install. The command:

  • Refuses if no kubectl-datadog Karpenter is present, or if a Karpenter installed by another tool occupies the cluster — we never modify what we did not install.
  • Auto-detects the immutable parameters (KarpenterNamespace, InstallMode, FargateSubnets) from the dd-karpenter CloudFormation stack so the user does not have to repeat them on every refresh.
  • Exposes flags only for the mutable knobs (--karpenter-version, --create-karpenter-resources, --inference-method, --debug). Trying to pass an immutable flag (--install-mode, --karpenter-namespace, --fargate-subnets) produces a clear unknown flag error from cobra/pflag — to change those, the documented path is uninstall + install.

Motivation

Today, "upgrading" a kubectl-datadog Karpenter install means re-running install with exactly the same flags (mode, namespace, fargate subnets). If the user forgets, the install command's immutability check raises a technical error, or worse the user accidentally tweaks an immutable parameter. There was also no way to require that an existing kubectl-datadog Karpenter be present — install happily treats a foreign Karpenter as a no-op success, which is the wrong UX for an explicit upgrade.

This is tracked under Jira CASCL-1323.

Additional Notes

The PR is split into three commits to ease review:

  1. Refactor (no behavior change): cluster/install/{guess,k8s}/ are elevated to cluster/{guess,k8s}/ (now used by install, update and uninstall); cluster/install/steps.go becomes cluster/apply/run.go exposing a package-level Run(ctx, streams, configFlags, clientset, RunOptions) error plus stack-name helpers (KarpenterStackName, DDKarpenterStackName, InstallModeTagKey, DetectedInstallMode); pflag types CreateKarpenterResources and InferenceMethod move to cluster/apply/; InstallMode data moves to cluster/apply/install_mode.go while the pflag wrapper stays in cluster/install/install_mode.go (only install binds --install-mode); cluster/uninstall/ and cluster/install/ migrate to the new helpers; clients.ResolveClusterName extracts the explicit-or-kubeconfig fallback used by all three commands.
  2. Feature: the cluster/update/ cobra command + unit tests covering resolveOptions (auto-detection, legacy stack without install-mode tag, corrupt tag, missing namespace, fargate without subnets) and validate.
  3. E2E: TestAutoscalingUpdate walks install (fargate) → update (auto-detect) → idempotent re-update → --create-karpenter-resources=all → unknown-flag rejection → uninstall, asserting Karpenter remains installed and pods schedule throughout. TestAutoscalingUpdateRefusesWhenNoInstall asserts the "no Karpenter installation found" error path.

Minimum Agent Versions

No agent dependency.

  • Agent: n/a
  • Cluster Agent: n/a

Describe your test plan

Unit tests on cmd/kubectl-datadog/autoscaling/cluster/...:

make kubectl-datadog
make lint
go test ./cmd/kubectl-datadog/autoscaling/cluster/...

E2E (the existing autoscaling e2e suite on EKS):

AWS_PROFILE=… AWS_REGION=eu-west-3 \
go test -v ./test/e2e/tests/autoscaling_suite/ -run TestAutoscalingUpdate

Manual smoke (against a dev EKS cluster — prefix with the AWS env vars):

./bin/kubectl-datadog autoscaling cluster install --cluster-name <c> --install-mode=fargate
./bin/kubectl-datadog autoscaling cluster update --cluster-name <c>                          # auto-detect
./bin/kubectl-datadog autoscaling cluster update --cluster-name <c> --install-mode=fargate   # rejected: unknown flag
./bin/kubectl-datadog autoscaling cluster update --cluster-name <c> --create-karpenter-resources=all
./bin/kubectl-datadog autoscaling cluster update --cluster-name <virgin>                     # rejected: no Karpenter
./bin/kubectl-datadog autoscaling cluster uninstall --cluster-name <c> --yes

Checklist

  • PR has at least one valid label: enhancement
  • PR has a milestone or the qa/skip-qa label
  • All commits are signed (see: signing commits)

🤖 Generated with Claude Code

L3n41c and others added 3 commits May 5, 2026 18:44
… + shared subpackages

Prepare ground for the upcoming `kubectl datadog autoscaling cluster
update` subcommand by extracting the convergence logic that install
will share with update into a new `apply` package, and by elevating
`guess` and `k8s` to siblings of the cobra commands now that several
callers depend on them.

* `cluster/install/{guess,k8s}/` → `cluster/{guess,k8s}/`. install,
  uninstall and the new apply package all import them; keeping them
  under install/ would create awkward `…/install/guess` import paths
  in the unrelated callers.
* `cluster/install/steps.go` (and its test) → `cluster/apply/run.go`,
  reshaped into a package-level `Run(ctx, streams, configFlags,
  clientset, RunOptions) error` plus exported helpers
  (`KarpenterStackName`, `DDKarpenterStackName`, `InstallModeTagKey`,
  `DetectedInstallMode`). `RunOptions` replaces what used to be eight
  package-level globals mutated by cobra flag binding. Display
  helpers now take `genericclioptions.IOStreams` instead of
  `*cobra.Command`.
* `cluster/install/{create_karpenter_resources,inference_method}.go`
  (with their tests) → `cluster/apply/`. Both flags are bound by
  install today and by update tomorrow, so their pflag.Value methods
  belong where both callers can reach them.
* `cluster/install/install_mode.go` is split: the bare type plus
  `DetectedInstallMode` go to `cluster/apply/install_mode.go`; the
  pflag.Value wrapper stays in `cluster/install/install_mode.go` as
  a package-private type around `apply.InstallMode`. Only install
  binds `--install-mode`, so the wrapper has a single consumer.
* `cluster/install/install.go` becomes a thin cobra wrapper: flags
  bind to fields on the local `options` struct (no more globals),
  `RunE` builds an `apply.RunOptions{ActionLabel: "Installing"}` and
  delegates to `apply.Run`.
* `cluster/uninstall/uninstall.go` migrates inline stack-name
  concatenations to `apply.KarpenterStackName` and
  `apply.DDKarpenterStackName`, eliminating the duplication.
* `cluster/common/clients/clients.go` gains `ResolveClusterName`,
  collapsing the explicit-flag-or-kubeconfig fallback that install,
  update and uninstall would otherwise repeat.

Behaviour preserved end to end; tests cover the moved code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cluster`

Refresh a previously-installed kubectl-datadog Karpenter deployment
without forcing the user to repeat the immutable flags they passed to
`install`.

`update` adds two values on top of an idempotent `install` re-run:

1. **Safety guard**: refuses if no kubectl-datadog Karpenter is
   present, or if a Karpenter installed by another tool is found. We
   never mutate an installation we did not create. The check runs
   before any AWS API call so a flaky AWS does not mask the
   "run install first" signal.
2. **Auto-detection of immutable parameters**: reads the
   dd-karpenter CFN stack and pulls KarpenterNamespace (parameter),
   InstallMode (`install-mode` tag) and FargateSubnets (parameter)
   from it. None of those are exposed as flags on `update` — they
   are CFN CreateOnly properties, so accepting them and then
   rejecting on mismatch would be UX noise. Changing one is now a
   documented `uninstall` + `install` operation.

Mutable knobs only on the cobra command:
* `--cluster-name`
* `--karpenter-version`
* `--create-karpenter-resources` (default `none` — preserves any
  hand-edits to EC2NodeClass / NodePool; pass `=all` to regenerate)
* `--inference-method`
* `--debug`

A drift check between the running Deployment's namespace and the
namespace recorded in the CFN stack errors out before reaching
`apply.Run`, so `update` surfaces the inconsistency loudly instead of
falling into apply's "foreign install" no-op success path.

The convergence work itself is delegated to `apply.Run` with
`ActionLabel: "Updating"`, so install and update share the CFN /
Helm / aws-auth / NodePool logic — no behavior duplication.

Unit tests cover `resolveOptions` (auto-detection happy paths,
legacy stack without install-mode tag, corrupt tag, missing
KarpenterNamespace, fargate without FargateSubnets) and `validate`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two new tests join the existing autoscaling suite:

* `TestAutoscalingUpdate` walks through install (fargate)
  → update (auto-detect) → idempotent re-update → update with
  `--create-karpenter-resources=all` → unknown-flag rejection
  → uninstall, asserting Karpenter remains installed and pods
  schedule throughout. The unknown-flag sub-test feeds
  `--install-mode=existing-nodes` to verify pflag rejects the
  immutable flag with "unknown flag" rather than silently
  re-applying the install.
* `TestAutoscalingUpdateRefusesWhenNoInstall` runs `update`
  against a cluster with no Karpenter and asserts the command
  fails with "no Karpenter installation found" — the error path
  pointing the user at `install` first.

A small `testUpdate` helper mirrors `testInstall` so subtests
stay terse.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@datadog-official
Copy link
Copy Markdown

datadog-official Bot commented May 5, 2026

Code Coverage

Fix all issues with BitsAI

🛑 Gate Violations

🎯 1 Code Coverage issue detected

A Patch coverage percentage gate may be blocking this PR.

Patch coverage: 19.35% (threshold: 80.00%)

ℹ️ Info

🎯 Code Coverage (details)
Patch Coverage: 19.35%
Overall Coverage: 41.49% (-0.06%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 2a48e4f | Docs | Datadog PR Page | Give us feedback!

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 5, 2026

Codecov Report

❌ Patch coverage is 19.49153% with 190 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.33%. Comparing base (4e9c1ab) to head (2a48e4f).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...bectl-datadog/autoscaling/cluster/update/update.go 34.61% 68 Missing ⚠️
...d/kubectl-datadog/autoscaling/cluster/apply/run.go 8.45% 65 Missing ⚠️
...ctl-datadog/autoscaling/cluster/install/install.go 11.76% 30 Missing ⚠️
...adog/autoscaling/cluster/common/clients/clients.go 0.00% 9 Missing ⚠️
...l-datadog/autoscaling/cluster/common/k8s/object.go 0.00% 9 Missing ⚠️
...datadog/autoscaling/cluster/uninstall/uninstall.go 0.00% 8 Missing ⚠️
cmd/kubectl-datadog/autoscaling/cluster/cluster.go 0.00% 1 Missing ⚠️

❌ Your patch status has failed because the patch coverage (19.49%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2976      +/-   ##
==========================================
- Coverage   41.39%   41.33%   -0.07%     
==========================================
  Files         331      332       +1     
  Lines       28911    29042     +131     
==========================================
+ Hits        11969    12005      +36     
- Misses      16086    16181      +95     
  Partials      856      856              
Flag Coverage Δ
unittests 41.33% <19.49%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...caling/cluster/apply/create_karpenter_resources.go 100.00% <ø> (ø)
...adog/autoscaling/cluster/apply/inference_method.go 100.00% <ø> (ø)
...-datadog/autoscaling/cluster/apply/install_mode.go 100.00% <ø> (ø)
...ectl-datadog/autoscaling/cluster/guess/aws-auth.go 0.00% <ø> (ø)
...tadog/autoscaling/cluster/guess/clusterauthmode.go 0.00% <ø> (ø)
...l-datadog/autoscaling/cluster/guess/clustername.go 65.00% <ø> (ø)
...l-datadog/autoscaling/cluster/guess/eksautomode.go 88.88% <ø> (ø)
...g/autoscaling/cluster/guess/ekspodidentityagent.go 0.00% <ø> (ø)
...ctl-datadog/autoscaling/cluster/guess/karpenter.go 94.28% <ø> (ø)
...g/autoscaling/cluster/guess/nodegroupproperties.go 0.00% <ø> (ø)
... and 13 more

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4e9c1ab...2a48e4f. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

L3n41c and others added 2 commits May 5, 2026 20:54
…pflag types

The earlier split of `install_mode.go` into a bare type in `apply/`
and a pflag wrapper in `install/` added boilerplate (a wrapper type,
a conversion at every call site) without buying clarity. Move the
whole file to `apply/` like `create_karpenter_resources.go` and
`inference_method.go` already are: the pflag methods live where the
type lives, and the `install` cobra command binds `apply.InstallMode`
directly.

`update` does not bind `--install-mode`, so the pflag methods just go
unused there — same situation as `apply.InstallMode` itself before
this commit. No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`InstallModeTagKey` and `DetectedInstallMode` are CloudFormation-stack
concerns, not pflag.Value plumbing. Putting them in `install_mode.go`
made that file diverge from `inference_method.go` and
`create_karpenter_resources.go`, which both moved over from `install`
with only a package rename.

Move them to `run.go` next to `KarpenterStackName` /
`DDKarpenterStackName`, where the rest of the CFN logic lives. The
three pflag-type files in `apply/` are now structurally symmetric:
data + pflag methods, nothing else.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@L3n41c L3n41c changed the title [CASCL-1323] Add 'update' subcommand to kubectl-datadog autoscaling cluster [CASCL-1323] Add update subcommand to kubectl-datadog autoscaling cluster May 5, 2026
`commonk8s.CreateOrUpdate` does a Get-then-Update, snapshotting
resourceVersion in between. That races against any controller that
reconciles the same resource — Karpenter mutates a NodePool's
.status (and an EC2NodeClass's status conditions) every time a
NodeClaim transitions, which bumps resourceVersion under us. The
window is harmless during `install` (the resource was just created
and Karpenter is not running yet) but it is the normal state during
`update --create-karpenter-resources=all`, and it surfaced as a real
e2e failure on the new `TestAutoscalingUpdate` sub-test:

  failed to update NodePool dd-karpenter-e2mz4: ... the object has
  been modified; please apply your changes to the latest version
  and try again

Wrapping the Get/Update in `retry.RetryOnConflict(retry.DefaultRetry,
…)` is the standard k8s pattern: re-fetch, re-apply, retry. The
default backoff caps at five attempts spread over ~2s, which is
plenty for the small status-only writes Karpenter performs and
keeps the failure mode loud if something else is genuinely wrong.

The fix also benefits install's NodePool / EC2NodeClass writes and
the `dd-cluster-info` ConfigMap snapshot — same call path, same
race once a controller starts touching the namespace.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@L3n41c L3n41c force-pushed the lenaic/CASCL-1323-kubectl-datadog-cluster-update branch from c3f919d to 067505c Compare May 6, 2026 10:05
`browser.Stdout` and `browser.Stderr` accept any `io.Writer`. The
shim only mattered if `streams.Out` / `streams.ErrOut` could be nil,
but every other call site in this file (`display.PrintBox`,
`fmt.Fprintf`) already assumes they are not — `genericclioptions.IOStreams`
is constructed with valid writers by the cobra runtime, so the shim
was dead defensive code. Drop it and pass the streams through
directly, matching the rest of the package.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@L3n41c L3n41c marked this pull request as ready for review May 6, 2026 12:27
@L3n41c L3n41c requested a review from a team May 6, 2026 12:27
@L3n41c L3n41c requested a review from a team as a code owner May 6, 2026 12:27
Copy link
Copy Markdown
Collaborator

@clamoriniere clamoriniere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@L3n41c L3n41c added this to the v1.28.0 milestone May 7, 2026
@L3n41c L3n41c merged commit d3652e6 into main May 7, 2026
42 of 43 checks passed
@L3n41c L3n41c deleted the lenaic/CASCL-1323-kubectl-datadog-cluster-update branch May 7, 2026 07:38
L3n41c added a commit that referenced this pull request May 7, 2026
The merge of main into the feature branch (commit 87c6e46) brought in
PR #2976's `update.go`, which referenced `guess.FindKarpenterInstallation`
— the symbol our PR moved to `karpenter.FindInstallation`. The
resolution updated all other call sites but missed this one, so the
build broke on the merge commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants