Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ For the namespaced variant, see [Proactive Policies](./Proactive-Policies.md). F
| `spec.scope.labelSelector` | none | Kubernetes label selector for matching workloads. |
| `spec.scope.workloadTypes` | `[Deployment, StatefulSet, CronJob, Rollout, Job, AnalysisRun, DaemonSet]` | Workload kinds this policy applies to. |
| `spec.scope.namespaceSelector.operator` | none | Namespace selector operator: `In` or `NotIn`. |
| `spec.scope.namespaceSelector.values` | none | Namespace patterns to include or exclude. |
| `spec.scope.namespaceSelector.values` | none | Namespace patterns to include or exclude (supports `*` wildcards, e.g. `prod-*`). |
| `spec.automationStrategyRef.name` | none | Required cluster strategy name. |
| `spec.weight` | `0` | Higher weight wins when multiple proactive policies match. |
| `spec.safetyChecks.maxAnalysisAgeDays` | `5` | Rejects old recommendations. |
Expand Down
12 changes: 12 additions & 0 deletions charts/kubex-automation-engine/docs/Policy-Configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,18 @@ Those resources remain fully supported by the controller and can be managed outs
| `policy.policies.<name>.safetyChecks.maxAnalysisAgeDays` | `ClusterProactivePolicy.spec.safetyChecks.maxAnalysisAgeDays` | Per-policy value wins over top-level `policy.safetyChecks.maxAnalysisAgeDays`. |
| `policy.safetyChecks.maxAnalysisAgeDays` | `ClusterProactivePolicy.spec.safetyChecks.maxAnalysisAgeDays` | Backward-compatible fallback when not set per policy. |

### Namespace wildcards in `scope[].namespaces.values`
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Minor] [Confidence: High]

Location: charts/kubex-automation-engine/docs/Policy-Configuration.md:75

Issue: The new section heading references scope[].namespaces.values but all other documentation and the ClusterProactivePolicy CRD field path use scope.namespaceSelector.values. The Cluster-Proactive-Policies.md doc updated in this same PR correctly refers to spec.scope.namespaceSelector.values.

Why it matters: Users searching the docs for the correct field path will find two different names for the same concept (namespaces.values vs namespaceSelector.values), leading to confusion and likely misconfiguration.

Suggested fix: Align the heading and prose to match the canonical field name used in the CRD and other docs:

### Namespace wildcards in `scope[].namespaceSelector.values`

`scope[].namespaceSelector.values` supports shell-style `*` wildcards ...


`scope[].namespaces.values` supports shell-style `*` wildcards when matching namespace names (for example: `prod-*`).

```yaml
scope:
- name: platform
namespaces:
operator: In
values: ["prod-*", "staging"]
```

Important:

- `maxAnalysisAgeDays` is written to generated `ClusterProactivePolicy` resources, not to generated strategies.
Expand Down
28 changes: 28 additions & 0 deletions charts/kubex-automation-engine/docs/Troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,34 @@ Use this sequence when rightsizing does not happen as expected.

For a consolidated map of the controller's safety gates, see [Safety Controls](./Safety-Controls.md).

## 0. Temporarily Enable Debug Logging (and Revert)

Most of the time you only want debug logs briefly. The quickest way is to update the live Deployment args (this triggers a rollout and will be overwritten by the next `helm upgrade`).

Enable debug (temporary):

```bash
kubectl -n kubex patch deploy/$(kubectl -n kubex get deploy -l app.kubernetes.io/name=kubex-automation-engine -o jsonpath='{.items[0].metadata.name}') --type='json' -p='[{"op":"replace","path":"/spec/template/spec/containers/0/args/3","value":"--zap-log-level=debug"}]'
```

Revert back to info:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Major] [Confidence: High]

Location: charts/kubex-automation-engine/docs/Troubleshooting.md:18

Issue: The kubectl patch command hardcodes /spec/template/spec/containers/0/args/3 as the JSON Patch path for --zap-log-level, which will silently break if the args list order ever changes.

Why it matters: args in the Deployment are rendered via {{- include "kubex-automation-engine.managerArgs" . | nindent 8 }}, whose output order is determined by the Helm template. If a future Helm release inserts or removes an arg before index 3, the patch will overwrite the wrong argument—potentially disabling a critical flag—without producing any visible error. This is an operational hazard for users who follow the documentation.

Suggested fix: Replace the index-based approach with a value filter to make the command index-agnostic. For example:

# Find the current index of the --zap-log-level arg
INDEX=$(kubectl -n kubex get deploy/$(kubectl -n kubex get deploy -l app.kubernetes.io/name=kubex-automation-engine -o jsonpath='{.items[0].metadata.name}') \
  -o jsonpath='{range .spec.template.spec.containers[0].args[*]}{@}{"\n"}{end}' \
  | grep -n 'zap-log-level' | cut -d: -f1)
INDEX=$((INDEX - 1))  # zero-indexed

kubectl -n kubex patch deploy/... --type='json' \
  -p="[{\"op\":\"replace\",\"path\":\"/spec/template/spec/containers/0/args/${INDEX}\",\"value\":\"--zap-log-level=debug\"}]"

Alternatively, document the use of --zap-log-level exclusively via helm upgrade --set controllerManager.extraArgs[0]=... (the persistent path already documented below) to avoid the fragile index reference altogether.

```bash
kubectl -n kubex patch deploy/$(kubectl -n kubex get deploy -l app.kubernetes.io/name=kubex-automation-engine -o jsonpath='{.items[0].metadata.name}') --type='json' -p='[{"op":"replace","path":"/spec/template/spec/containers/0/args/3","value":"--zap-log-level=info"}]'
```

If you want the setting to persist across upgrades, use Helm instead:

```bash
helm upgrade kubex-automation kubex/kubex-automation-engine -n kubex --reuse-values --set 'controllerManager.extraArgs[0]=--zap-log-level=debug'
```

Revert with Helm:

```bash
helm upgrade kubex-automation kubex/kubex-automation-engine -n kubex --reuse-values --set 'controllerManager.extraArgs[0]=--zap-log-level=info'
```

## 1. Interpret `rightsizing summary` Logs

```bash
Expand Down
3 changes: 3 additions & 0 deletions charts/kubex-automation-engine/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@ kind: Deployment
metadata:
name: {{ include "kubex-automation-engine.fullname" . }}
namespace: {{ include "kubex-automation-engine.namespace" . }}
annotations:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Major] [Confidence: Medium]

Location: charts/kubex-automation-engine/templates/deployment.yaml:13

Issue: The rightsizing.kubex.ai/pause-until: infinite annotation is added to the Deployment metadata but the same annotation already existed on the Pod template metadata (.spec.template.metadata.annotations). The Deployment-level annotation is not the mechanism the controller uses to pause rightsizing; the controller reads the annotation on the Pod template spec, not on the enclosing Deployment object itself.

Why it matters: If the controller evaluates this annotation on the Deployment object it would work, but most admission-time/recommendation-time controllers inspect the Pod or Pod template annotations, not the Deployment-level metadata. A mismatch could result in the engine still attempting to automate its own Deployment despite the intent expressed by the comment, or it may be dead/redundant metadata that misleads operators. Given the Pod template already carries the annotation, the Deployment-level copy may be harmless—but if it is relied upon for correctness it should be validated.

Suggested fix: Confirm whether the rightsizing.kubex.ai/pause-until annotation is evaluated at Deployment-level or Pod-level by the controller. If Pod-level is sufficient (as is typical for admission controllers), remove the newly added Deployment-level annotation to avoid confusion. If both levels are intentionally checked, add a comment explaining why both are needed.

# This annotation is set by default so that the automation doesn't attempt to automate itself
rightsizing.kubex.ai/pause-until: infinite
labels:
{{- include "kubex-automation-engine.labels" . | nindent 4 }}
spec:
Expand Down
3 changes: 3 additions & 0 deletions charts/kubex-automation-engine/templates/role.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,7 @@ rules:
- globalconfigurations
- gpuconsolidationpolicies
- gpurebalancingpolicies
- podaffinitypolicies
- policyevaluations
- proactivepolicies
- staticpolicies
Expand All @@ -150,6 +151,7 @@ rules:
- globalconfigurations/finalizers
- gpuconsolidationpolicies/finalizers
- gpurebalancingpolicies/finalizers
- podaffinitypolicies/finalizers
- policyevaluations/finalizers
- proactivepolicies/finalizers
- staticpolicies/finalizers
Expand All @@ -164,6 +166,7 @@ rules:
- globalconfigurations/status
- gpuconsolidationpolicies/status
- gpurebalancingpolicies/status
- podaffinitypolicies/status
- policyevaluations/status
- proactivepolicies/status
- staticpolicies/status
Expand Down
2 changes: 1 addition & 1 deletion charts/kubex-automation-engine/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ crdCheck:
skip: false
# Minimum version of the kubex-crds chart required before this chart can be installed/upgraded.
# Bump this whenever a new CRD field or schema change is required.
minCRDsHelmVersion: "1.0.0"
minCRDsHelmVersion: "1.0.1"
# -- Whether to create the kubex-gateway-config Secret automatically
createSecrets: true

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,247 @@
---
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
annotations:
controller-gen.kubebuilder.io/version: v0.19.0
name: podaffinitypolicies.rightsizing.kubex.ai
spec:
group: rightsizing.kubex.ai
names:
kind: PodAffinityPolicy
listKind: PodAffinityPolicyList
plural: podaffinitypolicies
singular: podaffinitypolicy
scope: Cluster
versions:
- name: v1alpha1
schema:
openAPIV3Schema:
description: PodAffinityPolicy is the Schema for the podaffinitypolicies API.
properties:
apiVersion:
description: |-
APIVersion defines the versioned schema of this representation of an object.
Servers should convert recognized schemas to the latest internal value, and
may reject unrecognized values.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources
type: string
kind:
description: |-
Kind is a string value representing the REST resource this object represents.
Servers may infer this from the endpoint the client submits requests to.
Cannot be updated.
In CamelCase.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
type: string
metadata:
type: object
spec:
description: spec defines the desired state of PodAffinityPolicy
properties:
affinity:
description: affinity describes the preferred node affinity to inject
at pod admission time.
properties:
nodes:
description: nodes lists hostname label values to prefer on replacement
pods.
items:
type: string
minItems: 1
type: array
required:
- nodes
type: object
scope:
description: scope narrows the workloads and namespaces this policy
applies to.
properties:
labelSelector:
description: labelSelector limits the workload objects (e.g.,
Deployments, CronJobs) this policy applies to.
properties:
matchExpressions:
description: matchExpressions is a list of label selector
requirements. The requirements are ANDed.
items:
description: |-
A label selector requirement is a selector that contains values, a key, and an operator that
relates the key and values.
properties:
key:
description: key is the label key that the selector
applies to.
type: string
operator:
description: |-
operator represents a key's relationship to a set of values.
Valid operators are In, NotIn, Exists and DoesNotExist.
type: string
values:
description: |-
values is an array of string values. If the operator is In or NotIn,
the values array must be non-empty. If the operator is Exists or DoesNotExist,
the values array must be empty. This array is replaced during a strategic
merge patch.
items:
type: string
type: array
x-kubernetes-list-type: atomic
required:
- key
- operator
type: object
type: array
x-kubernetes-list-type: atomic
matchLabels:
additionalProperties:
type: string
description: |-
matchLabels is a map of {key,value} pairs. A single {key,value} in the matchLabels
map is equivalent to an element of matchExpressions, whose key field is "key", the
operator is "In", and the values array contains only "value". The requirements are ANDed.
type: object
type: object
x-kubernetes-map-type: atomic
namespaceSelector:
description: namespaceSelector restricts the namespaces this policy
applies to.
properties:
operator:
description: operator determines how the listed values are
evaluated.
enum:
- In
- NotIn
type: string
values:
description: values contains the namespace name patterns to
match.
items:
type: string
minItems: 1
type: array
required:
- operator
- values
type: object
workloadTypes:
default:
- Deployment
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Minor] [Confidence: High]

Location: charts/kubex-crds/templates/rightsizing.kubex.ai_podaffinitypolicies.yaml:131

Issue: The workloadTypes default list order differs between the new PodAffinityPolicy CRD and other policies in the chart. The CRD lists [Deployment, StatefulSet, CronJob, Rollout, Job, AnalysisRun, DaemonSet] but the workloadTypes items enum lists [Deployment, StatefulSet, DaemonSet, CronJob, Rollout, Job, AnalysisRun].

Why it matters: While functionally equivalent (defaults and enums are sets), inconsistency across CRDs in the same chart makes maintenance harder and can confuse users comparing schemas. It also signals a copy-paste from two different sources without reconciliation.

Suggested fix: Align the default list order with the items enum order, or align all CRDs in this chart to use the same canonical ordering (e.g., [Deployment, StatefulSet, DaemonSet, CronJob, Rollout, Job, AnalysisRun]).

- StatefulSet
- CronJob
- Rollout
- Job
- AnalysisRun
- DaemonSet
description: workloadTypes limits the workload kinds this policy
applies to. When omitted, all supported workload types are targeted.
items:
description: WorkloadType enumerates the workload kinds a policy
can target.
enum:
- Deployment
- StatefulSet
- DaemonSet
- CronJob
- Rollout
- Job
- AnalysisRun
type: string
type: array
required:
- namespaceSelector
type: object
weight:
default: 0
description: |-
weight determines which policy wins when multiple PodAffinityPolicy policies match.
Higher weights take precedence. When weights are equal, older policies win.
format: int32
minimum: 0
type: integer
required:
- affinity
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Minor] [Confidence: High]

Location: charts/kubex-crds/templates/rightsizing.kubex.ai_podaffinitypolicies.yaml:165

Issue: The status.conditions description copy-pastes "represent the current state of the StaticPolicy resource" instead of "PodAffinityPolicy resource".

Why it matters: Stale copy-paste in the CRD schema description is surfaced in kubectl describe output and any OpenAPI-aware tooling (e.g., kubectl explain podaffinitypolicies.status.conditions). It will mislead users debugging PodAffinityPolicy resources and indicates the schema block was templated from StaticPolicy without a full find-replace pass.

Suggested fix: Change the description to reference PodAffinityPolicy:

description: |-
  conditions represent the current state of the PodAffinityPolicy resource.
  ...

- scope
type: object
status:
description: status defines the observed state of PodAffinityPolicy
properties:
conditions:
description: |-
conditions represent the current state of the StaticPolicy resource.
Each condition has a unique type and reflects the status of a specific aspect of the resource.

Standard condition types include:
- "Available": the resource is fully functional
- "Progressing": the resource is being created or updated
- "Degraded": the resource failed to reach or maintain its desired state

The status of each condition is one of True, False, or Unknown.
items:
description: Condition contains details for one aspect of the current
state of this API Resource.
properties:
lastTransitionTime:
description: |-
lastTransitionTime is the last time the condition transitioned from one status to another.
This should be when the underlying condition changed. If that is not known, then using the time when the API field changed is acceptable.
format: date-time
type: string
message:
description: |-
message is a human readable message indicating details about the transition.
This may be an empty string.
maxLength: 32768
type: string
observedGeneration:
description: |-
observedGeneration represents the .metadata.generation that the condition was set based upon.
For instance, if .metadata.generation is currently 12, but the .status.conditions[x].observedGeneration is 9, the condition is out of date
with respect to the current state of the instance.
format: int64
minimum: 0
type: integer
reason:
description: |-
reason contains a programmatic identifier indicating the reason for the condition's last transition.
Producers of specific condition types may define expected values and meanings for this field,
and whether the values are considered a guaranteed API.
The value should be a CamelCase string.
This field may not be empty.
maxLength: 1024
minLength: 1
pattern: ^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$
type: string
status:
description: status of the condition, one of True, False, Unknown.
enum:
- "True"
- "False"
- Unknown
type: string
type:
description: type of condition in CamelCase or in foo.example.com/CamelCase.
maxLength: 316
pattern: ^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])$
type: string
required:
- lastTransitionTime
- message
- reason
- status
- type
type: object
type: array
x-kubernetes-list-map-keys:
- type
x-kubernetes-list-type: map
type: object
required:
- spec
type: object
served: true
storage: true
subresources:
status: {}