Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions charts/kubex-automation-engine/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@ apiVersion: v2
name: kubex-automation-engine
description: A Helm chart for deploying kubex-automation-engine for automated workload rightsizing and resource optimization
type: application
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Major] [Confidence: High]

Issue: Chart version is being downgraded from 1.0.11.0.0.

Why it matters: Helm uses semver to determine whether an upgrade or a downgrade is happening. Reverting the chart version to a lower number than what is already published/deployed can cause helm upgrade to no-op or behave unexpectedly for users who already have 1.0.1 installed. It also breaks immutability guarantees in chart repositories — if 1.0.0 was already released, pushing a different chart under the same version is a breaking convention violation.

Suggested fix: If this is intentional (e.g., a reset before a new release cycle), document it explicitly in the PR body and ensure the chart repository has no prior 1.0.0 artifact. Otherwise, bump the version to 1.0.2 (or higher) to reflect the new content being added in this PR.

version: 1.0.1
icon: https://kubex.ai/wp-content/uploads/kubex-logo-landscape.svg
version: 1.0.0
icon: https://www.kubex.ai/wp-content/uploads/kubex-by-densify-logo.png
keywords:
- kubex
- automation
Expand Down
6 changes: 0 additions & 6 deletions charts/kubex-automation-engine/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,5 @@
# Kubex Automation Engine

<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://kubex.ai/wp-content/uploads/kubex-logo-reverse-landscape.svg">
<source media="(prefers-color-scheme: light)" srcset="https://kubex.ai/wp-content/uploads/kubex-logo-landscape.svg">
<img src="https://kubex.ai/wp-content/uploads/kubex-logo-landscape.svg" width="300">
</picture>

Kubernetes resource optimization with policy-driven rightsizing, admission-time mutation, and proactive resize execution.

# Quick Links
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ For the namespaced variant, see [Proactive Policies](./Proactive-Policies.md). F
| `spec.scope.labelSelector` | none | Kubernetes label selector for matching workloads. |
| `spec.scope.workloadTypes` | `[Deployment, StatefulSet, CronJob, Rollout, Job, AnalysisRun, DaemonSet]` | Workload kinds this policy applies to. |
| `spec.scope.namespaceSelector.operator` | none | Namespace selector operator: `In` or `NotIn`. |
| `spec.scope.namespaceSelector.values` | none | Namespace patterns to include or exclude. |
| `spec.scope.namespaceSelector.values` | none | Namespace patterns to include or exclude (supports `*` wildcards, e.g. `prod-*`). |
| `spec.automationStrategyRef.name` | none | Required cluster strategy name. |
| `spec.weight` | `0` | Higher weight wins when multiple proactive policies match. |
| `spec.safetyChecks.maxAnalysisAgeDays` | `5` | Rejects old recommendations. |
Expand Down
12 changes: 12 additions & 0 deletions charts/kubex-automation-engine/docs/Policy-Configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,18 @@ Those resources remain fully supported by the controller and can be managed outs
| `policy.policies.<name>.safetyChecks.maxAnalysisAgeDays` | `ClusterProactivePolicy.spec.safetyChecks.maxAnalysisAgeDays` | Per-policy value wins over top-level `policy.safetyChecks.maxAnalysisAgeDays`. |
| `policy.safetyChecks.maxAnalysisAgeDays` | `ClusterProactivePolicy.spec.safetyChecks.maxAnalysisAgeDays` | Backward-compatible fallback when not set per policy. |

### Namespace wildcards in `scope[].namespaces.values`

`scope[].namespaces.values` supports shell-style `*` wildcards when matching namespace names (for example: `prod-*`).

```yaml
scope:
- name: platform
namespaces:
operator: In
values: ["prod-*", "staging"]
```

Important:

- `maxAnalysisAgeDays` is written to generated `ClusterProactivePolicy` resources, not to generated strategies.
Expand Down
28 changes: 28 additions & 0 deletions charts/kubex-automation-engine/docs/Troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,34 @@ Use this sequence when rightsizing does not happen as expected.

For a consolidated map of the controller's safety gates, see [Safety Controls](./Safety-Controls.md).

## 0. Temporarily Enable Debug Logging (and Revert)

Most of the time you only want debug logs briefly. The quickest way is to update the live Deployment args (this triggers a rollout and will be overwritten by the next `helm upgrade`).

Enable debug (temporary):

```bash
kubectl -n kubex patch deploy/$(kubectl -n kubex get deploy -l app.kubernetes.io/name=kubex-automation-engine -o jsonpath='{.items[0].metadata.name}') --type='json' -p='[{"op":"replace","path":"/spec/template/spec/containers/0/args/3","value":"--zap-log-level=debug"}]'
```
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Minor] [Confidence: High]

Issue: The kubectl patch commands hardcode the JSON Patch path /spec/template/spec/containers/0/args/3, which assumes --zap-log-level is always the 4th argument (index 3) in the container's args array.

Why it matters: If the chart's deployment.yaml ever reorders arguments, or if a user adds/removes entries via controllerManager.extraArgs, the index 3 will point to the wrong arg, silently patching an unintended field (e.g., overwriting a different flag). This can be hard to diagnose in production incidents.

Suggested fix: Add a warning note such as: "Verify the argument index before running: kubectl -n kubex get deploy <name> -o jsonpath='{.spec.template.spec.containers[0].args}'" Or replace the JSON Patch replace with a jq-based approach that targets the arg by its prefix rather than by positional index.


Revert back to info:

```bash
kubectl -n kubex patch deploy/$(kubectl -n kubex get deploy -l app.kubernetes.io/name=kubex-automation-engine -o jsonpath='{.items[0].metadata.name}') --type='json' -p='[{"op":"replace","path":"/spec/template/spec/containers/0/args/3","value":"--zap-log-level=info"}]'
```

If you want the setting to persist across upgrades, use Helm instead:

```bash
helm upgrade kubex-automation kubex/kubex-automation-engine -n kubex --reuse-values --set 'controllerManager.extraArgs[0]=--zap-log-level=debug'
```

Revert with Helm:

```bash
helm upgrade kubex-automation kubex/kubex-automation-engine -n kubex --reuse-values --set 'controllerManager.extraArgs[0]=--zap-log-level=info'
```

## 1. Interpret `rightsizing summary` Logs

```bash
Expand Down
3 changes: 3 additions & 0 deletions charts/kubex-automation-engine/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@ kind: Deployment
metadata:
name: {{ include "kubex-automation-engine.fullname" . }}
namespace: {{ include "kubex-automation-engine.namespace" . }}
annotations:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Minor] [Confidence: High]

Issue: The rightsizing.kubex.ai/pause-until: infinite annotation is hardcoded directly in the Deployment template with no way for users to override or disable it via values.yaml.

Why it matters: While the self-automation-prevention intent is sound, hardcoding this annotation means:

  1. Users cannot opt out if their environment handles this differently.
  2. Operators who want to allow rightsizing on the controller itself (e.g., in a test environment) have no mechanism to do so.
  3. The annotation value "infinite" is a non-standard string — if the controller's annotation semantics ever change (e.g., to a timestamp format), this baked-in value will silently misbehave.

Suggested fix: Either expose this via a values.yaml toggle (e.g., controllerManager.pauseRightsizing: true) that conditionally renders the annotation, or at minimum add a {{- with .Values.controllerManager.annotations }} merge point so users can override deployment-level annotations.

# This annotation is set by default so that the automation doesn't attempt to automate itself
rightsizing.kubex.ai/pause-until: infinite
labels:
{{- include "kubex-automation-engine.labels" . | nindent 4 }}
spec:
Expand Down
3 changes: 3 additions & 0 deletions charts/kubex-automation-engine/templates/role.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,7 @@ rules:
- globalconfigurations
- gpuconsolidationpolicies
- gpurebalancingpolicies
- podaffinities
- policyevaluations
- proactivepolicies
- staticpolicies
Expand All @@ -150,6 +151,7 @@ rules:
- globalconfigurations/finalizers
- gpuconsolidationpolicies/finalizers
- gpurebalancingpolicies/finalizers
- podaffinities/finalizers
- policyevaluations/finalizers
- proactivepolicies/finalizers
- staticpolicies/finalizers
Expand All @@ -164,6 +166,7 @@ rules:
- globalconfigurations/status
- gpuconsolidationpolicies/status
- gpurebalancingpolicies/status
- podaffinities/status
- policyevaluations/status
- proactivepolicies/status
- staticpolicies/status
Expand Down
4 changes: 2 additions & 2 deletions charts/kubex-crds/Chart.yaml
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
apiVersion: v2
name: kubex-crds
description: CRDs for Kubex Automation Engine
icon: https://kubex.ai/wp-content/uploads/kubex-logo-landscape.svg
version: 1.0.1
version: 1.0.0
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Major] [Confidence: High]

Issue: Chart version is being downgraded from 1.0.11.0.0, mirroring the same problem in kubex-automation-engine/Chart.yaml. Additionally, the icon field is removed entirely without replacement.

Why it matters: Same semver immutability concern applies here. A chart registry (e.g., OCI or a classic chart museum) that already has 1.0.0 cannot accept a different artifact under the same version. Removing icon also degrades the chart's presentation in Helm UIs (Artifact Hub, Rancher, etc.) without explanation.

Suggested fix: Bump the version to 1.0.2 (or higher). If the icon URL was changed (as in kubex-automation-engine), apply the same replacement icon URL here rather than omitting the field entirely.

appVersion: v0.1
keywords:
- crd
- kubex
Expand Down
247 changes: 247 additions & 0 deletions charts/kubex-crds/templates/rightsizing.kubex.ai_podaffinities.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,247 @@
---
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
annotations:
controller-gen.kubebuilder.io/version: v0.19.0
name: podaffinities.rightsizing.kubex.ai
spec:
group: rightsizing.kubex.ai
names:
kind: PodAffinity
listKind: PodAffinityList
plural: podaffinities
singular: podaffinity
scope: Cluster
versions:
- name: v1alpha1
schema:
openAPIV3Schema:
description: PodAffinity is the Schema for the podaffinities API.
properties:
apiVersion:
description: |-
APIVersion defines the versioned schema of this representation of an object.
Servers should convert recognized schemas to the latest internal value, and
may reject unrecognized values.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources
type: string
kind:
description: |-
Kind is a string value representing the REST resource this object represents.
Servers may infer this from the endpoint the client submits requests to.
Cannot be updated.
In CamelCase.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
type: string
metadata:
type: object
spec:
description: spec defines the desired state of PodAffinity
properties:
affinity:
description: affinity describes the preferred node affinity to inject
at pod admission time.
properties:
nodes:
description: nodes lists hostname label values to prefer on replacement
pods.
items:
type: string
minItems: 1
type: array
required:
- nodes
type: object
scope:
description: scope narrows the workloads and namespaces this policy
applies to.
properties:
labelSelector:
description: labelSelector limits the workload objects (e.g.,
Deployments, CronJobs) this policy applies to.
properties:
matchExpressions:
description: matchExpressions is a list of label selector
requirements. The requirements are ANDed.
items:
description: |-
A label selector requirement is a selector that contains values, a key, and an operator that
relates the key and values.
properties:
key:
description: key is the label key that the selector
applies to.
type: string
operator:
description: |-
operator represents a key's relationship to a set of values.
Valid operators are In, NotIn, Exists and DoesNotExist.
type: string
values:
description: |-
values is an array of string values. If the operator is In or NotIn,
the values array must be non-empty. If the operator is Exists or DoesNotExist,
the values array must be empty. This array is replaced during a strategic
merge patch.
items:
type: string
type: array
x-kubernetes-list-type: atomic
required:
- key
- operator
type: object
type: array
x-kubernetes-list-type: atomic
matchLabels:
additionalProperties:
type: string
description: |-
matchLabels is a map of {key,value} pairs. A single {key,value} in the matchLabels
map is equivalent to an element of matchExpressions, whose key field is "key", the
operator is "In", and the values array contains only "value". The requirements are ANDed.
type: object
type: object
x-kubernetes-map-type: atomic
namespaceSelector:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Minor] [Confidence: Medium]

Issue: The CRD's namespaceSelector field is marked required under scope, but the PodAffinity description in Policy-Configuration.md shows scope[].namespaces.values supporting wildcards. However, the CRD schema does not document the wildcard behaviour in the values field description — it only says "values contains the namespace name patterns to match", with no mention of * support.

Why it matters: Users relying solely on kubectl explain or generated API docs will not discover the wildcard feature. It also creates a discoverability gap when comparing the ClusterProactivePolicy reference (which does document wildcards) with the new PodAffinity CRD (which does not).

Suggested fix: Augment the values field description in the CRD schema to explicitly mention wildcard support:

description: |-
  values contains the namespace name patterns to match.
  Supports shell-style '*' wildcards (e.g. 'prod-*').

description: namespaceSelector restricts the namespaces this policy
applies to.
properties:
operator:
description: operator determines how the listed values are
evaluated.
enum:
- In
- NotIn
type: string
values:
description: values contains the namespace name patterns to
match.
items:
type: string
minItems: 1
type: array
required:
- operator
- values
type: object
workloadTypes:
default:
- Deployment
- StatefulSet
- CronJob
- Rollout
- Job
- AnalysisRun
- DaemonSet
description: workloadTypes limits the workload kinds this policy
applies to. When omitted, all supported workload types are targeted.
items:
description: WorkloadType enumerates the workload kinds a policy
can target.
enum:
- Deployment
- StatefulSet
- DaemonSet
- CronJob
- Rollout
- Job
- AnalysisRun
type: string
type: array
required:
- namespaceSelector
type: object
weight:
default: 0
description: |-
weight determines which policy wins when multiple PodAffinity policies match.
Higher weights take precedence. When weights are equal, older policies win.
format: int32
minimum: 0
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Minor] [Confidence: Medium]

Issue: The status.conditions description copy-pastes the text "conditions represent the current state of the StaticPolicy resource" — but this is the PodAffinity CRD, not a StaticPolicy.

Why it matters: Stale/incorrect documentation in a CRD's OpenAPI schema is surfaced directly to kubectl explain, API docs generators, and operator SDKs. Users querying kubectl explain podaffinity.status.conditions will receive misleading information.

Suggested fix: Update the description to reference PodAffinity:

description: |-
  conditions represent the current state of the PodAffinity resource.
  ...

type: integer
required:
- affinity
- scope
type: object
status:
description: status defines the observed state of PodAffinity
properties:
conditions:
description: |-
conditions represent the current state of the StaticPolicy resource.
Each condition has a unique type and reflects the status of a specific aspect of the resource.

Standard condition types include:
- "Available": the resource is fully functional
- "Progressing": the resource is being created or updated
- "Degraded": the resource failed to reach or maintain its desired state

The status of each condition is one of True, False, or Unknown.
items:
description: Condition contains details for one aspect of the current
state of this API Resource.
properties:
lastTransitionTime:
description: |-
lastTransitionTime is the last time the condition transitioned from one status to another.
This should be when the underlying condition changed. If that is not known, then using the time when the API field changed is acceptable.
format: date-time
type: string
message:
description: |-
message is a human readable message indicating details about the transition.
This may be an empty string.
maxLength: 32768
type: string
observedGeneration:
description: |-
observedGeneration represents the .metadata.generation that the condition was set based upon.
For instance, if .metadata.generation is currently 12, but the .status.conditions[x].observedGeneration is 9, the condition is out of date
with respect to the current state of the instance.
format: int64
minimum: 0
type: integer
reason:
description: |-
reason contains a programmatic identifier indicating the reason for the condition's last transition.
Producers of specific condition types may define expected values and meanings for this field,
and whether the values are considered a guaranteed API.
The value should be a CamelCase string.
This field may not be empty.
maxLength: 1024
minLength: 1
pattern: ^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$
type: string
status:
description: status of the condition, one of True, False, Unknown.
enum:
- "True"
- "False"
- Unknown
type: string
type:
description: type of condition in CamelCase or in foo.example.com/CamelCase.
maxLength: 316
pattern: ^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])$
type: string
required:
- lastTransitionTime
- message
- reason
- status
- type
type: object
type: array
x-kubernetes-list-map-keys:
- type
x-kubernetes-list-type: map
type: object
required:
- spec
type: object
served: true
storage: true
subresources:
status: {}