Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
ca8a8b9
feat(stack): allow controller metrics scraping
gasarekubex May 6, 2026
f18919b
fix(stack): drop webhook from metrics allowlist
gasarekubex May 6, 2026
48d7b49
fix(stack): tighten controller metrics allowlist
gasarekubex May 6, 2026
7ec4ec8
fix(stack): keep controller metrics samples
gasarekubex May 6, 2026
faea326
fix(stack): dedicate controller metrics scrape
gasarekubex May 6, 2026
d3da635
fix(stack): make controller metrics scrape explicit http
gasarekubex May 6, 2026
d5d2a7a
fix(stack): pin controller metrics scrape to http
gasarekubex May 6, 2026
4b9909b
chore(stack): set chart version to 1.0.9
gasarekubex May 6, 2026
74a2275
fix(stack): deduplicate shared endpointslice relabels
gasarekubex May 6, 2026
f5d77f1
fix(stack): restore shared scrape labels
gasarekubex May 6, 2026
c602a89
docs(stack): clarify controller metrics scrape intent
gasarekubex May 6, 2026
d94a1d7
docs(stack): clarify controller metrics scrape behavior
gasarekubex May 6, 2026
6e4c87c
fix(stack): scope controller metrics scrape by service
gasarekubex May 7, 2026
e72a399
fix(stack): tighten controller metrics scraping
gasarekubex May 7, 2026
39fafc4
fix(stack): enforce controller metrics port explicitly
gasarekubex May 7, 2026
c4e9be1
fix(stack): make controller scrape path explicit
gasarekubex May 7, 2026
288f3c0
fix(stack): align controller scrape with validated defaults
gasarekubex May 7, 2026
a21e70b
docs(stack): clarify controller scrape contract
gasarekubex May 7, 2026
dbf848d
fix(stack): simplify controller metrics relabeling
gasarekubex May 11, 2026
8f3c861
fix(stack): close remaining metrics review items
gasarekubex May 12, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion charts/kubex-automation-stack/Chart.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
apiVersion: v2
description: Kubex Collection Stack
name: kubex-automation-stack
version: 1.0.8
version: 1.0.9
type: application
icon: https://www.kubex.ai/wp-content/uploads/kubex-by-densify-logo.png
dependencies:
Expand Down
48 changes: 44 additions & 4 deletions charts/kubex-automation-stack/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,7 @@ prometheus:
kubernetes_sd_configs:
- role: endpointslice
relabel_configs:
# Scheme annotation overrides the job's default scrape protocol when a target serves HTTPS.
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
Comment thread
gasarekubex marked this conversation as resolved.
action: replace
target_label: __scheme__
Expand All @@ -209,25 +210,64 @@ prometheus:
- action: labelmap
regex: __meta_kubernetes_service_annotation_prometheus_io_param_(.+)
replacement: __param_$1
- action: labelmap
- &kubexEndpointsliceServiceLabels
action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
- &kubexEndpointsliceNamespace
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Minor] [Confidence: High]

Location: charts/kubex-automation-stack/values.yaml:219

Issue: The action: keep regex on __meta_kubernetes_endpointslice_name in the shared kubernetes-service-endpointslice job was not updated to exclude the new controller metrics service, even though a dedicated job now exists for it. If the controller's EndpointSlice name ever happens to match the shared job's regex (e.g., a future release where the name contains kubex), the shared job and the dedicated job will both scrape it — and the shared job's metric_relabel_configs allowlist will then silently discard all controller-runtime/go/workqueue metrics.

Why it matters: Double-scraping the same endpoint wastes resources; the shared job's allowlist then drops all controller metrics that don't match the kube_*/node_*/DCGM_*/openshift_* pattern, so users would see gaps if they ever routed the controller through the shared job.

Suggested fix: Add a negative lookahead (or a separate drop rule) to the shared job's endpointslice-name keep regex so it explicitly excludes the controller metrics service name suffix:

- source_labels: [__meta_kubernetes_endpointslice_name]
  action: keep
  regex: '((kubex|densify)-(kube-state-metrics|prometheus-node-exporter|ephemeral-storage-collector)|.*dcgm|k8s-ephemeral-storage-metrics).*'
# Optionally, add a drop to prevent accidental overlap with the dedicated job:
- source_labels: [__meta_kubernetes_service_name]
  action: drop
  regex: '.+-kubex-automation-engine-metrics-service$'

source_labels: [__meta_kubernetes_namespace]
Comment thread
gasarekubex marked this conversation as resolved.
action: replace
target_label: namespace
- source_labels: [__meta_kubernetes_service_name]
action: drop
regex: '.+-kubex-automation-engine-metrics-service$'
- source_labels: [__meta_kubernetes_endpointslice_name]
action: keep
regex: '((kubex|densify)-(kube-state-metrics|prometheus-node-exporter|ephemeral-storage-collector)|.*dcgm|k8s-ephemeral-storage-metrics).*'
- source_labels: [__meta_kubernetes_service_name]
- &kubexEndpointsliceServiceName
source_labels: [__meta_kubernetes_service_name]
action: replace
Comment thread
gasarekubex marked this conversation as resolved.
target_label: service
- source_labels: [__meta_kubernetes_pod_node_name]
- &kubexEndpointsliceNodeName
Comment thread
gasarekubex marked this conversation as resolved.
source_labels: [__meta_kubernetes_pod_node_name]
Comment thread
gasarekubex marked this conversation as resolved.
action: replace
target_label: node
metric_relabel_configs:
- source_labels: [__name__]
Comment thread
gasarekubex marked this conversation as resolved.
regex: '^(DCGM_FI_(DEV_(FB_(FREE|USED)|GPU_UTIL|POWER_USAGE)|PROF_(DRAM_ACTIVE|GR_ENGINE_ACTIVE|PIPE_TENSOR_ACTIVE))|ephemeral_storage_.*|kube_(cronjob_(created|info|labels|next_schedule_time|status_(active|last_schedule_time))|daemonset_(created|labels|status_number_available)|deployment_(created|labels|metadata_generation|spec_strategy_rollingupdate_max_(surge|unavailable))|horizontalpodautoscaler_(info|labels|spec_(max_replicas|min_replicas|target_metric)|status_(condition|current_replicas|target_metric))|job_(created|info|labels|owner|spec_(completions|parallelism)|status_(active|completion_time|start_time))|namespace_(annotations|labels)|node_(info|labels|role|spec_taint|status_(allocatable|capacity))|pod_(container_(info|resource_(limits|requests)|status_(last_terminated_(exitcode|timestamp)|restarts_total|terminated(?:_reason)?))|created|info|labels|owner|status_(phase|qos_class))|replicaset_(created|labels|owner|spec_replicas)|replicationcontroller_(created|spec_replicas)|resourcequota(?:_created)?|statefulset_(created|labels|replicas))|node_(cpu_(core_throttles_total|seconds_total)|disk_(read_bytes_total|reads_completed_total|writes_completed_total|written_bytes_total)|memory_(Buffers_bytes|Cached_bytes|MemFree_bytes|MemTotal_bytes|SReclaimable_bytes)|network_(receive_(bytes_total|packets_total)|speed_bytes|transmit_(bytes_total|packets_total))|vmstat_oom_kill)|openshift_clusterresourcequota_(created|labels|namespace_usage|selector|usage))$'
action: keep
Comment thread
gasarekubex marked this conversation as resolved.

Comment thread
gasarekubex marked this conversation as resolved.
- job_name: 'kubex-automation-engine-metrics-endpointslice'
# The controller chart exposes unauthenticated metrics on port 8080 over plain HTTP.
# Unlike the shared endpointslice job above, this scrape does not honor scheme annotations or bearer tokens.
# The job is intentionally fixed to the validated controller metrics defaults.
# The shared annotation-driven path/port/param rules stay with the shared job; this job keeps its own explicit endpoint.
# The target is scraped by service-name suffix so multiple controller releases can be collected when needed.
# No bearer token is required for this /metrics endpoint.
scheme: http
Comment thread
gasarekubex marked this conversation as resolved.
kubernetes_sd_configs:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Major] [Confidence: High]

Location: charts/kubex-automation-stack/values.yaml:247

Issue: The &kubexEndpointsliceMetricsPath and &kubexEndpointsliceAddress and &kubexEndpointsliceParamLabels anchors are defined in the shared job but are never referenced (*) by the new dedicated job, even though they would be harmless no-ops. However, the &kubexEndpointsliceAddress rule specifically rewrites __address__ by combining the host with the prometheus.io/port annotation. If this anchor were accidentally aliased into the new job in a future edit, it would override the explicit $1:8080 port. More importantly right now: the new job deliberately replaces __address__ with a hard-coded port but sets target_label: __metrics_path__ via an inline rule (not via the *kubexEndpointsliceMetricsPath alias). This is consistent — but should be explicitly commented to prevent a future author from "completing the pattern" by adding *kubexEndpointsliceAddress, which would break the port rewrite. Consider adding a # Note: do NOT alias *kubexEndpointsliceAddress here guard comment.

Why it matters: The asymmetric use of anchors (some aliased, some re-inlined) creates a maintenance trap. A future author completing the pattern by adding the missing aliases will silently break the port override.

Suggested fix: Add a guard comment immediately before the first alias in the new job:

relabel_configs:
  # Fixed path/port — do NOT add *kubexEndpointsliceAddress or *kubexEndpointsliceMetricsPath here;
  # this job intentionally hard-codes port 8080 and path /metrics.
  - target_label: __metrics_path__
    replacement: /metrics
  - *kubexEndpointsliceServiceLabels
  ...

- role: endpointslice
relabel_configs:
Comment thread
gasarekubex marked this conversation as resolved.
Comment thread
gasarekubex marked this conversation as resolved.
# This job uses a fixed controller metrics path/port instead of the shared annotation-driven overrides.
- target_label: __metrics_path__
replacement: /metrics
Comment thread
gasarekubex marked this conversation as resolved.
- *kubexEndpointsliceServiceLabels
- *kubexEndpointsliceNamespace
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Major] [Confidence: High]

Location: charts/kubex-automation-stack/values.yaml:254

Issue: The address-rewrite regex '(.+?)(?::\d+)?' is missing the end anchor $. Without it, (.+?) is lazy and (?::\d+)? is optional, so the overall match will succeed with (.+?) capturing just the first character of the IP, leaving the rest (including a real port) in the unmatched tail. For an input like 10.244.0.3:6443, the lazy (.+?) will match 1, satisfying the full regex, and the replacement $1:8080 produces 1:8080 rather than 10.244.0.3:8080.

Why it matters: The rewritten __address__ values will be garbage, causing every scrape by this job to fail with a connection error.

Suggested fix: Anchor the regex to the full string:

- source_labels: [__address__]
  action: replace
  target_label: __address__
  regex: '(.+?)(?::\d+)?$'
  replacement: '$1:8080'

Or use a non-lazy quantifier with the anchor:

regex: '([^:]+)(?::\d+)?$'
replacement: '$1:8080'

# Matches any Helm release name prefix (for example 'controller-' or 'prod-').
# This intentionally supports scraping multiple controller releases when their service names share this suffix.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONTENT OF THIS REVIEW IS AI GENERATED

[Severity: Major] [Confidence: High]

Location: charts/kubex-automation-stack/values.yaml:261

Issue: The action: keep on __meta_kubernetes_service_name fires before any endpointslice-to-service name propagation from __meta_kubernetes_endpointslice_name, meaning that for multi-port services the address rewrite on line 254 ($1:8080) will execute on all discovered targets across the entire cluster before the keep filter narrows them — this is wasteful but, more critically, the address-rewrite rule (which overwrites __address__) runs at line 254 while the keep only runs at line 261. For Prometheus, all relabel_configs are evaluated in list order, so every endpoint in the cluster gets its port forced to 8080, then the keep fires. The side-effect is that if the same endpoint is also scraped by the sibling kubernetes-service-endpointslice job (because the service also matches that job's regex), it will be double-scraped; however the main concern is the ordering: move the keep filter to run before the __address__ rewrite so that the port rewrite only applies to targets that have already been selected.

Why it matters: The address-rewrite $1:8080 runs on every endpointslice target discovered cluster-wide before the keep filter narrows the set. For large clusters this inflates the set of active scrape connections (even temporarily) and can produce spurious scrape errors for unrelated services whose address was rewritten but the keep then drops. It also makes the configuration harder to reason about: readers expect the filter to come before the mutation.

Suggested fix: Re-order relabel_configs so that the service-name keep filter runs first, then the __address__ rewrite:

relabel_configs:
  - target_label: __metrics_path__
    replacement: /metrics
  - *kubexEndpointsliceServiceLabels
  - *kubexEndpointsliceNamespace
  # Keep filter FIRST — before any address mutation
  - source_labels: [__meta_kubernetes_service_name]
    action: keep
    regex: '.+-kubex-automation-engine-metrics-service$'
  # Port rewrite only runs for already-selected targets
  - source_labels: [__address__]
    action: replace
    target_label: __address__
    regex: '(.+?)(?::\d+)?'
    replacement: '$1:8080'
  - *kubexEndpointsliceServiceName
  - *kubexEndpointsliceNodeName

- source_labels: [__meta_kubernetes_service_name]
action: keep
regex: '.+-kubex-automation-engine-metrics-service$'
- source_labels: [__address__]
action: replace
target_label: __address__
regex: '(.+?)(?::\d+)?'
replacement: '$1:8080'
- *kubexEndpointsliceServiceName
- *kubexEndpointsliceNodeName
# Stores all scraped metrics as-is (no metric_relabel_configs filter).
# Expected families: controller_runtime_*, go_*, process_*, workqueue_*, rest_client_*, automation_controller_*
# Add a metric_relabel_configs allowlist here if cardinality becomes a concern.
Comment thread
gasarekubex marked this conversation as resolved.

#################################################################
# Ephemeral Storage Metrics Exporter
Comment thread
gasarekubex marked this conversation as resolved.
Comment thread
gasarekubex marked this conversation as resolved.
# Collects and exposes ephemeral storage metrics via a DaemonSet
Comment thread
gasarekubex marked this conversation as resolved.
Comment thread
gasarekubex marked this conversation as resolved.
Comment thread
gasarekubex marked this conversation as resolved.
Comment thread
gasarekubex marked this conversation as resolved.
Expand Down