Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion cmd/readiness-condition-reporter/main_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,6 @@ func TestUpdateNodeCondition(t *testing.T) {
if foundCondition == nil {
t.Fatal("Condition not found")
}

if foundCondition.Status != tt.wantStatus {
t.Errorf("Condition status = %v, want %v", foundCondition.Status, tt.wantStatus)
}
Expand Down
177 changes: 150 additions & 27 deletions docs/book/src/operations/monitoring.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,48 @@
# Monitoring

Node Readiness Controller exposes Prometheus-compatible metrics. This page describes the Prometheus metrics exposed by Node Readiness Controller for monitoring rule evaluation, taint operations, failures, and bootstrap progress.
The Node Readiness Controller exposes Prometheus-compatible metrics. This page documents the metrics currently registered by the controller and how they can be used for monitoring rule evaluation, taint operations, failures, bootstrap progress, and rule health.

## Metrics Endpoint

The controller serves metrics on `/metrics` only when metrics are explicitly enabled. Depending on the installation, the endpoint is served either over HTTP or over HTTPS. See [Installation](../user-guide/installation.md) for deployment details.
The controller serves metrics on `/metrics` only when metrics are explicitly enabled.

## Supported Metrics
Depending on the installation, the endpoint is exposed as:

- HTTP on port `8080` when the standard Prometheus component is enabled.
- HTTPS on port `8443` when the Prometheus TLS component is enabled.

See [Installation](https://www.google.com/search?q=../user-guide/installation.md) for deployment details.

## Metric Lifecycle Management

When a `NodeReadinessRule` is deleted, the controller automatically cleans up the associated rule-labeled Prometheus series. This prevents stale metrics from remaining visible in dashboards and alerts.

**Metrics cleaned up on rule deletion:**

- `node_readiness_taint_operations_total{rule="..."}`
- `node_readiness_evaluation_duration_seconds{rule="..."}`
- `node_readiness_failures_total{rule="..."}`
- `node_readiness_bootstrap_completed_total{rule="..."}`
- `node_readiness_reconciliation_latency_seconds{rule="..."}`
- `node_readiness_bootstrap_duration_seconds{rule="..."}`
- `node_readiness_nodes_by_state{rule="..."}`
- `node_readiness_rule_last_reconciliation_timestamp_seconds{rule="..."}`

This ensures that:

- Deleted rules do not continue to appear in dashboards with stale values.
- Memory usage does not grow unbounded from removed rules.
- Metric cardinality remains highly accurate over time.

**Note:** The global `node_readiness_rules_total` gauge is updated separately. Rule-labeled metrics are explicitly deleted during rule cleanup.

-----

## Core Metrics

### `node_readiness_rules_total`

Number of `NodeReadinessRule` objects tracked by the controller.
Number of `NodeReadinessRule` objects currently tracked by the controller.

| Property | Value |
| --- | --- |
Expand All @@ -25,24 +57,17 @@ Total number of taint operations performed by the controller.
| Property | Value |
| --- | --- |
| Type | `counter` |
| Labels | `rule`, `operation` |
| Labels | `rule`, `operation` (`add`, `remove`) |
| Recorded when | The controller successfully adds or removes a taint |

#### Labels

| Label | Description | Values |
| --- | --- | --- |
| `rule` | `NodeReadinessRule` name | Any rule name |
| `operation` | Taint operation performed by the controller | `add`, `remove` |

### `node_readiness_evaluation_duration_seconds`

Duration of rule evaluations.
Duration of the controller's internal rule evaluations.

| Property | Value |
| --- | --- |
| Type | `histogram` |
| Labels | none |
| Labels | `rule` |
| Buckets | Prometheus default histogram buckets |
| Recorded when | The controller evaluates a rule against a node |

Expand All @@ -53,15 +78,8 @@ Total number of failure events recorded by the controller.
| Property | Value |
| --- | --- |
| Type | `counter` |
| Labels | `rule`, `reason` |
| Recorded when | The controller records an evaluation failure or taint add/remove failure |

#### Labels

| Label | Description | Values |
| --- | --- | --- |
| `rule` | `NodeReadinessRule` name | Any rule name |
| `reason` | Failure label recorded by the controller | `EvaluationError`, `AddTaintError`, `RemoveTaintError` |
| Labels | `rule`, `reason` (`EvaluationError`, `AddTaintError`, `RemoveTaintError`) |
| Recorded when | The controller encounters an error evaluating or patching a node |

### `node_readiness_bootstrap_completed_total`

Expand All @@ -73,8 +91,113 @@ Total number of nodes that have completed bootstrap.
| Labels | `rule` |
| Recorded when | The controller marks bootstrap as completed for a node under a bootstrap-only rule |

#### Labels
-----

## Extended Health and SLI Metrics

### `node_readiness_reconciliation_latency_seconds`

End-to-end latency from node condition change to taint operation completion.

| Property | Value |
| --- | --- |
| Type | `histogram` |
| Labels | `rule`, `operation` (`add_taint`, `remove_taint`) |
| Buckets | `0.01`, `0.05`, `0.1`, `0.5`, `1`, `2`, `5`, `10`, `30`, `60`, `120`, `300` seconds |
| Recorded when | A taint operation completes |

**Use case:** Measure how quickly the controller responds to node condition changes in the cluster.

### `node_readiness_bootstrap_duration_seconds`

Time from node creation to bootstrap completion for bootstrap-only rules.

| Property | Value |
| --- | --- |
| Type | `histogram` |
| Labels | `rule` |
| Buckets | `1`, `5`, `10`, `30`, `60`, `120`, `300`, `600`, `1200` seconds |
| Recorded when | Bootstrap completion is observed for a node under a bootstrap-only rule |

**Use case:** Measure the actual time nodes take to become fully provisioned and bootstrap-complete.

### `node_readiness_nodes_by_state`

Number of nodes in each readiness state per rule.

| Property | Value |
| --- | --- |
| Type | `gauge` |
| Labels | `rule`, `state` (`ready`, `not_ready`, `bootstrapping`) |
| Recorded when | A rule reconciliation completes |

**Use case:** Track aggregate node health without introducing per-node metric cardinality, keeping controller memory footprint lean.

### `node_readiness_rule_last_reconciliation_timestamp_seconds`

Unix timestamp of the last reconciliation for a rule.

| Property | Value |
| --- | --- |
| Type | `gauge` |
| Labels | `rule` |
| Recorded when | A rule reconciliation loop successfully completes |

**Use case:** Detect rules that may be stuck or not reconciling frequently enough.

-----

## Example Queries & SLOs

### Latency Monitoring & SLOs

**Objective:** 95% of internal evaluations complete within 50 milliseconds (0.05s).

```promql
# Percentage of evaluations completing within 50ms
sum(rate(node_readiness_evaluation_duration_seconds_bucket{le="0.05"}[5m])) /
sum(rate(node_readiness_evaluation_duration_seconds_count[5m])) * 100
```

```promql
# P95 End-to-End Reconciliation Latency across all rules
histogram_quantile(0.95,
sum by (le) (
rate(node_readiness_reconciliation_latency_seconds_bucket[5m])
)
)
```

### Freshness Monitoring & SLOs

**Objective:** All rules reconcile within the last 2 minutes.

```promql
# Alert if any rule has not reconciled in the last 120 seconds
(time() - node_readiness_rule_last_reconciliation_timestamp_seconds) > 120
```

### Availability Monitoring & SLOs

**Objective:** 99.9% of targeted nodes are ready.

```promql
# Percentage of ready nodes globally
100 * sum(node_readiness_nodes_by_state{state="ready"}) / sum(node_readiness_nodes_by_state)

# Percentage of ready nodes per rule
100 * node_readiness_nodes_by_state{state="ready"} / sum by (rule) (node_readiness_nodes_by_state)
```

## Monitoring and Scale Testing

For an end-to-end monitoring setup with Prometheus and Grafana during scale tests, see the [scale testing guide](../../../../hack/test-workloads/scale/README.md).

## Alerting Recommendations

Typical alerts to consider:

| Label | Description | Values |
| --- | --- | --- |
| `rule` | `NodeReadinessRule` name | Any rule name |
- **High latency:** P95 reconciliation latency above 10s for 5 minutes.
- **Stale reconciliations:** Any rule with no reconciliation for more than 5 minutes.
- **High failure rate:** Sustained increase in `node_readiness_failures_total`.
- **Low availability:** Ready-node percentage below your target threshold for a sustained period.
Loading
Loading