Skip to content

feat: v1beta1 fields and validating admission webhook#25

Closed
yairfalse wants to merge 6 commits into
mainfrom
feat/v1beta1-and-validating-webhook
Closed

feat: v1beta1 fields and validating admission webhook#25
yairfalse wants to merge 6 commits into
mainfrom
feat/v1beta1-and-validating-webhook

Conversation

@yairfalse
Copy link
Copy Markdown
Collaborator

Summary

  • v1beta1 field support: Implement maxSurge, maxUnavailable, and progressDeadlineSeconds fields
  • Progress deadline detection: Automatically fail stuck rollouts
  • Validating webhook: Reject invalid Rollout specs at admission time

Changes

v1beta1 Fields (TDD)

  • parse_surge_value() - Parse percentage ("25%") or absolute ("5") values
  • calculate_replica_split_with_surge() - Calculate stable/canary replicas with surge
  • is_progress_deadline_exceeded() - Detect stuck rollouts
  • Wire surge calculation into canary strategy
  • Set progress_started_at when entering Progressing phase
  • Check deadline in reconcile loop, mark as Failed when exceeded

Validating Webhook

  • Add /validate endpoint for AdmissionReview
  • Make validate_rollout() public for webhook use
  • Create ValidatingWebhookConfiguration manifest
  • Patch CA bundle into webhook config on startup
  • Update RBAC for secrets, CRDs, and webhook configs
  • Enable TLS (port 8443) by default

Validation Rules

  • spec.replicas >= 0
  • canary.canaryService and stableService not empty
  • canary.steps must have at least one step
  • step.setWeight must be 0-100
  • pause.duration must be valid format

Test plan

  • 214 tests passing (19 new tests added)
  • cargo clippy -- -D warnings clean
  • cargo fmt applied

🤖 Generated with Claude Code

yairfalse and others added 4 commits January 5, 2026 11:52
…adlineSeconds)

TDD implementation of v1beta1 field parsing and calculation:
- parse_surge_value(): Parse "25%" or "5" surge values
- calculate_replica_split_with_surge(): Replica calculation with surge support
- is_progress_deadline_exceeded(): Detect stuck rollouts
- Add progress_started_at field to RolloutStatus

14 new tests covering parsing, replica math, and deadline detection.
Functions ready for integration into reconciliation loop.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add max_surge, max_unavailable, progress_deadline_seconds to v1alpha1
  RolloutSpec (optional, for controller access regardless of version)
- Add progress_started_at to RolloutStatus for deadline tracking
- Update canary strategy to use calculate_replica_split_with_surge
- Update conversion functions to handle new optional fields
- Update all test files with new RolloutSpec fields

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Set progress_started_at timestamp when initializing canary rollout
- Add progress deadline check in reconcile loop
- Mark rollout as Failed when deadline exceeded during Progressing phase
- Emit CDEvent on deadline failure
- Tests: test_initialize_sets_progress_started_at,
  test_advance_preserves_progress_started_at

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add /validate endpoint to webhook server
- Make validate_rollout public for webhook use
- Create ValidatingWebhookConfiguration manifest
- Patch ValidatingWebhookConfiguration CA bundle on startup
- Update RBAC for secrets, CRDs, and webhook configs
- Update webhook port to 8443 (TLS)
- Enable TLS by default in controller deployment

Validates:
- spec.replicas >= 0
- canary.canaryService and stableService not empty
- canary.steps has at least one step
- step.setWeight in range 0-100
- pause.duration valid format

Tests: 5 new validation webhook tests (214 total)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds v1beta1 field support and implements a validating admission webhook for the Kulta rollout controller. It introduces three new optional fields (maxSurge, maxUnavailable, progressDeadlineSeconds) to control rollout behavior and automatically fail stuck rollouts. A validating webhook rejects invalid Rollout specs at admission time.

Changes:

  • Implemented v1beta1 fields for surge control and progress deadline detection
  • Added validating admission webhook with TLS support and automatic CA bundle injection
  • Updated RBAC permissions and deployment manifests for webhook functionality

Reviewed changes

Copilot reviewed 20 out of 22 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/controller/rollout.rs Core logic for parsing surge values, calculating replica splits, and detecting progress deadline timeouts
src/server/webhook.rs Validating webhook implementation with AdmissionReview handling
src/server/tls.rs CA bundle patching for ValidatingWebhookConfiguration
src/server/health.rs Added /validate endpoint route
src/crd/rollout.rs Added v1beta1 field definitions (maxSurge, maxUnavailable, progressDeadlineSeconds, progress_started_at)
src/crd/conversion.rs Updated conversion logic between v1alpha1 and v1beta1
src/controller/strategies/canary.rs Updated to use calculate_replica_split_with_surge
src/controller/strategies/blue_green.rs Added v1beta1 fields to test fixtures
src/controller/strategies/simple.rs Added v1beta1 fields to test fixtures
src/bin/gen-crd.rs Updated webhook port from 8080 to 8443
deploy/validating-webhook.yaml New ValidatingWebhookConfiguration manifest
deploy/rbac.yaml Added permissions for secrets, CRDs, and webhook configurations
deploy/controller.yaml Added webhook port 8443 and TLS environment variables
src/server/webhook_test.rs Comprehensive tests for validating webhook
src/controller/rollout_test.rs Tests for surge parsing, replica calculation, and progress deadline
src/crd/rollout_test.rs Updated CRD generation test for multi-version support
src/crd/conversion_test.rs Tests for v1beta1 field conversion
tests/stress_test.rs Updated test fixtures with v1beta1 fields
tests/seppo_integration_test.rs Updated test fixtures with v1beta1 fields
src/controller/cdevents_test.rs Updated test fixtures with v1beta1 fields

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/server/tls.rs
Comment on lines +335 to +342
let patch = serde_json::json!({
"webhooks": [{
"name": "rollout.kulta.io",
"clientConfig": {
"caBundle": ca_bundle_base64
}
}]
});
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch strategy for updating the ValidatingWebhookConfiguration's CA bundle may not work correctly. The patch uses a JSON array with a single entry that only matches by name, but strategic merge patches on arrays typically use positional matching or require all array elements to be specified. If there are multiple webhooks in the configuration or if the webhook is not at index 0, this patch could fail or produce unexpected results.

Consider using a more precise patch strategy, such as a JSON Patch that targets the specific webhook by index, or ensure the webhook configuration only has one entry.

Copilot uses AI. Check for mistakes.
Comment thread src/controller/rollout.rs
Comment on lines +271 to +288
// With surge, we can keep more stable replicas during transition
// stable = total_replicas (keep all stable) while scaling up canary
// Total max = total_replicas + surge
let max_total = total_replicas + surge;
let min_total = (total_replicas - unavailable).max(0);

// During rollout: keep stable at full count, add canary pods up to surge limit
let canary_replicas = ideal_canary.min(max_total - total_replicas + ideal_canary);
let stable_replicas = (total_replicas - ideal_canary).max(min_total - canary_replicas);

// Ensure we don't exceed max or go below min
let total = stable_replicas + canary_replicas;
if total > max_total {
// Scale down stable if we exceed max
let excess = total - max_total;
return ((stable_replicas - excess).max(0), canary_replicas);
}

Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The replica split calculation logic on lines 278-279 appears complex and potentially incorrect. Specifically:

Line 278: canary_replicas = ideal_canary.min(max_total - total_replicas + ideal_canary) simplifies to ideal_canary.min(surge + ideal_canary), which will always equal ideal_canary (since ideal_canary cannot be greater than itself plus a positive surge value).

Line 279: stable_replicas = (total_replicas - ideal_canary).max(min_total - canary_replicas) has interdependence where canary_replicas is used but stable_replicas affects the total count.

This logic should be reviewed and potentially simplified. The current implementation may not correctly implement the intended surge behavior for all edge cases.

Suggested change
// With surge, we can keep more stable replicas during transition
// stable = total_replicas (keep all stable) while scaling up canary
// Total max = total_replicas + surge
let max_total = total_replicas + surge;
let min_total = (total_replicas - unavailable).max(0);
// During rollout: keep stable at full count, add canary pods up to surge limit
let canary_replicas = ideal_canary.min(max_total - total_replicas + ideal_canary);
let stable_replicas = (total_replicas - ideal_canary).max(min_total - canary_replicas);
// Ensure we don't exceed max or go below min
let total = stable_replicas + canary_replicas;
if total > max_total {
// Scale down stable if we exceed max
let excess = total - max_total;
return ((stable_replicas - excess).max(0), canary_replicas);
}
// With surge, we can keep more replicas during transition
// Total max = total_replicas + surge
let max_total = total_replicas + surge;
let min_total = (total_replicas - unavailable).max(0);
// Start from the ideal split and then enforce surge/unavailable constraints.
let mut canary_replicas = ideal_canary;
let mut stable_replicas = (total_replicas - ideal_canary).max(0);
// Ensure we don't exceed the maximum allowed replicas.
let mut total = stable_replicas + canary_replicas;
if total > max_total {
let mut excess = total - max_total;
// Prefer scaling down canary replicas first.
let canary_reduction = canary_replicas.min(excess);
canary_replicas -= canary_reduction;
excess -= canary_reduction;
// If still exceeding, scale down stable replicas as needed.
if excess > 0 {
stable_replicas = (stable_replicas - excess).max(0);
}
}
// Ensure we don't go below the minimum required replicas.
total = stable_replicas + canary_replicas;
if total < min_total {
let deficit = min_total - total;
// Prefer adding stable replicas to satisfy availability constraints.
stable_replicas += deficit;
}

Copilot uses AI. Check for mistakes.
Comment thread src/crd/conversion.rs
Comment on lines +50 to +53
// v1beta1 fields are dropped for pure v1alpha1 representation
max_surge: None,
max_unavailable: None,
progress_deadline_seconds: None,
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When converting from v1beta1 to v1alpha1, the v1beta1 fields (max_surge, max_unavailable, progress_deadline_seconds) are set to None. However, since v1alpha1's RolloutSpec now includes these fields as optional (to support internal use), this means data loss occurs during round-trip conversion (v1alpha1 -> v1beta1 -> v1alpha1). If a v1alpha1 resource has these fields set, they will be lost after conversion to v1beta1 and back.

Consider preserving the v1beta1 field values when converting to v1alpha1, since v1alpha1 now supports them internally.

Copilot uses AI. Check for mistakes.
Comment thread src/controller/rollout.rs
Comment on lines +1341 to +1344
// Check progress deadline (only for Progressing phase with deadline configured)
if let Some(deadline_seconds) = rollout.spec.progress_deadline_seconds {
if let Some(current_status) = &rollout.status {
if current_status.phase == Some(Phase::Progressing)
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The progress deadline check in the reconcile loop only triggers when current_status.phase == Some(Phase::Progressing), but the is_progress_deadline_exceeded function also checks for Phase::Preview. This creates an inconsistency where Blue-Green rollouts in Preview phase will never trigger the deadline check, even though the function supports it.

Either remove the Preview phase check from is_progress_deadline_exceeded or add Preview phase to the reconcile loop check to make the behavior consistent.

Suggested change
// Check progress deadline (only for Progressing phase with deadline configured)
if let Some(deadline_seconds) = rollout.spec.progress_deadline_seconds {
if let Some(current_status) = &rollout.status {
if current_status.phase == Some(Phase::Progressing)
// Check progress deadline (for Progressing or Preview phases with deadline configured)
if let Some(deadline_seconds) = rollout.spec.progress_deadline_seconds {
if let Some(current_status) = &rollout.status {
if (current_status.phase == Some(Phase::Progressing)
|| current_status.phase == Some(Phase::Preview))

Copilot uses AI. Check for mistakes.
Comment thread src/controller/rollout.rs
}
} else {
// Absolute value
value.parse::<i32>().unwrap_or(0).max(0)
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parse_surge_value function allows negative percentages through the range check (0..=100).contains(&percent), but the absolute value parsing uses .max(0) to prevent negative values. This creates an inconsistency. A negative percentage like "-10%" would return 0, but this should be validated and rejected as invalid rather than silently converted to 0.

Consider validating negative percentages explicitly and potentially logging a warning or returning an error for invalid input values.

Suggested change
value.parse::<i32>().unwrap_or(0).max(0)
match value.parse::<i32>() {
Ok(abs) if abs >= 0 => abs,
_ => 0,
}

Copilot uses AI. Check for mistakes.
Comment thread src/controller/rollout.rs
Comment on lines +1160 to 1163
pub fn validate_rollout(rollout: &Rollout) -> Result<(), String> {
// Validate replicas >= 0
if rollout.spec.replicas < 0 {
return Err(format!(
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validate_rollout function does not validate the v1beta1 fields (maxSurge, maxUnavailable, progressDeadlineSeconds). Invalid values for these fields (e.g., "invalid%", negative progressDeadlineSeconds) would be silently accepted by the webhook and only cause issues at runtime.

Consider adding validation for:

  • maxSurge and maxUnavailable: should match the expected format (percentage with % suffix or absolute number)
  • progressDeadlineSeconds: should be positive or zero

Copilot uses AI. Check for mistakes.
yairfalse and others added 2 commits January 11, 2026 00:49
Test code idiomatically uses expect() and unwrap() since panics are
the expected failure mechanism. Add #![allow(...)] to test modules:
- src/server/health_test.rs
- src/server/metrics_test.rs
- src/server/tls_test.rs
- src/crd/conversion_test.rs

Also fix unnecessary cast in stress_test.rs (i as i32 -> 0i32..5 range)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@yairfalse
Copy link
Copy Markdown
Collaborator Author

Merged directly to main with fixes for Copilot review comments

@yairfalse yairfalse closed this Jan 11, 2026
@yairfalse yairfalse deleted the feat/v1beta1-and-validating-webhook branch January 11, 2026 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants