feat: v1beta1 fields and validating admission webhook#25
Conversation
…adlineSeconds) TDD implementation of v1beta1 field parsing and calculation: - parse_surge_value(): Parse "25%" or "5" surge values - calculate_replica_split_with_surge(): Replica calculation with surge support - is_progress_deadline_exceeded(): Detect stuck rollouts - Add progress_started_at field to RolloutStatus 14 new tests covering parsing, replica math, and deadline detection. Functions ready for integration into reconciliation loop. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add max_surge, max_unavailable, progress_deadline_seconds to v1alpha1 RolloutSpec (optional, for controller access regardless of version) - Add progress_started_at to RolloutStatus for deadline tracking - Update canary strategy to use calculate_replica_split_with_surge - Update conversion functions to handle new optional fields - Update all test files with new RolloutSpec fields 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Set progress_started_at timestamp when initializing canary rollout - Add progress deadline check in reconcile loop - Mark rollout as Failed when deadline exceeded during Progressing phase - Emit CDEvent on deadline failure - Tests: test_initialize_sets_progress_started_at, test_advance_preserves_progress_started_at 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add /validate endpoint to webhook server - Make validate_rollout public for webhook use - Create ValidatingWebhookConfiguration manifest - Patch ValidatingWebhookConfiguration CA bundle on startup - Update RBAC for secrets, CRDs, and webhook configs - Update webhook port to 8443 (TLS) - Enable TLS by default in controller deployment Validates: - spec.replicas >= 0 - canary.canaryService and stableService not empty - canary.steps has at least one step - step.setWeight in range 0-100 - pause.duration valid format Tests: 5 new validation webhook tests (214 total) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds v1beta1 field support and implements a validating admission webhook for the Kulta rollout controller. It introduces three new optional fields (maxSurge, maxUnavailable, progressDeadlineSeconds) to control rollout behavior and automatically fail stuck rollouts. A validating webhook rejects invalid Rollout specs at admission time.
Changes:
- Implemented v1beta1 fields for surge control and progress deadline detection
- Added validating admission webhook with TLS support and automatic CA bundle injection
- Updated RBAC permissions and deployment manifests for webhook functionality
Reviewed changes
Copilot reviewed 20 out of 22 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/controller/rollout.rs | Core logic for parsing surge values, calculating replica splits, and detecting progress deadline timeouts |
| src/server/webhook.rs | Validating webhook implementation with AdmissionReview handling |
| src/server/tls.rs | CA bundle patching for ValidatingWebhookConfiguration |
| src/server/health.rs | Added /validate endpoint route |
| src/crd/rollout.rs | Added v1beta1 field definitions (maxSurge, maxUnavailable, progressDeadlineSeconds, progress_started_at) |
| src/crd/conversion.rs | Updated conversion logic between v1alpha1 and v1beta1 |
| src/controller/strategies/canary.rs | Updated to use calculate_replica_split_with_surge |
| src/controller/strategies/blue_green.rs | Added v1beta1 fields to test fixtures |
| src/controller/strategies/simple.rs | Added v1beta1 fields to test fixtures |
| src/bin/gen-crd.rs | Updated webhook port from 8080 to 8443 |
| deploy/validating-webhook.yaml | New ValidatingWebhookConfiguration manifest |
| deploy/rbac.yaml | Added permissions for secrets, CRDs, and webhook configurations |
| deploy/controller.yaml | Added webhook port 8443 and TLS environment variables |
| src/server/webhook_test.rs | Comprehensive tests for validating webhook |
| src/controller/rollout_test.rs | Tests for surge parsing, replica calculation, and progress deadline |
| src/crd/rollout_test.rs | Updated CRD generation test for multi-version support |
| src/crd/conversion_test.rs | Tests for v1beta1 field conversion |
| tests/stress_test.rs | Updated test fixtures with v1beta1 fields |
| tests/seppo_integration_test.rs | Updated test fixtures with v1beta1 fields |
| src/controller/cdevents_test.rs | Updated test fixtures with v1beta1 fields |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let patch = serde_json::json!({ | ||
| "webhooks": [{ | ||
| "name": "rollout.kulta.io", | ||
| "clientConfig": { | ||
| "caBundle": ca_bundle_base64 | ||
| } | ||
| }] | ||
| }); |
There was a problem hiding this comment.
The patch strategy for updating the ValidatingWebhookConfiguration's CA bundle may not work correctly. The patch uses a JSON array with a single entry that only matches by name, but strategic merge patches on arrays typically use positional matching or require all array elements to be specified. If there are multiple webhooks in the configuration or if the webhook is not at index 0, this patch could fail or produce unexpected results.
Consider using a more precise patch strategy, such as a JSON Patch that targets the specific webhook by index, or ensure the webhook configuration only has one entry.
| // With surge, we can keep more stable replicas during transition | ||
| // stable = total_replicas (keep all stable) while scaling up canary | ||
| // Total max = total_replicas + surge | ||
| let max_total = total_replicas + surge; | ||
| let min_total = (total_replicas - unavailable).max(0); | ||
|
|
||
| // During rollout: keep stable at full count, add canary pods up to surge limit | ||
| let canary_replicas = ideal_canary.min(max_total - total_replicas + ideal_canary); | ||
| let stable_replicas = (total_replicas - ideal_canary).max(min_total - canary_replicas); | ||
|
|
||
| // Ensure we don't exceed max or go below min | ||
| let total = stable_replicas + canary_replicas; | ||
| if total > max_total { | ||
| // Scale down stable if we exceed max | ||
| let excess = total - max_total; | ||
| return ((stable_replicas - excess).max(0), canary_replicas); | ||
| } | ||
|
|
There was a problem hiding this comment.
The replica split calculation logic on lines 278-279 appears complex and potentially incorrect. Specifically:
Line 278: canary_replicas = ideal_canary.min(max_total - total_replicas + ideal_canary) simplifies to ideal_canary.min(surge + ideal_canary), which will always equal ideal_canary (since ideal_canary cannot be greater than itself plus a positive surge value).
Line 279: stable_replicas = (total_replicas - ideal_canary).max(min_total - canary_replicas) has interdependence where canary_replicas is used but stable_replicas affects the total count.
This logic should be reviewed and potentially simplified. The current implementation may not correctly implement the intended surge behavior for all edge cases.
| // With surge, we can keep more stable replicas during transition | |
| // stable = total_replicas (keep all stable) while scaling up canary | |
| // Total max = total_replicas + surge | |
| let max_total = total_replicas + surge; | |
| let min_total = (total_replicas - unavailable).max(0); | |
| // During rollout: keep stable at full count, add canary pods up to surge limit | |
| let canary_replicas = ideal_canary.min(max_total - total_replicas + ideal_canary); | |
| let stable_replicas = (total_replicas - ideal_canary).max(min_total - canary_replicas); | |
| // Ensure we don't exceed max or go below min | |
| let total = stable_replicas + canary_replicas; | |
| if total > max_total { | |
| // Scale down stable if we exceed max | |
| let excess = total - max_total; | |
| return ((stable_replicas - excess).max(0), canary_replicas); | |
| } | |
| // With surge, we can keep more replicas during transition | |
| // Total max = total_replicas + surge | |
| let max_total = total_replicas + surge; | |
| let min_total = (total_replicas - unavailable).max(0); | |
| // Start from the ideal split and then enforce surge/unavailable constraints. | |
| let mut canary_replicas = ideal_canary; | |
| let mut stable_replicas = (total_replicas - ideal_canary).max(0); | |
| // Ensure we don't exceed the maximum allowed replicas. | |
| let mut total = stable_replicas + canary_replicas; | |
| if total > max_total { | |
| let mut excess = total - max_total; | |
| // Prefer scaling down canary replicas first. | |
| let canary_reduction = canary_replicas.min(excess); | |
| canary_replicas -= canary_reduction; | |
| excess -= canary_reduction; | |
| // If still exceeding, scale down stable replicas as needed. | |
| if excess > 0 { | |
| stable_replicas = (stable_replicas - excess).max(0); | |
| } | |
| } | |
| // Ensure we don't go below the minimum required replicas. | |
| total = stable_replicas + canary_replicas; | |
| if total < min_total { | |
| let deficit = min_total - total; | |
| // Prefer adding stable replicas to satisfy availability constraints. | |
| stable_replicas += deficit; | |
| } |
| // v1beta1 fields are dropped for pure v1alpha1 representation | ||
| max_surge: None, | ||
| max_unavailable: None, | ||
| progress_deadline_seconds: None, |
There was a problem hiding this comment.
When converting from v1beta1 to v1alpha1, the v1beta1 fields (max_surge, max_unavailable, progress_deadline_seconds) are set to None. However, since v1alpha1's RolloutSpec now includes these fields as optional (to support internal use), this means data loss occurs during round-trip conversion (v1alpha1 -> v1beta1 -> v1alpha1). If a v1alpha1 resource has these fields set, they will be lost after conversion to v1beta1 and back.
Consider preserving the v1beta1 field values when converting to v1alpha1, since v1alpha1 now supports them internally.
| // Check progress deadline (only for Progressing phase with deadline configured) | ||
| if let Some(deadline_seconds) = rollout.spec.progress_deadline_seconds { | ||
| if let Some(current_status) = &rollout.status { | ||
| if current_status.phase == Some(Phase::Progressing) |
There was a problem hiding this comment.
The progress deadline check in the reconcile loop only triggers when current_status.phase == Some(Phase::Progressing), but the is_progress_deadline_exceeded function also checks for Phase::Preview. This creates an inconsistency where Blue-Green rollouts in Preview phase will never trigger the deadline check, even though the function supports it.
Either remove the Preview phase check from is_progress_deadline_exceeded or add Preview phase to the reconcile loop check to make the behavior consistent.
| // Check progress deadline (only for Progressing phase with deadline configured) | |
| if let Some(deadline_seconds) = rollout.spec.progress_deadline_seconds { | |
| if let Some(current_status) = &rollout.status { | |
| if current_status.phase == Some(Phase::Progressing) | |
| // Check progress deadline (for Progressing or Preview phases with deadline configured) | |
| if let Some(deadline_seconds) = rollout.spec.progress_deadline_seconds { | |
| if let Some(current_status) = &rollout.status { | |
| if (current_status.phase == Some(Phase::Progressing) | |
| || current_status.phase == Some(Phase::Preview)) |
| } | ||
| } else { | ||
| // Absolute value | ||
| value.parse::<i32>().unwrap_or(0).max(0) |
There was a problem hiding this comment.
The parse_surge_value function allows negative percentages through the range check (0..=100).contains(&percent), but the absolute value parsing uses .max(0) to prevent negative values. This creates an inconsistency. A negative percentage like "-10%" would return 0, but this should be validated and rejected as invalid rather than silently converted to 0.
Consider validating negative percentages explicitly and potentially logging a warning or returning an error for invalid input values.
| value.parse::<i32>().unwrap_or(0).max(0) | |
| match value.parse::<i32>() { | |
| Ok(abs) if abs >= 0 => abs, | |
| _ => 0, | |
| } |
| pub fn validate_rollout(rollout: &Rollout) -> Result<(), String> { | ||
| // Validate replicas >= 0 | ||
| if rollout.spec.replicas < 0 { | ||
| return Err(format!( |
There was a problem hiding this comment.
The validate_rollout function does not validate the v1beta1 fields (maxSurge, maxUnavailable, progressDeadlineSeconds). Invalid values for these fields (e.g., "invalid%", negative progressDeadlineSeconds) would be silently accepted by the webhook and only cause issues at runtime.
Consider adding validation for:
- maxSurge and maxUnavailable: should match the expected format (percentage with % suffix or absolute number)
- progressDeadlineSeconds: should be positive or zero
Test code idiomatically uses expect() and unwrap() since panics are the expected failure mechanism. Add #![allow(...)] to test modules: - src/server/health_test.rs - src/server/metrics_test.rs - src/server/tls_test.rs - src/crd/conversion_test.rs Also fix unnecessary cast in stress_test.rs (i as i32 -> 0i32..5 range) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
Merged directly to main with fixes for Copilot review comments |
Summary
maxSurge,maxUnavailable, andprogressDeadlineSecondsfieldsChanges
v1beta1 Fields (TDD)
parse_surge_value()- Parse percentage ("25%") or absolute ("5") valuescalculate_replica_split_with_surge()- Calculate stable/canary replicas with surgeis_progress_deadline_exceeded()- Detect stuck rolloutsprogress_started_atwhen entering Progressing phaseValidating Webhook
/validateendpoint for AdmissionReviewvalidate_rollout()public for webhook useValidation Rules
spec.replicas>= 0canary.canaryServiceandstableServicenot emptycanary.stepsmust have at least one stepstep.setWeightmust be 0-100pause.durationmust be valid formatTest plan
cargo clippy -- -D warningscleancargo fmtapplied🤖 Generated with Claude Code