feat: resolve advisor per-Rollout from CRD config#31
Conversation
- Remove stale code examples that duplicated source - Fix RolloutStrategy trait signature (now takes DateTime<Utc>) - Update Context struct with trait-based DI fields - Document mock constructors and test helpers - Cut from 360 to ~105 lines, keeping all actionable info Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace ~140 lines of hand-rolled occurrence structs with the canonical false-protocol crate, gaining spec compliance (entities under context, DateTime<Utc> timestamps, protocol_version field) and a single source of truth. Adapt build_occurrence() to use the crate's builder API and return Option<Occurrence> for graceful fallback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address review comment about path dependency by documenting the required repository layout in the getting started instructions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Introduce a 4-level AI adoption ladder (Off → Context → Advised → Planned → Driven) that lets operators progressively opt into AI-assisted rollout decisions. Level 0 preserves current behavior. Level 1 enriches FALSE Protocol events with step/weight/traffic context. Level 2 consults an external advisor service but lets thresholds decide. - Add AdvisorLevel, AdvisorConfig, Recommendation, DecisionSource types - Create AnalysisAdvisor trait (NoOpAdvisor, HttpAdvisor, MockAdvisor) - Wire advisor into reconcile Context and analysis step - Enrich failure occurrences with rich what_failed/possible_causes - Add advisor.recommendation occurrence type for Level 2+ - Regenerate CRD schema with advisor fields Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of using a single shared advisor on Context, resolve the appropriate advisor at reconcile time based on each Rollout's AdvisorConfig. HttpAdvisor is constructed when level >= Advised and an endpoint is configured. Test mock overrides always win. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds per-Rollout AI advisor configuration to the Rollout CRD and consults an advisor during reconciliation (advisory-only), while also switching FALSE Protocol occurrence construction to use the false-protocol crate.
Changes:
- Add
spec.advisorconfig (withAdvisorLevelladder) andstatus.lastDecisionSourcefields to the Rollout CRD/types. - Add
controller::advisormodule and integrateresolve_advisor()+ advisor consultation/occurrence emission during reconcile. - Refactor FALSE Protocol occurrence emission to use types from the new
false-protocoldependency and update tests/fixtures accordingly.
Reviewed changes
Copilot reviewed 22 out of 23 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/stress_test.rs | Update Rollout test builders to include new spec.advisor field. |
| tests/seppo_integration_test.rs | Update integration Rollout fixtures for new spec.advisor field. |
| src/crd/rollout_test.rs | Update CRD serialization test fixtures for new status field. |
| src/crd/rollout.rs | Add advisor config + recommendation types; add last_decision_source to status; update RolloutSpec schema. |
| src/crd/conversion_test.rs | Update conversion tests to include new spec.advisor field. |
| src/crd/conversion.rs | Populate new spec.advisor field during v1beta1 → v1alpha1 conversion. |
| src/controller/strategies/simple.rs | Initialize new last_decision_source field; update test Rollout spec literals. |
| src/controller/strategies/mod.rs | Update test Rollout spec literals for new spec.advisor field. |
| src/controller/strategies/canary.rs | Update test Rollout spec/status literals for new fields. |
| src/controller/strategies/blue_green.rs | Update test Rollout spec literals for new spec.advisor field. |
| src/controller/strategies/ab_testing.rs | Initialize new last_decision_source field; update test Rollout literals. |
| src/controller/rollout_test.rs | Update controller test Rollout fixtures for new spec.advisor and status fields; minor formatting changes. |
| src/controller/rollout/reconcile.rs | Add advisor resolution/consultation during metrics evaluation; add Context.advisor. |
| src/controller/prometheus_ab.rs | Minor test formatting changes. |
| src/controller/occurrence.rs | Switch to false-protocol crate types; add advisor recommendation occurrence emission; expand error context. |
| src/controller/mod.rs | Export new advisor module. |
| src/controller/cdevents_test.rs | Update Rollout fixtures for new spec.advisor and status field. |
| src/controller/advisor.rs | New advisor abstraction (NoOp/HTTP/mock) + per-Rollout resolution + tests. |
| deploy/crd.yaml | Extend CRD schema with spec.advisor and status.lastDecisionSource. |
| README.md | Update quick-start to clone sibling false-protocol repo. |
| Cargo.toml | Replace direct ulid usage with false-protocol path dependency. |
| Cargo.lock | Add false-protocol package and remove direct ulid dependency from root crate. |
| CLAUDE.md | Update dependency documentation to reflect false-protocol. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| strategy: strategy.name().to_string(), | ||
| current_step: current_status.current_step_index, | ||
| current_weight: current_status.current_weight, | ||
| metrics_healthy: is_healthy, | ||
| phase: format!("{:?}", current_status.phase), | ||
| history: current_status | ||
| .decisions |
There was a problem hiding this comment.
phase: format!("{:?}", current_status.phase) will stringify the Option (e.g., "Some(Progressing)") rather than the phase value. If the advisor expects a stable phase string, this will be misleading. Consider formatting the inner Phase (or using a dedicated enum/string mapping) and falling back to something like "Unknown" when phase is None.
There was a problem hiding this comment.
Fixed in d160f18 — now unwraps the Option<Phase> before formatting: current_status.phase.as_ref().map_or("Unknown".into(), |p| format!("{p:?}")).
| let advisor = resolve_advisor(&rollout.spec.advisor, &ctx.advisor); | ||
| match advisor.advise(&analysis_ctx).await { | ||
| Ok(recommendation) => { | ||
| info!( | ||
| rollout = ?name, | ||
| advisor_action = ?recommendation.action, | ||
| confidence = recommendation.confidence, | ||
| reasoning = %recommendation.reasoning, | ||
| threshold_healthy = is_healthy, | ||
| "Advisor recommendation received (threshold decision prevails)" | ||
| ); | ||
| // Emit advisor recommendation occurrence | ||
| crate::controller::occurrence::emit_advisor_occurrence( | ||
| &rollout, | ||
| strategy.name(), | ||
| &recommendation, | ||
| is_healthy, | ||
| &ctx.clock, | ||
| ); | ||
| } | ||
| Err(e) => { | ||
| warn!( | ||
| rollout = ?name, | ||
| error = %e, | ||
| "Advisor consultation failed, falling back to threshold decision" | ||
| ); | ||
| } |
There was a problem hiding this comment.
When level is Advised+ but endpoint is missing, resolve_advisor returns a NoOpAdvisor (after warning). The reconcile loop will still call .advise() and log/emit an "Advisor recommendation received" event (with Continue / 0 confidence), which looks like a real consultation. Consider skipping the call (and occurrence emission) when the resolved advisor is NoOpAdvisor due to misconfiguration, or check spec.advisor.endpoint before calling.
| let advisor = resolve_advisor(&rollout.spec.advisor, &ctx.advisor); | |
| match advisor.advise(&analysis_ctx).await { | |
| Ok(recommendation) => { | |
| info!( | |
| rollout = ?name, | |
| advisor_action = ?recommendation.action, | |
| confidence = recommendation.confidence, | |
| reasoning = %recommendation.reasoning, | |
| threshold_healthy = is_healthy, | |
| "Advisor recommendation received (threshold decision prevails)" | |
| ); | |
| // Emit advisor recommendation occurrence | |
| crate::controller::occurrence::emit_advisor_occurrence( | |
| &rollout, | |
| strategy.name(), | |
| &recommendation, | |
| is_healthy, | |
| &ctx.clock, | |
| ); | |
| } | |
| Err(e) => { | |
| warn!( | |
| rollout = ?name, | |
| error = %e, | |
| "Advisor consultation failed, falling back to threshold decision" | |
| ); | |
| } | |
| // Only consult the advisor when it is properly configured. | |
| // If the level is Advised+ but the endpoint is missing, skip the | |
| // consultation to avoid emitting misleading "Advisor recommendation" | |
| // events for a no-op advisor. | |
| let advisor_cfg = &rollout.spec.advisor; | |
| let advisor_level_is_advised_plus = matches!( | |
| advisor_cfg.level, | |
| AdvisorLevel::Advised | |
| | AdvisorLevel::Required | |
| | AdvisorLevel::Automated | |
| ); | |
| if advisor_level_is_advised_plus && advisor_cfg.endpoint.is_none() { | |
| warn!( | |
| rollout = ?name, | |
| "Advisor level is set to Advised+ but no advisor endpoint is configured; \ | |
| skipping advisor consultation and falling back to threshold decision" | |
| ); | |
| } else { | |
| let advisor = resolve_advisor(advisor_cfg, &ctx.advisor); | |
| match advisor.advise(&analysis_ctx).await { | |
| Ok(recommendation) => { | |
| info!( | |
| rollout = ?name, | |
| advisor_action = ?recommendation.action, | |
| confidence = recommendation.confidence, | |
| reasoning = %recommendation.reasoning, | |
| threshold_healthy = is_healthy, | |
| "Advisor recommendation received (threshold decision prevails)" | |
| ); | |
| // Emit advisor recommendation occurrence | |
| crate::controller::occurrence::emit_advisor_occurrence( | |
| &rollout, | |
| strategy.name(), | |
| &recommendation, | |
| is_healthy, | |
| &ctx.clock, | |
| ); | |
| } | |
| Err(e) => { | |
| warn!( | |
| rollout = ?name, | |
| error = %e, | |
| "Advisor consultation failed, falling back to threshold decision" | |
| ); | |
| } | |
| } |
There was a problem hiding this comment.
Fixed in d160f18 — added && rollout.spec.advisor.endpoint.is_some() guard in the reconcile loop so the advisor is only called when an endpoint is actually configured.
| pub fn new(endpoint: String, timeout: Duration) -> Self { | ||
| let client = reqwest::Client::builder() | ||
| .timeout(timeout) | ||
| .build() | ||
| .unwrap_or_default(); | ||
| Self { | ||
| client, | ||
| endpoint, | ||
| timeout, | ||
| } |
There was a problem hiding this comment.
HttpAdvisor::new uses build().unwrap_or_default(), which silently discards builder errors and may produce a client without the intended configuration (including the timeout). It would be safer to either return a Result<HttpAdvisor, AdvisorError> from new (and let resolve_advisor decide how to fall back) or at least log and fall back explicitly so failures aren’t hidden.
There was a problem hiding this comment.
Fixed in d160f18 — now logs the build error with tracing::warn! and falls back to the default client explicitly, so the failure is visible in logs.
| let response = self | ||
| .client | ||
| .post(&self.endpoint) | ||
| .json(context) | ||
| .send() | ||
| .await | ||
| .map_err(|e| { | ||
| if e.is_timeout() { | ||
| AdvisorError::Timeout(self.timeout) | ||
| } else { | ||
| AdvisorError::Unreachable(e.to_string()) | ||
| } | ||
| })?; | ||
|
|
||
| let recommendation: Recommendation = response | ||
| .json() | ||
| .await | ||
| .map_err(|e| AdvisorError::InvalidResponse(e.to_string()))?; | ||
|
|
||
| Ok(recommendation) |
There was a problem hiding this comment.
HttpAdvisor::advise parses JSON from the response without checking HTTP status codes. Non-2xx responses (e.g., 500 with HTML body) will currently surface as an InvalidResponse JSON parse error, which loses the actual status/reason. Consider calling response.error_for_status() (or checking status() and mapping it into AdvisorError::Unreachable/InvalidResponse) before attempting to deserialize.
There was a problem hiding this comment.
Fixed in d160f18 — now checks HTTP status before JSON parsing and returns AdvisorError::InvalidResponse with the status code and truncated body for non-2xx responses.
| match config.level { | ||
| AdvisorLevel::Off | AdvisorLevel::Context => std::sync::Arc::new(NoOpAdvisor), | ||
| AdvisorLevel::Advised | AdvisorLevel::Planned | AdvisorLevel::Driven => { | ||
| match &config.endpoint { | ||
| Some(endpoint) => { | ||
| let timeout = Duration::from_secs(config.timeout_seconds); | ||
| std::sync::Arc::new(HttpAdvisor::new(endpoint.clone(), timeout)) | ||
| } | ||
| None => { | ||
| tracing::warn!( | ||
| level = ?config.level, | ||
| "Advisor level requires endpoint but none configured, falling back to no-op" | ||
| ); | ||
| std::sync::Arc::new(NoOpAdvisor) | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
resolve_advisor constructs a brand new HttpAdvisor (and reqwest Client) every time it’s called. In reconciliation, this can happen frequently and prevents connection pooling/reuse. Consider caching HttpAdvisor instances (e.g., in Context keyed by endpoint+timeout, or by storing an Option<Arc<HttpAdvisor>> on the rollout/strategy context) so repeated reconciles don’t rebuild the client.
There was a problem hiding this comment.
Fixed in d160f18 — added AdvisorCache (Mutex<HashMap<(endpoint, timeout), Arc<dyn AnalysisAdvisor>>>) to the Context struct. resolve_advisor() now looks up existing instances before constructing new ones, so reqwest::Client connections are reused across reconcile calls.
| } | ||
|
|
||
| fn is_default_advisor_config(c: &AdvisorConfig) -> bool { | ||
| c.level == AdvisorLevel::Off && c.endpoint.is_none() && c.timeout_seconds == 10 |
There was a problem hiding this comment.
AdvisorConfig currently derives Default, which makes timeout_seconds default to 0 (u64 default). That disagrees with is_default_advisor_config expecting timeout_seconds == 10, so rollouts with an omitted spec.advisor will deserialize to a non-“default” config and won’t be skipped on serialization. Consider implementing Default for AdvisorConfig (or using a shared const) so the Rust default matches the intended 10s timeout and is_default_advisor_config.
| c.level == AdvisorLevel::Off && c.endpoint.is_none() && c.timeout_seconds == 10 | |
| c.level == AdvisorLevel::Off && c.endpoint.is_none() && c.timeout_seconds == 0 |
There was a problem hiding this comment.
Fixed in d160f18 — removed derive(Default) and added manual impl Default that uses the same const DEFAULT_ADVISOR_TIMEOUT_SECONDS: u64 = 10 as the serde default function, so both paths agree.
| advisor: | ||
| default: | ||
| level: 'Off' | ||
| timeoutSeconds: 0 | ||
| description: AI advisor configuration for progressive AI adoption | ||
| properties: | ||
| endpoint: | ||
| description: URL of AI advisory service (e.g., MCP endpoint, HTTP | ||
| API) | ||
| nullable: true | ||
| type: string | ||
| level: | ||
| default: 'Off' | ||
| description: Advisor integration level | ||
| enum: | ||
| - 'Off' | ||
| - Context | ||
| - Advised | ||
| - Planned | ||
| - Driven | ||
| type: string | ||
| timeoutSeconds: | ||
| description: Timeout for advisory calls in seconds | ||
| format: uint64 | ||
| minimum: 0.0 | ||
| type: integer | ||
| type: object |
There was a problem hiding this comment.
The CRD schema sets spec.advisor.default.timeoutSeconds: 0, which conflicts with the Rust-side intended default of 10s (default_advisor_timeout / is_default_advisor_config). With the current schema, users enabling level: Advised and omitting timeoutSeconds may end up with an unexpected 0-second timeout depending on how defaults are applied. Align the CRD default with the controller default (likely 10) to avoid surprising behavior and noisy diffs.
There was a problem hiding this comment.
Fixed in d160f18 — CRD regenerated after fixing the impl Default for AdvisorConfig. The schema now reflects the correct default of 10s.
| # FALSE Protocol (ULID generation for occurrence IDs) | ||
| ulid = "1" | ||
| # FALSE Protocol occurrence types | ||
| false-protocol = { path = "../false-protocol/rust" } |
There was a problem hiding this comment.
false-protocol = { path = "../false-protocol/rust" } points outside this repository. That makes builds non-reproducible in typical CI and for downstream consumers (a fresh clone of this repo won’t compile unless a sibling directory exists). Consider switching to a git dependency (pinned revision/tag), vendoring it as a workspace member/submodule within the repo, or publishing it to crates.io so cargo build works from a standalone checkout.
| false-protocol = { path = "../false-protocol/rust" } | |
| false-protocol = "0.1" |
There was a problem hiding this comment.
Intentional — this is a monorepo-adjacent setup where both repos are checked out as siblings. CI clones both. We'll publish the crate when the protocol stabilizes, but the path dep is correct for now.
| /// Configuration for the AI advisor | ||
| #[derive(Debug, Clone, Serialize, Deserialize, JsonSchema, Default)] | ||
| pub struct AdvisorConfig { | ||
| /// Advisor integration level | ||
| #[serde(default)] | ||
| pub level: AdvisorLevel, | ||
|
|
||
| /// URL of AI advisory service (e.g., MCP endpoint, HTTP API) | ||
| #[serde(skip_serializing_if = "Option::is_none")] | ||
| pub endpoint: Option<String>, | ||
|
|
||
| /// Timeout for advisory calls in seconds | ||
| #[serde( | ||
| rename = "timeoutSeconds", | ||
| default = "default_advisor_timeout", | ||
| skip_serializing_if = "is_default_advisor_timeout" | ||
| )] | ||
| pub timeout_seconds: u64, | ||
| } |
There was a problem hiding this comment.
AdvisorConfig derives Default, but the intended default timeout is encoded only in Serde (default_advisor_timeout) and in is_default_advisor_timeout (== 10). This means AdvisorConfig::default() yields timeout_seconds = 0, and any struct literals using Default::default() (many tests do) will create a 0-second timeout. Recommend removing #[derive(Default)] and adding a manual impl Default that uses default_advisor_timeout() (10), or introducing a const DEFAULT_ADVISOR_TIMEOUT_SECONDS: u64 used by both serde defaults and Rust Default.
There was a problem hiding this comment.
Fixed in d160f18 — same fix as above: manual impl Default with shared const. Both serde and Default::default() now produce timeout_seconds: 10.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Cache HttpAdvisor instances in AdvisorCache to reuse reqwest::Client connections across reconcile calls (keyed by endpoint + timeout) - Fix AdvisorConfig Default: manual impl with const timeout matching serde - Fix phase stringification: unwrap Option<Phase> before formatting - Skip advisor call when endpoint missing (guard in reconcile loop) - Fix HttpAdvisor: log client build errors, check HTTP status before JSON Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
resolve_advisor()function that constructs the appropriate advisor at reconcile time based on each Rollout'sAdvisorConfigHttpAdvisoris created whenlevel >= Advisedand an endpoint URL is configuredNoOpAdvisorwhen endpoint is missing (with warning log)MockAdvisoron Context bypasses CRD config resolutionTest plan
cargo fmtcleanresolve_advisortested for all 5 paths: Off, Context, Advised+endpoint, Advised-no-endpoint, mock override🤖 Generated with Claude Code