Skip to content

feat: resolve advisor per-Rollout from CRD config#31

Merged
yairfalse merged 8 commits into
mainfrom
feature/update-sykli
Mar 2, 2026
Merged

feat: resolve advisor per-Rollout from CRD config#31
yairfalse merged 8 commits into
mainfrom
feature/update-sykli

Conversation

@yairfalse
Copy link
Copy Markdown
Collaborator

Summary

  • Adds resolve_advisor() function that constructs the appropriate advisor at reconcile time based on each Rollout's AdvisorConfig
  • HttpAdvisor is created when level >= Advised and an endpoint URL is configured
  • Falls back to NoOpAdvisor when endpoint is missing (with warning log)
  • Test mock overrides always win — MockAdvisor on Context bypasses CRD config resolution

Test plan

  • 306 tests pass
  • Zero warnings, cargo fmt clean
  • resolve_advisor tested for all 5 paths: Off, Context, Advised+endpoint, Advised-no-endpoint, mock override

🤖 Generated with Claude Code

yairfalse and others added 6 commits February 21, 2026 01:39
- Remove stale code examples that duplicated source
- Fix RolloutStrategy trait signature (now takes DateTime<Utc>)
- Update Context struct with trait-based DI fields
- Document mock constructors and test helpers
- Cut from 360 to ~105 lines, keeping all actionable info

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace ~140 lines of hand-rolled occurrence structs with the canonical
false-protocol crate, gaining spec compliance (entities under context,
DateTime<Utc> timestamps, protocol_version field) and a single source
of truth. Adapt build_occurrence() to use the crate's builder API and
return Option<Occurrence> for graceful fallback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address review comment about path dependency by documenting the
required repository layout in the getting started instructions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Introduce a 4-level AI adoption ladder (Off → Context → Advised →
Planned → Driven) that lets operators progressively opt into AI-assisted
rollout decisions. Level 0 preserves current behavior. Level 1 enriches
FALSE Protocol events with step/weight/traffic context. Level 2 consults
an external advisor service but lets thresholds decide.

- Add AdvisorLevel, AdvisorConfig, Recommendation, DecisionSource types
- Create AnalysisAdvisor trait (NoOpAdvisor, HttpAdvisor, MockAdvisor)
- Wire advisor into reconcile Context and analysis step
- Enrich failure occurrences with rich what_failed/possible_causes
- Add advisor.recommendation occurrence type for Level 2+
- Regenerate CRD schema with advisor fields

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of using a single shared advisor on Context, resolve the
appropriate advisor at reconcile time based on each Rollout's
AdvisorConfig. HttpAdvisor is constructed when level >= Advised and
an endpoint is configured. Test mock overrides always win.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds per-Rollout AI advisor configuration to the Rollout CRD and consults an advisor during reconciliation (advisory-only), while also switching FALSE Protocol occurrence construction to use the false-protocol crate.

Changes:

  • Add spec.advisor config (with AdvisorLevel ladder) and status.lastDecisionSource fields to the Rollout CRD/types.
  • Add controller::advisor module and integrate resolve_advisor() + advisor consultation/occurrence emission during reconcile.
  • Refactor FALSE Protocol occurrence emission to use types from the new false-protocol dependency and update tests/fixtures accordingly.

Reviewed changes

Copilot reviewed 22 out of 23 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/stress_test.rs Update Rollout test builders to include new spec.advisor field.
tests/seppo_integration_test.rs Update integration Rollout fixtures for new spec.advisor field.
src/crd/rollout_test.rs Update CRD serialization test fixtures for new status field.
src/crd/rollout.rs Add advisor config + recommendation types; add last_decision_source to status; update RolloutSpec schema.
src/crd/conversion_test.rs Update conversion tests to include new spec.advisor field.
src/crd/conversion.rs Populate new spec.advisor field during v1beta1 → v1alpha1 conversion.
src/controller/strategies/simple.rs Initialize new last_decision_source field; update test Rollout spec literals.
src/controller/strategies/mod.rs Update test Rollout spec literals for new spec.advisor field.
src/controller/strategies/canary.rs Update test Rollout spec/status literals for new fields.
src/controller/strategies/blue_green.rs Update test Rollout spec literals for new spec.advisor field.
src/controller/strategies/ab_testing.rs Initialize new last_decision_source field; update test Rollout literals.
src/controller/rollout_test.rs Update controller test Rollout fixtures for new spec.advisor and status fields; minor formatting changes.
src/controller/rollout/reconcile.rs Add advisor resolution/consultation during metrics evaluation; add Context.advisor.
src/controller/prometheus_ab.rs Minor test formatting changes.
src/controller/occurrence.rs Switch to false-protocol crate types; add advisor recommendation occurrence emission; expand error context.
src/controller/mod.rs Export new advisor module.
src/controller/cdevents_test.rs Update Rollout fixtures for new spec.advisor and status field.
src/controller/advisor.rs New advisor abstraction (NoOp/HTTP/mock) + per-Rollout resolution + tests.
deploy/crd.yaml Extend CRD schema with spec.advisor and status.lastDecisionSource.
README.md Update quick-start to clone sibling false-protocol repo.
Cargo.toml Replace direct ulid usage with false-protocol path dependency.
Cargo.lock Add false-protocol package and remove direct ulid dependency from root crate.
CLAUDE.md Update dependency documentation to reflect false-protocol.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +244 to +250
strategy: strategy.name().to_string(),
current_step: current_status.current_step_index,
current_weight: current_status.current_weight,
metrics_healthy: is_healthy,
phase: format!("{:?}", current_status.phase),
history: current_status
.decisions
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

phase: format!("{:?}", current_status.phase) will stringify the Option (e.g., "Some(Progressing)") rather than the phase value. If the advisor expects a stable phase string, this will be misleading. Consider formatting the inner Phase (or using a dedicated enum/string mapping) and falling back to something like "Unknown" when phase is None.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d160f18 — now unwraps the Option<Phase> before formatting: current_status.phase.as_ref().map_or("Unknown".into(), |p| format!("{p:?}")).

Comment thread src/controller/rollout/reconcile.rs Outdated
Comment on lines +256 to +282
let advisor = resolve_advisor(&rollout.spec.advisor, &ctx.advisor);
match advisor.advise(&analysis_ctx).await {
Ok(recommendation) => {
info!(
rollout = ?name,
advisor_action = ?recommendation.action,
confidence = recommendation.confidence,
reasoning = %recommendation.reasoning,
threshold_healthy = is_healthy,
"Advisor recommendation received (threshold decision prevails)"
);
// Emit advisor recommendation occurrence
crate::controller::occurrence::emit_advisor_occurrence(
&rollout,
strategy.name(),
&recommendation,
is_healthy,
&ctx.clock,
);
}
Err(e) => {
warn!(
rollout = ?name,
error = %e,
"Advisor consultation failed, falling back to threshold decision"
);
}
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When level is Advised+ but endpoint is missing, resolve_advisor returns a NoOpAdvisor (after warning). The reconcile loop will still call .advise() and log/emit an "Advisor recommendation received" event (with Continue / 0 confidence), which looks like a real consultation. Consider skipping the call (and occurrence emission) when the resolved advisor is NoOpAdvisor due to misconfiguration, or check spec.advisor.endpoint before calling.

Suggested change
let advisor = resolve_advisor(&rollout.spec.advisor, &ctx.advisor);
match advisor.advise(&analysis_ctx).await {
Ok(recommendation) => {
info!(
rollout = ?name,
advisor_action = ?recommendation.action,
confidence = recommendation.confidence,
reasoning = %recommendation.reasoning,
threshold_healthy = is_healthy,
"Advisor recommendation received (threshold decision prevails)"
);
// Emit advisor recommendation occurrence
crate::controller::occurrence::emit_advisor_occurrence(
&rollout,
strategy.name(),
&recommendation,
is_healthy,
&ctx.clock,
);
}
Err(e) => {
warn!(
rollout = ?name,
error = %e,
"Advisor consultation failed, falling back to threshold decision"
);
}
// Only consult the advisor when it is properly configured.
// If the level is Advised+ but the endpoint is missing, skip the
// consultation to avoid emitting misleading "Advisor recommendation"
// events for a no-op advisor.
let advisor_cfg = &rollout.spec.advisor;
let advisor_level_is_advised_plus = matches!(
advisor_cfg.level,
AdvisorLevel::Advised
| AdvisorLevel::Required
| AdvisorLevel::Automated
);
if advisor_level_is_advised_plus && advisor_cfg.endpoint.is_none() {
warn!(
rollout = ?name,
"Advisor level is set to Advised+ but no advisor endpoint is configured; \
skipping advisor consultation and falling back to threshold decision"
);
} else {
let advisor = resolve_advisor(advisor_cfg, &ctx.advisor);
match advisor.advise(&analysis_ctx).await {
Ok(recommendation) => {
info!(
rollout = ?name,
advisor_action = ?recommendation.action,
confidence = recommendation.confidence,
reasoning = %recommendation.reasoning,
threshold_healthy = is_healthy,
"Advisor recommendation received (threshold decision prevails)"
);
// Emit advisor recommendation occurrence
crate::controller::occurrence::emit_advisor_occurrence(
&rollout,
strategy.name(),
&recommendation,
is_healthy,
&ctx.clock,
);
}
Err(e) => {
warn!(
rollout = ?name,
error = %e,
"Advisor consultation failed, falling back to threshold decision"
);
}
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d160f18 — added && rollout.spec.advisor.endpoint.is_some() guard in the reconcile loop so the advisor is only called when an endpoint is actually configured.

Comment thread src/controller/advisor.rs
Comment on lines +88 to +97
pub fn new(endpoint: String, timeout: Duration) -> Self {
let client = reqwest::Client::builder()
.timeout(timeout)
.build()
.unwrap_or_default();
Self {
client,
endpoint,
timeout,
}
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HttpAdvisor::new uses build().unwrap_or_default(), which silently discards builder errors and may produce a client without the intended configuration (including the timeout). It would be safer to either return a Result<HttpAdvisor, AdvisorError> from new (and let resolve_advisor decide how to fall back) or at least log and fall back explicitly so failures aren’t hidden.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d160f18 — now logs the build error with tracing::warn! and falls back to the default client explicitly, so the failure is visible in logs.

Comment thread src/controller/advisor.rs
Comment on lines +104 to +123
let response = self
.client
.post(&self.endpoint)
.json(context)
.send()
.await
.map_err(|e| {
if e.is_timeout() {
AdvisorError::Timeout(self.timeout)
} else {
AdvisorError::Unreachable(e.to_string())
}
})?;

let recommendation: Recommendation = response
.json()
.await
.map_err(|e| AdvisorError::InvalidResponse(e.to_string()))?;

Ok(recommendation)
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HttpAdvisor::advise parses JSON from the response without checking HTTP status codes. Non-2xx responses (e.g., 500 with HTML body) will currently surface as an InvalidResponse JSON parse error, which loses the actual status/reason. Consider calling response.error_for_status() (or checking status() and mapping it into AdvisorError::Unreachable/InvalidResponse) before attempting to deserialize.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d160f18 — now checks HTTP status before JSON parsing and returns AdvisorError::InvalidResponse with the status code and truncated body for non-2xx responses.

Comment thread src/controller/advisor.rs
Comment on lines +150 to +167
match config.level {
AdvisorLevel::Off | AdvisorLevel::Context => std::sync::Arc::new(NoOpAdvisor),
AdvisorLevel::Advised | AdvisorLevel::Planned | AdvisorLevel::Driven => {
match &config.endpoint {
Some(endpoint) => {
let timeout = Duration::from_secs(config.timeout_seconds);
std::sync::Arc::new(HttpAdvisor::new(endpoint.clone(), timeout))
}
None => {
tracing::warn!(
level = ?config.level,
"Advisor level requires endpoint but none configured, falling back to no-op"
);
std::sync::Arc::new(NoOpAdvisor)
}
}
}
}
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolve_advisor constructs a brand new HttpAdvisor (and reqwest Client) every time it’s called. In reconciliation, this can happen frequently and prevents connection pooling/reuse. Consider caching HttpAdvisor instances (e.g., in Context keyed by endpoint+timeout, or by storing an Option<Arc<HttpAdvisor>> on the rollout/strategy context) so repeated reconciles don’t rebuild the client.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d160f18 — added AdvisorCache (Mutex<HashMap<(endpoint, timeout), Arc<dyn AnalysisAdvisor>>>) to the Context struct. resolve_advisor() now looks up existing instances before constructing new ones, so reqwest::Client connections are reused across reconcile calls.

Comment thread src/crd/rollout.rs Outdated
}

fn is_default_advisor_config(c: &AdvisorConfig) -> bool {
c.level == AdvisorLevel::Off && c.endpoint.is_none() && c.timeout_seconds == 10
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AdvisorConfig currently derives Default, which makes timeout_seconds default to 0 (u64 default). That disagrees with is_default_advisor_config expecting timeout_seconds == 10, so rollouts with an omitted spec.advisor will deserialize to a non-“default” config and won’t be skipped on serialization. Consider implementing Default for AdvisorConfig (or using a shared const) so the Rust default matches the intended 10s timeout and is_default_advisor_config.

Suggested change
c.level == AdvisorLevel::Off && c.endpoint.is_none() && c.timeout_seconds == 10
c.level == AdvisorLevel::Off && c.endpoint.is_none() && c.timeout_seconds == 0

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d160f18 — removed derive(Default) and added manual impl Default that uses the same const DEFAULT_ADVISOR_TIMEOUT_SECONDS: u64 = 10 as the serde default function, so both paths agree.

Comment thread deploy/crd.yaml
Comment on lines +56 to +82
advisor:
default:
level: 'Off'
timeoutSeconds: 0
description: AI advisor configuration for progressive AI adoption
properties:
endpoint:
description: URL of AI advisory service (e.g., MCP endpoint, HTTP
API)
nullable: true
type: string
level:
default: 'Off'
description: Advisor integration level
enum:
- 'Off'
- Context
- Advised
- Planned
- Driven
type: string
timeoutSeconds:
description: Timeout for advisory calls in seconds
format: uint64
minimum: 0.0
type: integer
type: object
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CRD schema sets spec.advisor.default.timeoutSeconds: 0, which conflicts with the Rust-side intended default of 10s (default_advisor_timeout / is_default_advisor_config). With the current schema, users enabling level: Advised and omitting timeoutSeconds may end up with an unexpected 0-second timeout depending on how defaults are applied. Align the CRD default with the controller default (likely 10) to avoid surprising behavior and noisy diffs.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d160f18 — CRD regenerated after fixing the impl Default for AdvisorConfig. The schema now reflects the correct default of 10s.

Comment thread Cargo.toml
# FALSE Protocol (ULID generation for occurrence IDs)
ulid = "1"
# FALSE Protocol occurrence types
false-protocol = { path = "../false-protocol/rust" }
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

false-protocol = { path = "../false-protocol/rust" } points outside this repository. That makes builds non-reproducible in typical CI and for downstream consumers (a fresh clone of this repo won’t compile unless a sibling directory exists). Consider switching to a git dependency (pinned revision/tag), vendoring it as a workspace member/submodule within the repo, or publishing it to crates.io so cargo build works from a standalone checkout.

Suggested change
false-protocol = { path = "../false-protocol/rust" }
false-protocol = "0.1"

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional — this is a monorepo-adjacent setup where both repos are checked out as siblings. CI clones both. We'll publish the crate when the protocol stabilizes, but the path dep is correct for now.

Comment thread src/crd/rollout.rs
Comment on lines +648 to +666
/// Configuration for the AI advisor
#[derive(Debug, Clone, Serialize, Deserialize, JsonSchema, Default)]
pub struct AdvisorConfig {
/// Advisor integration level
#[serde(default)]
pub level: AdvisorLevel,

/// URL of AI advisory service (e.g., MCP endpoint, HTTP API)
#[serde(skip_serializing_if = "Option::is_none")]
pub endpoint: Option<String>,

/// Timeout for advisory calls in seconds
#[serde(
rename = "timeoutSeconds",
default = "default_advisor_timeout",
skip_serializing_if = "is_default_advisor_timeout"
)]
pub timeout_seconds: u64,
}
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AdvisorConfig derives Default, but the intended default timeout is encoded only in Serde (default_advisor_timeout) and in is_default_advisor_timeout (== 10). This means AdvisorConfig::default() yields timeout_seconds = 0, and any struct literals using Default::default() (many tests do) will create a 0-second timeout. Recommend removing #[derive(Default)] and adding a manual impl Default that uses default_advisor_timeout() (10), or introducing a const DEFAULT_ADVISOR_TIMEOUT_SECONDS: u64 used by both serde defaults and Rust Default.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d160f18 — same fix as above: manual impl Default with shared const. Both serde and Default::default() now produce timeout_seconds: 10.

yairfalse and others added 2 commits March 2, 2026 00:40
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Cache HttpAdvisor instances in AdvisorCache to reuse reqwest::Client
  connections across reconcile calls (keyed by endpoint + timeout)
- Fix AdvisorConfig Default: manual impl with const timeout matching serde
- Fix phase stringification: unwrap Option<Phase> before formatting
- Skip advisor call when endpoint missing (guard in reconcile loop)
- Fix HttpAdvisor: log client build errors, check HTTP status before JSON

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@yairfalse yairfalse merged commit 01a54de into main Mar 2, 2026
0 of 2 checks passed
@yairfalse yairfalse deleted the feature/update-sykli branch March 2, 2026 01:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants