Background
The compute service was originally designed around the MVP architecture defined in the federation enhancement: operators watch project API servers, write resources to a single Datum Platform API Server, and that platform API server communicates directly with edge cluster cells at each POP.
The platform is now targeting a launch architecture that replaces this with a Karmada-based federated control plane:
- Virtual Control Planes (VCPs) replace per-project API servers (lightweight, ~0.8MB each)
- Operators watch project VCPs and write resources to a Karmada Federation API Server instead of directly to a platform API server
- Karmada PropagationPolicies drive resource propagation from the federation API server to the correct Datum POP cell(s)
- The federation layer sits between the compute operator and the edge cluster cells where workloads actually run
For the purposes of this issue, we assume a single federation control plane (as targeted for launch).
The current compute service has no defined integration path for this federated model.
Problem
The compute service's WorkloadDeploymentScheduler and WorkloadReconciler are built assuming a direct connection to a single control plane that has visibility into all Location resources across all POPs. In the launch federation architecture:
-
Placement scheduling changes: WorkloadDeployment resources need to be routed to the correct POP cell(s) via Karmada PropagationPolicies — it is unclear whether our current WorkloadDeploymentScheduler should be replaced by, or supplemented with, Karmada propagation rules.
-
Location resource availability: Location resources (from network-services-operator, keyed by topology.datum.net/city-code) are currently queried from the project control plane. In the federated model, it is unclear which API server hosts these resources and how the compute operator discovers available compute locations across all POP cells.
-
Resource ownership in the federation layer: It is unclear which resources (WorkloadDeployment, Instance, or new intermediary types) belong in the Federation API Server vs. the edge control plane API server, and how status flows back up through the federation layer to the consumer-facing Workload resource.
-
Operator topology: In the launch architecture, operators run in a dedicated Control Plane Cell and connect to both the Federation API Server (to write propagation targets) and multiple project VCPs (to read consumer intent). The compute operator's multicluster-runtime setup needs to be adapted to this topology.
Goals
Define a clear integration strategy that answers:
Relevant Context
Datum enhancements:
Karmada documentation:
Out of Scope
- Changes to the consumer-facing
Workload or WorkloadDeployment API surface
Background
The compute service was originally designed around the MVP architecture defined in the federation enhancement: operators watch project API servers, write resources to a single Datum Platform API Server, and that platform API server communicates directly with edge cluster cells at each POP.
The platform is now targeting a launch architecture that replaces this with a Karmada-based federated control plane:
For the purposes of this issue, we assume a single federation control plane (as targeted for launch).
The current compute service has no defined integration path for this federated model.
Problem
The compute service's
WorkloadDeploymentSchedulerandWorkloadReconcilerare built assuming a direct connection to a single control plane that has visibility into allLocationresources across all POPs. In the launch federation architecture:Placement scheduling changes:
WorkloadDeploymentresources need to be routed to the correct POP cell(s) via Karmada PropagationPolicies — it is unclear whether our currentWorkloadDeploymentSchedulershould be replaced by, or supplemented with, Karmada propagation rules.Location resource availability:
Locationresources (fromnetwork-services-operator, keyed bytopology.datum.net/city-code) are currently queried from the project control plane. In the federated model, it is unclear which API server hosts these resources and how the compute operator discovers available compute locations across all POP cells.Resource ownership in the federation layer: It is unclear which resources (
WorkloadDeployment,Instance, or new intermediary types) belong in the Federation API Server vs. the edge control plane API server, and how status flows back up through the federation layer to the consumer-facingWorkloadresource.Operator topology: In the launch architecture, operators run in a dedicated Control Plane Cell and connect to both the Federation API Server (to write propagation targets) and multiple project VCPs (to read consumer intent). The compute operator's multicluster-runtime setup needs to be adapted to this topology.
Goals
Define a clear integration strategy that answers:
PropagationPolicyresources get created to routeWorkloadDeployments to the correct POP cell(s) based on consumer-specified city codes?WorkloadDeploymentSchedulerget replaced by Karmada propagation, supplemented by it, or remain as-is with a translation layer?Locationresources live in the federated topology, and how does the compute operator discover them?Instancestatus flow from edge cluster cells back through the federation layer to the consumer-facingWorkloadstatus?Relevant Context
Datum enhancements:
Karmada documentation:
Out of Scope
WorkloadorWorkloadDeploymentAPI surface