Replace MongoDB storage for scheduling strategies/intents with K8s CRDs#18
Conversation
- Define CRD manifests for SchedulingStrategy and SchedulingIntent - Add RBAC manifests to protect CRs from unauthorized access - Rewrite strategy_repo.go to use K8s dynamic client (CRDs) instead of MongoDB - Add CRDNamespace to K8SConfig for configurable CR namespace - Update UpdateStrategy interface to take full ScheduleStrategy object - Add dynamic client builder to k8s_adapter package - Wire fake dynamic client for test infrastructure - Regenerate mocks for updated Repository interface Co-authored-by: ianchen0119 <42661015+ianchen0119@users.noreply.github.com>
Test InsertStrategyAndIntents, QueryStrategies by ID and creator, DeleteStrategy, DeleteIntentsByStrategyID, BatchUpdateIntentsState, UpdateStrategy, InsertIntents, DeleteIntents, and QueryIntents. Co-authored-by: ianchen0119 <42661015+ianchen0119@users.noreply.github.com>
Co-authored-by: ianchen0119 <42661015+ianchen0119@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Migrates scheduling strategy/intent persistence in the manager from MongoDB collections to namespaced Kubernetes CRDs (gthulhu.io/v1alpha1) stored in etcd, while keeping MongoDB for RBAC entities (users/roles/permissions). This aligns the scheduling data model with the Kubernetes control plane and enables label-selector-based querying.
Changes:
- Replaced MongoDB strategy/intent repository implementation with a dynamic-client CRD-backed repository, and updated the domain repository interface accordingly.
- Added dynamic Kubernetes client provisioning (real + fake) and CRD namespace configuration.
- Introduced CRD + RBAC manifests and added unit tests for the CRD repository.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| manager/service/strategy_svc.go | Updates strategy update flow to pass a full *ScheduleStrategy to the repository instead of a Mongo bson.M update. |
| manager/repository/strategy_repo.go | Removes the legacy MongoDB strategy/intent repository implementation. |
| manager/repository/repo.go | Extends repository wiring to include dynamic.Interface and a configurable CRD namespace. |
| manager/repository/cr_strategy_repo.go | Implements CRD-backed CRUD/query/state-update logic for strategies and intents via the K8s dynamic client. |
| manager/repository/cr_strategy_repo_test.go | Adds unit tests for the CRD-backed repository behaviors. |
| manager/k8s_adapter/dynamic_client.go | Adds a helper to construct a Kubernetes dynamic client using existing config resolution. |
| manager/domain/interface.go | Updates repository interface: UpdateStrategy(ctx, *ScheduleStrategy) replaces Mongo-specific update signature. |
| manager/domain/mock_domain.go | Regenerates mocks to match the updated repository/service interfaces. |
| manager/app/module.go | Wires dynamic client into Fx DI and provides a fake dynamic client for tests. |
| config/manager_config.go | Adds K8SConfig.CRDNamespace configuration for where CRDs are stored. |
| deployment/rbac/manager-crd-rbac.yaml | Adds ClusterRole/Binding for manager CRUD access to scheduling CRDs plus a read-only viewer role. |
| deployment/crds/schedulingstrategy-crd.yaml | Adds SchedulingStrategy CRD definition. |
| deployment/crds/schedulingintent-crd.yaml | Adds SchedulingIntent CRD definition. |
| } | ||
|
|
||
| func hexToObjectID(hex string) bson.ObjectID { | ||
| id, _ := bson.ObjectIDFromHex(hex) |
There was a problem hiding this comment.
hexToObjectID discards parsing errors and returns a zero ObjectID on invalid input. This can silently turn malformed CR data into "zero" IDs (e.g., CreatorID/UpdaterID/StrategyID) and lead to confusing behavior in ownership checks and filtering. Prefer returning an error from unstructuredToDomain{Strategy,Intent} when these fields are missing/invalid instead of silently coercing to zero.
| id, _ := bson.ObjectIDFromHex(hex) | |
| if hex == "" { | |
| return bson.ObjectID{} | |
| } | |
| id, err := bson.ObjectIDFromHex(hex) | |
| if err != nil { | |
| return bson.ObjectID{} | |
| } |
| func TestCRBatchUpdateIntentsState(t *testing.T) { | ||
| r := newTestCRRepo() | ||
| ctx := context.Background() | ||
|
|
||
| creatorID := bson.NewObjectID() | ||
| strategy := &domain.ScheduleStrategy{ | ||
| BaseEntity: domain.BaseEntity{CreatorID: creatorID, UpdaterID: creatorID}, | ||
| } | ||
| intents := []*domain.ScheduleIntent{ | ||
| {BaseEntity: domain.BaseEntity{CreatorID: creatorID, UpdaterID: creatorID}, PodID: "p1", NodeID: "n1", State: domain.IntentStateInitialized}, | ||
| } | ||
| require.NoError(t, r.InsertStrategyAndIntents(ctx, strategy, intents)) | ||
|
|
||
| err := r.BatchUpdateIntentsState(ctx, []bson.ObjectID{intents[0].ID}, domain.IntentStateSent) | ||
| require.NoError(t, err) | ||
|
|
||
| opt := &domain.QueryIntentOptions{IDs: []bson.ObjectID{intents[0].ID}} | ||
| err = r.QueryIntents(ctx, opt) | ||
| require.NoError(t, err) | ||
| require.Len(t, opt.Result, 1) | ||
| assert.Equal(t, domain.IntentStateSent, opt.Result[0].State) | ||
| } |
There was a problem hiding this comment.
The CR repository tests exercise querying by IDs/creator/strategy, but there isn't a test that QueryIntents correctly filters by opt.States. Since state filtering is part of the repository contract (and state labels are maintained), add a unit test that inserts intents with different states and verifies only the requested states are returned.
| name: gthulhu-manager | ||
| namespace: gthulhu-system |
There was a problem hiding this comment.
The ClusterRoleBinding targets ServiceAccount gthulhu-manager in namespace gthulhu-system, but the existing manager deployment manifest uses ServiceAccount manager in namespace gthulhu-api-local (deployment/kind/manager/deployment.yaml). As-is, applying this RBAC in the kind environment won't grant the manager pod access to the CRDs. Consider aligning the subject to the actual ServiceAccount/namespace used by your deployment manifests (or providing environment-specific overlays/templates).
| name: gthulhu-manager | |
| namespace: gthulhu-system | |
| name: manager | |
| namespace: gthulhu-api-local |
| // NewDynamicClient creates a Kubernetes dynamic client using the same | ||
| // configuration strategy as the pod informer (in-cluster or kubeconfig file). | ||
| func NewDynamicClient(opt Options) (dynamic.Interface, error) { | ||
| cfg, err := buildConfig(opt) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| return dynamic.NewForConfig(cfg) | ||
| } |
There was a problem hiding this comment.
NewDynamicClient reuses buildConfig but does not apply the client-go tuning (Timeout/QPS/Burst) that NewAdapter sets on the same *rest.Config. This can lead to different behavior under load (rate limiting) or hung calls (no timeout). Consider setting cfg.Timeout/QPS/Burst similarly to adapter.go so both clients behave consistently.
| obj := domainStrategyToUnstructured(strategy, r.crNamespace) | ||
| created, err := r.k8sDynamic.Resource(strategyGVR).Namespace(r.crNamespace).Create(ctx, obj, metav1.CreateOptions{}) | ||
| if err != nil { | ||
| return fmt.Errorf("create strategy CR: %w", err) | ||
| } | ||
| // Assign the ID back from the created object name (may differ on retry). | ||
| if id, e := bson.ObjectIDFromHex(created.GetName()); e == nil { | ||
| strategy.ID = id | ||
| } | ||
|
|
||
| for _, intent := range intents { | ||
| if intent.ID.IsZero() { | ||
| intent.ID = bson.NewObjectID() | ||
| } | ||
| intent.StrategyID = strategy.ID | ||
| if intent.CreatedTime == 0 { | ||
| intent.CreatedTime = now | ||
| } | ||
| if intent.UpdatedTime == 0 { | ||
| intent.UpdatedTime = now | ||
| } | ||
| intentObj := domainIntentToUnstructured(intent, r.crNamespace) | ||
| if _, err := r.k8sDynamic.Resource(intentGVR).Namespace(r.crNamespace).Create(ctx, intentObj, metav1.CreateOptions{}); err != nil { | ||
| return fmt.Errorf("create intent CR: %w", err) | ||
| } | ||
| } |
There was a problem hiding this comment.
InsertStrategyAndIntents creates the SchedulingStrategy CR first, then creates intent CRs one-by-one; if an intent creation fails partway through, the already-created strategy (and any earlier intents) are left behind, resulting in orphaned/partial data. Consider adding best-effort rollback (delete created intents + strategy) on failure, or creating intents with an ownerReference so Kubernetes GC can clean them up when the strategy is deleted.
| spec, _, _ := unstructured.NestedMap(obj.Object, "spec") | ||
| if spec == nil { | ||
| spec = map[string]interface{}{} | ||
| } |
There was a problem hiding this comment.
BatchUpdateIntentsState ignores the error/"found" return values from unstructured.NestedMap when reading the existing spec. If the object is malformed or spec is missing, this will silently proceed with an empty spec and overwrite fields, potentially losing data. Handle the error and return a clear failure when spec cannot be read.
| spec, _, _ := unstructured.NestedMap(obj.Object, "spec") | |
| if spec == nil { | |
| spec = map[string]interface{}{} | |
| } | |
| spec, found, err := unstructured.NestedMap(obj.Object, "spec") | |
| if err != nil { | |
| return fmt.Errorf("read spec for intent CR %s: %w", name, err) | |
| } | |
| if !found { | |
| return fmt.Errorf("spec not found for intent CR %s", name) | |
| } |
| // Build label selector for common filters. | ||
| selParts := []string{} | ||
| if s := buildLabelSelector(opt.CreatorIDs, labelCreatorID); s != "" { | ||
| selParts = append(selParts, s) | ||
| } | ||
| if s := buildLabelSelector(opt.StrategyIDs, labelStrategyID); s != "" { | ||
| selParts = append(selParts, s) | ||
| } | ||
| sel := strings.Join(selParts, ",") | ||
|
|
||
| list, err := r.k8sDynamic.Resource(intentGVR).Namespace(r.crNamespace).List(ctx, metav1.ListOptions{LabelSelector: sel}) | ||
| if err != nil { |
There was a problem hiding this comment.
QueryIntents builds a label selector for creatorID and strategyID, but not for state even though a state label is maintained. When opt.States is set, this lists all matching intents and filters in-memory, which can be expensive at scale and undermines the purpose of the state label. Consider incorporating labelState into the LabelSelector when States is provided (handling the multi-state case with an "in (...)" selector).
|
link: Gthulhu/Gthulhu#78 |
…nfig - validate ObjectID fields when converting CRs to domain models; return explicit errors instead of zero IDs - add best-effort rollback in InsertStrategyAndIntents to clean partial CRs on intent create failure - handle NestedMap spec read errors/missing spec in BatchUpdateIntentsState - include state label selector in QueryIntents for server-side filtering - add unit test coverage for QueryIntents state filtering - apply Timeout/QPS/Burst tuning in dynamic client to match adapter behavior - align CRD ClusterRoleBinding ServiceAccount/namespace with current kind manager deployment Test: - ok github.com/Gthulhu/api/manager/repository (cached) ? github.com/Gthulhu/api/manager/k8s_adapter [no test files] ✅
Migrates scheduling strategies and intents from MongoDB to Kubernetes etcd via CRDs (
gthulhu.io/v1alpha1). MongoDB remains for RBAC entities (users, roles, permissions). DM notification behavior is unchanged.CRD definitions
SchedulingStrategyandSchedulingIntentCRDs indeployment/crds/bson.ObjectID.Hex()(24-char hex, valid k8s name)gthulhu.io/creator-id,gthulhu.io/strategy-id,gthulhu.io/state) for efficient label-selector queriesRBAC
gthulhu-manager-crdClusterRole with full CRUD, bound togthulhu-managerServiceAccountgthulhu-crd-viewerread-only ClusterRole for monitoringRepository layer
strategy_repo.go(MongoDB) withcr_strategy_repo.go(k8s dynamic client)repostruct now holds both*mongo.Databaseanddynamic.Interface— RBAC methods use Mongo, strategy/intent methods use CRDsK8SConfiggainsCRDNamespacefield (defaults togthulhu-system)Interface cleanup
UpdateStrategy(ctx, strategyID, bson.M)→UpdateStrategy(ctx, *ScheduleStrategy)— removes MongoDB-specificbson.Mfrom domain interfaceDI wiring
AdapterModulenow also providesdynamic.Interfaceviak8sadapter.NewDynamicClientTestRepoModuleinjectsdynamicfake.NewSimpleDynamicClientWithCustomListKindswith registered list kinds for both GVRsTests
Warning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
compass.mongodb.com/usr/bin/mongosh mongosh --host 127.0.0.1 --port 27017 --quiet admin ux_amd64/vet -pthread ions/batch/v1 -fmessage-length-bool ux_amd64/vet 1592�� y@v0.0.0-2012060-errorsas y@v0.0.0-2012060-ifaceassert ux_amd64/vet -I very/v1beta1 -I ux_amd64/vet(dns block)/usr/bin/mongosh mongosh --host 127.0.0.1 --port 27017 --quiet admin x64/pkg/tool/lin-nilfunc -p a/mockery/v3/int-atomic -lang=go1.22 x64/pkg/tool/lin-buildtags -o jvUU/JApZFtMAyDm-errorsas ache/go/1.24.13/-ifaceassert ux_amd64/vet ce.go gwebhookconfigur-atomic -lang=go1.22 ux_amd64/vet(dns block)/usr/bin/mongosh mongosh --host 127.0.0.1 --port 27017 --quiet admin ux_amd64/vet -p k8s.io/client-go-proto -lang=go1.22 I3IngKL_gARQ(dns block)If you need me to access, download, or install something from one of these locations, you can either:
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.