Skip to content
This repository was archived by the owner on Mar 10, 2026. It is now read-only.

Replace MongoDB storage for scheduling strategies/intents with K8s CRDs#18

Merged
ianchen0119 merged 5 commits intomainfrom
copilot/migrate-scheduling-strategies-to-etcd
Mar 10, 2026
Merged

Replace MongoDB storage for scheduling strategies/intents with K8s CRDs#18
ianchen0119 merged 5 commits intomainfrom
copilot/migrate-scheduling-strategies-to-etcd

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 2, 2026

Migrates scheduling strategies and intents from MongoDB to Kubernetes etcd via CRDs (gthulhu.io/v1alpha1). MongoDB remains for RBAC entities (users, roles, permissions). DM notification behavior is unchanged.

CRD definitions

  • SchedulingStrategy and SchedulingIntent CRDs in deployment/crds/
  • CR names use bson.ObjectID.Hex() (24-char hex, valid k8s name)
  • Labels (gthulhu.io/creator-id, gthulhu.io/strategy-id, gthulhu.io/state) for efficient label-selector queries

RBAC

  • gthulhu-manager-crd ClusterRole with full CRUD, bound to gthulhu-manager ServiceAccount
  • gthulhu-crd-viewer read-only ClusterRole for monitoring

Repository layer

  • Replaced strategy_repo.go (MongoDB) with cr_strategy_repo.go (k8s dynamic client)
  • repo struct now holds both *mongo.Database and dynamic.Interface — RBAC methods use Mongo, strategy/intent methods use CRDs
  • K8SConfig gains CRDNamespace field (defaults to gthulhu-system)

Interface cleanup

  • UpdateStrategy(ctx, strategyID, bson.M)UpdateStrategy(ctx, *ScheduleStrategy) — removes MongoDB-specific bson.M from domain interface

DI wiring

  • AdapterModule now also provides dynamic.Interface via k8sadapter.NewDynamicClient
  • TestRepoModule injects dynamicfake.NewSimpleDynamicClientWithCustomListKinds with registered list kinds for both GVRs

Tests

  • 8 new unit tests for CR repository (insert, query, update, delete, batch state update, label-based filtering)
  • All existing integration tests pass unchanged against the fake dynamic client

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • compass.mongodb.com
    • Triggering command: /usr/bin/mongosh mongosh --host 127.0.0.1 --port 27017 --quiet admin ux_amd64/vet -pthread ions/batch/v1 -fmessage-length-bool ux_amd64/vet 1592�� y@v0.0.0-2012060-errorsas y@v0.0.0-2012060-ifaceassert ux_amd64/vet -I very/v1beta1 -I ux_amd64/vet (dns block)
    • Triggering command: /usr/bin/mongosh mongosh --host 127.0.0.1 --port 27017 --quiet admin x64/pkg/tool/lin-nilfunc -p a/mockery/v3/int-atomic -lang=go1.22 x64/pkg/tool/lin-buildtags -o jvUU/JApZFtMAyDm-errorsas ache/go/1.24.13/-ifaceassert ux_amd64/vet ce.go gwebhookconfigur-atomic -lang=go1.22 ux_amd64/vet (dns block)
    • Triggering command: /usr/bin/mongosh mongosh --host 127.0.0.1 --port 27017 --quiet admin ux_amd64/vet -p k8s.io/client-go-proto -lang=go1.22 I3IngKL_gARQ (dns block)

If you need me to access, download, or install something from one of these locations, you can either:


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 3 commits March 2, 2026 06:00
- Define CRD manifests for SchedulingStrategy and SchedulingIntent
- Add RBAC manifests to protect CRs from unauthorized access
- Rewrite strategy_repo.go to use K8s dynamic client (CRDs) instead of MongoDB
- Add CRDNamespace to K8SConfig for configurable CR namespace
- Update UpdateStrategy interface to take full ScheduleStrategy object
- Add dynamic client builder to k8s_adapter package
- Wire fake dynamic client for test infrastructure
- Regenerate mocks for updated Repository interface

Co-authored-by: ianchen0119 <42661015+ianchen0119@users.noreply.github.com>
Test InsertStrategyAndIntents, QueryStrategies by ID and creator,
DeleteStrategy, DeleteIntentsByStrategyID, BatchUpdateIntentsState,
UpdateStrategy, InsertIntents, DeleteIntents, and QueryIntents.

Co-authored-by: ianchen0119 <42661015+ianchen0119@users.noreply.github.com>
Co-authored-by: ianchen0119 <42661015+ianchen0119@users.noreply.github.com>
Copilot AI changed the title [WIP] Migrate scheduling strategies and intents to k8s etcd storage Replace MongoDB storage for scheduling strategies/intents with K8s CRDs Mar 2, 2026
@ianchen0119 ianchen0119 requested a review from gthulhu-work March 2, 2026 06:13
@ianchen0119 ianchen0119 added this to the R2 milestone Mar 2, 2026
@ianchen0119 ianchen0119 requested review from vx416 and yanun0323 March 2, 2026 16:42
@ianchen0119 ianchen0119 marked this pull request as ready for review March 2, 2026 16:42
Copilot AI review requested due to automatic review settings March 2, 2026 16:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Migrates scheduling strategy/intent persistence in the manager from MongoDB collections to namespaced Kubernetes CRDs (gthulhu.io/v1alpha1) stored in etcd, while keeping MongoDB for RBAC entities (users/roles/permissions). This aligns the scheduling data model with the Kubernetes control plane and enables label-selector-based querying.

Changes:

  • Replaced MongoDB strategy/intent repository implementation with a dynamic-client CRD-backed repository, and updated the domain repository interface accordingly.
  • Added dynamic Kubernetes client provisioning (real + fake) and CRD namespace configuration.
  • Introduced CRD + RBAC manifests and added unit tests for the CRD repository.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
manager/service/strategy_svc.go Updates strategy update flow to pass a full *ScheduleStrategy to the repository instead of a Mongo bson.M update.
manager/repository/strategy_repo.go Removes the legacy MongoDB strategy/intent repository implementation.
manager/repository/repo.go Extends repository wiring to include dynamic.Interface and a configurable CRD namespace.
manager/repository/cr_strategy_repo.go Implements CRD-backed CRUD/query/state-update logic for strategies and intents via the K8s dynamic client.
manager/repository/cr_strategy_repo_test.go Adds unit tests for the CRD-backed repository behaviors.
manager/k8s_adapter/dynamic_client.go Adds a helper to construct a Kubernetes dynamic client using existing config resolution.
manager/domain/interface.go Updates repository interface: UpdateStrategy(ctx, *ScheduleStrategy) replaces Mongo-specific update signature.
manager/domain/mock_domain.go Regenerates mocks to match the updated repository/service interfaces.
manager/app/module.go Wires dynamic client into Fx DI and provides a fake dynamic client for tests.
config/manager_config.go Adds K8SConfig.CRDNamespace configuration for where CRDs are stored.
deployment/rbac/manager-crd-rbac.yaml Adds ClusterRole/Binding for manager CRUD access to scheduling CRDs plus a read-only viewer role.
deployment/crds/schedulingstrategy-crd.yaml Adds SchedulingStrategy CRD definition.
deployment/crds/schedulingintent-crd.yaml Adds SchedulingIntent CRD definition.

Comment thread manager/repository/cr_strategy_repo.go Outdated
}

func hexToObjectID(hex string) bson.ObjectID {
id, _ := bson.ObjectIDFromHex(hex)
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hexToObjectID discards parsing errors and returns a zero ObjectID on invalid input. This can silently turn malformed CR data into "zero" IDs (e.g., CreatorID/UpdaterID/StrategyID) and lead to confusing behavior in ownership checks and filtering. Prefer returning an error from unstructuredToDomain{Strategy,Intent} when these fields are missing/invalid instead of silently coercing to zero.

Suggested change
id, _ := bson.ObjectIDFromHex(hex)
if hex == "" {
return bson.ObjectID{}
}
id, err := bson.ObjectIDFromHex(hex)
if err != nil {
return bson.ObjectID{}
}

Copilot uses AI. Check for mistakes.
Comment on lines +166 to +187
func TestCRBatchUpdateIntentsState(t *testing.T) {
r := newTestCRRepo()
ctx := context.Background()

creatorID := bson.NewObjectID()
strategy := &domain.ScheduleStrategy{
BaseEntity: domain.BaseEntity{CreatorID: creatorID, UpdaterID: creatorID},
}
intents := []*domain.ScheduleIntent{
{BaseEntity: domain.BaseEntity{CreatorID: creatorID, UpdaterID: creatorID}, PodID: "p1", NodeID: "n1", State: domain.IntentStateInitialized},
}
require.NoError(t, r.InsertStrategyAndIntents(ctx, strategy, intents))

err := r.BatchUpdateIntentsState(ctx, []bson.ObjectID{intents[0].ID}, domain.IntentStateSent)
require.NoError(t, err)

opt := &domain.QueryIntentOptions{IDs: []bson.ObjectID{intents[0].ID}}
err = r.QueryIntents(ctx, opt)
require.NoError(t, err)
require.Len(t, opt.Result, 1)
assert.Equal(t, domain.IntentStateSent, opt.Result[0].State)
}
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CR repository tests exercise querying by IDs/creator/strategy, but there isn't a test that QueryIntents correctly filters by opt.States. Since state filtering is part of the repository contract (and state labels are maintained), add a unit test that inserts intents with different states and verifies only the requested states are returned.

Copilot generated this review using guidance from repository custom instructions.
Comment thread deployment/rbac/manager-crd-rbac.yaml Outdated
Comment on lines +20 to +21
name: gthulhu-manager
namespace: gthulhu-system
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ClusterRoleBinding targets ServiceAccount gthulhu-manager in namespace gthulhu-system, but the existing manager deployment manifest uses ServiceAccount manager in namespace gthulhu-api-local (deployment/kind/manager/deployment.yaml). As-is, applying this RBAC in the kind environment won't grant the manager pod access to the CRDs. Consider aligning the subject to the actual ServiceAccount/namespace used by your deployment manifests (or providing environment-specific overlays/templates).

Suggested change
name: gthulhu-manager
namespace: gthulhu-system
name: manager
namespace: gthulhu-api-local

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +13
// NewDynamicClient creates a Kubernetes dynamic client using the same
// configuration strategy as the pod informer (in-cluster or kubeconfig file).
func NewDynamicClient(opt Options) (dynamic.Interface, error) {
cfg, err := buildConfig(opt)
if err != nil {
return nil, err
}
return dynamic.NewForConfig(cfg)
}
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NewDynamicClient reuses buildConfig but does not apply the client-go tuning (Timeout/QPS/Burst) that NewAdapter sets on the same *rest.Config. This can lead to different behavior under load (rate limiting) or hung calls (no timeout). Consider setting cfg.Timeout/QPS/Burst similarly to adapter.go so both clients behave consistently.

Copilot uses AI. Check for mistakes.
Comment on lines +58 to +83
obj := domainStrategyToUnstructured(strategy, r.crNamespace)
created, err := r.k8sDynamic.Resource(strategyGVR).Namespace(r.crNamespace).Create(ctx, obj, metav1.CreateOptions{})
if err != nil {
return fmt.Errorf("create strategy CR: %w", err)
}
// Assign the ID back from the created object name (may differ on retry).
if id, e := bson.ObjectIDFromHex(created.GetName()); e == nil {
strategy.ID = id
}

for _, intent := range intents {
if intent.ID.IsZero() {
intent.ID = bson.NewObjectID()
}
intent.StrategyID = strategy.ID
if intent.CreatedTime == 0 {
intent.CreatedTime = now
}
if intent.UpdatedTime == 0 {
intent.UpdatedTime = now
}
intentObj := domainIntentToUnstructured(intent, r.crNamespace)
if _, err := r.k8sDynamic.Resource(intentGVR).Namespace(r.crNamespace).Create(ctx, intentObj, metav1.CreateOptions{}); err != nil {
return fmt.Errorf("create intent CR: %w", err)
}
}
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InsertStrategyAndIntents creates the SchedulingStrategy CR first, then creates intent CRs one-by-one; if an intent creation fails partway through, the already-created strategy (and any earlier intents) are left behind, resulting in orphaned/partial data. Consider adding best-effort rollback (delete created intents + strategy) on failure, or creating intents with an ownerReference so Kubernetes GC can clean them up when the strategy is deleted.

Copilot uses AI. Check for mistakes.
Comment thread manager/repository/cr_strategy_repo.go Outdated
Comment on lines +121 to +124
spec, _, _ := unstructured.NestedMap(obj.Object, "spec")
if spec == nil {
spec = map[string]interface{}{}
}
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BatchUpdateIntentsState ignores the error/"found" return values from unstructured.NestedMap when reading the existing spec. If the object is malformed or spec is missing, this will silently proceed with an empty spec and overwrite fields, potentially losing data. Handle the error and return a clear failure when spec cannot be read.

Suggested change
spec, _, _ := unstructured.NestedMap(obj.Object, "spec")
if spec == nil {
spec = map[string]interface{}{}
}
spec, found, err := unstructured.NestedMap(obj.Object, "spec")
if err != nil {
return fmt.Errorf("read spec for intent CR %s: %w", name, err)
}
if !found {
return fmt.Errorf("spec not found for intent CR %s", name)
}

Copilot uses AI. Check for mistakes.
Comment on lines +215 to +226
// Build label selector for common filters.
selParts := []string{}
if s := buildLabelSelector(opt.CreatorIDs, labelCreatorID); s != "" {
selParts = append(selParts, s)
}
if s := buildLabelSelector(opt.StrategyIDs, labelStrategyID); s != "" {
selParts = append(selParts, s)
}
sel := strings.Join(selParts, ",")

list, err := r.k8sDynamic.Resource(intentGVR).Namespace(r.crNamespace).List(ctx, metav1.ListOptions{LabelSelector: sel})
if err != nil {
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QueryIntents builds a label selector for creatorID and strategyID, but not for state even though a state label is maintained. When opt.States is set, this lists all matching intents and filters in-memory, which can be expensive at scale and undermines the purpose of the state label. Consider incorporating labelState into the LabelSelector when States is provided (handling the multi-state case with an "in (...)" selector).

Copilot uses AI. Check for mistakes.
@ianchen0119
Copy link
Copy Markdown
Contributor

link: Gthulhu/Gthulhu#78

…nfig

- validate ObjectID fields when converting CRs to domain models; return explicit errors instead of zero IDs
- add best-effort rollback in InsertStrategyAndIntents to clean partial CRs on intent create failure
- handle NestedMap spec read errors/missing spec in BatchUpdateIntentsState
- include state label selector in QueryIntents for server-side filtering
- add unit test coverage for QueryIntents state filtering
- apply Timeout/QPS/Burst tuning in dynamic client to match adapter behavior
- align CRD ClusterRoleBinding ServiceAccount/namespace with current kind manager deployment

Test:
- ok  	github.com/Gthulhu/api/manager/repository	(cached)
?   	github.com/Gthulhu/api/manager/k8s_adapter	[no test files] ✅
@ianchen0119 ianchen0119 merged commit 062feef into main Mar 10, 2026
2 checks passed
@ianchen0119 ianchen0119 deleted the copilot/migrate-scheduling-strategies-to-etcd branch March 10, 2026 08:26
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants