Skip to content

Split base/ into individual deployable modules#426

Open
yangw-dev wants to merge 12 commits into
mainfrom
elainewy/remove-base-module
Open

Split base/ into individual deployable modules#426
yangw-dev wants to merge 12 commits into
mainfrom
elainewy/remove-base-module

Conversation

@yangw-dev
Copy link
Copy Markdown
Contributor

@yangw-dev yangw-dev commented Apr 7, 2026

Summary

Every component previously bundled in base/ is now its own deployable module, deployed through the unified deploy-module flow. eks is just the first module in the list — no special provider field, no separate deploy-eks or deploy-base recipe.

Harbor stays inside the eks module (tightly coupled to EKS terraform for OIDC/S3/IAM). gp3 StorageClass is also in eks kubernetes/ (Harbor PVCs need it before Harbor installs).

Before:

just deploy-base <cluster>          # special recipe for terraform + harbor + base k8s
just deploy-module <cluster> arc    # modules deployed separately

After:

modules:
  - eks                      # terraform + kubeconfig + mirror + harbor + gp3 StorageClass (seperate later)
  - nvidia-device-plugin     # GPU device plugin DaemonSet
  - node-performance-tuning  # sysctl tuning DaemonSet
  - registry-mirror          # containerd mirror ConfigMap
  - image-cache-janitor      # containerd image cache cleanup DaemonSet
  - git-cache                # git clone cache (central + DaemonSet)
  - node-compactor           # underutilized node drainer
  - karpenter                # ... rest of modules
# One loop, no special cases
for module in $(modules); do just deploy-module "$CLUSTER" "$module"; done

Key changes

  • modules/eks/terraform/modules/eks/infra/ — prevents deploy-module Phase 1 from auto-running terraform with wrong state key. eks deploy.sh handles terraform internally.
  • modules/eks/deploy.sh — 4 phases: terraform, kubeconfig, mirror-images, harbor (helm + secrets + proxy cache)
  • modules/eks/kubernetes/ — gp3 StorageClass + harbor-system namespace (applied in Phase 2, before deploy.sh installs Harbor)
  • 5 new modules created — nvidia-device-plugin, node-performance-tuning, registry-mirror, image-cache-janitor, git-cache, node-compactor
  • manifests/ convention — modules with placeholder substitution (image-cache-janitor, node-compactor) use manifests/ instead of kubernetes/ to prevent deploy-module Phase 2 from applying unsubstituted manifests
  • Kubeconfig soft-fail — only for eks module (cluster doesn't exist yet on fresh deploy)
  • deploy-eks and deploy-base recipes removed — everything goes through deploy-module
  • Smoke tests splittest_base_kubernetes.py split across per-module tests/smoke/ dirs
  • clusters.yaml — eks as first module, new modules in dependency order

Follow-up work

  • Extract Harbor into its own module — currently Harbor deploy (~200 lines) lives inside eks deploy.sh. Long-term, extract to modules/harbor/ with its own deploy.sh that reads terraform outputs from eks infra/ (same pattern as karpenter). Blocked on: no urgency, eks module works.
  • Extract mirror-images into its own module — similar to Harbor, the bootstrap image mirroring could be a standalone module.

ciforge migration required

After merging this and updating the submodule, ciforge needs:

1. clusters.yaml — add base modules to EKS clusters

# arc-cbr-production and re-prod need eks + base modules before karpenter
modules:
  - eks
  - nvidia-device-plugin
  - node-performance-tuning
  - registry-mirror
  - image-cache-janitor
  - git-cache
  - node-compactor
  - karpenter        # was previously the first module
  - arc
  - ...

The GCP cluster (arc-gcp-experiment) already has gke first — no change needed.

2. clusters.yaml — update header comments

Replace the two-phase deploy description:

# Deploy flow per cluster:
#   for module in modules:
#     just deploy-module <cluster> <module>

3. justfiledeploy-gcp is now redundant (optional cleanup)

The upstream deploy recipe loops all modules uniformly. gke as the first module handles GCP infra the same way eks handles AWS. The kubeconfig override for GCP is still needed.

4. pyproject.toml — update testpaths (if overridden)

  • base/node-compactor/scripts/pythonmodules/node-compactor/scripts/python
  • base/kubernetes/git-cache/scripts/pythonmodules/git-cache/scripts/python
  • base/kubernetes/image-cache-janitor/scripts/pythonmodules/image-cache-janitor/scripts/python

Test plan

Every component previously bundled in the eks module is now its own
deployable module, discovered and deployed through the unified
deploy-module flow. eks itself is just the first module in the list.

New modules: storageclass-eks, nvidia-device-plugin,
node-performance-tuning, registry-mirror, harbor, git-cache,
node-compactor — each with kubernetes manifests, deploy scripts
(where needed), and smoke tests.

Key changes:
- Rename modules/eks/terraform/ → infra/ (skip deploy-module Phase 1)
- Simplify modules/eks/deploy.sh to terraform + kubeconfig + mirror
- Extract harbor deploy.sh (~240 lines) with EKS terraform output reads
- Kubeconfig soft-fail in deploy-module for first-deploy bootstrap
- Remove deploy-eks recipe; deploy loops all modules uniformly
- Split test_base_kubernetes.py across per-module smoke tests
- Update clusters.yaml with full module ordering
@yangw-dev yangw-dev requested a review from jeanschmidt April 7, 2026 23:08
Comment thread osdc/justfile
fi
fi

# Deploy base infrastructure (tofu + harbor + base k8s)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this moved to eks module

…module

# Conflicts:
#	osdc/base/kubernetes/kustomization.yaml
#	osdc/justfile
#	osdc/pyproject.toml
@yangw-dev yangw-dev changed the title move eks into its own module Split base/ into individual deployable modules Apr 7, 2026
Harbor is tightly coupled to EKS terraform (OIDC, S3 bucket, IAM) —
keeping it as a separate module added indirection without benefit.

- Move harbor deploy logic back into modules/eks/deploy.sh (Phase 4)
- Move helm values, namespace, smoke tests back under modules/eks/
- Remove modules/harbor/ directory
- Remove harbor from clusters.yaml module lists
- Fix _deploy-harbor justfile recipe paths (terraform → infra)
mirror-images.sh and karpenter/deploy.sh still referenced the old
modules/eks/terraform/ path after the rename to infra/. Updated all
remaining references in scripts and documentation.
@yangw-dev yangw-dev marked this pull request as ready for review April 8, 2026 02:10
@yangw-dev yangw-dev requested a review from huydhn April 8, 2026 02:10
@malfet
Copy link
Copy Markdown
Contributor

malfet commented Apr 8, 2026

@claude review this and focus what actually constitutes a functional changes and what just moves code around

@jeanschmidt
Copy link
Copy Markdown
Contributor

before proceeding with manual review, claude flagged some big stuff:

  PR 426 Review: Split base/ into individual deployable modules

  CRITICAL

  1. Broken base/kubernetes/kustomization.yaml
  - osdc/base/kubernetes/kustomization.yaml still references storageclass-gp3.yaml, node-performance-tuning.yaml, nvidia-device-plugin.yaml, harbor-namespace.yaml, git-cache/, and registry-mirror-config.yaml — all moved to modules/. Running kubectl apply -k base/kubernetes/ will
  fail. The file is dead code since deploy-base was removed, but it's a trap for anyone who finds it.
  - Fix: Delete the file, or strip it to only reference image-cache-janitor/.
  - Flagged by: 5 of 8 agents

  HIGH

  2. image-cache-janitor orphaned from deploy flow
  - Still lives at base/kubernetes/image-cache-janitor/. Not migrated to modules/, not listed in clusters.yaml, and the old deploy-base recipe that called its deploy.sh is deleted. The janitor will stop being deployed.
  - Fix: Move to modules/image-cache-janitor/ and add to both clusters' module lists, or explicitly document it as deferred with a tracking task.
  - Flagged by: 6 of 8 agents

  3. deploy-module runs kubectl apply -k before cluster exists (fresh deploy)
  - The deploy-module recipe applies kubernetes/kustomization.yaml (Phase 2) BEFORE calling deploy.sh (Phase 3). For the eks module on a fresh cluster, the kubeconfig doesn't exist yet, so kubectl apply -k will fail. The kubeconfig soft-fail (2>/dev/null || true) masks this.
  - Fix: Skip the kubernetes phase for modules that have a deploy.sh (let deploy.sh handle ordering), or have deploy.sh apply the namespace itself.
  - Flagged by: deploy.sh reviewer

  4. Docs still reference deleted deploy-base recipe
  - osdc/docs/operations.md lines 34, 143: just deploy-base arc-staging — recipe no longer exists.
  - osdc/docs/modules.md line 132: says EKS is "Always deployed as part of deploy-base" — now wrong.
  - Fix: Update to just deploy / just deploy-module commands.
  - Flagged by: 4 of 8 agents

  5. docs/modules.md still says terraform/ not infra/
  - Line 134: - **terraform/**: VPC... but directory was renamed to infra/.
  - Fix: Change to - **infra/**: VPC...
  - Flagged by: integration reviewer

  MEDIUM

  6. Kubeconfig error suppression too broad
  - just kubeconfig "$CLUSTER" 2>/dev/null || true in deploy-module silences ALL kubeconfig errors. Intentional for eks (cluster doesn't exist yet), but for other modules (karpenter, arc, etc.) a kubeconfig failure due to expired creds or network issues gets silently swallowed,
  leading to confusing downstream kubectl failures.
  - Fix: Only soft-fail when $MODULE == "eks".
  - Flagged by: 3 of 8 agents

  7. Port-forward process leak in deploy.sh
  - PW_PF_PID (port 8090, line 228) has no trap for cleanup. If the script exits unexpectedly between lines 228-248, the port-forward process is orphaned. The trap on line 265 only covers PF_PID (port 8080).
  - Fix: Set a single trap at the top of Phase 4 that kills both PIDs.
  - Flagged by: 2 of 8 agents

  8. Harbor logic duplicated (~200 lines)
  - modules/eks/deploy.sh Phase 4 (lines 102-304) and justfile recipes _deploy-harbor / _configure-harbor-projects (lines 413-667, marked "LEGACY") contain the same Harbor secrets, Helm install, password migration, and proxy cache logic. The legacy copy is dead code that will
  drift.
  - Fix: Delete _deploy-harbor and _configure-harbor-projects from the justfile.
  - Flagged by: structure reviewer

  LOW

  9. eval on $TFVARS (deploy.sh:56) — inherited from justfile but worth converting to array.
  10. Helm passwords via --set visible in ps output (deploy.sh:204-205) — inherited, consider --set-file.
  11. CRED_ARGS unquoted expansion (deploy.sh:285-299) — use arrays instead of string concat.
  12. pyproject.toml still references base/kubernetes/image-cache-janitor — correct for now but inconsistent with migration direction.
  13. Justfile lint recipes still scan base/ via _src_dirs base — harmless but confusing.

Copy link
Copy Markdown
Contributor

@jeanschmidt jeanschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are issues from claude

- Delete legacy Harbor recipes (_deploy-harbor, _configure-harbor-projects) from justfile
- Add harbor and image-cache-janitor to clusters.yaml module lists
- Kubeconfig soft-fail only for eks module (not all modules)
- Fix port-forward process leak in eks deploy.sh (track both PIDs)
- Fix drift test path after image-cache-janitor move to kubernetes/ subdir
- Update docs/operations.md and docs/modules.md references
- pyproject.toml: base/kubernetes/image-cache-janitor → modules/image-cache-janitor
- justfile test-janitor: base/kubernetes/image-cache-janitor → modules/image-cache-janitor
- image-cache-janitor deploy.sh: fix kustomize path to kubernetes/ subdir
- image-cache-janitor deploy.sh: update comment (deploy-module, not deploy-base)
- Remove harbor from clusters.yaml modules (deployed by eks deploy.sh, no standalone module dir)
- Add storageclass-gp3.yaml to eks kubernetes/ so gp3 exists before Harbor PVCs
- Remove kustomization.yaml from image-cache-janitor and node-compactor to prevent
  Phase 2 applying unsubstituted placeholder manifests before deploy.sh
Replace kubectl kustomize with cat for image-cache-janitor and
node-compactor since kustomization.yaml was removed to prevent
Phase 2/3 placeholder conflicts.
image-cache-janitor and node-compactor use placeholder substitution
in deploy.sh. Rename kubernetes/ to manifests/ so deploy-module
Phase 2 doesn't auto-apply unsubstituted manifests. Restore
kustomization.yaml for kubectl kustomize in deploy.sh.
storageclass-gp3.yaml already in eks/kubernetes/ (applied in Phase 2
before Harbor needs it). Separate module was redundant.
# --- Apply Kubernetes manifests with image substitution ---
echo "Applying image-cache-janitor manifests..."
kubectl kustomize "$JANITOR_DIR/" \
kubectl kustomize "$JANITOR_DIR/manifests/" \
Copy link
Copy Markdown
Contributor Author

@yangw-dev yangw-dev Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for janitor, we do not want it direclty apply kustomize
change kubernetes/ folder as manifest so it does not trigger phase2 for this module

we let the deploy.sh handle the changes since it has some placeholder

@yangw-dev yangw-dev requested a review from jeanschmidt April 8, 2026 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants