fix(k8s): isolate custom-stack namespaces and retain shared ns on destroy by Cre-eD · Pull Request #230 · simple-container-com/api

Cre-eD · 2026-05-07T20:33:52Z

Problem

Sub-env client stacks (parentEnv: production, stackEnv: tenant-a/tenant-b/...) were deriving their k8s namespace name from stackName via pkg/clouds/pulumi/kubernetes/deployment.go:67:

namespace := lo.If(args.Namespace == "", stackName).Else(args.Namespace)

So every sibling under the same stackName ended up pointing at the same physical namespace. Each Pulumi stack independently tracked its own Namespace resource for that metadata.Name with a unique URN suffix, but they all referenced the same physical namespace. Every other resource type (Deployment, Service, Secret, ConfigMap, HPA, VPA, ImagePullSecret) was already correctly env-suffixed via generateResourceName — the namespace was the one piece left non-isolated.

When any sibling stack was destroyed, Pulumi ran the delete for its tracked Namespace resource — which calls k8s to delete the namespace by metadata.Name. K8s obliged and cascade-deleted every resource in that namespace, including everything owned by the other live sibling stacks.

Real outage

Destroying a throwaway sub-env stack on a production cluster wiped every live sibling's Deployments, Services, and namespace-scoped Secrets in one shot. Recovery required redeploying all of them plus rolling Caddy.

Fix — three layers

1. Per-stackEnv namespace for custom stacks

New helper GenerateNamespaceName(baseNS, stackEnv, parentEnv) exported from naming.go. For custom stacks (parentEnv != stackEnv) the namespace is suffixed with -{stackEnv}, mirroring the suffix every other resource already gets. Standard stacks (parentEnv unset, or parentEnv == stackEnv) keep the existing stackName-based namespace, so the parent stack is untouched.

The result is RFC 1123-sanitized inside the helper (lowercase, _ → -, ≤63 chars with FNV-1a truncation), so direct callers can pass it straight into metadata.namespace without their own sanitization step. Sanitization is idempotent — pre-sanitized inputs see no change.

Behavior matrix:

Stack	parentEnv	stackEnv	Namespace	Change
`production`	production	production	`<stackName>`	unchanged
`tenant-a`	production	tenant-a	`<stackName>-tenant-a`	NEW
`tenant-b`	production	tenant-b	`<stackName>-tenant-b`	NEW
`preview-test`	production	preview-test	`<stackName>-preview-test`	NEW
`staging`	(defaults to staging)	staging	`<stackName>`	unchanged

2. Dependency-resource processors aligned with the new namespace

The init-job and CloudSQL-proxy code paths previously hardcoded params.input.StackParams.StackName as the namespace — fine when the pod also lived in <stackName>, but stranded after the change above. Updated three call sites to derive the namespace via kubernetes.GenerateNamespaceName(stackName, stackEnv, parentEnv):

compute_proc_postgres.go — Postgres init-DB-user job
compute_proc_mongodb.go — MongoDB init-DB-user job
gcp/compute_proc.go (two sites) — CloudSQL proxy sidecar Secret + ad-hoc init proxy

Without these updates, custom-stack pods would fail to mount the CloudSQL proxy credential Secret (which would have been created in the now-different parent namespace).

3. `RetainOnDelete(true)` on namespace resources

Both corev1.NewNamespace call sites pass sdk.RetainOnDelete(true):

simple_container.go:232 — client stacks
helpers.go:53 — helm operator stacks (same hazard)

Pulumi keeps the resource in state but skips the k8s delete API call on destroy. This is critical during migration: when an existing custom stack first runs pulumi up with this version, Pulumi sees the namespace metadata.Name change, schedules a Replace, creates the new namespace, and would delete the old shared namespace (wiping the parent stack and any siblings still on the old NS) — except that RetainOnDelete skips the delete. The parent's resources keep running through the migration.

RetainOnDelete also continues to defend against accidental destroy of any namespace that legitimately ends up holding multiple stacks' resources (helm operators, anyone who explicitly sets the same Namespace on multiple stacks). Same pattern is already used elsewhere in the codebase for shared resources (see cloudflare/registrar.go:143,320).

Migration semantics

Any deploy that uses parentEnv != stackEnv will Replace its namespace-scoped resources on the next pulumi up — Pulumi creates them in the new namespace and deletes the old ones. The parent stack is unaffected because its resources sit in a different Pulumi stack with different URNs.

Caddy routing follows automatically:

Caddy uses kubectl get services --all-namespaces (caddy.go:189) to discover services with the simple-container.com/caddyfile-entry annotation
The Caddyfile entry template (simple_container.go:602) encodes ${namespace} in the upstream URL
New namespaces are picked up on Caddy's next config rebuild; the rolling-restart annotation patch (kube_run.go:204) triggers it

Server-Side Apply is enabled on the k8s provider (provider.go:23), so subsequent pulumi up runs against any retained namespace patch the existing object via SSA rather than throwing AlreadyExists. Refresh, import, and replace flows are unaffected — RetainOnDelete only changes the destroy path.

The empty parent namespace lingers only if the last stack referencing it is destroyed; manual cleanup. Right trade vs. silent cascade.

⚠️ PersistentVolumeClaim migration caveat

If a custom stack uses persistentVolumes (simple_container.go:397 creates PVCs), the namespace move triggers a Pulumi Replace on each PVC. Because PVCs are namespace-scoped and not movable, Pulumi creates the new PVC and deletes the old one. If the StorageClass's reclaimPolicy is Delete (default for dynamic volumes on GCP/AWS), the underlying PV and its data are destroyed.

Mitigations for any consumer with stateful custom stacks before merging this:

Patch the existing PV's persistentVolumeReclaimPolicy: Retain first (kubectl patch pv ... --patch ...)
Optionally kubectl edit pv <name> to clear claimRef and reattach to the new PVC after the migration
Or accept that the data is dev-only and let it recreate

Stacks that don't define persistentVolumes (the typical case where state lives in managed services like Cloud SQL / RDS / Redis) are unaffected.

Tests

TestGenerateNamespaceName — table-driven coverage of standard / self-reference / custom-stack derivation including underscore normalization and case folding
TestGenerateNamespaceName_SiblingsAreUnique — direct regression for the shared-namespace outage scenario (parent + 4 tenant sub-envs + preview-test all resolve to distinct namespaces)
go test ./pkg/clouds/pulumi/kubernetes/... and ./pkg/clouds/pulumi/gcp/... pass
go build ./... clean
Branch-preview build + pulumi preview against a real custom-stack consumer to validate the migration plan

Breaking change scope

Any SC consumer with a deploy where parentEnv != stackEnv will see their custom stacks recreate-in-new-ns on next pulumi up. Brief gap during the namespace cutover; RetainOnDelete keeps the old namespace alive so the parent stack continues to serve regardless. Stateful custom stacks should follow the PVC caveat above.

Reviews

Codex (final pass v4): "No actionable correctness issues were found in the reviewed diff against origin/main."
Gemini 2.5 Pro (v3): APPROVE — confirmed semantics, migration flow, PVC mitigation via docs is the right call, and no new edge cases from exporting GenerateNamespaceName.

github-actions · 2026-05-08T19:55:02Z

Semgrep Scan Results

Repository: api | Commit: 730bc5f

Check	Status	Details
🚨 Semgrep	ERROR	11 error(s), 73 warning(s), 108 total

Scanned at 2026-05-09 18:27 UTC

github-actions · 2026-05-08T19:55:25Z

Security Scan Results

Repository: api | Commit: 730bc5f

Check	Status	Details
✅ Secret Scan	Pass	No secrets detected
⚠️ Dependencies (Trivy)	High	1 high, 2 total
⚠️ Dependencies (Grype)	High	1 high, 2 total
📦 SBOM	Generated	469 components (CycloneDX)

Scanned at 2026-05-09 18:27 UTC

…troy Sub-env client stacks (parentEnv: production, stackEnv: gl-pay/payhey/...) were derivieng their k8s namespace name from stackName via deployment.go:67: namespace := lo.If(args.Namespace == "", stackName).Else(args.Namespace) Every sibling under the same stackName ended up pointing at the same physical namespace (e.g. `pay-space-wallet`). Each Pulumi stack independently created a Namespace resource for that metadata.Name, with a unique URN suffix `<deployment>-ns`. When *any* sibling stack was destroyed, Pulumi ran the delete operation for its tracked Namespace resource — which calls k8s to delete the namespace by metadata.Name. Kubernetes obliged and cascade-deleted *every* resource in that namespace, including everything owned by the other live sibling stacks. Real outage: a destroy of a throwaway `caddy-test` sub-env stack wiped the production wallet/gl-pay/payhey/rulex/smart-gate Deployments and Services. Recovery required redeploying all five plus rolling Caddy. Two-layer fix in this PR: 1. Proper isolation — each custom stack gets its own physical namespace. `generateNamespaceName(baseNS, stackEnv, parentEnv)` in naming.go suffixes the namespace with `-stackEnv` for custom stacks (parentEnv != stackEnv), mirroring the per-stackEnv suffix every other resource type (Deployment/Service/Secret/ConfigMap/HPA/VPA/ImagePullSecret) already gets via generateResourceName. Standard stacks (parentEnv unset, or parentEnv == stackEnv) keep their existing stackName-based namespace, so the parent stack itself is untouched. After this change, sibling sub-envs no longer share a namespace and `pulumi destroy` cleanly removes only that stack's resources. 2. RetainOnDelete safety net — both `corev1.NewNamespace` call sites (the client-stack one in simple_container.go and the helm-operator one in helpers.go) now pass `sdk.RetainOnDelete(true)`. Pulumi keeps the namespace resource in state but skips the k8s delete API call on destroy. This is critical during the per-stack migration: when a custom stack first runs `pulumi up` with this version, Pulumi sees the namespace metadata.Name change (`pay-space-wallet` → `pay-space-wallet-gl-pay`), schedules a Replace, creates the new namespace, and *would* delete the shared parent namespace if not for RetainOnDelete. After migration, RetainOnDelete continues to defend against accidental destroy of any namespace that ends up holding more than one stack's resources (e.g. shared helm-operator namespaces). Migration semantics: any deploy that already uses parentEnv != stackEnv will Replace its namespace-scoped resources on the next `pulumi up` — Pulumi creates them in the new namespace and deletes the old ones. The parent stack is unaffected because its resources sit in a different Pulumi stack with different URNs. Caddy auto-discovers services across all namespaces (kubectl get services --all-namespaces) and the Caddyfile upstream URL encodes namespace via the existing `\${namespace}` placeholder in simple_container.go, so routing follows the new namespace automatically. The empty parent namespace lingers only if the *last* sibling under one stackName is destroyed; it must be cleaned up manually. That's the right default — silent destructive cascade across stacks is far worse than a leaked empty namespace. Tests: - TestGenerateNamespaceName covers all parentEnv/stackEnv combinations including the regression cases. - TestGenerateNamespaceName_SiblingsAreUnique enumerates the pay_space_wallet outage scenario (production parent + 5 sub-envs + caddy-test) and asserts each resolves to a distinct namespace. Signed-off-by: Dmitrii Creed <creeed22@gmail.com>

Cre-eD force-pushed the fix/namespace-retain-on-delete branch from 2c080c1 to 30bc87f Compare May 8, 2026 19:53

Cre-eD force-pushed the fix/namespace-retain-on-delete branch from 30bc87f to a49ca64 Compare May 8, 2026 20:22

Cre-eD changed the title ~~fix(k8s): retain shared namespace on stack destroy to prevent sibling wipe~~ fix(k8s): isolate custom-stack namespaces and retain shared ns on destroy May 8, 2026

Cre-eD force-pushed the fix/namespace-retain-on-delete branch 2 times, most recently from f2f92ee to ec5a57e Compare May 9, 2026 07:25

Cre-eD force-pushed the fix/namespace-retain-on-delete branch from ec5a57e to da6bdf3 Compare May 9, 2026 18:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(k8s): isolate custom-stack namespaces and retain shared ns on destroy#230

fix(k8s): isolate custom-stack namespaces and retain shared ns on destroy#230
Cre-eD wants to merge 1 commit intomainfrom
fix/namespace-retain-on-delete

Cre-eD commented May 7, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 8, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Cre-eD commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Real outage

Fix — three layers

1. Per-stackEnv namespace for custom stacks

2. Dependency-resource processors aligned with the new namespace

3. RetainOnDelete(true) on namespace resources

Migration semantics

⚠️ PersistentVolumeClaim migration caveat

Tests

Breaking change scope

Reviews

Uh oh!

github-actions Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Semgrep Scan Results

Uh oh!

github-actions Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Security Scan Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Cre-eD commented May 7, 2026 •

edited

Loading

3. `RetainOnDelete(true)` on namespace resources

github-actions Bot commented May 8, 2026 •

edited

Loading

github-actions Bot commented May 8, 2026 •

edited

Loading