feat(mlflow): add enterprise patterns for complex kots deployments
The MLflow example is the strongest reference for vendors running multi-chart KOTS on existing clusters with stateful services. Several advisory engagements have surfaced gaps that, if closed, would make this the canonical enterprise reference implementation.
This issue tracks enhancing the MLflow example with patterns that complex deployments need but currently lack.
Current State
MLflow already demonstrates:
- Multi-chart orchestration (infra operators + app chart via HelmChart weight)
- Embedded vs external PostgreSQL via KOTS Config
- CRD validation hooks
- Helm upgrade flags (
--wait, --timeout)
- Basic support bundles and preflights
Gaps Identified
1. KOTS auto-update behavior is undocumented
Vendors choosing KOTS specifically for auto-deploy need explicit documentation on how it behaves with HelmChart weight ordering, what happens during operator upgrades, and known edge cases (config field changes, semver rollback, required versions).
Files to modify:
README.md or new docs/auto-update.md
kots/manifests/replicated-app.yaml (add autoDeploy field if not present)
2. License field wiring pattern is missing
The example does not show how to wire KOTS license entitlements (e.g., LicenseFieldValue "tier", LicenseFieldValue "max_users") into Helm values so the application can consume them. This is a common blocker for vendors with seat-based or feature-tier licensing.
Files to modify:
kots/manifests/kots-config.yaml — add Config items using {{repl LicenseFieldValue "..."}}
kots/manifests/mlflow-chart.yaml — map Config items into HelmChart CR values.license.*
chart/mlflow/templates/license-configmap.yaml — new file: reads .Values.license.* and creates ConfigMap for app consumption
3. Security hardening defaults are absent
No podSecurityContext, containerSecurityContext, or NetworkPolicy examples. For vendors in regulated industries (SOC 2, HIPAA), these are required.
Files to modify:
chart/mlflow/values.yaml — add default podSecurityContext and containerSecurityContext blocks
- New file:
chart/mlflow/templates/networkpolicy.yaml — default deny with same-namespace and ingress-nginx exceptions
4. GPU node scheduling pattern is missing
Vendors with ML/AI workloads need examples of GPU node selectors, tolerations, and nvidia.com/gpu resource limits.
Files to modify:
chart/mlflow/values.yaml — add gpu.enabled, gpu.nodeSelector, gpu.tolerations, gpu.resources
chart/mlflow/templates/deployment.yaml — conditionally inject GPU scheduling blocks
5. KOTS snapshot / backup integration is not configured
For stateful services (PostgreSQL, MinIO), vendors need examples of KOTS snapshot configuration and restore procedures.
Files to modify:
kots/manifests/replicated-app.yaml — add backup.volumes for PostgreSQL and MinIO PVCs
kots/manifests/kots-preflight.yaml — add snapshot collector check
README.md — document backup/restore procedure
6. Support bundle wrapper charts for upstream dependencies
MLflow depends on upstream charts (Cloudnative-PG, MinIO Operator) that do not include support bundles. The patterns/support-bundles doc shows how to create wrapper charts, but MLflow itself does not implement this.
Files to add:
charts/postgres-wrapper/ — wrapper chart with _supportbundle.tpl and secret-supportbundle.yaml
charts/minio-operator-wrapper/ — same pattern
7. Air-gap preflight checks are missing
No validation for registry reachability, image bundle completeness, or local mirror configuration.
Files to modify:
kots/manifests/kots-preflight.yaml — add runPod or imagePull collectors for registry validation
Acceptance Criteria
Related
Notes
These enhancements would make MLflow the canonical reference for vendors with:
- Multi-chart KOTS on existing clusters
- Stateful service operators (PostgreSQL, object storage)
- License-based feature gating
- Security/compliance requirements
- ML/AI workloads with GPU scheduling needs
- Air-gap deployment requirements
The goal is one well-maintained, comprehensive example rather than many thin ones.
feat(mlflow): add enterprise patterns for complex kots deployments
The MLflow example is the strongest reference for vendors running multi-chart KOTS on existing clusters with stateful services. Several advisory engagements have surfaced gaps that, if closed, would make this the canonical enterprise reference implementation.
This issue tracks enhancing the MLflow example with patterns that complex deployments need but currently lack.
Current State
MLflow already demonstrates:
--wait,--timeout)Gaps Identified
1. KOTS auto-update behavior is undocumented
Vendors choosing KOTS specifically for auto-deploy need explicit documentation on how it behaves with HelmChart weight ordering, what happens during operator upgrades, and known edge cases (config field changes, semver rollback, required versions).
Files to modify:
README.mdor newdocs/auto-update.mdkots/manifests/replicated-app.yaml(addautoDeployfield if not present)2. License field wiring pattern is missing
The example does not show how to wire KOTS license entitlements (e.g.,
LicenseFieldValue "tier",LicenseFieldValue "max_users") into Helm values so the application can consume them. This is a common blocker for vendors with seat-based or feature-tier licensing.Files to modify:
kots/manifests/kots-config.yaml— add Config items using{{repl LicenseFieldValue "..."}}kots/manifests/mlflow-chart.yaml— map Config items into HelmChart CRvalues.license.*chart/mlflow/templates/license-configmap.yaml— new file: reads.Values.license.*and creates ConfigMap for app consumption3. Security hardening defaults are absent
No
podSecurityContext,containerSecurityContext, orNetworkPolicyexamples. For vendors in regulated industries (SOC 2, HIPAA), these are required.Files to modify:
chart/mlflow/values.yaml— add defaultpodSecurityContextandcontainerSecurityContextblockschart/mlflow/templates/networkpolicy.yaml— default deny with same-namespace and ingress-nginx exceptions4. GPU node scheduling pattern is missing
Vendors with ML/AI workloads need examples of GPU node selectors, tolerations, and
nvidia.com/gpuresource limits.Files to modify:
chart/mlflow/values.yaml— addgpu.enabled,gpu.nodeSelector,gpu.tolerations,gpu.resourceschart/mlflow/templates/deployment.yaml— conditionally inject GPU scheduling blocks5. KOTS snapshot / backup integration is not configured
For stateful services (PostgreSQL, MinIO), vendors need examples of KOTS snapshot configuration and restore procedures.
Files to modify:
kots/manifests/replicated-app.yaml— addbackup.volumesfor PostgreSQL and MinIO PVCskots/manifests/kots-preflight.yaml— add snapshot collector checkREADME.md— document backup/restore procedure6. Support bundle wrapper charts for upstream dependencies
MLflow depends on upstream charts (Cloudnative-PG, MinIO Operator) that do not include support bundles. The
patterns/support-bundlesdoc shows how to create wrapper charts, but MLflow itself does not implement this.Files to add:
charts/postgres-wrapper/— wrapper chart with_supportbundle.tplandsecret-supportbundle.yamlcharts/minio-operator-wrapper/— same pattern7. Air-gap preflight checks are missing
No validation for registry reachability, image bundle completeness, or local mirror configuration.
Files to modify:
kots/manifests/kots-preflight.yaml— addrunPodorimagePullcollectors for registry validationAcceptance Criteria
LicenseFieldValue) -> HelmChart CRvalues-> application ConfigMaprunAsNonRoot: true,readOnlyRootFilesystem: true) tovalues.yamlREADME.mdwith enterprise deployment guide covering all new patternsRelated
patterns/support-bundles— wrapper chart patternpatterns/multi-chart-orchestration— weight ordering guidepatterns/embedded-vs-external-database— database mode switchingNotes
These enhancements would make MLflow the canonical reference for vendors with:
The goal is one well-maintained, comprehensive example rather than many thin ones.