Skip to content

feat(mlflow): add enterprise patterns for complex kots deployments #166

@kriscoleman

Description

@kriscoleman

feat(mlflow): add enterprise patterns for complex kots deployments

The MLflow example is the strongest reference for vendors running multi-chart KOTS on existing clusters with stateful services. Several advisory engagements have surfaced gaps that, if closed, would make this the canonical enterprise reference implementation.

This issue tracks enhancing the MLflow example with patterns that complex deployments need but currently lack.

Current State

MLflow already demonstrates:

  • Multi-chart orchestration (infra operators + app chart via HelmChart weight)
  • Embedded vs external PostgreSQL via KOTS Config
  • CRD validation hooks
  • Helm upgrade flags (--wait, --timeout)
  • Basic support bundles and preflights

Gaps Identified

1. KOTS auto-update behavior is undocumented

Vendors choosing KOTS specifically for auto-deploy need explicit documentation on how it behaves with HelmChart weight ordering, what happens during operator upgrades, and known edge cases (config field changes, semver rollback, required versions).

Files to modify:

  • README.md or new docs/auto-update.md
  • kots/manifests/replicated-app.yaml (add autoDeploy field if not present)

2. License field wiring pattern is missing

The example does not show how to wire KOTS license entitlements (e.g., LicenseFieldValue "tier", LicenseFieldValue "max_users") into Helm values so the application can consume them. This is a common blocker for vendors with seat-based or feature-tier licensing.

Files to modify:

  • kots/manifests/kots-config.yaml — add Config items using {{repl LicenseFieldValue "..."}}
  • kots/manifests/mlflow-chart.yaml — map Config items into HelmChart CR values.license.*
  • chart/mlflow/templates/license-configmap.yaml — new file: reads .Values.license.* and creates ConfigMap for app consumption

3. Security hardening defaults are absent

No podSecurityContext, containerSecurityContext, or NetworkPolicy examples. For vendors in regulated industries (SOC 2, HIPAA), these are required.

Files to modify:

  • chart/mlflow/values.yaml — add default podSecurityContext and containerSecurityContext blocks
  • New file: chart/mlflow/templates/networkpolicy.yaml — default deny with same-namespace and ingress-nginx exceptions

4. GPU node scheduling pattern is missing

Vendors with ML/AI workloads need examples of GPU node selectors, tolerations, and nvidia.com/gpu resource limits.

Files to modify:

  • chart/mlflow/values.yaml — add gpu.enabled, gpu.nodeSelector, gpu.tolerations, gpu.resources
  • chart/mlflow/templates/deployment.yaml — conditionally inject GPU scheduling blocks

5. KOTS snapshot / backup integration is not configured

For stateful services (PostgreSQL, MinIO), vendors need examples of KOTS snapshot configuration and restore procedures.

Files to modify:

  • kots/manifests/replicated-app.yaml — add backup.volumes for PostgreSQL and MinIO PVCs
  • kots/manifests/kots-preflight.yaml — add snapshot collector check
  • README.md — document backup/restore procedure

6. Support bundle wrapper charts for upstream dependencies

MLflow depends on upstream charts (Cloudnative-PG, MinIO Operator) that do not include support bundles. The patterns/support-bundles doc shows how to create wrapper charts, but MLflow itself does not implement this.

Files to add:

  • charts/postgres-wrapper/ — wrapper chart with _supportbundle.tpl and secret-supportbundle.yaml
  • charts/minio-operator-wrapper/ — same pattern

7. Air-gap preflight checks are missing

No validation for registry reachability, image bundle completeness, or local mirror configuration.

Files to modify:

  • kots/manifests/kots-preflight.yaml — add runPod or imagePull collectors for registry validation

Acceptance Criteria

  • Document KOTS auto-update behavior with HelmChart weight ordering and known edge cases
  • Add license field wiring: KOTS Config (LicenseFieldValue) -> HelmChart CR values -> application ConfigMap
  • Add security context defaults (runAsNonRoot: true, readOnlyRootFilesystem: true) to values.yaml
  • Add NetworkPolicy template with default-deny + explicit allow rules
  • Add GPU scheduling pattern (nodeSelector, tolerations, resource limits) with conditional enablement
  • Add KOTS snapshot configuration for PostgreSQL and MinIO PVCs with restore documentation
  • Add support bundle wrapper charts for at least one upstream dependency (PostgreSQL or MinIO)
  • Add air-gap preflight checks (registry reachability, image bundle validation)
  • Update README.md with enterprise deployment guide covering all new patterns
  • Verify all patterns work in a CMX or test cluster install

Related

Notes

These enhancements would make MLflow the canonical reference for vendors with:

  • Multi-chart KOTS on existing clusters
  • Stateful service operators (PostgreSQL, object storage)
  • License-based feature gating
  • Security/compliance requirements
  • ML/AI workloads with GPU scheduling needs
  • Air-gap deployment requirements

The goal is one well-maintained, comprehensive example rather than many thin ones.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions