Skip to content

refactor(snapshot): add manifest-based snapshotctl flow and shared workload builders#7671

Merged
galletas1712 merged 59 commits intomainfrom
schwinns/snapshotctl-manifest-flow-clean
Apr 3, 2026
Merged

refactor(snapshot): add manifest-based snapshotctl flow and shared workload builders#7671
galletas1712 merged 59 commits intomainfrom
schwinns/snapshotctl-manifest-flow-clean

Conversation

@galletas1712
Copy link
Copy Markdown
Contributor

@galletas1712 galletas1712 commented Mar 29, 2026

Overview

This PR turns snapshot checkpoint/restore into a snapshot-owned protocol layer that is shared by all three entrypoints:

  • explicit DynamoCheckpoint CRs
  • DynamoGraphDeployment checkpoint restore / mode: Auto
  • standalone snapshotctl

The operator side is now much thinner. It keeps CR-specific reconciliation, identity resolution, and status projection, while deploy/snapshot owns the shared checkpoint protocol: labels/annotations, artifact version handling, job naming, lease-aware job observation, restore pod shaping, and snapshot-agent-backed storage discovery.

This also removes operator-owned checkpoint storage configuration. Snapshot storage is now owned by the snapshot chart and discovered from the namespace-local snapshot-agent DaemonSet, so the operator and snapshotctl consume the same source of truth.

High-Level Changes

  • add snapshotctl checkpoint and snapshotctl restore for manifest-based checkpoint and restore without the operator
  • extract the shared snapshot logic into deploy/snapshot/protocol and reuse it from both snapshotctl and the operator
  • refactor DynamoCheckpoint reconciliation so the controller is mostly a thin CR wrapper around shared job/state logic
  • move checkpoint storage ownership fully into the snapshot module/chart and remove the redundant operator storage config surface
  • rewrite the snapshot docs so the primary flow is explicit DynamoCheckpoint, then DGD auto mode, then lower-level snapshotctl

Reviewer Start Points

  • deploy/snapshot/protocol/common.go
  • deploy/snapshot/protocol/checkpoint.go
  • deploy/snapshot/protocol/restore.go
  • deploy/snapshot/cmd/snapshotctl/
  • deploy/operator/internal/controller/dynamocheckpoint_controller.go
  • deploy/operator/internal/checkpointjob/job.go
  • docs/kubernetes/snapshot.md

Smaller Changes

  • move the snapshot implementation packages from deploy/snapshot/pkg/... to deploy/snapshot/internal/...
  • standardize the snapshot agent selector on app.kubernetes.io/component=snapshot-agent
  • add snapshot-owned PVC storage discovery from the snapshot-agent DaemonSet and use it from both operator restore prep and snapshotctl
  • add apps/daemonsets read RBAC where the operator now performs snapshot-owned storage discovery
  • keep storage.type in the snapshot chart for future expansion, but fail fast for non-pvc values today
  • remove checkpoint storage fields, defaults, validation, and Helm values from the operator config
  • update DynamoCheckpoint status/docs/generated CRDs to the current phase/hash/job/message model
  • simplify the operator Docker/Make/CI wiring by using a named BuildKit context for deploy/snapshot
  • delete the duplicate snapshot chart README and consolidate user docs into docs/kubernetes/snapshot.md
  • clean up the sample DynamoCheckpoint manifest and snapshot troubleshooting guidance
  • use the first container consistently in the shared snapshot checkpoint/restore paths, and document snapshotctl as a single-container flow

Summary by CodeRabbit

  • New Features

    • Added snapshotctl command-line tool for manual checkpoint and restore operations.
  • Refactor

    • Removed checkpoint storage backend configuration from operator settings; checkpoint storage is now snapshot-agent managed.
    • Restructured snapshot protocol to centralize checkpoint and restore workflows.
    • Reorganized snapshot package structure for better separation of concerns.
  • Documentation

    • Updated Helm chart documentation and Kubernetes API reference to reflect storage configuration removal.
    • Revised snapshot workflow guide with new quick-start using DynamoCheckpoint resources and auto-mode checkpointing.

@github-actions github-actions Bot added the deployment::k8s Relates to dynamo deployment in kubernetes label Mar 29, 2026
Comment thread deploy/operator/config/rbac/role.yaml
Comment thread deploy/operator/internal/checkpoint/podspec.go Outdated
Comment thread deploy/snapshot/protocol/restore.go
Comment thread deploy/snapshot/protocol/common.go Outdated
Comment thread deploy/operator/internal/controller/dynamocheckpoint_controller.go
Comment thread deploy/snapshot/protocol/checkpoint.go Outdated
Comment thread deploy/snapshot/protocol/restore.go Outdated
Comment thread deploy/snapshot/cmd/snapshotctl/checkpoint.go
Comment thread deploy/helm/charts/snapshot/README.md
Copy link
Copy Markdown
Contributor

@dillon-cullinan dillon-cullinan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving changes under .github/workflows

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

actions container deployment::k8s Relates to dynamo deployment in kubernetes documentation Improvements or additions to documentation refactor size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants