refactor(snapshot): add manifest-based snapshotctl flow and shared workload builders#7671
Merged
galletas1712 merged 59 commits intomainfrom Apr 3, 2026
Merged
Conversation
Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
tmonty12
reviewed
Apr 2, 2026
This was referenced Apr 2, 2026
julienmancuso
approved these changes
Apr 2, 2026
Contributor
dillon-cullinan
left a comment
There was a problem hiding this comment.
Approving changes under .github/workflows
dillon-cullinan
approved these changes
Apr 3, 2026
…manifest-flow-clean
Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR turns snapshot checkpoint/restore into a snapshot-owned protocol layer that is shared by all three entrypoints:
DynamoCheckpointCRsDynamoGraphDeploymentcheckpoint restore /mode: AutosnapshotctlThe operator side is now much thinner. It keeps CR-specific reconciliation, identity resolution, and status projection, while
deploy/snapshotowns the shared checkpoint protocol: labels/annotations, artifact version handling, job naming, lease-aware job observation, restore pod shaping, and snapshot-agent-backed storage discovery.This also removes operator-owned checkpoint storage configuration. Snapshot storage is now owned by the snapshot chart and discovered from the namespace-local
snapshot-agentDaemonSet, so the operator andsnapshotctlconsume the same source of truth.High-Level Changes
snapshotctl checkpointandsnapshotctl restorefor manifest-based checkpoint and restore without the operatordeploy/snapshot/protocoland reuse it from bothsnapshotctland the operatorDynamoCheckpointreconciliation so the controller is mostly a thin CR wrapper around shared job/state logicDynamoCheckpoint, then DGD auto mode, then lower-levelsnapshotctlReviewer Start Points
deploy/snapshot/protocol/common.godeploy/snapshot/protocol/checkpoint.godeploy/snapshot/protocol/restore.godeploy/snapshot/cmd/snapshotctl/deploy/operator/internal/controller/dynamocheckpoint_controller.godeploy/operator/internal/checkpointjob/job.godocs/kubernetes/snapshot.mdSmaller Changes
deploy/snapshot/pkg/...todeploy/snapshot/internal/...app.kubernetes.io/component=snapshot-agentsnapshot-agentDaemonSet and use it from both operator restore prep andsnapshotctlapps/daemonsetsread RBAC where the operator now performs snapshot-owned storage discoverystorage.typein the snapshot chart for future expansion, but fail fast for non-pvcvalues todayDynamoCheckpointstatus/docs/generated CRDs to the current phase/hash/job/message modeldeploy/snapshotdocs/kubernetes/snapshot.mdDynamoCheckpointmanifest and snapshot troubleshooting guidancesnapshotctlas a single-container flowSummary by CodeRabbit
New Features
snapshotctlcommand-line tool for manual checkpoint and restore operations.Refactor
Documentation
DynamoCheckpointresources and auto-mode checkpointing.