Fsdp2 stormscope [WIP] by negin513 · Pull Request #1671 · NVIDIA/physicsnemo

negin513 · 2026-05-26T17:09:08Z

PhysicsNeMo Pull Request

Migrates StormScope off FSDP1 onto FSDP2 (fully_shard / FSDPModule)....

Description

FSDP1's flat-param machinery doesn't compose with ShardTensor / DTensor.

This is the immediate motivator: the refactored ShardTensor in #1556 breaks FSDP1's backward pass in StormScope, so until StormScope/StormCast move to FSDP2, domain parallelism implementations were not working with FSDP (or using DDP entirely). The current implementation is DDP only.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.
If I am implementing a new model or modifying any existing model, I have followed the Models Implementation Coding Standards.

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

copy-pr-bot · 2026-05-26T17:09:12Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

negin513 · 2026-05-26T17:16:23Z

-            forward_prefetch=True,  # Optimization for faster training
-            backward_prefetch=BackwardPrefetch.BACKWARD_PRE,  # Backward prefetching for overlap
-        )
+        # FSDP2 rejects non-contiguous parameters (PyTorch <= 2.10):


NOTE: this block exists solely for backward compatibility with PyTorch <= 2.10. Do we care about backward compatibility?

greptile-apps · 2026-05-26T17:17:29Z

Greptile Summary

This PR migrates StormScope/StormCast from FSDP1 (FullyShardedDataParallel with NO_SHARD) to FSDP2 (fully_shard / FSDPModule), motivated by FSDP1's flat-param machinery being incompatible with the refactored ShardTensor/DTensor in #1556.

parallel.py: replaces the FSDP(NO_SHARD) wrapper with fully_shard, adds a pre-shard loop to force standard contiguity on channels-last parameters (FSDP2 rejects non-contiguous params ≤ PyTorch 2.10), and updates the docstring/return type to FSDPModule.
checkpoint.py: adds _unwrapped_class_name to recover the original class name from FSDP2's dynamically-generated subclass, extends _has_non_fsdp_dtensors with a degenerate-mesh guard for FSDP2 (size-1 mesh axes break DCP's broadcast), and threads both helpers through _unique_model_names.

Important Files Changed

Filename	Overview
physicsnemo/utils/checkpoint.py	Adds FSDPModule (FSDP2) support: new _unwrapped_class_name helper, extended _has_non_fsdp_dtensors logic for degenerate meshes, and updated _unique_model_names to use the new helper; _is_distributed_model does not add an explicit FSDPModule check.
examples/weather/stormcast/utils/parallel.py	Replaces FSDP1 (NO_SHARD) with FSDP2 fully_shard; adds a pre-shard contiguity normalization loop that also iterates over DTensor parameters when use_shard_tensor=True.
examples/weather/stormcast/utils/trainer.py	Minor comment and log-message updates to reflect FSDP2 terminology; no logic changes.
examples/weather/stormcast/test_training.py	Removes now-unused FSDP1 imports (StateDictType, ShardedStateDictConfig, ShardedOptimStateDictConfig); no test logic changes.

Comments Outside Diff (1)

physicsnemo/utils/checkpoint.py, line 70-74 (link)

_is_distributed_model relies solely on the DTensor parameter check to detect FSDP2 models. In practice fully_shard always converts parameters to DTensors, so this works, but a model with no learnable parameters would return False even after FSDP2 wrapping, silently routing it through the non-distributed checkpoint path. Adding an explicit FSDPModule branch makes the intent clear and is defensive against edge cases.

_{Reviews (1): Last reviewed commit: "improving checkpoint" | Re-trigger Greptile}

greptile-apps · 2026-05-26T17:17:45Z

+    if isinstance(inner, FSDPModule):
+        bases = type(inner).__bases__
+        if len(bases) >= 2 and bases[0] is FSDPModule:
+            return bases[1].__name__
+    return type(inner).__name__


Fragile MRO assumption for FSDP2 class name

_unwrapped_class_name returns bases[1].__name__ only when bases[0] is FSDPModule. If a future PyTorch version changes the order of bases or introduces an intermediate mixin in the dynamically-generated class (e.g. (FSDPModule, SomeMixin, OriginalCls)), the condition bases[0] is FSDPModule still holds but bases[1].__name__ would return SomeMixin instead of the real user class, silently generating the wrong checkpoint filename. Using type(inner).__mro__ to find the first non-FSDPModule/torch.nn.Module base would be more resilient.

greptile-apps · 2026-05-26T17:17:47Z

+        with torch.no_grad():
+            for p in model.parameters():
+                if p.is_contiguous():
+                    continue
+                p.data = p.data.contiguous()


When use_shard_tensor=True, distribute_module has already been called and model.parameters() yields DTensor-backed nn.Parameters. Assigning p.data = p.data.contiguous() on a DTensor parameter is not documented PyTorch API; Tensor.set_() (which backs the .data setter) with a DTensor argument may silently strip the DTensor's mesh/placements metadata, breaking the subsequent fully_shard call. In practice distribute_tensor normalises contiguity internally so the guard p.is_contiguous() is usually True for DTensor params and the assignment is skipped — but making the skip explicit prevents a silent breakage if that behaviour changes.

Suggested change

with torch.no_grad():

for p in model.parameters():

if p.is_contiguous():

continue

p.data = p.data.contiguous()

with torch.no_grad():

for p in model.parameters():

if isinstance(p.data, DTensor):

continue # distribute_module already normalises DTensor local shards

if p.is_contiguous():

continue

p.data = p.data.contiguous()

negin513 added 3 commits May 20, 2026 10:33

minimal implementation for stormcast

1950c41

test update

5fd5fa4

improving checkpoint

9abe917

negin513 requested a review from CharlelieLrt as a code owner May 26, 2026 17:09

negin513 commented May 26, 2026

View reviewed changes

greptile-apps Bot reviewed May 26, 2026

View reviewed changes

negin513 changed the title ~~Fsdp2 stormscope~~ Fsdp2 stormscope [WIP] May 26, 2026

Merge branch 'main' into fsdp2-stormcast

eb894cf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fsdp2 stormscope [WIP]#1671

Fsdp2 stormscope [WIP]#1671
negin513 wants to merge 4 commits into
NVIDIA:mainfrom
negin513:fsdp2-stormcast

negin513 commented May 26, 2026

Uh oh!

copy-pr-bot Bot commented May 26, 2026

Uh oh!

negin513 May 26, 2026

Uh oh!

greptile-apps Bot commented May 26, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot May 26, 2026

Uh oh!

greptile-apps Bot May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

negin513 commented May 26, 2026

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Review Process

Uh oh!

copy-pr-bot Bot commented May 26, 2026

Uh oh!

negin513 May 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Important Files Changed

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented May 26, 2026 •

edited

Loading