Skip to content

🐛[BUG]: incompatibilities between current PNM with pytorch 2.11.0 #1570

@jialusui1102

Description

@jialusui1102

Version

container 26:03

On which installation method(s) does this occur?

No response

Describe the issue

With pytorch version 2.11.0+cu130 and PNM container 26:03, I keep running into this import error when importing ShardTensor:

File "/code/ensemble_parallelism_trainer_one_logger_fsdp2.py", line 32, in
from losses.weighted_metrics import per_variable_metrics
File "/code/losses/weighted_metrics.py", line 10, in
from physicsnemo.domain_parallel.shard_tensor import ShardTensor
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/init.py", line 66, in
register_custom_ops()
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/init.py", line 55, in register_custom_ops
from .custom_ops import (
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/custom_ops/init.py", line 24, in
from ._tensor_ops import unbind_rules
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/custom_ops/_tensor_ops.py", line 43, in
from torch.distributed.tensor._ops.registration import (
ModuleNotFoundError: No module named 'torch.distributed.tensor._ops.registration'
Traceback (most recent call last):
File "/code/ensemble_parallelism_trainer_one_logger_fsdp2.py", line 32, in
from losses.weighted_metrics import per_variable_metrics
File "/code/losses/weighted_metrics.py", line 10, in
from physicsnemo.domain_parallel.shard_tensor import ShardTensor
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/init.py", line 66, in
register_custom_ops()
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/init.py", line 55, in register_custom_ops
from .custom_ops import (
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/custom_ops/init.py", line 24, in
from ._tensor_ops import unbind_rules
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/custom_ops/_tensor_ops.py", line 43, in
from torch.distributed.tensor._ops.registration import (
ModuleNotFoundError: No module named 'torch.distributed.tensor._ops.registration'

Minimum reproducible example

from physicsnemo.domain_parallel.shard_tensor import ShardTensor

Relevant log output

Traceback (most recent call last):
  File "/code/ensemble_parallelism_trainer_one_logger_fsdp2.py", line 32, in <module>
    from losses.weighted_metrics import per_variable_metrics
  File "/code/losses/weighted_metrics.py", line 10, in <module>
    from physicsnemo.domain_parallel.shard_tensor import ShardTensor
  File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/__init__.py", line 66, in <module>
    register_custom_ops()
  File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/__init__.py", line 55, in register_custom_ops
    from .custom_ops import (
  File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/custom_ops/__init__.py", line 24, in <module>
    from ._tensor_ops import unbind_rules
  File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/custom_ops/_tensor_ops.py", line 43, in <module>
    from torch.distributed.tensor._ops.registration import (
ModuleNotFoundError: No module named 'torch.distributed.tensor._ops.registration'
Traceback (most recent call last):
  File "/code/ensemble_parallelism_trainer_one_logger_fsdp2.py", line 32, in <module>
    from losses.weighted_metrics import per_variable_metrics
  File "/code/losses/weighted_metrics.py", line 10, in <module>
    from physicsnemo.domain_parallel.shard_tensor import ShardTensor
  File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/__init__.py", line 66, in <module>
    register_custom_ops()
  File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/__init__.py", line 55, in register_custom_ops
    from .custom_ops import (
  File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/custom_ops/__init__.py", line 24, in <module>
    from ._tensor_ops import unbind_rules
  File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/custom_ops/_tensor_ops.py", line 43, in <module>
    from torch.distributed.tensor._ops.registration import (
ModuleNotFoundError: No module named 'torch.distributed.tensor._ops.registration'

Environment details

Container:nvcr.io/nvidia/physicsnemo/physicsnemo:26.03
Pytorch version: 2.11.0+cu130
CUDA Version: 13.1
Driver Version: 535.129.03

Metadata

Metadata

Assignees

Labels

? - Needs TriageNeed team to review and classifybugSomething isn't workingexternalIssues/PR filed by people outside the team

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions