Version
container 26:03
On which installation method(s) does this occur?
No response
Describe the issue
With pytorch version 2.11.0+cu130 and PNM container 26:03, I keep running into this import error when importing ShardTensor:
File "/code/ensemble_parallelism_trainer_one_logger_fsdp2.py", line 32, in
from losses.weighted_metrics import per_variable_metrics
File "/code/losses/weighted_metrics.py", line 10, in
from physicsnemo.domain_parallel.shard_tensor import ShardTensor
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/init.py", line 66, in
register_custom_ops()
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/init.py", line 55, in register_custom_ops
from .custom_ops import (
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/custom_ops/init.py", line 24, in
from ._tensor_ops import unbind_rules
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/custom_ops/_tensor_ops.py", line 43, in
from torch.distributed.tensor._ops.registration import (
ModuleNotFoundError: No module named 'torch.distributed.tensor._ops.registration'
Traceback (most recent call last):
File "/code/ensemble_parallelism_trainer_one_logger_fsdp2.py", line 32, in
from losses.weighted_metrics import per_variable_metrics
File "/code/losses/weighted_metrics.py", line 10, in
from physicsnemo.domain_parallel.shard_tensor import ShardTensor
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/init.py", line 66, in
register_custom_ops()
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/init.py", line 55, in register_custom_ops
from .custom_ops import (
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/custom_ops/init.py", line 24, in
from ._tensor_ops import unbind_rules
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/custom_ops/_tensor_ops.py", line 43, in
from torch.distributed.tensor._ops.registration import (
ModuleNotFoundError: No module named 'torch.distributed.tensor._ops.registration'
Minimum reproducible example
from physicsnemo.domain_parallel.shard_tensor import ShardTensor
Relevant log output
Traceback (most recent call last):
File "/code/ensemble_parallelism_trainer_one_logger_fsdp2.py", line 32, in <module>
from losses.weighted_metrics import per_variable_metrics
File "/code/losses/weighted_metrics.py", line 10, in <module>
from physicsnemo.domain_parallel.shard_tensor import ShardTensor
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/__init__.py", line 66, in <module>
register_custom_ops()
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/__init__.py", line 55, in register_custom_ops
from .custom_ops import (
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/custom_ops/__init__.py", line 24, in <module>
from ._tensor_ops import unbind_rules
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/custom_ops/_tensor_ops.py", line 43, in <module>
from torch.distributed.tensor._ops.registration import (
ModuleNotFoundError: No module named 'torch.distributed.tensor._ops.registration'
Traceback (most recent call last):
File "/code/ensemble_parallelism_trainer_one_logger_fsdp2.py", line 32, in <module>
from losses.weighted_metrics import per_variable_metrics
File "/code/losses/weighted_metrics.py", line 10, in <module>
from physicsnemo.domain_parallel.shard_tensor import ShardTensor
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/__init__.py", line 66, in <module>
register_custom_ops()
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/__init__.py", line 55, in register_custom_ops
from .custom_ops import (
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/custom_ops/__init__.py", line 24, in <module>
from ._tensor_ops import unbind_rules
File "/usr/local/lib/python3.12/dist-packages/physicsnemo/domain_parallel/custom_ops/_tensor_ops.py", line 43, in <module>
from torch.distributed.tensor._ops.registration import (
ModuleNotFoundError: No module named 'torch.distributed.tensor._ops.registration'
Environment details
Container:nvcr.io/nvidia/physicsnemo/physicsnemo:26.03
Pytorch version: 2.11.0+cu130
CUDA Version: 13.1
Driver Version: 535.129.03
Version
container 26:03
On which installation method(s) does this occur?
No response
Describe the issue
With pytorch version 2.11.0+cu130 and PNM container 26:03, I keep running into this import error when importing ShardTensor:
Minimum reproducible example
Relevant log output
Environment details