Hi, i am facing a problem like below:
python -m torch.distributed.launch --nproc_per_node=2 run.py --data_root data --batch_size 12 --dataset ade --name LWF --task 100-50 --step 0 --lr 0.01 --epochs 60 --method LWF
/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
FutureWarning,
WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
INFO:rank1: Device: cuda:1
Traceback (most recent call last):
File "run.py", line 390, in
main(opts)
File "run.py", line 116, in main
logger = Logger(logdir_full, rank=rank, debug=opts.debug, summary=opts.visualize, step=opts.step)
File "/home/cuong69/Desktop/MiB-master/utils/logger.py", line 15, in init
import tensorboardX
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/tensorboardX/init.py", line 5, in
from .torchvis import TorchVis
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/tensorboardX/torchvis.py", line 11, in
from .writer import SummaryWriter
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/tensorboardX/writer.py", line 15, in
from .event_file_writer import EventFileWriter
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/tensorboardX/event_file_writer.py", line 28, in
from .proto import event_pb2
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/tensorboardX/proto/event_pb2.py", line 15, in
from tensorboardX.proto import summary_pb2 as tensorboardX_dot_proto_dot_summary__pb2
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/tensorboardX/proto/summary_pb2.py", line 15, in
from tensorboardX.proto import tensor_pb2 as tensorboardX_dot_proto_dot_tensor__pb2
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/tensorboardX/proto/tensor_pb2.py", line 15, in
from tensorboardX.proto import resource_handle_pb2 as tensorboardX_dot_proto_dot_resource__handle__pb2
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/tensorboardX/proto/resource_handle_pb2.py", line 22, in
serialized_pb=_b('\n(tensorboardX/proto/resource_handle.proto\x12\x0ctensorboardX"r\n\x13ResourceHandleProto\x12\x0e\n\x06\x64\x65vice\x18\x01 \x01(\t\x12\x11\n\tcontainer\x18\x02 \x01(\t\x12\x0c\n\x04name\x18\x03 \x01(\t\x12\x11\n\thash_code\x18\x04 \x01(\x04\x12\x17\n\x0fmaybe_type_name\x18\x05 \x01(\tB/\n\x18org.tensorflow.frameworkB\x0eResourceHandleP\x01\xf8\x01\x01\x62\x06proto3')
TypeError: new() got an unexpected keyword argument 'serialized_options'
Filtering images...
0/2000 ...
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1651457 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1651456) of binary: /home/cuong69/anaconda3/envs/plop/bin/python
Traceback (most recent call last):
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
run.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2022-06-15_09:57:39
host : aaa-Z490-AORUS-MASTER
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1651456)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
I think it is related to version conflict...my gpu is RTX3090, therefore, i must use cuda 11.3.
Please help me to solve the problem..Thank you!
Hi, i am facing a problem like below:
python -m torch.distributed.launch --nproc_per_node=2 run.py --data_root data --batch_size 12 --dataset ade --name LWF --task 100-50 --step 0 --lr 0.01 --epochs 60 --method LWF
/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects
--local_rankargument to be set, pleasechange it to read from
os.environ['LOCAL_RANK']instead. Seehttps://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
FutureWarning,
WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
INFO:rank1: Device: cuda:1
Traceback (most recent call last):
File "run.py", line 390, in
main(opts)
File "run.py", line 116, in main
logger = Logger(logdir_full, rank=rank, debug=opts.debug, summary=opts.visualize, step=opts.step)
File "/home/cuong69/Desktop/MiB-master/utils/logger.py", line 15, in init
import tensorboardX
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/tensorboardX/init.py", line 5, in
from .torchvis import TorchVis
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/tensorboardX/torchvis.py", line 11, in
from .writer import SummaryWriter
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/tensorboardX/writer.py", line 15, in
from .event_file_writer import EventFileWriter
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/tensorboardX/event_file_writer.py", line 28, in
from .proto import event_pb2
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/tensorboardX/proto/event_pb2.py", line 15, in
from tensorboardX.proto import summary_pb2 as tensorboardX_dot_proto_dot_summary__pb2
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/tensorboardX/proto/summary_pb2.py", line 15, in
from tensorboardX.proto import tensor_pb2 as tensorboardX_dot_proto_dot_tensor__pb2
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/tensorboardX/proto/tensor_pb2.py", line 15, in
from tensorboardX.proto import resource_handle_pb2 as tensorboardX_dot_proto_dot_resource__handle__pb2
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/tensorboardX/proto/resource_handle_pb2.py", line 22, in
serialized_pb=_b('\n(tensorboardX/proto/resource_handle.proto\x12\x0ctensorboardX"r\n\x13ResourceHandleProto\x12\x0e\n\x06\x64\x65vice\x18\x01 \x01(\t\x12\x11\n\tcontainer\x18\x02 \x01(\t\x12\x0c\n\x04name\x18\x03 \x01(\t\x12\x11\n\thash_code\x18\x04 \x01(\x04\x12\x17\n\x0fmaybe_type_name\x18\x05 \x01(\tB/\n\x18org.tensorflow.frameworkB\x0eResourceHandleP\x01\xf8\x01\x01\x62\x06proto3')
TypeError: new() got an unexpected keyword argument 'serialized_options'
Filtering images...
0/2000 ...
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1651457 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1651456) of binary: /home/cuong69/anaconda3/envs/plop/bin/python
Traceback (most recent call last):
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
run.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2022-06-15_09:57:39
host : aaa-Z490-AORUS-MASTER
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1651456)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
I think it is related to version conflict...my gpu is RTX3090, therefore, i must use cuda 11.3.
Please help me to solve the problem..Thank you!