请问我这个该如何解决呀?
$ python run_plm.py
--test
--plm-type llama
--plm-size base
--state-feature-dim 256
--device cuda:0
2025-03-06 12:27:33.292295: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2025-03-06 12:27:33.296379: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-06 12:27:33.305468: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1741235253.320360 1321670 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1741235253.324758 1321670 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-06 12:27:33.340503: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Arguments:
Namespace(exp_pool_path='artifacts/exp_pool/exp_pool.pkl', sample_step=None, num_executors=50, job_arrival_cap=200, job_arrival_rate=4e-05, moving_delay=2000.0, warmup_delay=1000.0, render_mode=None, dataset='tpch', plm_type='llama', plm_size='base', rank=128, pt_encoder_config=None, state_feature_dim=256, K=20, gamma=1.0, max_exec_num=50, max_stage_num=100, lr=0.0001, weight_decay=0.0001, warmup_steps=2000, num_iters=10, num_steps_per_iter=10000, eval_max_ep_len=6000, eval_per_iter=1, save_checkpoint_per_iter=1, target_return_scale=1.0, which_layer=-1, train=False, test=True, grad_accum_steps=32, seed=1, env_seed=1, scale=1000, resume_dir=None, resume_iter=None, model_dir=None, use_head=3, device='cuda:0', device_out='cuda:0', device_mid=None)
Experience dataset info:
Munch({'max_reward': 0, 'min_reward': -504129.1369907103, 'max_return': 5.793701710089156, 'min_return': 0.0009962767476381065, 'min_timestep': 0, 'max_timestep': 5813, 'min_num_nodes': 1, 'max_num_nodes': 134, 'min_num_dags': 1, 'max_num_dags': 16, 'min_stage_idx': 0, 'max_stage_idx': 47, 'min_job_idx': 0, 'max_job_idx': 15, 'min_num_exec': 0, 'max_num_exec': 49})
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.09it/s]
Traceback (most recent call last):
File "/home/weixingzu/project/NetLLM/cluster_job_scheduling/run_plm.py", line 366, in
run(args)
File "/home/weixingzu/project/NetLLM/cluster_job_scheduling/run_plm.py", line 222, in run
plm, *_ = load_plm(args.plm_type, os.path.join(PLM_DIR, args.plm_type, args.plm_size),
File "/home/weixingzu/project/NetLLM/cluster_job_scheduling/plm_special/utils/plm_utils.py", line 149, in load_plm
model = model_class.model.from_pretrained(model_path, config=model_config, device_map=device_map)
File "/home/weixingzu/project/huggingface/transformers/src/transformers/modeling_utils.py", line 271, in _wrapper
return func(*args, **kwargs)
File "/home/weixingzu/project/huggingface/transformers/src/transformers/modeling_utils.py", line 4534, in from_pretrained
dispatch_model(model, **device_map_kwargs)
File "/home/weixingzu/miniconda3/envs/cjs_netllm/lib/python3.10/site-packages/accelerate/big_modeling.py", line 496, in dispatch_model
model.to(device)
File "/home/weixingzu/project/huggingface/transformers/src/transformers/modeling_utils.py", line 3262, in to
return super().to(*args, **kwargs)
File "/home/weixingzu/miniconda3/envs/cjs_netllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1174, in to
return self._apply(convert)
File "/home/weixingzu/miniconda3/envs/cjs_netllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 780, in _apply
module._apply(fn)
File "/home/weixingzu/miniconda3/envs/cjs_netllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 805, in _apply
param_applied = fn(param)
File "/home/weixingzu/miniconda3/envs/cjs_netllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1167, in convert
raise NotImplementedError(
NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.
(cjs_netllm)
weixingzu @ ubuntu-4090 in ~/project/NetLLM/cluster_job_scheduling on git:master x [12:28:15] C:1
$ python run_plm.py
--test
--plm-type llama
--plm-size base
--state-feature-dim 256
--device cpu
2025-03-06 12:29:38.344107: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2025-03-06 12:29:38.348603: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-06 12:29:38.359148: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1741235378.375602 1321939 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1741235378.380137 1321939 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-06 12:29:38.396860: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Arguments:
Namespace(exp_pool_path='artifacts/exp_pool/exp_pool.pkl', sample_step=None, num_executors=50, job_arrival_cap=200, job_arrival_rate=4e-05, moving_delay=2000.0, warmup_delay=1000.0, render_mode=None, dataset='tpch', plm_type='llama', plm_size='base', rank=128, pt_encoder_config=None, state_feature_dim=256, K=20, gamma=1.0, max_exec_num=50, max_stage_num=100, lr=0.0001, weight_decay=0.0001, warmup_steps=2000, num_iters=10, num_steps_per_iter=10000, eval_max_ep_len=6000, eval_per_iter=1, save_checkpoint_per_iter=1, target_return_scale=1.0, which_layer=-1, train=False, test=True, grad_accum_steps=32, seed=1, env_seed=1, scale=1000, resume_dir=None, resume_iter=None, model_dir=None, use_head=3, device='cpu', device_out='cpu', device_mid=None)
Experience dataset info:
Munch({'max_reward': 0, 'min_reward': -504129.1369907103, 'max_return': 5.793701710089156, 'min_return': 0.0009962767476381065, 'min_timestep': 0, 'max_timestep': 5813, 'min_num_nodes': 1, 'max_num_nodes': 134, 'min_num_dags': 1, 'max_num_dags': 16, 'min_stage_idx': 0, 'max_stage_idx': 47, 'min_job_idx': 0, 'max_job_idx': 15, 'min_num_exec': 0, 'max_num_exec': 49})
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.07it/s]
Traceback (most recent call last):
File "/home/weixingzu/project/NetLLM/cluster_job_scheduling/run_plm.py", line 366, in
run(args)
File "/home/weixingzu/project/NetLLM/cluster_job_scheduling/run_plm.py", line 222, in run
plm, *_ = load_plm(args.plm_type, os.path.join(PLM_DIR, args.plm_type, args.plm_size),
File "/home/weixingzu/project/NetLLM/cluster_job_scheduling/plm_special/utils/plm_utils.py", line 149, in load_plm
model = model_class.model.from_pretrained(model_path, config=model_config, device_map=device_map)
File "/home/weixingzu/project/huggingface/transformers/src/transformers/modeling_utils.py", line 271, in _wrapper
return func(*args, **kwargs)
File "/home/weixingzu/project/huggingface/transformers/src/transformers/modeling_utils.py", line 4534, in from_pretrained
dispatch_model(model, **device_map_kwargs)
File "/home/weixingzu/miniconda3/envs/cjs_netllm/lib/python3.10/site-packages/accelerate/big_modeling.py", line 496, in dispatch_model
model.to(device)
File "/home/weixingzu/project/huggingface/transformers/src/transformers/modeling_utils.py", line 3262, in to
return super().to(*args, **kwargs)
File "/home/weixingzu/miniconda3/envs/cjs_netllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1174, in to
return self._apply(convert)
File "/home/weixingzu/miniconda3/envs/cjs_netllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 780, in _apply
module._apply(fn)
File "/home/weixingzu/miniconda3/envs/cjs_netllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 805, in _apply
param_applied = fn(param)
File "/home/weixingzu/miniconda3/envs/cjs_netllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1167, in convert
raise NotImplementedError(
NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.
(cjs_netllm)
请问我这个该如何解决呀?
$ python run_plm.py
--test
--plm-type llama
--plm-size base
--state-feature-dim 256
--device cuda:0
2025-03-06 12:27:33.292295: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable
TF_ENABLE_ONEDNN_OPTS=0.2025-03-06 12:27:33.296379: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-06 12:27:33.305468: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1741235253.320360 1321670 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1741235253.324758 1321670 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-06 12:27:33.340503: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Arguments:
Namespace(exp_pool_path='artifacts/exp_pool/exp_pool.pkl', sample_step=None, num_executors=50, job_arrival_cap=200, job_arrival_rate=4e-05, moving_delay=2000.0, warmup_delay=1000.0, render_mode=None, dataset='tpch', plm_type='llama', plm_size='base', rank=128, pt_encoder_config=None, state_feature_dim=256, K=20, gamma=1.0, max_exec_num=50, max_stage_num=100, lr=0.0001, weight_decay=0.0001, warmup_steps=2000, num_iters=10, num_steps_per_iter=10000, eval_max_ep_len=6000, eval_per_iter=1, save_checkpoint_per_iter=1, target_return_scale=1.0, which_layer=-1, train=False, test=True, grad_accum_steps=32, seed=1, env_seed=1, scale=1000, resume_dir=None, resume_iter=None, model_dir=None, use_head=3, device='cuda:0', device_out='cuda:0', device_mid=None)
Experience dataset info:
Munch({'max_reward': 0, 'min_reward': -504129.1369907103, 'max_return': 5.793701710089156, 'min_return': 0.0009962767476381065, 'min_timestep': 0, 'max_timestep': 5813, 'min_num_nodes': 1, 'max_num_nodes': 134, 'min_num_dags': 1, 'max_num_dags': 16, 'min_stage_idx': 0, 'max_stage_idx': 47, 'min_job_idx': 0, 'max_job_idx': 15, 'min_num_exec': 0, 'max_num_exec': 49})
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.09it/s]
Traceback (most recent call last):
File "/home/weixingzu/project/NetLLM/cluster_job_scheduling/run_plm.py", line 366, in
run(args)
File "/home/weixingzu/project/NetLLM/cluster_job_scheduling/run_plm.py", line 222, in run
plm, *_ = load_plm(args.plm_type, os.path.join(PLM_DIR, args.plm_type, args.plm_size),
File "/home/weixingzu/project/NetLLM/cluster_job_scheduling/plm_special/utils/plm_utils.py", line 149, in load_plm
model = model_class.model.from_pretrained(model_path, config=model_config, device_map=device_map)
File "/home/weixingzu/project/huggingface/transformers/src/transformers/modeling_utils.py", line 271, in _wrapper
return func(*args, **kwargs)
File "/home/weixingzu/project/huggingface/transformers/src/transformers/modeling_utils.py", line 4534, in from_pretrained
dispatch_model(model, **device_map_kwargs)
File "/home/weixingzu/miniconda3/envs/cjs_netllm/lib/python3.10/site-packages/accelerate/big_modeling.py", line 496, in dispatch_model
model.to(device)
File "/home/weixingzu/project/huggingface/transformers/src/transformers/modeling_utils.py", line 3262, in to
return super().to(*args, **kwargs)
File "/home/weixingzu/miniconda3/envs/cjs_netllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1174, in to
return self._apply(convert)
File "/home/weixingzu/miniconda3/envs/cjs_netllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 780, in _apply
module._apply(fn)
File "/home/weixingzu/miniconda3/envs/cjs_netllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 805, in _apply
param_applied = fn(param)
File "/home/weixingzu/miniconda3/envs/cjs_netllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1167, in convert
raise NotImplementedError(
NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.
(cjs_netllm)
weixingzu @ ubuntu-4090 in ~/project/NetLLM/cluster_job_scheduling on git:master x [12:28:15] C:1
$ python run_plm.py
--test
--plm-type llama
--plm-size base
--state-feature-dim 256
--device cpu
2025-03-06 12:29:38.344107: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable
TF_ENABLE_ONEDNN_OPTS=0.2025-03-06 12:29:38.348603: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-06 12:29:38.359148: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1741235378.375602 1321939 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1741235378.380137 1321939 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-06 12:29:38.396860: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Arguments:
Namespace(exp_pool_path='artifacts/exp_pool/exp_pool.pkl', sample_step=None, num_executors=50, job_arrival_cap=200, job_arrival_rate=4e-05, moving_delay=2000.0, warmup_delay=1000.0, render_mode=None, dataset='tpch', plm_type='llama', plm_size='base', rank=128, pt_encoder_config=None, state_feature_dim=256, K=20, gamma=1.0, max_exec_num=50, max_stage_num=100, lr=0.0001, weight_decay=0.0001, warmup_steps=2000, num_iters=10, num_steps_per_iter=10000, eval_max_ep_len=6000, eval_per_iter=1, save_checkpoint_per_iter=1, target_return_scale=1.0, which_layer=-1, train=False, test=True, grad_accum_steps=32, seed=1, env_seed=1, scale=1000, resume_dir=None, resume_iter=None, model_dir=None, use_head=3, device='cpu', device_out='cpu', device_mid=None)
Experience dataset info:
Munch({'max_reward': 0, 'min_reward': -504129.1369907103, 'max_return': 5.793701710089156, 'min_return': 0.0009962767476381065, 'min_timestep': 0, 'max_timestep': 5813, 'min_num_nodes': 1, 'max_num_nodes': 134, 'min_num_dags': 1, 'max_num_dags': 16, 'min_stage_idx': 0, 'max_stage_idx': 47, 'min_job_idx': 0, 'max_job_idx': 15, 'min_num_exec': 0, 'max_num_exec': 49})
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.07it/s]
Traceback (most recent call last):
File "/home/weixingzu/project/NetLLM/cluster_job_scheduling/run_plm.py", line 366, in
run(args)
File "/home/weixingzu/project/NetLLM/cluster_job_scheduling/run_plm.py", line 222, in run
plm, *_ = load_plm(args.plm_type, os.path.join(PLM_DIR, args.plm_type, args.plm_size),
File "/home/weixingzu/project/NetLLM/cluster_job_scheduling/plm_special/utils/plm_utils.py", line 149, in load_plm
model = model_class.model.from_pretrained(model_path, config=model_config, device_map=device_map)
File "/home/weixingzu/project/huggingface/transformers/src/transformers/modeling_utils.py", line 271, in _wrapper
return func(*args, **kwargs)
File "/home/weixingzu/project/huggingface/transformers/src/transformers/modeling_utils.py", line 4534, in from_pretrained
dispatch_model(model, **device_map_kwargs)
File "/home/weixingzu/miniconda3/envs/cjs_netllm/lib/python3.10/site-packages/accelerate/big_modeling.py", line 496, in dispatch_model
model.to(device)
File "/home/weixingzu/project/huggingface/transformers/src/transformers/modeling_utils.py", line 3262, in to
return super().to(*args, **kwargs)
File "/home/weixingzu/miniconda3/envs/cjs_netllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1174, in to
return self._apply(convert)
File "/home/weixingzu/miniconda3/envs/cjs_netllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 780, in _apply
module._apply(fn)
File "/home/weixingzu/miniconda3/envs/cjs_netllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 805, in _apply
param_applied = fn(param)
File "/home/weixingzu/miniconda3/envs/cjs_netllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1167, in convert
raise NotImplementedError(
NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.
(cjs_netllm)