I trained CryoDRGN-AI on a public dataset but found the estimated std of image is inf, so it causes segment fault. How can i fix this problem. Below is the log file:
2025-07-29 01:39:14.015955: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(WARNING) (reconstruct.py) (29-Jul-25 01:40:51) Output directory out/ already exists here! Renaming the old one to old-out_000_abinit-het8.
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Using 4 GPUs!
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Increasing batch size for known poses to 64
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Increasing batch size for HPS to 32
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Increasing batch size for SGD to 128
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Use cuda True
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Will write tensorboard summaries in /lustre/grp/gyqlab/zhangcw/drgnai/10177/out/summaries
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Creating dataset
(INFO) (dataset.py) (29-Jul-25 01:41:21) Loaded 61929 384x384 images
(INFO) (dataset.py) (29-Jul-25 01:41:21) Windowing images with radius 0.85
(INFO) (dataset.py) (29-Jul-25 01:41:23) Computing FFT
(INFO) (dataset.py) (29-Jul-25 01:41:23) Spawning 16 processes
(INFO) (dataset.py) (29-Jul-25 01:48:52) Symmetrizing image data
/lustre/grp/gyqlab/zhangcw/miniconda3/envs/drgnai/lib/python3.9/site-packages/numpy/core/_methods.py:176: RuntimeWarning: overflow encountered in multiply
x = um.multiply(x, x, out=x)
(INFO) (dataset.py) (29-Jul-25 01:49:33) Normalized HT by 0 +/- inf
(INFO) (dataset.py) (29-Jul-25 01:50:15) Normalized real space images by 1.0225237765928034e+25 +/- inf
(INFO) (reconstruct.py) (29-Jul-25 01:50:24) Loading ctf params from /lustre/grp/gyqlab/share/cryoem_particles/10177/data/dataset_trimer_61929_ptcls/ctf.pkl
(INFO) (ctf.py) (29-Jul-25 01:50:24) Image size (pix) : 384
(INFO) (ctf.py) (29-Jul-25 01:50:24) A/pix : 1.0670000314712524
(INFO) (ctf.py) (29-Jul-25 01:50:24) DefocusU (A) : 21713.8671875
(INFO) (ctf.py) (29-Jul-25 01:50:24) DefocusV (A) : 21426.921875
(INFO) (ctf.py) (29-Jul-25 01:50:24) Dfang (deg) : 34.44029998779297
(INFO) (ctf.py) (29-Jul-25 01:50:24) voltage (kV) : 300.0
(INFO) (ctf.py) (29-Jul-25 01:50:24) cs (mm) : 2.5999999046325684
(INFO) (ctf.py) (29-Jul-25 01:50:24) w : 0.10000000149011612
(INFO) (ctf.py) (29-Jul-25 01:50:24) Phase shift (deg) : 0.0
(INFO) (reconstruct.py) (29-Jul-25 01:50:26) Building lattice
(INFO) (reconstruct.py) (29-Jul-25 01:50:28) Heterogeneous reconstruction with z_dim = 8
(INFO) (reconstruct.py) (29-Jul-25 01:50:28) Initializing model...
(INFO) (reconstruct.py) (29-Jul-25 01:50:29) DrgnAI(
(pose_table): PoseTable()
(conf_table): ConfTable()
(hypervolume): HyperVolume(
(mlp): ResidualLinearMLP(
(main): Sequential(
(0): Linear(in_features=392, out_features=1024, bias=True)
(1): ReLU()
(2): ResidualLinear(
(linear): Linear(in_features=1024, out_features=1024, bias=True)
)
(3): ReLU()
(4): ResidualLinear(
(linear): Linear(in_features=1024, out_features=1024, bias=True)
)
(5): ReLU()
(6): ResidualLinear(
(linear): Linear(in_features=1024, out_features=1024, bias=True)
)
(7): ReLU()
(8): MyLinear(in_features=1024, out_features=1, bias=True)
)
)
)
)
(INFO) (reconstruct.py) (29-Jul-25 01:50:29) 4543121 parameters in model
(INFO) (reconstruct.py) (29-Jul-25 01:50:29) Model initialized. Moving to GPU...
(INFO) (reconstruct.py) (29-Jul-25 01:50:57) --- Training Starts Now ---
(INFO) (reconstruct.py) (29-Jul-25 01:50:57) Will pretrain on 10000 particles
(INFO) (reconstruct.py) (29-Jul-25 01:50:57) Will make a full summary at the end of this epoch
(INFO) (reconstruct.py) (29-Jul-25 01:57:41) # [Train Epoch: -1/108] [10048/61929 particles]
(INFO) (reconstruct.py) (29-Jul-25 01:57:41) # =====> SGD Epoch: -1 finished in 0:06:44.266807; total loss = 0.007430
(INFO) (analysis.py) (29-Jul-25 01:57:44) Explained variance ratio:
(INFO) (analysis.py) (29-Jul-25 01:57:44) [0.13470442 0.13069272 0.12847659 0.12622799 0.12421749 0.12214833
0.11920623 0.11432623]
/lustre/grp/gyqlab/zhangcw/miniconda3/envs/drgnai/lib/python3.9/site-packages/cryodrgnai/models.py:458: RuntimeWarning: invalid value encountered in multiply
volume = volume * norm[1] + norm[0]
(INFO) (reconstruct.py) (29-Jul-25 01:58:14) Will use pose search on 61929 particles
(INFO) (reconstruct.py) (29-Jul-25 01:58:14) Will make a full summary at the end of this epoch
/var/spool/slurmd/job23553767/slurm_script: line 10: 2340797 Segmentation fault drgnai train /lustre/grp/gyqlab/zhangcw/drgnai/10177 --multigpu
I trained CryoDRGN-AI on a public dataset but found the estimated std of image is inf, so it causes segment fault. How can i fix this problem. Below is the log file:
2025-07-29 01:39:14.015955: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(WARNING) (reconstruct.py) (29-Jul-25 01:40:51) Output directory
out/already exists here! Renaming the old one toold-out_000_abinit-het8.(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Using 4 GPUs!
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Increasing batch size for known poses to 64
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Increasing batch size for HPS to 32
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Increasing batch size for SGD to 128
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Use cuda True
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Will write tensorboard summaries in /lustre/grp/gyqlab/zhangcw/drgnai/10177/out/summaries
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Creating dataset
(INFO) (dataset.py) (29-Jul-25 01:41:21) Loaded 61929 384x384 images
(INFO) (dataset.py) (29-Jul-25 01:41:21) Windowing images with radius 0.85
(INFO) (dataset.py) (29-Jul-25 01:41:23) Computing FFT
(INFO) (dataset.py) (29-Jul-25 01:41:23) Spawning 16 processes
(INFO) (dataset.py) (29-Jul-25 01:48:52) Symmetrizing image data
/lustre/grp/gyqlab/zhangcw/miniconda3/envs/drgnai/lib/python3.9/site-packages/numpy/core/_methods.py:176: RuntimeWarning: overflow encountered in multiply
x = um.multiply(x, x, out=x)
(INFO) (dataset.py) (29-Jul-25 01:49:33) Normalized HT by 0 +/- inf
(INFO) (dataset.py) (29-Jul-25 01:50:15) Normalized real space images by 1.0225237765928034e+25 +/- inf
(INFO) (reconstruct.py) (29-Jul-25 01:50:24) Loading ctf params from /lustre/grp/gyqlab/share/cryoem_particles/10177/data/dataset_trimer_61929_ptcls/ctf.pkl
(INFO) (ctf.py) (29-Jul-25 01:50:24) Image size (pix) : 384
(INFO) (ctf.py) (29-Jul-25 01:50:24) A/pix : 1.0670000314712524
(INFO) (ctf.py) (29-Jul-25 01:50:24) DefocusU (A) : 21713.8671875
(INFO) (ctf.py) (29-Jul-25 01:50:24) DefocusV (A) : 21426.921875
(INFO) (ctf.py) (29-Jul-25 01:50:24) Dfang (deg) : 34.44029998779297
(INFO) (ctf.py) (29-Jul-25 01:50:24) voltage (kV) : 300.0
(INFO) (ctf.py) (29-Jul-25 01:50:24) cs (mm) : 2.5999999046325684
(INFO) (ctf.py) (29-Jul-25 01:50:24) w : 0.10000000149011612
(INFO) (ctf.py) (29-Jul-25 01:50:24) Phase shift (deg) : 0.0
(INFO) (reconstruct.py) (29-Jul-25 01:50:26) Building lattice
(INFO) (reconstruct.py) (29-Jul-25 01:50:28) Heterogeneous reconstruction with z_dim = 8
(INFO) (reconstruct.py) (29-Jul-25 01:50:28) Initializing model...
(INFO) (reconstruct.py) (29-Jul-25 01:50:29) DrgnAI(
(pose_table): PoseTable()
(conf_table): ConfTable()
(hypervolume): HyperVolume(
(mlp): ResidualLinearMLP(
(main): Sequential(
(0): Linear(in_features=392, out_features=1024, bias=True)
(1): ReLU()
(2): ResidualLinear(
(linear): Linear(in_features=1024, out_features=1024, bias=True)
)
(3): ReLU()
(4): ResidualLinear(
(linear): Linear(in_features=1024, out_features=1024, bias=True)
)
(5): ReLU()
(6): ResidualLinear(
(linear): Linear(in_features=1024, out_features=1024, bias=True)
)
(7): ReLU()
(8): MyLinear(in_features=1024, out_features=1, bias=True)
)
)
)
)
(INFO) (reconstruct.py) (29-Jul-25 01:50:29) 4543121 parameters in model
(INFO) (reconstruct.py) (29-Jul-25 01:50:29) Model initialized. Moving to GPU...
(INFO) (reconstruct.py) (29-Jul-25 01:50:57) --- Training Starts Now ---
(INFO) (reconstruct.py) (29-Jul-25 01:50:57) Will pretrain on 10000 particles
(INFO) (reconstruct.py) (29-Jul-25 01:50:57) Will make a full summary at the end of this epoch
(INFO) (reconstruct.py) (29-Jul-25 01:57:41) # [Train Epoch: -1/108] [10048/61929 particles]
(INFO) (reconstruct.py) (29-Jul-25 01:57:41) # =====> SGD Epoch: -1 finished in 0:06:44.266807; total loss = 0.007430
(INFO) (analysis.py) (29-Jul-25 01:57:44) Explained variance ratio:
(INFO) (analysis.py) (29-Jul-25 01:57:44) [0.13470442 0.13069272 0.12847659 0.12622799 0.12421749 0.12214833
0.11920623 0.11432623]
/lustre/grp/gyqlab/zhangcw/miniconda3/envs/drgnai/lib/python3.9/site-packages/cryodrgnai/models.py:458: RuntimeWarning: invalid value encountered in multiply
volume = volume * norm[1] + norm[0]
(INFO) (reconstruct.py) (29-Jul-25 01:58:14) Will use pose search on 61929 particles
(INFO) (reconstruct.py) (29-Jul-25 01:58:14) Will make a full summary at the end of this epoch
/var/spool/slurmd/job23553767/slurm_script: line 10: 2340797 Segmentation fault drgnai train /lustre/grp/gyqlab/zhangcw/drgnai/10177 --multigpu