Skip to content

estimated std of image cause training pause #20

@Qmi3

Description

@Qmi3

I trained CryoDRGN-AI on a public dataset but found the estimated std of image is inf, so it causes segment fault. How can i fix this problem. Below is the log file:
2025-07-29 01:39:14.015955: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(WARNING) (reconstruct.py) (29-Jul-25 01:40:51) Output directory out/ already exists here! Renaming the old one to old-out_000_abinit-het8.
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Using 4 GPUs!
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Increasing batch size for known poses to 64
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Increasing batch size for HPS to 32
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Increasing batch size for SGD to 128
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Use cuda True
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Will write tensorboard summaries in /lustre/grp/gyqlab/zhangcw/drgnai/10177/out/summaries
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Creating dataset
(INFO) (dataset.py) (29-Jul-25 01:41:21) Loaded 61929 384x384 images
(INFO) (dataset.py) (29-Jul-25 01:41:21) Windowing images with radius 0.85
(INFO) (dataset.py) (29-Jul-25 01:41:23) Computing FFT
(INFO) (dataset.py) (29-Jul-25 01:41:23) Spawning 16 processes
(INFO) (dataset.py) (29-Jul-25 01:48:52) Symmetrizing image data
/lustre/grp/gyqlab/zhangcw/miniconda3/envs/drgnai/lib/python3.9/site-packages/numpy/core/_methods.py:176: RuntimeWarning: overflow encountered in multiply
x = um.multiply(x, x, out=x)
(INFO) (dataset.py) (29-Jul-25 01:49:33) Normalized HT by 0 +/- inf
(INFO) (dataset.py) (29-Jul-25 01:50:15) Normalized real space images by 1.0225237765928034e+25 +/- inf
(INFO) (reconstruct.py) (29-Jul-25 01:50:24) Loading ctf params from /lustre/grp/gyqlab/share/cryoem_particles/10177/data/dataset_trimer_61929_ptcls/ctf.pkl
(INFO) (ctf.py) (29-Jul-25 01:50:24) Image size (pix) : 384
(INFO) (ctf.py) (29-Jul-25 01:50:24) A/pix : 1.0670000314712524
(INFO) (ctf.py) (29-Jul-25 01:50:24) DefocusU (A) : 21713.8671875
(INFO) (ctf.py) (29-Jul-25 01:50:24) DefocusV (A) : 21426.921875
(INFO) (ctf.py) (29-Jul-25 01:50:24) Dfang (deg) : 34.44029998779297
(INFO) (ctf.py) (29-Jul-25 01:50:24) voltage (kV) : 300.0
(INFO) (ctf.py) (29-Jul-25 01:50:24) cs (mm) : 2.5999999046325684
(INFO) (ctf.py) (29-Jul-25 01:50:24) w : 0.10000000149011612
(INFO) (ctf.py) (29-Jul-25 01:50:24) Phase shift (deg) : 0.0
(INFO) (reconstruct.py) (29-Jul-25 01:50:26) Building lattice
(INFO) (reconstruct.py) (29-Jul-25 01:50:28) Heterogeneous reconstruction with z_dim = 8
(INFO) (reconstruct.py) (29-Jul-25 01:50:28) Initializing model...
(INFO) (reconstruct.py) (29-Jul-25 01:50:29) DrgnAI(
(pose_table): PoseTable()
(conf_table): ConfTable()
(hypervolume): HyperVolume(
(mlp): ResidualLinearMLP(
(main): Sequential(
(0): Linear(in_features=392, out_features=1024, bias=True)
(1): ReLU()
(2): ResidualLinear(
(linear): Linear(in_features=1024, out_features=1024, bias=True)
)
(3): ReLU()
(4): ResidualLinear(
(linear): Linear(in_features=1024, out_features=1024, bias=True)
)
(5): ReLU()
(6): ResidualLinear(
(linear): Linear(in_features=1024, out_features=1024, bias=True)
)
(7): ReLU()
(8): MyLinear(in_features=1024, out_features=1, bias=True)
)
)
)
)
(INFO) (reconstruct.py) (29-Jul-25 01:50:29) 4543121 parameters in model
(INFO) (reconstruct.py) (29-Jul-25 01:50:29) Model initialized. Moving to GPU...
(INFO) (reconstruct.py) (29-Jul-25 01:50:57) --- Training Starts Now ---
(INFO) (reconstruct.py) (29-Jul-25 01:50:57) Will pretrain on 10000 particles
(INFO) (reconstruct.py) (29-Jul-25 01:50:57) Will make a full summary at the end of this epoch
(INFO) (reconstruct.py) (29-Jul-25 01:57:41) # [Train Epoch: -1/108] [10048/61929 particles]
(INFO) (reconstruct.py) (29-Jul-25 01:57:41) # =====> SGD Epoch: -1 finished in 0:06:44.266807; total loss = 0.007430
(INFO) (analysis.py) (29-Jul-25 01:57:44) Explained variance ratio:
(INFO) (analysis.py) (29-Jul-25 01:57:44) [0.13470442 0.13069272 0.12847659 0.12622799 0.12421749 0.12214833
0.11920623 0.11432623]
/lustre/grp/gyqlab/zhangcw/miniconda3/envs/drgnai/lib/python3.9/site-packages/cryodrgnai/models.py:458: RuntimeWarning: invalid value encountered in multiply
volume = volume * norm[1] + norm[0]
(INFO) (reconstruct.py) (29-Jul-25 01:58:14) Will use pose search on 61929 particles
(INFO) (reconstruct.py) (29-Jul-25 01:58:14) Will make a full summary at the end of this epoch
/var/spool/slurmd/job23553767/slurm_script: line 10: 2340797 Segmentation fault drgnai train /lustre/grp/gyqlab/zhangcw/drgnai/10177 --multigpu

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions