estimated std of image cause training pause

I trained CryoDRGN-AI on a public dataset but found the estimated std of image is inf, so it causes segment fault. How can i fix this problem. Below is the log file:
2025-07-29 01:39:14.015955: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(WARNING) (reconstruct.py) (29-Jul-25 01:40:51) Output directory `out/` already exists here! Renaming the old one to `old-out_000_abinit-het8`.
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Using 4 GPUs!
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Increasing batch size for known poses to 64
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Increasing batch size for HPS to 32
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Increasing batch size for SGD to 128
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Use cuda True
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Will write tensorboard summaries in /lustre/grp/gyqlab/zhangcw/drgnai/10177/out/summaries
(INFO) (reconstruct.py) (29-Jul-25 01:40:51) Creating dataset
(INFO) (dataset.py) (29-Jul-25 01:41:21) Loaded 61929 384x384 images
(INFO) (dataset.py) (29-Jul-25 01:41:21) Windowing images with radius 0.85
(INFO) (dataset.py) (29-Jul-25 01:41:23) Computing FFT
(INFO) (dataset.py) (29-Jul-25 01:41:23) Spawning 16 processes
(INFO) (dataset.py) (29-Jul-25 01:48:52) Symmetrizing image data
/lustre/grp/gyqlab/zhangcw/miniconda3/envs/drgnai/lib/python3.9/site-packages/numpy/core/_methods.py:176: RuntimeWarning: overflow encountered in multiply
  x = um.multiply(x, x, out=x)
(INFO) (dataset.py) (29-Jul-25 01:49:33) Normalized HT by 0 +/- inf
(INFO) (dataset.py) (29-Jul-25 01:50:15) Normalized real space images by 1.0225237765928034e+25 +/- inf
(INFO) (reconstruct.py) (29-Jul-25 01:50:24) Loading ctf params from /lustre/grp/gyqlab/share/cryoem_particles/10177/data/dataset_trimer_61929_ptcls/ctf.pkl
(INFO) (ctf.py) (29-Jul-25 01:50:24) Image size (pix)  : 384
(INFO) (ctf.py) (29-Jul-25 01:50:24) A/pix             : 1.0670000314712524
(INFO) (ctf.py) (29-Jul-25 01:50:24) DefocusU (A)      : 21713.8671875
(INFO) (ctf.py) (29-Jul-25 01:50:24) DefocusV (A)      : 21426.921875
(INFO) (ctf.py) (29-Jul-25 01:50:24) Dfang (deg)       : 34.44029998779297
(INFO) (ctf.py) (29-Jul-25 01:50:24) voltage (kV)      : 300.0
(INFO) (ctf.py) (29-Jul-25 01:50:24) cs (mm)           : 2.5999999046325684
(INFO) (ctf.py) (29-Jul-25 01:50:24) w                 : 0.10000000149011612
(INFO) (ctf.py) (29-Jul-25 01:50:24) Phase shift (deg) : 0.0
(INFO) (reconstruct.py) (29-Jul-25 01:50:26) Building lattice
(INFO) (reconstruct.py) (29-Jul-25 01:50:28) Heterogeneous reconstruction with z_dim = 8
(INFO) (reconstruct.py) (29-Jul-25 01:50:28) Initializing model...
(INFO) (reconstruct.py) (29-Jul-25 01:50:29) DrgnAI(
  (pose_table): PoseTable()
  (conf_table): ConfTable()
  (hypervolume): HyperVolume(
    (mlp): ResidualLinearMLP(
      (main): Sequential(
        (0): Linear(in_features=392, out_features=1024, bias=True)
        (1): ReLU()
        (2): ResidualLinear(
          (linear): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (3): ReLU()
        (4): ResidualLinear(
          (linear): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (5): ReLU()
        (6): ResidualLinear(
          (linear): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (7): ReLU()
        (8): MyLinear(in_features=1024, out_features=1, bias=True)
      )
    )
  )
)
(INFO) (reconstruct.py) (29-Jul-25 01:50:29) 4543121 parameters in model
(INFO) (reconstruct.py) (29-Jul-25 01:50:29) Model initialized. Moving to GPU...
(INFO) (reconstruct.py) (29-Jul-25 01:50:57) --- Training Starts Now ---
(INFO) (reconstruct.py) (29-Jul-25 01:50:57) Will pretrain on 10000 particles
(INFO) (reconstruct.py) (29-Jul-25 01:50:57) Will make a full summary at the end of this epoch
(INFO) (reconstruct.py) (29-Jul-25 01:57:41) # [Train Epoch: -1/108] [10048/61929 particles]
(INFO) (reconstruct.py) (29-Jul-25 01:57:41) # =====> SGD Epoch: -1 finished in 0:06:44.266807; total loss = 0.007430
(INFO) (analysis.py) (29-Jul-25 01:57:44) Explained variance ratio:
(INFO) (analysis.py) (29-Jul-25 01:57:44) [0.13470442 0.13069272 0.12847659 0.12622799 0.12421749 0.12214833
 0.11920623 0.11432623]
/lustre/grp/gyqlab/zhangcw/miniconda3/envs/drgnai/lib/python3.9/site-packages/cryodrgnai/models.py:458: RuntimeWarning: invalid value encountered in multiply
  volume = volume * norm[1] + norm[0]
(INFO) (reconstruct.py) (29-Jul-25 01:58:14) Will use pose search on 61929 particles
(INFO) (reconstruct.py) (29-Jul-25 01:58:14) Will make a full summary at the end of this epoch
/var/spool/slurmd/job23553767/slurm_script: line 10: 2340797 Segmentation fault      drgnai train /lustre/grp/gyqlab/zhangcw/drgnai/10177 --multigpu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

estimated std of image cause training pause #20

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

estimated std of image cause training pause #20

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions