drgnai train requesting too much GPU memory

Hi there,

I am attempting to use `drgnai train` for an _ab initio_ reconstruction of a small-ish flexible protein complex. When the program leaves the HPS phase and transitions to the SGD phase, too much memory is requested as indicated below:

```
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.14 GiB. GPU 0 has a total capacity of 31.73 GiB of which 3.11 GiB is
free. Including non-PyTorch memory, this process has 28.62 GiB memory in use. Of the allocated memory 23.19 GiB is allocated by PyTor
ch, and 4.89 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_C
ONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes
/cuda.html#environment-variables)
```

This occurred an a HPC cluster using Slurm with the batch script below:

```
#!/bin/bash
#SBATCH --account=def-caveney-ab
#SBATCH --job-name=j933_128_homo
#SBATCH --cpus-per-task=24
#SBATCH --mem=80G
#SBATCH --gres=gpu=v100l:2
#SBATCH --time=1-00:00
#SBATCH --output=j933_128_homo.log
#SBATCH --error=j933_128_homo.err

module load python/3.10.13
source ~/software/drgnai_env/bin/activate

drgnai setup homo --particles inputs/particles.128.txt --ctf inputs/ctf.pkl \
--capture-setup spa --reconstruction-type homo \
--conf-estimation autodecoder --pose-estimation abinit \
--cfgs "batch_size_sgd=128" "z_dim=8"

drgnai train homo
```

This has happened to me on this HPC cluster using 2x 32 GB GPUs (above) as well as on a local server using 4x 24 GB GPUs (without Slurm). The amount of extra memory that is attempting to be allocated is the same in both situations, so it doesn't appear to be an issue with the actual memory that is available - regardless of how much is available in total, the program is attempting to allocate an additional ~3 GB.

On the local server with 4x 24 GB GPUs I installed the program exactly according to the Gitbook documentation (i.e. create a conda environment with python 3.9, use pip to install). On the HPC cluster conda is not an option, so I created a python virtualenv (python version 3.10.13, using 3.9 would force me into a different "StdEnv" that I didn't want to use) and then installed as otherwise indicated in the Gitbook documentation.

I am able to solve this issue by reducing the SGD batch size to 128, but I raise the issue because it seems weird to me that the default of 256 seems to lead to the same OOM error on GPUs with differing amounts of memory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

drgnai train requesting too much GPU memory #16

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

drgnai train requesting too much GPU memory #16

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions