Skip to content

drgnai train requesting too much GPU memory #16

@sean-workman

Description

@sean-workman

Hi there,

I am attempting to use drgnai train for an ab initio reconstruction of a small-ish flexible protein complex. When the program leaves the HPS phase and transitions to the SGD phase, too much memory is requested as indicated below:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.14 GiB. GPU 0 has a total capacity of 31.73 GiB of which 3.11 GiB is
free. Including non-PyTorch memory, this process has 28.62 GiB memory in use. Of the allocated memory 23.19 GiB is allocated by PyTor
ch, and 4.89 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_C
ONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes
/cuda.html#environment-variables)

This occurred an a HPC cluster using Slurm with the batch script below:

#!/bin/bash
#SBATCH --account=def-caveney-ab
#SBATCH --job-name=j933_128_homo
#SBATCH --cpus-per-task=24
#SBATCH --mem=80G
#SBATCH --gres=gpu=v100l:2
#SBATCH --time=1-00:00
#SBATCH --output=j933_128_homo.log
#SBATCH --error=j933_128_homo.err

module load python/3.10.13
source ~/software/drgnai_env/bin/activate

drgnai setup homo --particles inputs/particles.128.txt --ctf inputs/ctf.pkl \
--capture-setup spa --reconstruction-type homo \
--conf-estimation autodecoder --pose-estimation abinit \
--cfgs "batch_size_sgd=128" "z_dim=8"

drgnai train homo

This has happened to me on this HPC cluster using 2x 32 GB GPUs (above) as well as on a local server using 4x 24 GB GPUs (without Slurm). The amount of extra memory that is attempting to be allocated is the same in both situations, so it doesn't appear to be an issue with the actual memory that is available - regardless of how much is available in total, the program is attempting to allocate an additional ~3 GB.

On the local server with 4x 24 GB GPUs I installed the program exactly according to the Gitbook documentation (i.e. create a conda environment with python 3.9, use pip to install). On the HPC cluster conda is not an option, so I created a python virtualenv (python version 3.10.13, using 3.9 would force me into a different "StdEnv" that I didn't want to use) and then installed as otherwise indicated in the Gitbook documentation.

I am able to solve this issue by reducing the SGD batch size to 128, but I raise the issue because it seems weird to me that the default of 256 seems to lead to the same OOM error on GPUs with differing amounts of memory.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions