Skip to content

mlnpapez/MolGPTiny

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MolGPTiny

MolGPTiny is a lightweight GPT-style molecular modeling repository for pretraining and evaluation on common molecular datasets.

Structure

  • Training entrypoint: train.py (pretraining)
  • Evaluation entrypoint: eval.py
  • Models: models/ (GPT architecture for causal language modeling)
  • Configs: configs/ (data, model, hydra launcher)
  • Data: data/ (pretraining datasets)
  • Outputs: outputs/ (checkpoints, logs, generated artifacts)

Requirements

  • conda (recommended)
  • Python 3.12
  • GPU with recent CUDA for training (optional but strongly recommended unless you enjoy waiting)

Installation

  1. Create and activate a conda environment:
conda create -n molgptiny python=3.12 -y
conda activate molgptiny
  1. Install runtime dependencies via pip:
pip install -U pip
pip install torch --index-url https://download.pytorch.org/whl/cu126
pip install lightning numpy scikit-learn pandas hydra-core rdkit flatten-dict

Optional: if you prefer a single requirements file, run pip install -r requirements.txt.

Datasets

Pretraining Datasets (unsupervised SMILES language modeling)

Pretraining uses large unlabeled molecular datasets where the model learns to predict tokens in SMILES sequences. This learned representation can be further evaluated or fine-tuned on downstream tasks.

  • chemblv31: A subset of the ChEMBL database (version 31), a large-scale bioactivity database for drug discovery. ChEMBL contains activity data for millions of compounds and is a standard resource in medicinal chemistry. The pretraining dataset is at data/chemblv31/.

    • Citation: Gaulton, A., Bellis, L. J., Bento, A. P., Chambers, J., Davies, M., Hersey, A., ... & Overington, J. P. (2012). ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Research, 40(D1), D1100–D1107.
    • Link: https://academic.oup.com/nar/article/40/D1/D1100/2903401
  • guacamol: The GuacaMol benchmark dataset for molecular generation, containing a diverse set of drug-like molecules and molecular generation objectives. Useful for evaluating learned representations on generation and property prediction tasks. Data at data/guacamol/.

    • Citation: Brown, N., Fiscato, M., Segler, M. H. S., & Vaucher, A. C. (2019). GuacaMol: Benchmarking Models for De Novo Molecular Design. Journal of Chemical Information and Modeling, 59(3), 1096-1108. (preprint on arXiv:1811.09621).
    • Link: https://arxiv.org/abs/1811.09621

Running the Pipeline

The typical workflow is pretraining on large molecular datasets. Configuration overrides use Hydra's dot-notation syntax.

Pretraining (causal language modeling on SMILES):

srun -p amdgpufast --cpus-per-task=16 --mem=128000 --gres=gpu:1 --mpi=none --pty bash
python train.py \
  data.dataset=chemblv31 \
  trainer.max_epochs=25 \
  hydra.run.dir=outputs/singlerun/train

Evaluation (generate molecules, compute metrics):

python eval.py \
  data.dataset=chemblv31 \
  checkpoint_path=outputs/checkpoints/last.ckpt \
  hydra.run.dir=outputs/singlerun/eval

Use hydra.run.dir to control where outputs are saved. Omit it to use the default Hydra working directory structure. Pass data.max_samples=N to limit dataset size for quick experiments.

Outputs

  • Checkpoints: outputs/checkpoints/ (training checkpoints)
  • Logs: outputs/logs/ (Lightning logs per run)
  • Generated artifacts: outputs/singlerun/ (evaluation outputs)

Notes

  • If you have a Slurm cluster, the configs/hydra/launcher/slurm_gpu.yaml can be used to configure distributed runs.
  • Adjust package installation (CUDA versions, torch index URL) to your platform as needed.

About

A lightweight GPT-style chemical language model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages