MolGPTiny

MolGPTiny is a lightweight GPT-style molecular modeling repository for pretraining and evaluation on common molecular datasets.

Structure

Training entrypoint: train.py (pretraining)
Evaluation entrypoint: eval.py
Models: models/ (GPT architecture for causal language modeling)
Configs: configs/ (data, model, hydra launcher)
Data: data/ (pretraining datasets)
Outputs: outputs/ (checkpoints, logs, generated artifacts)

Requirements

conda (recommended)
Python 3.12
GPU with recent CUDA for training (optional but strongly recommended unless you enjoy waiting)

Installation

Create and activate a conda environment:

conda create -n molgptiny python=3.12 -y
conda activate molgptiny

Install runtime dependencies via pip:

pip install -U pip
pip install torch --index-url https://download.pytorch.org/whl/cu126
pip install lightning numpy scikit-learn pandas hydra-core rdkit flatten-dict

Optional: if you prefer a single requirements file, run pip install -r requirements.txt.

Datasets

Pretraining Datasets (unsupervised SMILES language modeling)

Pretraining uses large unlabeled molecular datasets where the model learns to predict tokens in SMILES sequences. This learned representation can be further evaluated or fine-tuned on downstream tasks.

chemblv31: A subset of the ChEMBL database (version 31), a large-scale bioactivity database for drug discovery. ChEMBL contains activity data for millions of compounds and is a standard resource in medicinal chemistry. The pretraining dataset is at data/chemblv31/.
- Citation: Gaulton, A., Bellis, L. J., Bento, A. P., Chambers, J., Davies, M., Hersey, A., ... & Overington, J. P. (2012). ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Research, 40(D1), D1100–D1107.
- Link: https://academic.oup.com/nar/article/40/D1/D1100/2903401
guacamol: The GuacaMol benchmark dataset for molecular generation, containing a diverse set of drug-like molecules and molecular generation objectives. Useful for evaluating learned representations on generation and property prediction tasks. Data at data/guacamol/.
- Citation: Brown, N., Fiscato, M., Segler, M. H. S., & Vaucher, A. C. (2019). GuacaMol: Benchmarking Models for De Novo Molecular Design. Journal of Chemical Information and Modeling, 59(3), 1096-1108. (preprint on arXiv:1811.09621).
- Link: https://arxiv.org/abs/1811.09621

Running the Pipeline

The typical workflow is pretraining on large molecular datasets. Configuration overrides use Hydra's dot-notation syntax.

Pretraining (causal language modeling on SMILES):

srun -p amdgpufast --cpus-per-task=16 --mem=128000 --gres=gpu:1 --mpi=none --pty bash
python train.py \
  data.dataset=chemblv31 \
  trainer.max_epochs=25 \
  hydra.run.dir=outputs/singlerun/train

Evaluation (generate molecules, compute metrics):

python eval.py \
  data.dataset=chemblv31 \
  checkpoint_path=outputs/checkpoints/last.ckpt \
  hydra.run.dir=outputs/singlerun/eval

Use hydra.run.dir to control where outputs are saved. Omit it to use the default Hydra working directory structure. Pass data.max_samples=N to limit dataset size for quick experiments.

Outputs

Checkpoints: outputs/checkpoints/ (training checkpoints)
Logs: outputs/logs/ (Lightning logs per run)
Generated artifacts: outputs/singlerun/ (evaluation outputs)

Notes

If you have a Slurm cluster, the configs/hydra/launcher/slurm_gpu.yaml can be used to configure distributed runs.
Adjust package installation (CUDA versions, torch index URL) to your platform as needed.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
data		data
models		models
.gitignore		.gitignore
README.md		README.md
data.py		data.py
eval.py		eval.py
requirements.txt		requirements.txt
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MolGPTiny

Installation

Datasets

Running the Pipeline

Outputs

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MolGPTiny

Installation

Datasets

Running the Pipeline

Outputs

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages