MolGPTiny is a lightweight GPT-style molecular modeling repository for pretraining and evaluation on common molecular datasets.
Structure
- Training entrypoint:
train.py(pretraining) - Evaluation entrypoint:
eval.py - Models:
models/(GPT architecture for causal language modeling) - Configs:
configs/(data, model, hydra launcher) - Data:
data/(pretraining datasets) - Outputs:
outputs/(checkpoints, logs, generated artifacts)
Requirements
- conda (recommended)
- Python 3.12
- GPU with recent CUDA for training (optional but strongly recommended unless you enjoy waiting)
- Create and activate a conda environment:
conda create -n molgptiny python=3.12 -y
conda activate molgptiny- Install runtime dependencies via
pip:
pip install -U pip
pip install torch --index-url https://download.pytorch.org/whl/cu126
pip install lightning numpy scikit-learn pandas hydra-core rdkit flatten-dictOptional: if you prefer a single requirements file, run pip install -r requirements.txt.
Pretraining Datasets (unsupervised SMILES language modeling)
Pretraining uses large unlabeled molecular datasets where the model learns to predict tokens in SMILES sequences. This learned representation can be further evaluated or fine-tuned on downstream tasks.
-
chemblv31: A subset of the ChEMBL database (version 31), a large-scale bioactivity database for drug discovery. ChEMBL contains activity data for millions of compounds and is a standard resource in medicinal chemistry. The pretraining dataset is atdata/chemblv31/.- Citation: Gaulton, A., Bellis, L. J., Bento, A. P., Chambers, J., Davies, M., Hersey, A., ... & Overington, J. P. (2012). ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Research, 40(D1), D1100–D1107.
- Link: https://academic.oup.com/nar/article/40/D1/D1100/2903401
-
guacamol: The GuacaMol benchmark dataset for molecular generation, containing a diverse set of drug-like molecules and molecular generation objectives. Useful for evaluating learned representations on generation and property prediction tasks. Data atdata/guacamol/.- Citation: Brown, N., Fiscato, M., Segler, M. H. S., & Vaucher, A. C. (2019). GuacaMol: Benchmarking Models for De Novo Molecular Design. Journal of Chemical Information and Modeling, 59(3), 1096-1108. (preprint on arXiv:1811.09621).
- Link: https://arxiv.org/abs/1811.09621
The typical workflow is pretraining on large molecular datasets. Configuration overrides use Hydra's dot-notation syntax.
Pretraining (causal language modeling on SMILES):
srun -p amdgpufast --cpus-per-task=16 --mem=128000 --gres=gpu:1 --mpi=none --pty bash
python train.py \
data.dataset=chemblv31 \
trainer.max_epochs=25 \
hydra.run.dir=outputs/singlerun/trainEvaluation (generate molecules, compute metrics):
python eval.py \
data.dataset=chemblv31 \
checkpoint_path=outputs/checkpoints/last.ckpt \
hydra.run.dir=outputs/singlerun/evalUse hydra.run.dir to control where outputs are saved. Omit it to use the default Hydra working directory structure. Pass data.max_samples=N to limit dataset size for quick experiments.
- Checkpoints:
outputs/checkpoints/(training checkpoints) - Logs:
outputs/logs/(Lightning logs per run) - Generated artifacts:
outputs/singlerun/(evaluation outputs)
- If you have a Slurm cluster, the
configs/hydra/launcher/slurm_gpu.yamlcan be used to configure distributed runs. - Adjust package installation (CUDA versions, torch index URL) to your platform as needed.