Learning chemical reaction condition embeddings.
- Setup — install the package and configure environment variables
- Use — quick-start: generate condition embeddings with a trained model
- Reproducing Publication Results — step-by-step guide to reproduce every figure and table
- Quick Start (Precomputed Embeddings) — reproduce results without model weights
- Local Reactivity — kNN yield prediction across Amide Coupling HTE datasets
- HTE Maps — t-SNE / UMAP visualisations of condition spaces
- Component Property Correlations — solvent embedding ↔ physical-property analysis
- Diversity Analysis — HTE plate diversity vs. reaction outcomes
- Training a Model — train a CondCL model from scratch
- Generating Embeddings — create the embedding files required for analysis
- Evaluation Embeddings (Local Reactivity & HTE Maps) — for local reactivity & HTE maps
- Diversity Embeddings — for diversity analysis
- Property Embeddings — for component property correlations
- Data — building training and validation datasets from scratch
Create a uv environment and install requirements:
uv sync
uv pip install -e .Create a .env file with the required environment variables:
CONDITION_LEARNING_DATA_DIR=/path/to/external/data/directory
PATH_TO_SMARTSRX=/path/to/hardcoded/smartsrx.csv
PATH_TO_ENV_MANAGER=/path/to/chemformer/envNavigate to the aizynthmodels repository and set up the aizynthmodels environment. After creating the environment, install h5py:
pip install h5pyFor generating embeddings for new components, see the create.ipynb notebook.
See create.ipynb for a walkthrough of generating condition and component embeddings with a trained model, and using those embeddings to recreate the evaluation embedding files.
All analyses can be reproduced without model weights by using precomputed embedding files distributed alongside this repository.
The following precomputed files are included:
data/embeddings/evaluation/<reaction_type>_<dataset>/— condition-to-embedding maps for all HTE datasets (used in Local Reactivity and HTE Maps)data/embeddings/evaluation/diversity.h5— embeddings for the diversity analysisdata/embeddings/physical-properties/solvents.h5— solvent embeddings for property correlation analysis (and out-of-domain solvents too)
If you have a trained model and need to regenerate these files yourself, see Generating Embeddings.
Evaluate whether the learnt condition representation is more predictive of reaction yield than baseline methods (OHE, Morgan, Mordred) using leave-one-out kNN regression.
Amide Coupling HTE datasets:
| Dataset | Flag |
|---|---|
| Doyle | doyle |
| Pfizer 1 (no additive) | pfizer1_no_additive |
| Pfizer 2 | pfizer2 |
| Pfizer 1 (additive) | pfizer1_additive |
| Pfizer 1 (inc. OOD) | pfizer1_no_additive_inc_out_of_domain |
| Pfizer 2 (inc. OOD) | pfizer2_inc_out_of_domain |
With precomputed embeddings (no model needed): skip this step — embeddings are already at data/embeddings/evaluation/.
With a trained model, generate embeddings first:
# Learnt embeddings:
evaluation-scripts/create_eval_embeddings.sh \
-e <experiment_name> \
-m <checkpoint> \
-r amide_coupling \
-p <chemformer_emb_path> \
-s default # or 'ood'
# Baseline embeddings (OHE, Morgan, Mordred, ChemFormer — no model needed):
evaluation-scripts/create_eval_embeddings.sh \
-r amide_coupling \
-p <chemformer_emb_path> \
-s default \
-b# Learnt Representation + Baselines:
evaluation-scripts/run_local_reactivity_experiments.sh -r amide_coupling -s default -b
# Learnt representation (baseline only needs to be run once):
evaluation-scripts/run_local_reactivity_experiments.sh -e <experiment_name> -r amide_coupling -s defaultFor 'out-of-domain' datasets, use -s ood instead of -s default.
Open notebooks/condCL/reactivity.ipynb, set the EXPERIMENT_NAME and SUFFIX variables, and run all cells. The notebook reads the CSV result files generated in Step 2.
Visualise the condition embedding space of HTE datasets using dimensionality reduction.
Condition-to-embedding maps must already exist in data/embeddings/evaluation/ (precomputed files are included, or regenerate via Step 1 of Local Reactivity above).
Open notebooks/vis/hte.ipynb, set REACTION_TYPE, EXPERIMENT_NAME, and DATASET, then run all cells.
Available datasets match those listed in the Local Reactivity table.
Measure correlation between learnt component (solvent) embedding dimensions and physical properties.
With precomputed embeddings (no model needed):
# test-2048 for the paper results
uv run src/condition_learning/evaluation/properties.py \
-e <experiment_name> \
--precomputed_embeddings_path data/embeddings/physical_properties/solvent.h5 \
--run_baselineWith a trained model (and optionally save embeddings for future use):
uv run src/condition_learning/evaluation/properties.py \
-e <experiment_name> \
-c <checkpoint> \
--run_baseline \
--save_embeddings_path data/embeddings/physical_properties/solvent.h5To include out-of-domain solvents, add --include_ood and either --precomputed_ood_embeddings_path <path> or --save_ood_embeddings_path <path>.
Open notebooks/condCL/properties.ipynb, set EXPERIMENT_NAME_AC, and run all cells. The notebook reads the CSV results generated in Step 1 — no model weights are needed.
Study the relationship between HTE plate condition diversity and reaction outcomes.
With precomputed embeddings (no model needed):
The repository includes embeddings at data/embeddings/evaluation/diversity.h5, so running the diversity experiment requires only:
./evaluation-scripts/run_diversity_experiments.shBaselines (Morgan and Mordred, no model needed) are included in run_diversity_experiments.sh.
With a trained model (and optionally save embeddings for future use), replace the learnt section with:
uv run src/condition_learning/evaluation/diversity.py \
--hte_dataset_path <dataset>.h5 \
--reaction_type amide_coupling \
--method learnt \
--model_name <experiment_name> \
--model_checkpoint <checkpoint> \
--save_embeddings_path data/embeddings/evaluation/diversity/<name>.h5 \
--max_batch_size 32 \
--num_repeats 2048 \
--kernel_type gaussian \
--kernel_length_scale 32 \
--output_path path/to/output.csvOpen notebooks/condCL/diversity.ipynb and run all cells.
The notebook supports two modes:
- Precomputed embeddings — set
precomputed_embeddings_path(e.g."data/embeddings/evaluation/diversity.h5") to skip model loading entirely. - Trained model — leave
precomputed_embeddings_path = Noneand setexperiment_name/model_checkpoint.
Training is configured with Hydra. The main config file is config/condCL/train.yaml. Sub-configs for the data, model architecture, and trainer are in config/condCL/data/, config/condCL/model/, and config/condCL/trainer/ respectively.
The data configs (config/condCL/data/) specify paths to the training datasets and ChemFormer embeddings. You will need to update these paths to point to your own data. See the Data section for instructions on generating the required datasets, and data/README.md for a description of the required file formats.
uv run src/condition_learning/condCL/train.py experiment_name=<name> [+ OVERRIDES]Any parameter can be overridden from the command line. For example:
uv run src/condition_learning/condCL/train.py \
experiment_name=my_experiment \
data=amide_coupling_smartsrx \
trainer.num_epochs=500Training produces a models/<experiment_name>/ folder:
models/<experiment_name>/
├── model_config.yaml # Config snapshot used for training
├── last.ckpt # Model weights (last epoch)
└── epoch=N-best-{metric}.ckpt # Best model (by validation metric)
After training a model, generate the embedding files required for the analyses above. See create.ipynb for a notebook walkthrough of loading a trained model and generating embeddings. Each analysis has its own embedding format; the sections below describe them individually.
Creates per-condition embedding maps in data/embeddings/evaluation/<reaction_type>_<dataset>/<experiment_name>/.
# Learnt embeddings (requires trained model):
evaluation-scripts/create_eval_embeddings.sh \
-e <experiment_name> \
-m <checkpoint> \
-r amide_coupling \
-p <chemformer_emb_path> \
-s default # or 'ood', 'bhc'
# Baseline embeddings (OHE, Morgan, Mordred, ChemFormer — no model needed):
evaluation-scripts/create_eval_embeddings.sh \
-r amide_coupling \
-p <chemformer_emb_path> \
-s default \
-bSaves an .h5 file with component columns and an embedding column. The script will also run the diversity analysis.
uv run src/condition_learning/evaluation/diversity.py \
--hte_dataset_path <dataset>.h5 \
--reaction_type amide_coupling \
--method learnt \
--model_name <experiment_name> \
--model_checkpoint <checkpoint> \
--save_embeddings_path data/embeddings/evaluation/diversity.h5 \
--max_batch_size 32 --num_repeats 2048 \
--kernel_type gaussian --kernel_length_scale 32 \
--output_path data/condition_diversity/learnt/pfizer1_no_additive_k_32.csvSaves an .h5 file with SMILES, component embeddings, and physical-property columns.
uv run src/condition_learning/evaluation/properties.py \
-e <experiment_name> \
-c <checkpoint> \
--run_baseline \
--save_embeddings_path data/embeddings/physical-properties/solvents.h5For a detailed description of the required dataset formats see data/README.md.
Validation datasets are included in this repository under data/datasets/. If you need to re-create the validation datasets, follow create_validation_data.ipynb.
Follow the procedure used to generate the model training data. Note that extract_components.py will need to be replaced for other data sources — the required input to add_heuristic_labels.py is a raw_component_data.csv file with the columns specified in data/README.md.
Run dataset_component.sh for the full extraction pipeline:
- Extract components — [
extract_components.py]. Parse a reaction database; seedata/README.mdfor the required output format. - Process roles and counts —
process_component_roles_and_counts.py - Label via SMARTS substructure matching —
add_heuristic_labels.py - Clean (split ligand/precatalysts, clean solvents/bases) —
clean_components.py - Merge into final wide-form dataset —
create_final_dataset.py - Create ChemFormer embeddings —
chemformer.py. Note: this requires a pre-trained ChemFormer model.
python condition_learning/chemformer.py \
--input_csv /path/to/processed/components_merged_heuristics.csv \
--output_path /path/to/processed/chemformer_embeddings.pt \
--config_path config/chemformer.yamlRun dataset_rxn.sh:
- Extract reactions — [
extract_rxns.py]. Seedata/README.mdfor the required output format. - Clean conditions —
clean_rxn_conditions.py - Filter and compute embeddings —
filter_rxns_and_compute_embs.py
Final datasets are in .h5 format ({reaction_type}_data.h5). To convert back to CSV:
from condition_learning.utils import convert_h5_to_dataframe
df = convert_h5_to_dataframe(h5_path, component_cols)
df.write_csv(csv_path)