Condition Learning

Learning chemical reaction condition embeddings.

Setup — install the package and configure environment variables
Use — quick-start: generate condition embeddings with a trained model
Reproducing Publication Results — step-by-step guide to reproduce every figure and table
- Quick Start (Precomputed Embeddings) — reproduce results without model weights
- Local Reactivity — kNN yield prediction across Amide Coupling HTE datasets
- HTE Maps — t-SNE / UMAP visualisations of condition spaces
- Component Property Correlations — solvent embedding ↔ physical-property analysis
- Diversity Analysis — HTE plate diversity vs. reaction outcomes
Training a Model — train a CondCL model from scratch
- Config — Hydra configuration
- Train — launch training
Generating Embeddings — create the embedding files required for analysis
- Evaluation Embeddings (Local Reactivity & HTE Maps) — for local reactivity & HTE maps
- Diversity Embeddings — for diversity analysis
- Property Embeddings — for component property correlations
Data — building training and validation datasets from scratch

Setup

CondCL

Create a uv environment and install requirements:

uv sync
uv pip install -e .

Create a .env file with the required environment variables:

CONDITION_LEARNING_DATA_DIR=/path/to/external/data/directory
PATH_TO_SMARTSRX=/path/to/hardcoded/smartsrx.csv
PATH_TO_ENV_MANAGER=/path/to/chemformer/env

ChemFormer (needed to create input embeddings)

Navigate to the aizynthmodels repository and set up the aizynthmodels environment. After creating the environment, install h5py:

pip install h5py

For generating embeddings for new components, see the create.ipynb notebook.

Use

See create.ipynb for a walkthrough of generating condition and component embeddings with a trained model, and using those embeddings to recreate the evaluation embedding files.

Reproducing Publication Results

Quick Start (Precomputed Embeddings)

All analyses can be reproduced without model weights by using precomputed embedding files distributed alongside this repository.

The following precomputed files are included:

data/embeddings/evaluation/<reaction_type>_<dataset>/ — condition-to-embedding maps for all HTE datasets (used in Local Reactivity and HTE Maps)
data/embeddings/evaluation/diversity.h5 — embeddings for the diversity analysis
data/embeddings/physical-properties/solvents.h5 — solvent embeddings for property correlation analysis (and out-of-domain solvents too)

If you have a trained model and need to regenerate these files yourself, see Generating Embeddings.

1. Local Reactivity

Evaluate whether the learnt condition representation is more predictive of reaction yield than baseline methods (OHE, Morgan, Mordred) using leave-one-out kNN regression.

Datasets

Amide Coupling HTE datasets:

Dataset	Flag
Doyle	`doyle`
Pfizer 1 (no additive)	`pfizer1_no_additive`
Pfizer 2	`pfizer2`
Pfizer 1 (additive)	`pfizer1_additive`
Pfizer 1 (inc. OOD)	`pfizer1_no_additive_inc_out_of_domain`
Pfizer 2 (inc. OOD)	`pfizer2_inc_out_of_domain`

Step 1 — Evaluation embeddings

With precomputed embeddings (no model needed): skip this step — embeddings are already at data/embeddings/evaluation/.

With a trained model, generate embeddings first:

# Learnt embeddings:
evaluation-scripts/create_eval_embeddings.sh \
    -e <experiment_name> \
    -m <checkpoint> \
    -r amide_coupling \
    -p <chemformer_emb_path> \
    -s default          # or 'ood'

# Baseline embeddings (OHE, Morgan, Mordred, ChemFormer — no model needed):
evaluation-scripts/create_eval_embeddings.sh \
    -r amide_coupling \
    -p <chemformer_emb_path> \
    -s default \
    -b

Step 2 — Run the reactivity evaluation

# Learnt Representation + Baselines:
evaluation-scripts/run_local_reactivity_experiments.sh -r amide_coupling -s default -b

# Learnt representation (baseline only needs to be run once):
evaluation-scripts/run_local_reactivity_experiments.sh -e <experiment_name> -r amide_coupling -s default

For 'out-of-domain' datasets, use -s ood instead of -s default.

Step 3 — Visualise results

Open notebooks/condCL/reactivity.ipynb, set the EXPERIMENT_NAME and SUFFIX variables, and run all cells. The notebook reads the CSV result files generated in Step 2.

2. HTE Maps

Visualise the condition embedding space of HTE datasets using dimensionality reduction.

Prerequisites

Condition-to-embedding maps must already exist in data/embeddings/evaluation/ (precomputed files are included, or regenerate via Step 1 of Local Reactivity above).

Run

Open notebooks/vis/hte.ipynb, set REACTION_TYPE, EXPERIMENT_NAME, and DATASET, then run all cells.

Available datasets match those listed in the Local Reactivity table.

3. Component Property Correlations

Measure correlation between learnt component (solvent) embedding dimensions and physical properties.

Step 1 — Run the property-correlation analysis

With precomputed embeddings (no model needed):

# test-2048 for the paper results
uv run src/condition_learning/evaluation/properties.py \
    -e <experiment_name> \
    --precomputed_embeddings_path data/embeddings/physical_properties/solvent.h5 \
    --run_baseline

With a trained model (and optionally save embeddings for future use):

uv run src/condition_learning/evaluation/properties.py \
    -e <experiment_name> \
    -c <checkpoint> \
    --run_baseline \
    --save_embeddings_path data/embeddings/physical_properties/solvent.h5

To include out-of-domain solvents, add --include_ood and either --precomputed_ood_embeddings_path <path> or --save_ood_embeddings_path <path>.

Step 2 — Visualise results

Open notebooks/condCL/properties.ipynb, set EXPERIMENT_NAME_AC, and run all cells. The notebook reads the CSV results generated in Step 1 — no model weights are needed.

4. Diversity Analysis

Study the relationship between HTE plate condition diversity and reaction outcomes.

Step 1 — Run diversity experiments

With precomputed embeddings (no model needed): The repository includes embeddings at data/embeddings/evaluation/diversity.h5, so running the diversity experiment requires only:

./evaluation-scripts/run_diversity_experiments.sh

Baselines (Morgan and Mordred, no model needed) are included in run_diversity_experiments.sh.

With a trained model (and optionally save embeddings for future use), replace the learnt section with:

uv run src/condition_learning/evaluation/diversity.py \
    --hte_dataset_path <dataset>.h5 \
    --reaction_type amide_coupling \
    --method learnt \
    --model_name <experiment_name> \
    --model_checkpoint <checkpoint> \
    --save_embeddings_path data/embeddings/evaluation/diversity/<name>.h5 \
    --max_batch_size 32 \
    --num_repeats 2048 \
    --kernel_type gaussian \
    --kernel_length_scale 32 \
    --output_path path/to/output.csv

Step 2 — Visualise results

Open notebooks/condCL/diversity.ipynb and run all cells.

The notebook supports two modes:

Precomputed embeddings — set precomputed_embeddings_path (e.g. "data/embeddings/evaluation/diversity.h5") to skip model loading entirely.
Trained model — leave precomputed_embeddings_path = None and set experiment_name / model_checkpoint.

Training a Model

Config

Training is configured with Hydra. The main config file is config/condCL/train.yaml. Sub-configs for the data, model architecture, and trainer are in config/condCL/data/, config/condCL/model/, and config/condCL/trainer/ respectively.

The data configs (config/condCL/data/) specify paths to the training datasets and ChemFormer embeddings. You will need to update these paths to point to your own data. See the Data section for instructions on generating the required datasets, and data/README.md for a description of the required file formats.

Train

uv run src/condition_learning/condCL/train.py experiment_name=<name> [+ OVERRIDES]

Any parameter can be overridden from the command line. For example:

uv run src/condition_learning/condCL/train.py \
    experiment_name=my_experiment \
    data=amide_coupling_smartsrx \
    trainer.num_epochs=500

Training produces a models/<experiment_name>/ folder:

models/<experiment_name>/
├── model_config.yaml           # Config snapshot used for training
├── last.ckpt                   # Model weights (last epoch)
└── epoch=N-best-{metric}.ckpt  # Best model (by validation metric)

Generating Embeddings

After training a model, generate the embedding files required for the analyses above. See create.ipynb for a notebook walkthrough of loading a trained model and generating embeddings. Each analysis has its own embedding format; the sections below describe them individually.

Evaluation Embeddings (Local Reactivity & HTE Maps)

Creates per-condition embedding maps in data/embeddings/evaluation/<reaction_type>_<dataset>/<experiment_name>/.

# Learnt embeddings (requires trained model):
evaluation-scripts/create_eval_embeddings.sh \
    -e <experiment_name> \
    -m <checkpoint> \
    -r amide_coupling \
    -p <chemformer_emb_path> \
    -s default          # or 'ood', 'bhc'

# Baseline embeddings (OHE, Morgan, Mordred, ChemFormer — no model needed):
evaluation-scripts/create_eval_embeddings.sh \
    -r amide_coupling \
    -p <chemformer_emb_path> \
    -s default \
    -b

Diversity Embeddings

Saves an .h5 file with component columns and an embedding column. The script will also run the diversity analysis.

uv run src/condition_learning/evaluation/diversity.py \
    --hte_dataset_path <dataset>.h5 \
    --reaction_type amide_coupling \
    --method learnt \
    --model_name <experiment_name> \
    --model_checkpoint <checkpoint> \
    --save_embeddings_path data/embeddings/evaluation/diversity.h5 \
    --max_batch_size 32 --num_repeats 2048 \
    --kernel_type gaussian --kernel_length_scale 32 \
    --output_path data/condition_diversity/learnt/pfizer1_no_additive_k_32.csv

Property Embeddings

Saves an .h5 file with SMILES, component embeddings, and physical-property columns.

uv run src/condition_learning/evaluation/properties.py \
    -e <experiment_name> \
    -c <checkpoint> \
    --run_baseline \
    --save_embeddings_path data/embeddings/physical-properties/solvents.h5

Data

For a detailed description of the required dataset formats see data/README.md.

Validation Data

Validation datasets are included in this repository under data/datasets/. If you need to re-create the validation datasets, follow create_validation_data.ipynb.

Training Data

Follow the procedure used to generate the model training data. Note that extract_components.py will need to be replaced for other data sources — the required input to add_heuristic_labels.py is a raw_component_data.csv file with the columns specified in data/README.md.

Components

Run dataset_component.sh for the full extraction pipeline:

Extract components — [extract_components.py]. Parse a reaction database; see data/README.md for the required output format.
Process roles and counts — process_component_roles_and_counts.py
Label via SMARTS substructure matching — add_heuristic_labels.py
Clean (split ligand/precatalysts, clean solvents/bases) — clean_components.py
Merge into final wide-form dataset — create_final_dataset.py
Create ChemFormer embeddings — chemformer.py. Note: this requires a pre-trained ChemFormer model.

ChemFormer

python condition_learning/chemformer.py \
    --input_csv /path/to/processed/components_merged_heuristics.csv \
    --output_path /path/to/processed/chemformer_embeddings.pt \
    --config_path config/chemformer.yaml

Reactions

Run dataset_rxn.sh:

Extract reactions — [extract_rxns.py]. See data/README.md for the required output format.
Clean conditions — clean_rxn_conditions.py
Filter and compute embeddings — filter_rxns_and_compute_embs.py

Final datasets are in .h5 format ({reaction_type}_data.h5). To convert back to CSV:

from condition_learning.utils import convert_h5_to_dataframe

df = convert_h5_to_dataframe(h5_path, component_cols)
df.write_csv(csv_path)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
config		config
data		data
evaluation-scripts		evaluation-scripts
examples		examples
notebooks		notebooks
reports		reports
src/condition_learning		src/condition_learning
.gitignore		.gitignore
.python-version		.python-version
AUTHORS.md		AUTHORS.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
dataset_component.sh		dataset_component.sh
dataset_rxn.sh		dataset_rxn.sh
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Condition Learning

Table of Contents

Setup

CondCL

ChemFormer (needed to create input embeddings)

Use

Reproducing Publication Results

Quick Start (Precomputed Embeddings)

1. Local Reactivity

Datasets

Step 1 — Evaluation embeddings

Step 2 — Run the reactivity evaluation

Step 3 — Visualise results

2. HTE Maps

Prerequisites

Run

3. Component Property Correlations

Step 1 — Run the property-correlation analysis

Step 2 — Visualise results

4. Diversity Analysis

Step 1 — Run diversity experiments

Step 2 — Visualise results

Training a Model

Config

Train

Generating Embeddings

Evaluation Embeddings (Local Reactivity & HTE Maps)

Diversity Embeddings

Property Embeddings

Data

Validation Data

Training Data

Components

ChemFormer

Reactions

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages