Skip to content

MolecularAI/Condition_learning

Repository files navigation

alt text

Condition Learning

Learning chemical reaction condition embeddings.

Table of Contents


Setup

CondCL

Create a uv environment and install requirements:

uv sync
uv pip install -e .

Create a .env file with the required environment variables:

CONDITION_LEARNING_DATA_DIR=/path/to/external/data/directory
PATH_TO_SMARTSRX=/path/to/hardcoded/smartsrx.csv
PATH_TO_ENV_MANAGER=/path/to/chemformer/env

ChemFormer (needed to create input embeddings)

Navigate to the aizynthmodels repository and set up the aizynthmodels environment. After creating the environment, install h5py:

pip install h5py

For generating embeddings for new components, see the create.ipynb notebook.


Use

See create.ipynb for a walkthrough of generating condition and component embeddings with a trained model, and using those embeddings to recreate the evaluation embedding files.


Reproducing Publication Results

Quick Start (Precomputed Embeddings)

All analyses can be reproduced without model weights by using precomputed embedding files distributed alongside this repository.

The following precomputed files are included:

  • data/embeddings/evaluation/<reaction_type>_<dataset>/ — condition-to-embedding maps for all HTE datasets (used in Local Reactivity and HTE Maps)
  • data/embeddings/evaluation/diversity.h5 — embeddings for the diversity analysis
  • data/embeddings/physical-properties/solvents.h5 — solvent embeddings for property correlation analysis (and out-of-domain solvents too)

If you have a trained model and need to regenerate these files yourself, see Generating Embeddings.

1. Local Reactivity

Evaluate whether the learnt condition representation is more predictive of reaction yield than baseline methods (OHE, Morgan, Mordred) using leave-one-out kNN regression.

Datasets

Amide Coupling HTE datasets:

Dataset Flag
Doyle doyle
Pfizer 1 (no additive) pfizer1_no_additive
Pfizer 2 pfizer2
Pfizer 1 (additive) pfizer1_additive
Pfizer 1 (inc. OOD) pfizer1_no_additive_inc_out_of_domain
Pfizer 2 (inc. OOD) pfizer2_inc_out_of_domain

Step 1 — Evaluation embeddings

With precomputed embeddings (no model needed): skip this step — embeddings are already at data/embeddings/evaluation/.

With a trained model, generate embeddings first:

# Learnt embeddings:
evaluation-scripts/create_eval_embeddings.sh \
    -e <experiment_name> \
    -m <checkpoint> \
    -r amide_coupling \
    -p <chemformer_emb_path> \
    -s default          # or 'ood'

# Baseline embeddings (OHE, Morgan, Mordred, ChemFormer — no model needed):
evaluation-scripts/create_eval_embeddings.sh \
    -r amide_coupling \
    -p <chemformer_emb_path> \
    -s default \
    -b

Step 2 — Run the reactivity evaluation

# Learnt Representation + Baselines:
evaluation-scripts/run_local_reactivity_experiments.sh -r amide_coupling -s default -b

# Learnt representation (baseline only needs to be run once):
evaluation-scripts/run_local_reactivity_experiments.sh -e <experiment_name> -r amide_coupling -s default

For 'out-of-domain' datasets, use -s ood instead of -s default.

Step 3 — Visualise results

Open notebooks/condCL/reactivity.ipynb, set the EXPERIMENT_NAME and SUFFIX variables, and run all cells. The notebook reads the CSV result files generated in Step 2.

2. HTE Maps

Visualise the condition embedding space of HTE datasets using dimensionality reduction.

Prerequisites

Condition-to-embedding maps must already exist in data/embeddings/evaluation/ (precomputed files are included, or regenerate via Step 1 of Local Reactivity above).

Run

Open notebooks/vis/hte.ipynb, set REACTION_TYPE, EXPERIMENT_NAME, and DATASET, then run all cells.

Available datasets match those listed in the Local Reactivity table.

3. Component Property Correlations

Measure correlation between learnt component (solvent) embedding dimensions and physical properties.

Step 1 — Run the property-correlation analysis

With precomputed embeddings (no model needed):

# test-2048 for the paper results
uv run src/condition_learning/evaluation/properties.py \
    -e <experiment_name> \
    --precomputed_embeddings_path data/embeddings/physical_properties/solvent.h5 \
    --run_baseline

With a trained model (and optionally save embeddings for future use):

uv run src/condition_learning/evaluation/properties.py \
    -e <experiment_name> \
    -c <checkpoint> \
    --run_baseline \
    --save_embeddings_path data/embeddings/physical_properties/solvent.h5

To include out-of-domain solvents, add --include_ood and either --precomputed_ood_embeddings_path <path> or --save_ood_embeddings_path <path>.

Step 2 — Visualise results

Open notebooks/condCL/properties.ipynb, set EXPERIMENT_NAME_AC, and run all cells. The notebook reads the CSV results generated in Step 1 — no model weights are needed.

4. Diversity Analysis

Study the relationship between HTE plate condition diversity and reaction outcomes.

Step 1 — Run diversity experiments

With precomputed embeddings (no model needed): The repository includes embeddings at data/embeddings/evaluation/diversity.h5, so running the diversity experiment requires only:

./evaluation-scripts/run_diversity_experiments.sh

Baselines (Morgan and Mordred, no model needed) are included in run_diversity_experiments.sh.

With a trained model (and optionally save embeddings for future use), replace the learnt section with:

uv run src/condition_learning/evaluation/diversity.py \
    --hte_dataset_path <dataset>.h5 \
    --reaction_type amide_coupling \
    --method learnt \
    --model_name <experiment_name> \
    --model_checkpoint <checkpoint> \
    --save_embeddings_path data/embeddings/evaluation/diversity/<name>.h5 \
    --max_batch_size 32 \
    --num_repeats 2048 \
    --kernel_type gaussian \
    --kernel_length_scale 32 \
    --output_path path/to/output.csv

Step 2 — Visualise results

Open notebooks/condCL/diversity.ipynb and run all cells.

The notebook supports two modes:

  • Precomputed embeddings — set precomputed_embeddings_path (e.g. "data/embeddings/evaluation/diversity.h5") to skip model loading entirely.
  • Trained model — leave precomputed_embeddings_path = None and set experiment_name / model_checkpoint.

Training a Model

Config

Training is configured with Hydra. The main config file is config/condCL/train.yaml. Sub-configs for the data, model architecture, and trainer are in config/condCL/data/, config/condCL/model/, and config/condCL/trainer/ respectively.

The data configs (config/condCL/data/) specify paths to the training datasets and ChemFormer embeddings. You will need to update these paths to point to your own data. See the Data section for instructions on generating the required datasets, and data/README.md for a description of the required file formats.

Train

uv run src/condition_learning/condCL/train.py experiment_name=<name> [+ OVERRIDES]

Any parameter can be overridden from the command line. For example:

uv run src/condition_learning/condCL/train.py \
    experiment_name=my_experiment \
    data=amide_coupling_smartsrx \
    trainer.num_epochs=500

Training produces a models/<experiment_name>/ folder:

models/<experiment_name>/
├── model_config.yaml           # Config snapshot used for training
├── last.ckpt                   # Model weights (last epoch)
└── epoch=N-best-{metric}.ckpt  # Best model (by validation metric)

Generating Embeddings

After training a model, generate the embedding files required for the analyses above. See create.ipynb for a notebook walkthrough of loading a trained model and generating embeddings. Each analysis has its own embedding format; the sections below describe them individually.

Evaluation Embeddings (Local Reactivity & HTE Maps)

Creates per-condition embedding maps in data/embeddings/evaluation/<reaction_type>_<dataset>/<experiment_name>/.

# Learnt embeddings (requires trained model):
evaluation-scripts/create_eval_embeddings.sh \
    -e <experiment_name> \
    -m <checkpoint> \
    -r amide_coupling \
    -p <chemformer_emb_path> \
    -s default          # or 'ood', 'bhc'

# Baseline embeddings (OHE, Morgan, Mordred, ChemFormer — no model needed):
evaluation-scripts/create_eval_embeddings.sh \
    -r amide_coupling \
    -p <chemformer_emb_path> \
    -s default \
    -b

Diversity Embeddings

Saves an .h5 file with component columns and an embedding column. The script will also run the diversity analysis.

uv run src/condition_learning/evaluation/diversity.py \
    --hte_dataset_path <dataset>.h5 \
    --reaction_type amide_coupling \
    --method learnt \
    --model_name <experiment_name> \
    --model_checkpoint <checkpoint> \
    --save_embeddings_path data/embeddings/evaluation/diversity.h5 \
    --max_batch_size 32 --num_repeats 2048 \
    --kernel_type gaussian --kernel_length_scale 32 \
    --output_path data/condition_diversity/learnt/pfizer1_no_additive_k_32.csv

Property Embeddings

Saves an .h5 file with SMILES, component embeddings, and physical-property columns.

uv run src/condition_learning/evaluation/properties.py \
    -e <experiment_name> \
    -c <checkpoint> \
    --run_baseline \
    --save_embeddings_path data/embeddings/physical-properties/solvents.h5

Data

For a detailed description of the required dataset formats see data/README.md.

Validation Data

Validation datasets are included in this repository under data/datasets/. If you need to re-create the validation datasets, follow create_validation_data.ipynb.

Training Data

Follow the procedure used to generate the model training data. Note that extract_components.py will need to be replaced for other data sources — the required input to add_heuristic_labels.py is a raw_component_data.csv file with the columns specified in data/README.md.

Components

Run dataset_component.sh for the full extraction pipeline:

  1. Extract components — [extract_components.py]. Parse a reaction database; see data/README.md for the required output format.
  2. Process roles and counts — process_component_roles_and_counts.py
  3. Label via SMARTS substructure matching — add_heuristic_labels.py
  4. Clean (split ligand/precatalysts, clean solvents/bases) — clean_components.py
  5. Merge into final wide-form dataset — create_final_dataset.py
  6. Create ChemFormer embeddings — chemformer.py. Note: this requires a pre-trained ChemFormer model.

ChemFormer

python condition_learning/chemformer.py \
    --input_csv /path/to/processed/components_merged_heuristics.csv \
    --output_path /path/to/processed/chemformer_embeddings.pt \
    --config_path config/chemformer.yaml

Reactions

Run dataset_rxn.sh:

  1. Extract reactions — [extract_rxns.py]. See data/README.md for the required output format.
  2. Clean conditions — clean_rxn_conditions.py
  3. Filter and compute embeddings — filter_rxns_and_compute_embs.py

Final datasets are in .h5 format ({reaction_type}_data.h5). To convert back to CSV:

from condition_learning.utils import convert_h5_to_dataframe

df = convert_h5_to_dataframe(h5_path, component_cols)
df.write_csv(csv_path)

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors