GitHub - AxelRolov/cdr_bench: Benchmarking Dimensionality Reduction Techniques on Chemical Datasets

Based on the publication:

Orlov, A. A., Akhmetshin, T. N., Horvath, D., Marcou, G., & Varnek, A. "From High Dimensions to Human Insight: Exploring Dimensionality Reduction for Chemical Space Visualization." Molecular Informatics, 2024, 44(1). DOI: 10.1002/minf.202400265

Installation

Requires Python 3.11 and uv.

git clone https://github.com/AxelRolov/cdr_bench.git
cd cdr_bench
uv sync

Quick Usage

# 1. Generate molecular descriptors from SMILES
python scripts/generate_descriptors.py bench_configs/features.toml

# 2. Run benchmarking (grid search optimization)
python scripts/run_benchmarking.py --config bench_configs/run_benchmarking.toml

# 3. Analyze and aggregate results
python scripts/analyze_results.py --input_dir results/ --output_dir results/ --k_hit 20

Documentation

Full documentation is available at axelrolov.github.io/cdr_bench.

Project Structure

cdr_bench/
├── src/cdr_bench/          # Core library
│   ├── dr_methods/         # DimReducer wrapper (PCA, UMAP, t-SNE, GTM)
│   ├── optimization/       # Grid search optimizer and parameter definitions
│   ├── scoring/            # Quality metrics (NN overlap, co-ranking, trustworthiness)
│   ├── io_utils/           # HDF5 I/O, config loading, data preprocessing
│   ├── features/           # Descriptor generation (Morgan FP, MACCS, ChemDist)
│   └── visualization/      # Plotting utilities
├── scripts/                # Pipeline scripts
│   ├── run_benchmarking.py
│   ├── generate_descriptors.py
│   ├── analyze_results.py
│   ├── prepare_lolo.py
│   └── analyze_lib_distance_preservation.py
├── bench_configs/          # TOML configuration files
│   ├── run_benchmarking.toml
│   ├── features.toml
│   └── method_configs/     # Per-method hyperparameter grids
├── datasets/               # Sample ChEMBL datasets (HDF5)
├── results/                # Benchmark results and metrics
├── notebooks/              # Jupyter notebooks for analysis
└── tests/                  # Test suite

Datasets

The datasets/ directory contains ChEMBL subset datasets used in the study. Full datasets and all embeddings are available on Zenodo.

Citation

@article{orlov2024high,
  title={From High Dimensions to Human Insight: Exploring Dimensionality Reduction for Chemical Space Visualization},
  author={Orlov, Alexey A. and Akhmetshin, Tagir N. and Horvath, Dragos and Marcou, Gilles and Varnek, Alexandre},
  journal={Molecular Informatics},
  volume={44},
  number={1},
  pages={e202400265},
  year={2024},
  doi={10.1002/minf.202400265}
}

Generative Topographic Mapping

The GTM results in the original publication were obtained using an in-house implementation. This repository uses the open-source ChemographyKit for GTM. If you use it, please cite the ChemographyKit publication as well.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows		.github/workflows
assets		assets
bench_configs		bench_configs
datasets		datasets
docs		docs
notebooks		notebooks
results		results
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Quick Usage

Documentation

Project Structure

Datasets

Citation

Generative Topographic Mapping

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Installation

Quick Usage

Documentation

Project Structure

Datasets

Citation

Generative Topographic Mapping

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages