This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Patho-Bench-cli is a unified downloader for Patho-Bench pathology datasets. It downloads whole slide images (WSIs) from various medical imaging archives, verifies them with OpenSlide, and generates embeddings using TRIDENT.
# Install
uv pip install -e .
uv pip install -e ".[dev]" # with dev dependencies
# Run tests
pytest
# CLI usage
patho-bench-cli tasks # Download task definitions from HuggingFace
patho-bench-cli list # List all providers
patho-bench-cli list <provider> # List datasets for a provider
patho-bench-cli download <provider> -o ./slides # Dry-run (creates manifest)
patho-bench-cli download <provider> --download # Actually download
patho-bench-cli verify ./slides # Verify WSI files can be opened
patho-bench-cli embed <provider> --slides-dir ./slides --embeddings-dir ./embeddingsThe core abstraction is DatasetProvider (in patho_bench_cli/providers/base.py), an abstract base class that each data source implements:
- Required properties:
name,description,datasets - Required methods:
list_tasks(),get_slide_ids_for_tasks(),download_slides(),download_full() - Optional override:
get_storage_directories()for providers with subdirectory organization
Providers are registered in patho_bench_cli/providers/registry.py via _auto_register(). The registry provides get_provider(name) and list_providers().
Current providers:
cptac- TCIA PathDB API (multiple cancer types)panda- Kaggle APIidr- BioImage Archive HTTPbioimage- EBI BioImage Archiveovarian_bevacizumab,post_nat_brca- TCIA collectionsimp,bcnb,hancock,boehmk- Various sources
patho_bench_cli/cli.py contains all subcommand handlers:
cmd_list()- List providers/datasetscmd_download()- Download slides with optional verification and symlink creationcmd_tasks()- Download Patho-Bench task definitions from HuggingFacecmd_verify()- Validate WSI files using OpenSlidecmd_embed()- Generate embeddings using TRIDENT subprocess
Task files: Patho-Bench tasks are stored as TSV files at tasks/{dataset}/{task}/k=all.tsv with columns including slide_id and case_id.
Symlink organization: The --create-symlinks flag creates by_task/{dataset}/{task}/ directories with symlinks to actual slide files, avoiding duplication when slides appear in multiple tasks.
Slide verification: verify_slides_in_parallel() uses ThreadPoolExecutor to check WSIs can be opened with OpenSlide, validates MPP/magnification metadata, and performs trial reads at multiple random locations to detect corruption.
Retry logic: patho_bench_cli/utils.py provides download_file_with_retry() and retry_on_timeout() decorator using tenacity for transient network errors.
- Create
patho_bench_cli/providers/<name>.pyimplementingDatasetProvider - Register in
patho_bench_cli/providers/registry.pyby importing and callingregister_provider() - Provider must handle mapping between Patho-Bench slide_ids and the source's file identifiers
- TRIDENT: Git submodule at
./TRIDENT/used for embedding generation viarun_batch_of_slides.py - Patho-Bench: Git submodule at
./Patho-Bench/for futurebenchcommand that will run benchmarking with an embeddings directory argument