Skip to content

Refactor applications/DynaCell into a reusable benchmark package#404

Draft
alxndrkalinin wants to merge 320 commits into
modular-viscy-stagingfrom
dynacell-models
Draft

Refactor applications/DynaCell into a reusable benchmark package#404
alxndrkalinin wants to merge 320 commits into
modular-viscy-stagingfrom
dynacell-models

Conversation

@alxndrkalinin
Copy link
Copy Markdown
Collaborator

@alxndrkalinin alxndrkalinin commented Apr 7, 2026

Summary

Makes applications/dynacell into the reusable dynacell package in VisCy.

Splits config responsibilities more cleanly:

  • VisCy keeps reusable dynacell code plus generic model recipes/examples.
  • Paper-specific benchmark state such as SEC61B run configs is removed from VisCy and is expected to live in dynacell-paper.

This PR also adds opt-in validation loss support for CellDiff flow-matching so benchmark reruns can log a scalar validation curve without forcing extra validation work on every flow-matching job.

What changed

Config and package reorg

  • Move reusable configs from applications/dynacell/examples/configs/ into applications/dynacell/configs/.
  • Keep configs/recipes/ for reusable fragments and configs/examples/ for generic runnable examples.
  • Refresh generic fit/predict examples for CellDiff, FNet3D, and UNetViT3D, and add generic UNeXt2 fit/predict examples.
  • Remove committed lightning_logs/, SEC61B paper configs, and the hard-coded hcs_sec61b_3d.yml recipe.
  • Update the dynacell README, config-discovery tests, and local ignore rules to match the new layout.

Absorb reusable benchmark modules from dynacell-paper

  • Add dynacell.data with path-based manifest/collection/spec loaders and Pydantic schemas.
  • Add dynacell.reporting with Hydra config, tables, figures, and tests.
  • Add dynacell.evaluation with configs, spectral-PCC tooling, I/O helpers, segmentation, metrics, and tests.
  • Add dynacell.preprocess with the reusable config.py and zarr_utils.py utilities.

Package and CLI shape

  • Extend the dynacell CLI so evaluate and report route to Hydra entry points while fit/predict/test/validate still go through viscy_utils.
  • Ship Hydra _configs inside the Python package so installed dynacell evaluate/report commands resolve config from package data instead of the source tree.
  • Make dynacell.__init__ lazy so lightweight imports do not pull in the full training stack.
  • Add optional dependency extras for eval, report, and preprocess, plus clearer missing-extra install hints from the CLI router.

Runtime, training, and transform fixes

  • Add overwrite= support to HCSPredictionWriter, including explicit handling/logging when prediction channels already exist.
  • Adjust VisCyCLI checkpoint hparam precedence so predict/test/validate honor user config over stale checkpoint values, while fit resume still keeps checkpoint training hparams.
  • Add direct ckpt_path loading to DynacellFlowMatching so predict-time settings come from config rather than checkpoint metadata.
  • Add opt-in compute_validation_loss to DynacellFlowMatching. When enabled it logs loss/val/<idx> and aggregate loss/validate while preserving epoch-end ODE sample generation; when disabled it keeps the previous cheaper validation path.
  • Extract a shared _aggregate_validation_losses() helper so UNet and flow-matching use the same weighted aggregation logic.
  • Skip the strict HCS spatial-shape check when GPU augmentations intentionally handle cropping.
  • Extend BatchedRandAffined with configurable padding_mode, safe_crop_size, and safe_crop_coverage, plus warnings/tests around unsupported X/Y-rotation guarantees.

Evaluation and reporting follow-up fixes

  • Fix corr_coef/PCC computation and add regression tests.
  • Make cached evaluate_model() returns consistent with fresh evaluation output types.
  • Fix report barplots for mixed-result inputs where some models are missing requested metrics.
  • Respect use_gpu in evaluation instead of always taking the CUDA path when visible.
  • Raise FileNotFoundError for missing preprocess configs instead of failing less clearly.

Benchmark schema + launcher (landed earlier on this branch)

  • Commit the configs/benchmarks/virtual_staining/ layout with shared axes (train_sets/, targets/, model_overlays/, launcher_profiles/, predict_sets/) and per-leaf train/predict files composed via viscy_utils.compose.load_composed_config.
  • Migrate the CellDiff train + predict configs for ER/mito/nucleus/membrane onto schema leaves; add FNet3D paper, UNetViT3D, and UNeXt2 train leaves for ER.
  • Add applications/dynacell/tools/submit_benchmark_job.py with --dry-run, --print-script, --print-resolved-config, and --override. It composes a leaf, strips the launcher: + benchmark: reserved keys, freezes the resolved config to {run_root}/resolved/, and renders an sbatch script to {run_root}/slurm/.
  • Extend _maybe_compose_config in viscy_utils/cli.py so direct dynacell fit -c <leaf> also strips those reserved keys.
  • Add a ckpt_sha256_12 sidecar to the evaluation cache so repeated lookups skip full-file hashing.

Topology / trainer-recipe ownership cleanup

Four commits (9734d07 → f9b8f1e → 5b7eaae → 3fdb7cf) untangle three layers that were previously mingled in recipes/trainer/fit_*.yml:

  • Runtime profile renamed. runtime_single_gpu.yml → runtime_shared.yml: the file's content (srun + PYTHONUNBUFFERED/NCCL_DEBUG/PYTHONFAULTHANDLER env vars) has nothing single-GPU-specific, and every leaf — including the 4-GPU UNeXt2 leaf — composed it. Rename + 14 leaf-base updates + schema-doc references.
  • New topology layer (recipes/topology/single_gpu.yml + ddp_4gpu.yml) owns trainer.accelerator/devices/strategy/num_nodes. Duplicated independently under applications/dynacell/configs/recipes/topology/ and applications/cytoland/examples/configs/recipes/topology/ because CLAUDE.md forbids cross-application imports.
  • Unified trainer recipesrecipes/trainer/fit.yml + predict.yml per app, replacing fit_1gpu/fit_4gpu/fit_fm_4gpu/predict_gpu. Mode invariants only (seed, log cadence, callbacks, WandbLogger class pinned with per-app project). Topology and precision are owned by the topology recipes and model overlays respectively. Every benchmark model overlay and example leaf now composes [fit.yml + topology/*.yml] and sets precision + max_epochs explicitly.
  • Hardware profiles (hardware_h200_single, hardware_gpu_any_long, hardware_4gpu) drop their trainer: block; they now own only launcher.sbatch.*.
  • Intentional behavior deltas (spelled out, not masked): strategy: ddp → auto on every single-GPU leaf (Lightning-equivalent at devices=1, just intent-honest); dynacell examples fnet3d/unetvit3d/unext2/fit.yml and benchmark unext2.yml now pin WandbLogger project=dynacell instead of falling through to Lightning's default TensorBoard; fit_1gpu-derived paths move from save_top_k: 4 to save_top_k: 5.
  • Preserved via leaf-body overrides: FNet3D paper keeps precision: 32-true; examples/celldiff/fit.yml keeps FM-style epoch-based checkpointing (save_top_k: -1, every_n_epochs: 10, no monitor); cytoland FCMAE pretraining keeps strategy: ddp_find_unused_parameters_true; cytoland→dynacell bridge configs keep project: dynacell override over cytoland's default.
  • LEGACY deletion: applications/dynacell/tools/LEGACY/examples_configs/ (pre-schema CellDiff configs) is removed outright per CLAUDE.md's "avoid backwards-compatibility hacks" rule. The equivalence tests that composed LEGACY (test_*_leaf_matches_legacy, test_fnet3d_paper_leaf_matches_ran_config, test_byte_equivalence_sec61b_train_leaf) are replaced with forward-looking composition sanity tests.
  • Plan + review trail: plan is in ~/.claude/plans/vectorized-sleeping-clock.md; commits ran through /review-plan and /verify-plan subagent rounds before landing.

Dependency and packaging updates

  • Pin microssim to a compatible Git revision and refresh uv.lock so the new dynacell[eval] extra resolves with the current workspace.
  • Allow direct references in hatch metadata so the workspace can build/package those dependency pins cleanly.

Why

This is the VisCy side of the dynacell-paper split:

  • dynacell in VisCy becomes the reusable runtime.
  • dynacell-paper owns benchmark-instance configs, orchestration, and manuscript-specific assets.

That gives us a cleaner boundary for future work like multi-experiment collections and benchmark-specific runners without keeping paper state in the shared application package.

The flow-matching validation-loss change is intentionally opt-in because it adds extra validation compute and the resulting metric is still stochastic; default flow-matching checkpointing remains epoch-based.

The topology / trainer-recipe cleanup at the end removes a silent-drop-DDP trap: before this change, pairing a mismatched trainer recipe and benchmark hardware profile let the hardware profile's trainer.devices win (both layers redundantly set it), which could silently run a DDP training on 1 GPU. Topology is now the single source of truth.

Verification

  • uvx ruff check applications/dynacell/
  • uvx ruff format --check applications/dynacell/
  • .venv/bin/python -m pytest applications/dynacell/tests -q
  • .venv/bin/python -m pytest applications/dynacell/tests/test_data_manifests.py applications/dynacell/tests/test_reporting_tables.py applications/dynacell/tests/test_reporting_tables_extended.py applications/dynacell/tests/test_reporting_figures.py applications/dynacell/tests/test_evaluation_metrics.py applications/dynacell/tests/test_evaluation_pipeline.py applications/dynacell/tests/test_evaluation_io.py applications/dynacell/tests/test_preprocess_config.py applications/dynacell/tests/test_preprocess_zarr_utils.py applications/dynacell/tests/test_cli_routing.py -q
  • .venv/bin/python -m pytest applications/dynacell/tests/test_engine.py -q
  • .venv/bin/python -m pytest applications/dynacell/tests/test_training_integration.py::test_celldiff_fm_validation_loss_keeps_generation applications/dynacell/tests/test_training_integration.py::test_celldiff_fm_warmup_cosine_fast_dev_run applications/dynacell/tests/test_training_integration.py::test_celldiff_fm_constant_schedule_fast_dev_run -q
  • .venv/bin/dynacell fit --print_config -c applications/dynacell/configs/examples/unext2/fit.yml
  • uv run pytest applications/dynacell/tests/test_benchmark_config_composition.py applications/dynacell/tests/test_submit_benchmark_job.py -v (benchmark schema + submit-tool tests)
  • Topology/trainer cleanup end-to-end:
    • Composition snapshot of 35 touched leaves before vs after the cleanup; only expected intentional deltas (strategy ddp → auto at devices=1, WandbLogger added where previously default TB).
    • uv run dynacell fit -c <benchmark-unext2-leaf> --trainer.fast_dev_run=true --trainer.devices=1 --trainer.strategy=auto ran one training step cleanly on an A40 interactive node (loss 1.048).
    • submit_benchmark_job.py <benchmark-unext2-leaf> queued a real 4-GPU DDP job (31122607) via the new schema path.

@alxndrkalinin alxndrkalinin changed the title Dynacell models Refactor applications/DynaCell into a reusable benchmark package Apr 14, 2026
@alxndrkalinin alxndrkalinin requested a review from Copilot April 14, 2026 22:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors applications/dynacell from a thin training app into a reusable dynacell benchmark package, reorganizing configs and bringing evaluation/reporting/preprocess/data modules (with tests) into VisCy while updating shared runtime utilities to support the new package shape.

Changes:

  • Reorganize Dynacell configs into configs/recipes/ + configs/examples/, remove paper-/HPC-specific SEC61B artifacts, and update docs/tests for config discovery.
  • Add reusable dynacell.data, dynacell.evaluation, dynacell.reporting, and dynacell.preprocess modules plus substantial test coverage.
  • Update shared VisCy components (CLI checkpoint merging, prediction writer overwrite behavior, affine augmentation safety options, HCS training shape checks) to support the new workflows.

Reviewed changes

Copilot reviewed 74 out of 85 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
packages/viscy-utils/src/viscy_utils/cli.py Adjust checkpoint-hparam vs user-config precedence during LightningCLI parsing.
packages/viscy-utils/src/viscy_utils/callbacks/prediction_writer.py Add overwrite option to allow reusing existing output stores.
packages/viscy-transforms/src/viscy_transforms/_affine.py Add configurable padding_mode and safe-crop scale clamping for GPU affine aug.
packages/viscy-transforms/tests/test_affine.py Extend affine tests to cover scale-floor computation and safe-crop behavior.
packages/viscy-data/src/viscy_data/hcs.py Skip strict training shape check when GPU augmentations handle cropping.
applications/dynacell/src/dynacell/main.py Route evaluate/report to Hydra entry points; keep fit/predict via viscy_utils.
applications/dynacell/src/dynacell/engine.py Add ckpt_path init-time loading to decouple predict-time settings from ckpt hparams.
applications/dynacell/src/dynacell/data/_yaml.py Add shared OmegaConf→Pydantic YAML loader.
applications/dynacell/src/dynacell/data/manifests.py Add dataset manifest + split schemas and path-based loaders.
applications/dynacell/src/dynacell/data/collections.py Add frozen benchmark collection schemas and loader.
applications/dynacell/src/dynacell/data/specs.py Add benchmark-spec schema/loader for tying pipeline stages together.
applications/dynacell/src/dynacell/data/init.py Re-export dynacell.data public API.
applications/dynacell/src/dynacell/evaluation/init.py Initialize evaluation package.
applications/dynacell/src/dynacell/evaluation/pipeline.py Add Hydra-based evaluation pipeline with caching + output writing.
applications/dynacell/src/dynacell/evaluation/metrics.py Add pixel/mask/feature metric computations (optional deps guarded).
applications/dynacell/src/dynacell/evaluation/utils.py Add feature extractors + plotting and distribution-distance helpers.
applications/dynacell/src/dynacell/evaluation/io.py Add eval I/O dispatch helpers (Zarr vs TIFF-like) and preprocessing utilities.
applications/dynacell/src/dynacell/evaluation/segmentation.py Add segmentation workflows with optional dependency guards.
applications/dynacell/src/dynacell/evaluation/formatting.py Add DataFrame formatting utilities for evaluation outputs.
applications/dynacell/src/dynacell/evaluation/torch_ssim.py Add torch-native SSIM implementation used by metrics.
applications/dynacell/src/dynacell/evaluation/spectral_pcc/init.py Initialize spectral_pcc subpackage.
applications/dynacell/src/dynacell/evaluation/spectral_pcc/plot_shading_analysis.py Add analysis plotting script for shading artifact comparison.
applications/dynacell/src/dynacell/evaluation/spectral_pcc/plot_combined.py Add combined metrics plotting + summary script.
applications/dynacell/src/dynacell/evaluation/spectral_pcc/diagnostic_real.py Add Hydra diagnostic workflow for real-data spectra/metrics analysis.
applications/dynacell/src/dynacell/reporting/init.py Re-export reporting public API.
applications/dynacell/src/dynacell/reporting/tables.py Add table aggregation + LaTeX rendering utilities.
applications/dynacell/src/dynacell/reporting/figures.py Add matplotlib figure generation for aggregated metrics.
applications/dynacell/src/dynacell/reporting/cli.py Add Hydra reporting CLI entry point and config path wiring.
applications/dynacell/src/dynacell/preprocess/init.py Re-export preprocess utilities.
applications/dynacell/src/dynacell/preprocess/config.py Add OmegaConf-based preprocess config loader (with fallback).
applications/dynacell/src/dynacell/preprocess/zarr_utils.py Add OME-Zarr rewrite utility (rechunk/reshard).
applications/dynacell/configs/examples/unext2/fit.yml Add generic UNeXt2 fit example in new config layout.
applications/dynacell/configs/examples/unext2/predict.yml Add generic UNeXt2 predict example with init-time ckpt_path.
applications/dynacell/configs/examples/fnet3d/fit.yml Update FNet3D fit example paths to new layout.
applications/dynacell/configs/examples/fnet3d/predict.yml Add generic FNet3D predict example with init-time ckpt_path.
applications/dynacell/configs/examples/unetvit3d/fit.yml Update UNetViT3D fit example paths to new layout.
applications/dynacell/configs/examples/unetvit3d/predict.yml Update UNetViT3D predict example paths and use init-time ckpt_path.
applications/dynacell/configs/examples/celldiff/fit.yml Update CellDiff fit example paths to new layout.
applications/dynacell/configs/examples/celldiff/predict.yml Add CellDiff predict example with init-time ckpt_path and overlap config.
applications/dynacell/configs/recipes/trainer/fit_1gpu.yml Add reusable 1-GPU trainer recipe.
applications/dynacell/configs/recipes/trainer/fit_4gpu.yml Add reusable 4-GPU trainer recipe.
applications/dynacell/configs/recipes/trainer/fit_fm_4gpu.yml Add reusable 4-GPU flow-matching trainer recipe.
applications/dynacell/configs/recipes/trainer/predict_gpu.yml Remove ckpt_path from trainer recipe (moved to model init args).
applications/dynacell/configs/recipes/models/unext2_3d.yml Add reusable UNeXt2 model recipe (z=15).
applications/dynacell/configs/recipes/models/unext2_3d_z8.yml Add reusable UNeXt2 model recipe (z=8).
applications/dynacell/configs/recipes/models/fnet3d.yml Add reusable FNet3D model recipe.
applications/dynacell/configs/recipes/models/fnet3d_z8.yml Add reusable FNet3D model recipe (z=8).
applications/dynacell/configs/recipes/models/unetvit3d.yml Add reusable UNetViT3D model recipe.
applications/dynacell/configs/recipes/models/celldiff_fm.yml Add reusable CellDiff flow-matching model recipe.
applications/dynacell/configs/recipes/modes/spotlight.yml Add Spotlight mode recipe for foreground-aware loss.
applications/dynacell/configs/recipes/data/hcs_phase_fluor_3d.yml Adjust data recipe defaults (e.g., preload false).
applications/dynacell/configs/evaluate/eval.yaml Add Hydra eval base config.
applications/dynacell/configs/evaluate/spectral_pcc/base.yaml Add Hydra config for spectral PCC workflows.
applications/dynacell/configs/evaluate/spectral_pcc/simulate.yaml Add simulation config for metric validation.
applications/dynacell/configs/evaluate/spectral_pcc/diagnostic_real.yaml Add real-data diagnostic config.
applications/dynacell/configs/report/base.yaml Add Hydra report config.
applications/dynacell/tests/test_training_integration.py Update config discovery test to new configs/examples location.
applications/dynacell/tests/test_cli_routing.py Add tests for CLI routing between Lightning and Hydra subcommands.
applications/dynacell/tests/test_data_manifests.py Add tests for dataset manifest/collection/spec schemas and loaders.
applications/dynacell/tests/test_evaluation_io.py Add tests for eval I/O dispatching based on path type.
applications/dynacell/tests/test_evaluation_metrics.py Add regression tests for shared-scale pixel metrics behavior.
applications/dynacell/tests/test_evaluation_pipeline.py Add caching regression test for evaluation pipeline.
applications/dynacell/tests/test_reporting_tables.py Add tests for reporting tables loading/aggregation/LaTeX output.
applications/dynacell/tests/test_reporting_tables_extended.py Add extended reporting table tests (caption/label, lower-is-better).
applications/dynacell/tests/test_reporting_figures.py Add tests for figure creation/saving and empty-data behavior.
applications/dynacell/tests/test_preprocess_config.py Add tests for preprocess config loader fallback behavior.
applications/dynacell/tests/test_preprocess_zarr_utils.py Add tests for zarr rewrite correctness (data, chunks, metadata, shards).
applications/dynacell/README.md Update usage + config layout documentation; remove paper-specific SEC61B instructions.
applications/dynacell/.gitignore Ignore runtime artifacts (lightning_logs, outputs, pycache).
applications/dynacell/pyproject.toml Add optional extras for eval/report/preprocess and allow direct references.
applications/dynacell/examples/configs/sec61b/run_unext2_continue.slurm Remove paper/HPC-specific SEC61B SLURM script.
applications/dynacell/examples/configs/sec61b/run_unext2.slurm Remove paper/HPC-specific SEC61B SLURM script.
applications/dynacell/examples/configs/sec61b/run_fnet3d_paper.slurm Remove paper/HPC-specific SEC61B SLURM script.
applications/dynacell/examples/configs/sec61b/run_fnet3d.slurm Remove paper/HPC-specific SEC61B SLURM script.
applications/dynacell/examples/configs/sec61b/fit_unext2.yml Remove paper-specific SEC61B config.
applications/dynacell/examples/configs/sec61b/fit_fnet3d_paper.yml Remove paper-specific SEC61B config.
applications/dynacell/examples/configs/sec61b/fit_fnet3d.yml Remove paper-specific SEC61B config.
applications/dynacell/examples/configs/sec61b/fit_celldiff.yml Remove paper-specific SEC61B config.
applications/dynacell/examples/configs/recipes/data/hcs_sec61b_3d.yml Remove hard-coded SEC61B data recipe.
applications/dynacell/examples/configs/fnet3d/predict.yml Remove old predict example path (superseded by new configs layout).
applications/dynacell/examples/configs/celldiff/predict.yml Remove old predict example path (superseded by new configs layout).
CLAUDE.md Update repo/dev workflow documentation and conventions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread applications/dynacell/src/dynacell/__main__.py
Comment thread packages/viscy-transforms/src/viscy_transforms/_affine.py
Comment thread packages/viscy-transforms/src/viscy_transforms/_affine.py
Comment thread applications/dynacell/src/dynacell/reporting/tables.py
Comment thread applications/dynacell/src/dynacell/preprocess/zarr_utils.py
Comment thread applications/dynacell/pyproject.toml
Comment thread packages/viscy-utils/src/viscy_utils/cli.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 78 out of 91 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/viscy-utils/src/viscy_utils/callbacks/prediction_writer.py Outdated
Comment thread applications/dynacell/src/dynacell/preprocess/config.py Outdated
Comment thread applications/dynacell/src/dynacell/evaluation/pipeline.py Outdated
Comment thread applications/dynacell/src/dynacell/reporting/tables.py
Comment thread applications/dynacell/src/dynacell/reporting/figures.py Outdated
alxndrkalinin added a commit that referenced this pull request Apr 16, 2026
Addresses two PR #404 review findings:

1. `_CacheContext._manifest_dirty` was mutated directly from helper
   call sites, leaking implementation detail. Adds `mark_manifest_dirty`
   and `consume_manifest_dirty` methods and routes every external touch
   through them. Only the dataclass itself now references the private
   field.

2. `resolve_dynaclr_encoder_cfg` used `except Exception` to detect a
   missing nested config key — wider than needed and against CLAUDE.md
   guidance. Replaced with `OmegaConf.select(..., default=None)`, which
   handles missing keys natively without a try/except.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
alxndrkalinin added a commit that referenced this pull request Apr 16, 2026
Addresses a PR #404 review finding: the split GT/pred feature API had
structural tests (empty inputs, column-drop, shape mismatch) but no
pinned-value regression guard. Adds two tests that seed deterministic
synthetic inputs and assert exact output values for CP_FID / CP_KID /
CP_Median_Cosine_Similarity and the DINOv3 equivalents.

If anyone later changes the column-drop, per-side z-score, or
FID/KID/cosine pairing logic — or a dependency shifts numerics — these
tests will fail rather than silently drifting metrics.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dihan.zheng and others added 13 commits April 16, 2026 17:22
Repeated ckpt_sha256_12 calls on multi-GB checkpoints dominate parallel
sweep cache-key resolution. Write a sibling .sha256 sidecar after the
first hash; on later calls, reuse the sidecar when its mtime >= the
ckpt's. Falls back to recompute on any OSError (read-only dir, NFS
flake) and on corrupt non-hex sidecars.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dynacell's benchmark leaf YAMLs carry two reserved top-level keys:
`launcher:` (sbatch/runtime metadata) and `benchmark:` (experiment
identifiers). LightningCLI rejects unknown top-level keys, so these must
be removed before the composed config reaches the CLI. Widen
_maybe_compose_config to:

- strip both reserved keys whether or not `base:` is present
- extract _find_config_arg and _replace_config_path_in_argv helpers

This unblocks `uv run dynacell fit -c <benchmark-leaf.yml>` without
requiring the dedicated benchmark submit tool.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lands the benchmark config layout without any runnable leaves yet:

- BENCHMARK_CONFIG_SCHEMA.md — reference doc (previously untracked)
- virtual_staining/README.md — reserved-keys contract, compose+submit
  docs
- shared/train_sets/ipsc_confocal.yml — imaging modality defaults
- shared/targets/{er_sec61b, mito_tomm20, nucleus, membrane}.yml — four
  targets with channel names, train-side data paths, normalizations,
  and RandWeightedCropd
- shared/model_overlays/celldiff_{fit,predict}.yml — model + trainer
  recipe binding + mode-specific data hparams and GPU aug stack
- shared/launcher_profiles/{mode_fit, mode_predict, hardware_h200_single,
  runtime_single_gpu}.yml — launcher metadata split across axes
- shared/predict_sets/ipsc_confocal.yml — predict-set metadata +
  source_channel (duplicated from train_sets because predict leaves
  don't compose train_sets)

Train/predict leaves land in the next two commits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Four benchmark leaves at configs/benchmarks/virtual_staining/train/<org>/
ipsc_confocal/celldiff.yml — one per organelle. Each composes the shared
axes (train_set, target, celldiff_fit overlay, launcher profiles) and
carries organelle-specific WandB run name, checkpoint dirpath, and
launcher.{job_name, run_root} in the leaf body.

test_benchmark_config_composition.py composes both the pre-schema
fit_celldiff.yml and the new leaf through load_composed_config, strips
reserved keys, and asserts the full intersection of model/data/trainer
fields matches. All four organelles pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Four predict leaves at configs/benchmarks/virtual_staining/predict/<org>/
ipsc_confocal/celldiff/ipsc_confocal.yml. Each overrides:

- data.init_args.data_path to the test_cropped store for the organelle
- data.init_args.normalizations to Phase3D-only (predict doesn't use
  target normalization — target isn't loaded)
- data.init_args.augmentations to [] (clears target-inherited
  RandWeightedCropd; predict has no CPU augs)
- trainer.callbacks to a single HCSPredictionWriter with the organelle's
  output zarr

Extends test_benchmark_config_composition.py with a predict-side
equivalence test that asserts model.init_args.{num_generate_steps,
predict_method, predict_overlap, ckpt_path, net_config}, the predict
data.init_args key intersection, HCSPredictionWriter output_store
equality, and a 'test_cropped/' guard on data_path. All four predict
leaves pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…late

Drives one benchmark leaf end-to-end: compose via load_composed_config,
apply --override (stdlib dotlist, interpolation forbidden), validate
launcher block, consistency-check trainer.devices vs sbatch.gpus, render
sbatch from tools/sbatch_template.sbatch using a string.Template
subclass with @@ delimiter (so shell $VARs pass through verbatim), and
submit.

The SBATCH directive render order (job-name, time, nodes, ntasks,
partition, cpus-per-task, gpus, mem, constraint, output, error) is
pinned explicitly to match Dihan's run_celldiff.slurm. Byte-equivalence
test against the SEC61B train leaf confirms the rendered sbatch
differs only on the final srun --config path.

Flags: --dry-run, --print-script, --print-resolved-config,
--override key.path=value (repeatable).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
git-renamed the four pre-schema CellDiff trees (memb/nucl/sec61b/tomm20,
fit+predict YAMLs and run_celldiff.slurm) from
applications/dynacell/examples/configs/ to
applications/dynacell/tools/LEGACY/examples_configs/. Empty examples/
parent removed.

Post-move, the eight YAMLs' base: paths needed one additional '..' to
still resolve to configs/recipes/ — the only content change. This keeps
the equivalence test in test_benchmark_config_composition.py able to
compose the LEGACY files as the source-of-truth reference. Both test
files' EXAMPLES paths updated to the new location.

tools/LEGACY/README.md documents the contract: reference-only, not for
direct launch; delete after one successful end-to-end submit run and
2026-06-30 at the earliest.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds the configs/benchmarks/virtual_staining/ layer to the config
structure section, points at its own README for composition order, and
documents the submit_benchmark_job.py tool with --dry-run examples.
Also notes that launcher:/benchmark: reserved keys are stripped
automatically by _maybe_compose_config.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
viscy_utils.compose._deep_merge was private, forcing dynacell's
submit_benchmark_job.py to keep a byte-identical copy with a docstring
explaining the duplication. Drop the underscore and export it. Prevents
silent drift between the two copies if one is updated (e.g. changing
list-replace to list-append semantics).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three substantive fixes plus cleanup:

- Drop the @@OVERRIDES tail from sbatch_template.sbatch. Previously
  --override tokens were both merged into the resolved YAML AND appended
  to the srun command line, applying the overrides twice. For scalar
  overrides this happened to be idempotent; for list overrides it would
  have silently duplicated entries.
- Make --print-script and --print-resolved-config imply skip-submission.
  Previously running submit_benchmark_job.py with --print-resolved-config
  alone (no --dry-run) would still sbatch the job — a surprising
  foot-gun.
- Use the newly-public deep_merge from viscy_utils.compose; drop the
  duplicated copy from submit_benchmark_job.py.
- Change _apply_override to return the merged dict instead of mutating
  in place via clear()+update(). Simpler contract matching _deep_merge.
- Deduplicate the stat() call in ckpt_sha256_12 (Path.exists() followed
  by Path.stat() was two syscalls for one result).
- Strip stale "# Equivalent to examples/configs/..." comments from the 8
  leaf YAMLs — the referenced path was moved to tools/LEGACY/ in an
  earlier commit.
- Clean up author-referencing narration comments ("matches Dihan's
  ...") — the code is the contract now.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three confirmed review findings:

- Remove sys.path.insert from test_submit_benchmark_job.py (CLAUDE.md
  bans sys.path mutation). Replace with pytest pythonpath config in the
  workspace pyproject.toml pointing at applications/dynacell/tools so
  the test can import submit_benchmark_job.
- Make --dry-run the mode that writes the resolved YAML and sbatch to
  disk (previously nothing wrote files outside the real-submit path,
  which meant --dry-run rendered a path it never populated). --print-*
  flags are now documented as preview-only: stdout inspection, no disk
  writes, no submission.
- Drop redundant FileNotFoundError from the except tuple in
  ckpt_sha256_12 — it's an OSError subclass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 131 out of 143 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pyproject.toml
Comment thread applications/dynacell/tools/submit_benchmark_job.py Outdated
Comment thread packages/viscy-utils/tests/test_compose.py Outdated
alxndrkalinin and others added 5 commits April 16, 2026 21:05
Prior semantics had --dry-run write resolved YAML + sbatch to
launcher.run_root, which fails with PermissionError on production
run_roots the caller can't write to — making --dry-run unsafe as a
preview mechanism despite its name.

New contract:
- --print-script / --print-resolved-config: pure preview, stdout only,
  no disk writes, no submission. Safe on any leaf regardless of
  run_root write permission.
- --dry-run alone: write resolved YAML + sbatch to run_root without
  submitting. Requires write permission.
- --dry-run combined with --print-*: --print-* wins (still a pure
  preview).
- Bare invocation: write + submit (unchanged).

README updated to use --print-script for the safe-preview example and
to document --dry-run's write-to-run_root semantics.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_render_env_block previously wrote `export KEY=VALUE` verbatim. If a
value contains a space or shell metacharacter, the export breaks or
opens an injection path via --override launcher.env.FOO=... Quote
values with shlex.quote() and validate keys match a shell identifier
regex. No-op for the current YAML values (INFO, 1, etc. quote to
themselves), preserving byte-equivalence with Dihan's reference
run_celldiff.slurm.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
A prior sed rename of _deep_merge → deep_merge accidentally stripped
the underscore from the four test function names
(test_deep_merge_flat etc. became testdeep_merge_flat). Pytest still
collected them, but the naming broke -k filtering.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous docs only showed --print-script/--dry-run/submit, which
hid --print-resolved-config and --override entirely and left the
preview-vs-write-vs-submit semantics implicit. After the 7706ae8
preview-contract fix, the distinction between "safe preview" and
"writes to run_root" is user-visible, so it needs to be spelled out.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Commit 1690b7f added a <ckpt>.sha256 sidecar to skip re-hashing large
DynaCLR checkpoints on every eval run, but the README only showed
ckpt_sha12 as a cache key. Surface the sidecar + mtime-guard so users
know that touching/replacing the ckpt invalidates it automatically
and don't have to grep cache.py to understand the behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
alxndrkalinin and others added 27 commits May 3, 2026 23:32
Best val checkpoint from job 31822574 (epoch 119, loss/validate=0.2722,
146 epochs total — plateaued from ep 119 onward). Mirrors the existing
nucleus/fnet3d_paper/a549_mantis predict pattern: ipsc-test leaf has no
target override, a549 leaves override dataset_ref.target to gene-keyed
`caax`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Best val checkpoint from job 31858488 (epoch 281, loss/validate=0.3143,
289 epochs total — 7-epoch plateau). Mirrors the existing
nucleus/fnet3d_paper/a549_mantis predict pattern: ipsc-test leaf has no
target override, a549 leaves override dataset_ref.target to gene-keyed
`caax`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Best val checkpoint from job 31822562 (epoch 110, loss/validate=0.8345,
135 epochs total). Mirrors the existing membrane/fcmae_vscyto3d_scratch/
a549_mantis predict pattern: ipsc-test leaf has no target override, a549
leaves override dataset_ref.target to gene-keyed `h2b`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bring predict_batch.sh to parity with predict_local.sh: accept
<train_set> and <test_set> args (with ipsc/a549/joint shorthand)
instead of hardcoding ipsc-trained → A549-mantis prediction.
Renamed from predict_all_a549.sh — the old name lies once test_set
is variable.

Verified with --dry-run on the original use case (er fnet3d_paper
ipsc a549) and on cases that weren't possible before (nucleus
fnet3d_paper a549 a549, ... a549 ipsc). submit_benchmark_batch.py
itself was already test-set-agnostic; this is a wrapper-only change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Track FOV name and timepoint per cell across CellProfiler, DINOv3, and
DynaCLR embeddings, then save as .npz to embeddings/ under the eval output
dir. Also redirects OUT_ROOT to evaluations_with_embeddings and enables ER
organelle evals (disabling membrane/nucleus) across all 5 model eval scripts.
Add viscy-data[mmap] extra to dynacell pyproject.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…h size to 2

Add predict YAMLs for membrane and nucleus CellDiff across mock/denv/zikv
infection variants. Reduce batch_size from 4 to 2 in the joint
ipsc+confocal+a549+mantis train configs for both membrane and nucleus.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Best val checkpoint from job 31822558 (epoch 134, loss/validate=0.8142,
178 epochs total). Fills the last gap in the (Nucleus|Cell Membrane) ×
(F-net|UNeXt2 scratch|UNeXt2 FCMAE) a549-trained predict matrix; only
the FCMAE Cell Membrane variant remains. ipsc-test leaf has no target
override, a549 leaves override dataset_ref.target to gene-keyed `h2b`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a "Prediction zarr naming convention" section to dynacell/CLAUDE.md
covering the iPSC / A549 / Joint training-set infix scheme and the
historical SEC61B/TOMM20 double-underscore form. Settles the naming for
forthcoming joint-trained predicts as `_jointtrained` (decided 2026-05-04).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Predict configs for J31910356 (val 0.5716 @ ep132): VSCyto3D-pretrained
FCMAE on a549_mantis sec61b. Covers iPSC test set + 3 a549 plates
(mock/denv/zikv). Both manifests use sec61b natively — no dataset_ref
override needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Predict configs for J31822521 (TIMEOUT @ ep102, val 0.6448 @ ep92): joint
iPSC+A549 UNeXt2 scratch on nucleus. Covers iPSC test set + 3 a549 plates
(mock/denv/zikv); a549 leaves override target to h2b for the gene-keyed
a549 manifest. Outputs land at nucl_fcmae_vscyto3d_scratch_jointtrained{,_<cond>}.zarr
per the _jointtrained convention.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Predict configs for J31822529 (finished @ 119 ep, val 0.3754 @ ep111): joint
iPSC+A549 VSCyto3D-pretrained on cell membrane. Covers iPSC test set + 3
a549 plates (mock/denv/zikv); a549 leaves override target to caax for the
gene-keyed a549 manifest.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Predict configs for J31822536 (finished @ 118 ep, val 0.3859 @ ep112): joint
iPSC+A549 UNeXt2 scratch on cell membrane. Covers iPSC test set + 3 a549
plates (mock/denv/zikv); a549 leaves override target to caax.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Predict configs for J31910360 (cancelled @ ≥113 ep): A549-only UNeXt2 scratch
on mito (TOMM20). Pinned to ep113 ckpt with val 0.7329 per the resume's
re-evaluation in last.ckpt's best_model_score block.

Carries a wandb-collision caveat: J31910360 (TOMM20) and J31910346 (SEC61B)
landed on the same node gpu-f-5 within 9 seconds and the launcher's
timestamp-derived wandb run id `20260502-204536` collided. The wandb
dashboard for that run shows the TOMM20 display name but logged metrics
align step-wise with SEC61B's training, not TOMM20's. The original-run
wandb history is therefore unrecoverable for TOMM20; the on-disk Lightning
trainer state is the source of truth.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Predict configs for J31910346 (completed cleanly @ ep199, val 0.6219 @ ep137):
A549-only UNeXt2 scratch on ER (SEC61B). Carries the same wandb-collision
caveat as the sibling TOMM20 configs: this SEC61B run shared wandb run id
`20260502-204536` with J31910360 (TOMM20) on gpu-f-5 — the wandb history
metrics actually belong to SEC61B (step-wise alignment confirms), but the
dashboard display name was overwritten by TOMM20.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Predict configs for J31962520 (completed @ ep129 hitting max_steps, val
0.9709 @ ep126): joint iPSC+A549 F-net on nucleus. Covers iPSC test set + 3
a549 plates; a549 leaves override target to h2b. Note val late-stage drift
(ep127→1.05, ep128→1.28) — best-val ckpt at ep126 used for predictions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Predict configs for J31962519 (completed @ ep129 hitting max_steps, val
0.5759 @ ep116): joint iPSC+A549 F-net on cell membrane. Covers iPSC test
set + 3 a549 plates; a549 leaves override target to caax. Note val drifted
to 0.6751 at ep128 — best-val ckpt at ep116 used for predictions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Predict configs for J31965119 (completed @ ep392 hitting max_steps, val
0.8291 @ ep248): A549-only F-net on mito (TOMM20). Final val drifted to
0.9493 — best-val ckpt at ep248 used for predictions. Both iPSC and a549
manifests use tomm20; no dataset_ref override needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…r all models

Add mix-trained evaluation scripts for celldiff, vscyto3d, unext2, and fnet3d
across iPSC and A549 (mock/denv/zikv) test sets, covering membrane and nucleus
targets. Also add new CellDiff predict configs for movie/iPSC sets and fix
nucleus CellDiff A549 output paths to use joint_predictions/.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fills the two missing slots in the 6 train-set × 2 organelle FNet3D
membrane/nucleus eval matrix: joint→iPSC nucleus and joint→A549
nucleus (×3 infections). Predictions already existed under
{ipsc,a549}/joint_predictions/, but no eval scripts had been wired up.

Distinct from upstream's run_eval_mix_trained_a549_pred_nucleus_*.sh,
which run pixel-only (compute_feature_metrics=false, no seg path).
These compute all three metric tiers (pixel + mask + DynaCLR feature
metrics) for the supplementary table comparing FNet3D across training
pools.

Slurm runner submits the two scripts as a 2-task array so the iPSC
eval (1 condition) and the A549 eval (3 infections) run in parallel
on h100/h200/a100.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ics)

Adds the UNeXt2 (fcmae_vscyto3d_scratch) counterpart to the FNet3D
joint-trained membrane evals, completing the model coverage needed for
the 3-pool × 2-test comparison table in the paper appendix.

UNeXt2 joint membrane predictions already existed; only the eval
scripts were missing. Same shape as the joint nucleus runner: a 2-task
slurm array — task 0 = iPSC test, task 1 = A549 test × {mock, denv,
zikv} — with full metric tiers (pixel + mask + DynaCLR features).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rsion output dirs as r2

- Add *_celldiff.yml target files using MinMaxSampled at timepoint_statistics
  instead of NormalizeSampled, keeping shared targets unchanged for other models
- Update all celldiff train configs to reference new celldiff-specific targets
- Bump output directory version to celldiff_r2 across all celldiff train configs
  (save_dir, dirpath, run_root) to avoid overwriting previous runs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The inherited batch_size=4 from celldiff_fit.yml combined with the
inline num_samples=2 yields 8 GPU samples/step under
BatchedConcatDataModule (which does NOT divide by num_samples per
CLAUDE.md). That OOMs the H200 at ~138/140 GiB during
unet/blocks.py:187 (h + res_conv(x)). Verified via J32771895
(CELLDiff_JOINT_SEC61B, failed 2m12s) and J32771903
(CELLDiff_JOINT_TOMM20, failed 3m30s).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Predict configs for the joint iPSC+A549 FCMAE-scratch ER model
(J31910330, ep75 val 0.5234). Covers iPSC test_cropped + a549 mantis
sec61b mock/denv/zikv. No dataset_ref override needed — both iPSC and
a549 manifests use the `sec61b` gene key.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Predict configs for the joint iPSC+A549 FCMAE-pretrained_ws8500 ER
model (J31910331, ep111 val 0.5164). Covers iPSC test_cropped + a549
mantis sec61b mock/denv/zikv. ckpt_path points to the _ws8500 training
subdir while the config namespace uses the plain fcmae_vscyto3d_pretrained
name for consistency with iPSC + a549 single-set predict configs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Predict configs for the joint iPSC+A549 FCMAE-scratch Mito model
(J31910343, ep63 val 0.6063). Covers iPSC test_cropped + a549 mantis
tomm20 mock/denv/zikv. No dataset_ref override needed — both iPSC and
a549 manifests use the `tomm20` gene key.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Predict configs for the joint iPSC+A549 FCMAE-pretrained_ws8500 Mito
model (J31910345, ep43 val 0.6016). Covers iPSC test_cropped + a549
mantis tomm20 mock/denv/zikv. ckpt_path points to the _ws8500 training
subdir while the config namespace uses the plain fcmae_vscyto3d_pretrained
name for consistency with iPSC + a549 single-set predict configs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dihan.zheng and others added 2 commits May 15, 2026 12:45
…+ 3 methods

Replace the single predict__ipsc_confocal.yml per organelle with three
method-specific configs (denoise/sliding_window/iterative). All use
celldiff_r2 checkpoints and MinMaxSampled normalization (timepoint_statistics)
to match the r2 training setup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
)

* chore(dynacell): add torch-fidelity + scikit-learn to eval extra

torch-fidelity pinned to a commit on toshas/torch-fidelity master that
includes MIND on top of v0.4.0 — MIND is not in any tagged release yet
and the upcoming eval pipeline depends on it. scikit-learn is already a
transitive dep but pinning it explicitly under the eval extra protects
the new linear-probe code path against an upstream prune.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(dynacell): add save_paths helper for canonical eval-dir naming

Produces the same on-disk save_dir layout the paper aggregation scripts
in /hpc/mydata/alex.kalinin/dynacell expect (paper key + train-set
infix + organelle long-form + infection suffix). The library copy here
and the helper in paper/scripts/compute_all_organelle_precision_recall.py
must stay in sync; both files carry a comment naming the other as VisCy
and the paper repo are independent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(dynacell): add feature_metrics module backed by torch-fidelity

Wraps torch-fidelity's *_features_to_metric helpers so the eval pipeline
can keep feeding pre-extracted per-cell embeddings directly (the library's
top-level calculate_metrics() can't accept precomputed features). Three
gaps the prior in-tree implementations did not close are now closed:
KID gets a subset-resampled mean+std, P/R/F1 gain a non-parametric
bootstrap, and MIND (sliced 2-Wasserstein) joins as a kernel-free
sensitivity-to-distribution-shape diagnostic motivated by the paper's
Appendix I mode-collapse analysis.

Small-cohort KID auto-shrinks the subset size and NaNs out below 16
cells; PRC bootstrap resamples both sides with replacement before each
manifold rebuild; tensor conversion happens once outside the loop.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(dynacell): add feature_select for CP regionprops pruning

Vendors pycytominer's variance + correlation feature-selection logic
(BSD-3 attribution in module header) so the CP regionprops track in
the eval pipeline can drop near-constant or redundantly-correlated
columns before they reach FID/KID/PRC. Operates on raw numpy without
a pandas dep, statistics computed on pooled (gt, pred) so the same
keep-mask is applied to both sides.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(dynacell): add linear_probe with FOV-stratified AUROC + MADScaler

Probe-based diagnostics that FID/KID/MIND/PRC do not provide: how
linearly distinguishable are real vs predicted embeddings, and how well
do features preserve biological treatment information across plates.

GroupKFold by FOV holds out whole fields of view so per-FOV illumination,
segmentation artifacts, and biological clustering can't leak across the
split. MADScaler (median + median-absolute-deviation, not the IQR-based
sklearn RobustScaler) lives inside the CV Pipeline so normalization
statistics are computed only on the training fold. Logistic regression
defaults to class_weight='balanced' to handle unequal mock/infected
cohort sizes.

`paired_auroc` is the shared helper for the two probe call sites
(real-vs-predicted in pipeline.py and between-condition in the
post-hoc cross_condition_probe CLI).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor(dynacell): rewire eval pipeline through torch-fidelity helpers

Replaces the in-tree _frechet_distance, _polynomial_mmd, and
_pairwise_feature_metrics in utils.py, plus cp_pairwise / deep_pairwise /
cp_dataset_fid_kid / deep_dataset_fid_kid / _nan_pairwise in metrics.py,
with calls into the new feature_metrics module. Dataset-level CP track
also runs feature_select pruning before similarity, plus the
real-vs-predicted linear probe per feature type. The three feature-type
similarity+probe calls run via ThreadPoolExecutor(max_workers=3) since
torch-fidelity, sklearn LBFGS, and numpy BLAS all release the GIL on
their inner loops.

CP-specific per-(FOV, timepoint) preprocessing (target-zero-column drop
and per-side z-score) moves to a small helper in pipeline.py since the
deep tracks need neither step.

Tests in test_evaluation_metrics.py drop the seven cp_pairwise /
deep_pairwise / pinned-value cases (covered by feature_metrics_test.py);
test_evaluation_pipeline.py grows stubs for the new feature_metrics,
feature_select, and linear_probe modules so the cache-reuse path can
still import the pipeline without pulling in torch-fidelity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(dynacell): add cross-condition linear-probe CLI

Post-hoc diagnostic that runs FOV-stratified probes between two
infection conditions (mock vs denv, mock vs zikv) on the per-cell
embeddings saved by the eval pipeline. Per-plate MAD scaling cancels
the illumination/exposure/gain offsets that would otherwise let any
classifier separate the plates trivially (CP at AUROC=1.0 before this
guard); CP-only feature_select then drops near-constant columns before
the linear probe sees them.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(dynacell): add eval-runner pair + Hydra-override staging tools

Mirrors the predict_{local,batch}.sh / submit_benchmark_{job,batch}.py
pair so an entire (organelle, model, train_set, test_set) eval set
launches with one command. Unlike Lightning's predict CLI, dynacell
evaluate is Hydra-decorated and takes override key=value tokens
directly — the staging tools read the predict leaf, extract
output_store / dataset_ref / benchmark.*, build the override list,
and emit `uv run dynacell evaluate …` invocations.

evaluate_local.sh supports --parallel N and --cross-condition-probe;
the latter triggers an automatic post-step that runs
`python -m dynacell.evaluation.cross_condition_probe` over the three
A549 per-plate save_dirs once all per-plate evals complete.
evaluate_batch.sh wraps a single sbatch job that runs each leaf's
eval in series via srun.

Save-dir naming routes through dynacell.evaluation.save_paths so the
paths stay bit-identical to the paper aggregation scripts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* perf(dynacell): cap inner BLAS to 1 thread under the eval parallel pool

The dataset-level metrics + linear-probe block runs three feature types
concurrently via ThreadPoolExecutor. Without a thread-pool guard, each
worker's sklearn LBFGS + numpy ops can grab cpu_count() BLAS threads on
top of the three outer threads, oversubscribing cores on dense nodes.
threadpoolctl.threadpool_limits(1) wraps the executor block so each
outer thread gets exactly one BLAS thread.

Rename `_cp_pairwise_preprocess` to `_cp_dropzero_zscore` so the name
no longer collides visually with the deleted `cp_pairwise` helper.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor(dynacell): make save_paths name conventions uniform

ORGANELLE_EVAL_TARGET and ORGANELLE_A549_PLATE were already public
because the eval-runner tools import them; promote PAPER_KEY and
ORGANELLE_PAPER from underscore-prefixed to public for the same
reason. The four lookup tables are sibling concepts and the leading-
underscore split implied a private/public boundary that doesn't exist.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* perf(dynacell): avoid redundant predict-leaf compose in eval-batch tool

_resolve_one_leaf already loads + composes every leaf and reads
benchmark.trained_on as part of the call; return that string in the
tuple instead of re-running load_composed_config on the head leaf
just to fetch the same field. Saves one ~1-2 s compose per submit
invocation. Also adds a cross-leaf trained_on consistency check
mirroring the existing organelle / model_name checks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add CellDinoModel foundation wrapper

CELL-DINO is a DINOv2-architecture ViT pretrained on
fluorescence microscopy (Human Protein Atlas). The
channel_adaptive_dino_vitl16 checkpoint processes one channel
at a time through a single-channel ViT-L/16 stem; the wrapper
reshapes (B, C, H, W) -> (B*C, 1, H, W), runs the backbone, and
mean-pools the cls token across channels for a fixed-dim
embedding regardless of input channel count. Weights load from
a local .pth state_dict — nothing fetched from the network.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dynacell): wire CELL-DINO as a 4th opt-in eval backbone

Adds the CellDinoFeatureExtractor adapter (wraps
viscy_models.foundation.CellDinoModel; same extract_features(image_2d)
contract as DinoV3FeatureExtractor / DynaCLRFeatureExtractor), threads
celldino through the GT cache (FeatureKind literal, _CacheContext field,
manifest validation, fov_gt_deep_features dispatch), accumulates
pred/gt embeddings per (FOV, t), dispatches per-timepoint pairwise +
dataset-level FID/KID/PRC/MIND/probe via the existing parallel pool
(max_workers bumped to 4), saves the per-cell NPZs, and surfaces the
backbone in cross_condition_probe.

CELL-DINO is opt-in: it only runs when `feature_extractor.celldino.
weights_path` is set in the eval config. Existing eval dirs leave the
field null and see zero behavior change — the cross-condition probe
emits NaN rows with a "missing embeddings file" reason for any pair
where one side did not run the backbone.

Why: CELL-DINO is DINOv2-architecture ViT pretrained on fluorescence
microscopy (HPA), so its embeddings carry cell-biology priors that
neither DINOv3 (LVD-1689M natural images) nor DynaCLR (in-house) does.
Useful as a third reference for the cross-condition probe and as a
fourth dataset-level FID/KID/PRC/MIND track.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor(dynacell): apply simplify findings on CELL-DINO wiring

- `cross_condition_probe._FEATURE_TYPES` derives from `cache.FeatureKind`
  via `typing.get_args` instead of hand-maintaining a parallel tuple, so
  the 5th backbone touches one line instead of two.
- `pipeline_cache._resolve_force` reads `force.gt_celldino` directly
  (matching the peers `gt_cp`/`gt_dinov3`/`gt_dynaclr`); the defensive
  `getattr(force, "gt_celldino", False)` was dead since eval.yaml now
  always defines the field.
- `pipeline_cache.fov_gt_deep_features` docstring updated: it accepts
  `celldino` in addition to `dinov3`/`dynaclr`.
- `eval.yaml` celldino block comment shrunk from 3 lines to 1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* perf(dynacell): batch per-cell deep-feature crops across backbones

Factors the per-cell crop construction out of _extract_per_cell_features
into a public build_pred_crops / build_target_crops / features_from_crops
trio so the 2-D max-z projection, cell iteration, and crop construction
run once per (FOV, timepoint) instead of once per backbone. The
prediction-side loop in pipeline.py now builds crops one time and
dispatches to DINOv3 / DynaCLR / CellDINO over the same list — eliminates
3× redundant pure-CPU work per timepoint.

Each deep extractor (DinoV3, DynaCLR, CellDino) gains an
extract_features_batch(images, batch_size=...) method that stacks all
cells of one (FOV, t) into a single forward, chunked to bound VRAM
(64 for DynaCLR, 32 for the ViT backbones). features_from_crops uses
the batch path when available and falls back to per-cell when an
extractor only exposes extract_features — keeps the test-stub
_IdentityExtractor working and gives the same backward-compat to any
future extractor that does not need batching.

Net effect on the deep-feature stage: 1 max-z + 1 cell iteration per
timepoint instead of 3-4; one forward per backbone per (FOV, t)
instead of N forwards per cell. Wallclock impact depends on cell
count per FOV but is the highest-leverage item the simplify reviewer
flagged on the CELL-DINO PR — should yield 2-5× speedup on plates with
hundreds of cells per timepoint.

deep_target_features / deep_pred_features stay as backward-compat
wrappers so the GT cache compute_fn path and the existing test suite
keep working unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(dynacell): --print-cmd suppresses execution

`--print-cmd` already prints the resolved command tokens, but without
`--dry-run` it then fell through to `subprocess.run(cmd_tokens)`. That
contradicted the help text ("print the literal `dynacell evaluate`
command to stdout (no exec)") and meant a caller probing the staged
invocation would also actually execute it.

Gate the subprocess.run call so `--print-cmd` alone short-circuits at
exit code 0. `--dry-run` still wins when both are passed; combining
them retains the existing provenance behavior used by evaluate_local.sh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(dynacell): drop bash JOB_NAME override; defer to python tool

evaluate_batch.sh built `EVAL_${ORG}_${MODEL^^}_${TRAIN}_ON_${TEST}` and
passed it via `--job-name`, but `${MODEL^^}` keeps the raw code-side
config key (e.g. `FCMAE_VSCYTO3D_SCRATCH`) while
submit_evaluation_batch.py's `_composite_job_name` runs the model
through `paper_key()` first (e.g. `UNEXT2`). Result: the SLURM job name
disagreed with the staged provenance files for the same inputs.

Remove the bash JOB_NAME construction entirely. The python tool's
default already produces the canonical `EVAL_<ORG>_<PAPER>_<TRAIN>_ON_<TEST>`
shape via paper_key translation. A user who needs a custom job name can
still pass `--job-name FOO` through the extra-args tail of the bash
wrapper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viscy-models): validate ViT depth divisibility by block_chunks

`range(0, depth, depth // block_chunks)` produces more than
`block_chunks` BlockChunks when depth doesn't divide cleanly (e.g.
depth=25, block_chunks=4 -> 5 chunks of [6, 6, 6, 6, 1]). The current
ViT-L/16 configuration is fine (depth=24, block_chunks=4), but a future
ViT size with a non-divisible depth would silently produce a state_dict
layout the loader cannot match.

Raise ValueError up front rather than constructing a malformed module.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(dynacell): correct CellDINO install hint message

`_require_cell_dino` instructed users to `pip install viscy-models`,
but no PyPI distribution by that name provides the in-tree workspace
package. A user hitting the ImportError would not be able to recover
by following the suggestion.

Rewrite the message to point at the actual install paths:
`uv sync --all-packages --all-extras` from the repo root, or
`pip install -e packages/viscy-models` for an editable single-package
install.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(dynacell): bind feature-select sidecar to defaults + complexity note

Two Copilot review findings on feature_select.py:

1. The cp_selected_feature_mask.json sidecar emitted by pipeline.py
   hardcoded `{freq_cut: 0.05, unique_cut: 0.01, corr_threshold: 0.9}`,
   matching the in-tree `select_features` defaults but drifting silently
   if either side ever changed. Promote the three thresholds to
   module-level constants (DEFAULT_FREQ_CUT / DEFAULT_UNIQUE_CUT /
   DEFAULT_CORR_THRESHOLD), use them as kwarg defaults across
   variance_threshold / correlation_threshold / select_features, and
   reference the same constants from pipeline.py so the sidecar tracks
   the actual call.
2. Note variance_threshold's per-column np.unique loop in its docstring;
   it's designed for the ~30-col CP regionprops matrix and should not
   be applied to deep-feature matrices (1024+ columns) without
   vectorizing first.

The test_evaluation_pipeline.py stub gains the three new constants so
the pipeline-under-stub still imports cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(dynacell): add gt_celldino to pipeline_cache test fixtures

A prior simplify commit (680337e) replaced
`getattr(force, "gt_celldino", False)` with `force.gt_celldino` in
`_resolve_force`, but the test fixtures in test_pipeline_cache.py were
not updated to include the new key. Result: every test that goes
through `init_cache_context` or `_resolve_force` directly raised
`omegaconf.errors.ConfigAttributeError: Missing key gt_celldino`.

Add the key to the three force_recompute dict literals (the shared
fixture builder plus the two inline `_resolve_force_*` test bodies).
This realigns the test config shape with eval.yaml, which always emits
the field.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(dynacell): pin canonical eval save_dir paths against paper-script drift

save_paths.eval_save_dir is deliberately duplicated with
paper/scripts/compute_all_organelle_precision_recall.py:eval_dir_for in
a separate git repo. Any silent divergence between the two definitions
breaks downstream paper aggregation scripts (they can no longer find
metrics in the expected location), but the failure surfaces only when a
paper job runs against a fresh eval.

Add a parametrized pin test covering one representative case from each
branch of eval_save_dir:
- iPSC test x {ipsc, a549, joint} train
- A549 test x {ipsc, a549, joint} train

The pins encode the actual expected paths (relative to data_root) and
fail on any drift from save_paths.py, forcing a future change here to
also touch the paper-side mirror.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(dynacell): clarify torch-fidelity pin + PRC bootstrap + extractor contract

Three docstring-only clarifications from Copilot review:

1. `pyproject.toml`: note the torch-fidelity git-SHA pin's upgrade path
   (drop the `@5e211a9` once upstream cuts a `>=0.5.0` release with
   MIND), so the pin doesn't become permanent technical debt.
2. `feature_metrics.compute_feature_similarity`: spell out that
   Precision / Recall / F1 are bootstrap *means* over resamples drawn
   with replacement, so the absolute numbers are systematically lower
   than single-shot PRC and not directly comparable to non-bootstrap
   tables. Within-scheme comparisons (across models / plates /
   conditions evaluated with the same setting) are still valid.
3. `metrics.features_from_crops`: document the implicit return-type
   contract for `extract_features` / `extract_features_batch`. Both
   paths call `.detach().cpu()` and assume a torch.Tensor; the batched
   path additionally assumes the leading dim equals `len(crops)`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(viscy-models): clarify CellDinoModel weights_path is caller-provided

The module docstring and `weights_path` Parameter docstring previously
called the Biohub HPC checkpoint path the model's "default", but
`weights_path` is a required init arg with no default. Rephrase both
to make clear the caller passes the path explicitly and the HPC
location is one example.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Eduardo Hirata-Miyasaki <edhiratam@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants