feat(dynacell): migrate eval metrics to torch-fidelity, add probes by alxndrkalinin · Pull Request #423 · mehta-lab/VisCy

alxndrkalinin · 2026-05-15T16:42:19Z

Summary

Migrate dynacell's eval pipeline to torch-fidelity for FID/KID/PRC/MIND, adding KID std + non-parametric P/R/F1 bootstrap + MIND (kernel-free, motivated by the paper's Appendix I mode-collapse analysis).
Add CP feature selection (vendored pycytominer variance + correlation pruning, BSD-3), FOV-stratified linear probes (real-vs-pred + cross-condition), and the dynacell.evaluation.cross_condition_probe CLI with per-plate MAD scaling to cancel illumination/exposure offsets.
Add high-level eval-runner scripts (evaluate_local.sh, evaluate_batch.sh) plus submit_evaluation_{job,batch}.py Hydra-override staging tools; save-dir naming routes through a new save_paths.py that stays bit-identical to the paper aggregation scripts.

Stacks on dynacell-models (PR #404).

Test plan

uv run pytest applications/dynacell/src/dynacell/evaluation/ applications/dynacell/tests/test_evaluation_metrics.py applications/dynacell/tests/test_evaluation_pipeline.py --deselect applications/dynacell/tests/test_evaluation_metrics.py::test_gain_and_offset_errors_are_not_scale_invariant → 41/41 pass (10 feature_metrics, 8 feature_select, 9 linear_probe, 13 evaluation_metrics, 1 evaluation_pipeline). The one deselected test is a pre-existing failure on dynacell-models (commit bfc61891 made nrmse scale-invariant).
uvx ruff check clean on all touched files.
submit_evaluation_job.py --dry-run on membrane/fnet3d_paper/a549_mantis/predict__a549_mantis_mock.yml produces canonical eval_fnet3d_a549trained_membrane_mock save_dir with correct .cmd + .yml provenance.
eval_save_dir matches paper/scripts/compute_all_organelle_precision_recall.py:eval_dir_for on 5 spot-checked (organelle, model, train_set, plate) combos including iPSC-trained, A549-trained, and joint flavors.
Real-data cross_condition_probe on three celldiff joint membrane A549 plates: CP AUROCs in [0.46, 0.71] (no longer 1.0 after per-plate MAD + select_features), DINOv3/DynaCLR at ceiling consistent with cross-condition morphology being learnable in deep embedding space.
End-to-end uv run dynacell evaluate ... on a real predict zarr (requires DINOv3 + DynaCLR weights cached + segmentation; left for reviewer to drive once stacked PR lands).

Notes

torch-fidelity is pinned via git URL to commit 5e211a9 because MIND is not in any tagged release yet.
save_paths.py is deliberately duplicated with the paper repo's compute_all_organelle_precision_recall.py:eval_dir_for; both files carry a comment pointing at the other.
Per-(FOV, timepoint) KID columns on ~50-cell A549 cohorts are noisy and flagged as exploratory.

🤖 Generated with Claude Code

torch-fidelity pinned to a commit on toshas/torch-fidelity master that includes MIND on top of v0.4.0 — MIND is not in any tagged release yet and the upcoming eval pipeline depends on it. scikit-learn is already a transitive dep but pinning it explicitly under the eval extra protects the new linear-probe code path against an upstream prune. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Produces the same on-disk save_dir layout the paper aggregation scripts in /hpc/mydata/alex.kalinin/dynacell expect (paper key + train-set infix + organelle long-form + infection suffix). The library copy here and the helper in paper/scripts/compute_all_organelle_precision_recall.py must stay in sync; both files carry a comment naming the other as VisCy and the paper repo are independent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Wraps torch-fidelity's *_features_to_metric helpers so the eval pipeline can keep feeding pre-extracted per-cell embeddings directly (the library's top-level calculate_metrics() can't accept precomputed features). Three gaps the prior in-tree implementations did not close are now closed: KID gets a subset-resampled mean+std, P/R/F1 gain a non-parametric bootstrap, and MIND (sliced 2-Wasserstein) joins as a kernel-free sensitivity-to-distribution-shape diagnostic motivated by the paper's Appendix I mode-collapse analysis. Small-cohort KID auto-shrinks the subset size and NaNs out below 16 cells; PRC bootstrap resamples both sides with replacement before each manifold rebuild; tensor conversion happens once outside the loop. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Vendors pycytominer's variance + correlation feature-selection logic (BSD-3 attribution in module header) so the CP regionprops track in the eval pipeline can drop near-constant or redundantly-correlated columns before they reach FID/KID/PRC. Operates on raw numpy without a pandas dep, statistics computed on pooled (gt, pred) so the same keep-mask is applied to both sides. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Probe-based diagnostics that FID/KID/MIND/PRC do not provide: how linearly distinguishable are real vs predicted embeddings, and how well do features preserve biological treatment information across plates. GroupKFold by FOV holds out whole fields of view so per-FOV illumination, segmentation artifacts, and biological clustering can't leak across the split. MADScaler (median + median-absolute-deviation, not the IQR-based sklearn RobustScaler) lives inside the CV Pipeline so normalization statistics are computed only on the training fold. Logistic regression defaults to class_weight='balanced' to handle unequal mock/infected cohort sizes. `paired_auroc` is the shared helper for the two probe call sites (real-vs-predicted in pipeline.py and between-condition in the post-hoc cross_condition_probe CLI). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replaces the in-tree _frechet_distance, _polynomial_mmd, and _pairwise_feature_metrics in utils.py, plus cp_pairwise / deep_pairwise / cp_dataset_fid_kid / deep_dataset_fid_kid / _nan_pairwise in metrics.py, with calls into the new feature_metrics module. Dataset-level CP track also runs feature_select pruning before similarity, plus the real-vs-predicted linear probe per feature type. The three feature-type similarity+probe calls run via ThreadPoolExecutor(max_workers=3) since torch-fidelity, sklearn LBFGS, and numpy BLAS all release the GIL on their inner loops. CP-specific per-(FOV, timepoint) preprocessing (target-zero-column drop and per-side z-score) moves to a small helper in pipeline.py since the deep tracks need neither step. Tests in test_evaluation_metrics.py drop the seven cp_pairwise / deep_pairwise / pinned-value cases (covered by feature_metrics_test.py); test_evaluation_pipeline.py grows stubs for the new feature_metrics, feature_select, and linear_probe modules so the cache-reuse path can still import the pipeline without pulling in torch-fidelity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Post-hoc diagnostic that runs FOV-stratified probes between two infection conditions (mock vs denv, mock vs zikv) on the per-cell embeddings saved by the eval pipeline. Per-plate MAD scaling cancels the illumination/exposure/gain offsets that would otherwise let any classifier separate the plates trivially (CP at AUROC=1.0 before this guard); CP-only feature_select then drops near-constant columns before the linear probe sees them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Mirrors the predict_{local,batch}.sh / submit_benchmark_{job,batch}.py pair so an entire (organelle, model, train_set, test_set) eval set launches with one command. Unlike Lightning's predict CLI, dynacell evaluate is Hydra-decorated and takes override key=value tokens directly — the staging tools read the predict leaf, extract output_store / dataset_ref / benchmark.*, build the override list, and emit `uv run dynacell evaluate …` invocations. evaluate_local.sh supports --parallel N and --cross-condition-probe; the latter triggers an automatic post-step that runs `python -m dynacell.evaluation.cross_condition_probe` over the three A549 per-plate save_dirs once all per-plate evals complete. evaluate_batch.sh wraps a single sbatch job that runs each leaf's eval in series via srun. Save-dir naming routes through dynacell.evaluation.save_paths so the paths stay bit-identical to the paper aggregation scripts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The dataset-level metrics + linear-probe block runs three feature types concurrently via ThreadPoolExecutor. Without a thread-pool guard, each worker's sklearn LBFGS + numpy ops can grab cpu_count() BLAS threads on top of the three outer threads, oversubscribing cores on dense nodes. threadpoolctl.threadpool_limits(1) wraps the executor block so each outer thread gets exactly one BLAS thread. Rename `_cp_pairwise_preprocess` to `_cp_dropzero_zscore` so the name no longer collides visually with the deleted `cp_pairwise` helper. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ORGANELLE_EVAL_TARGET and ORGANELLE_A549_PLATE were already public because the eval-runner tools import them; promote PAPER_KEY and ORGANELLE_PAPER from underscore-prefixed to public for the same reason. The four lookup tables are sibling concepts and the leading- underscore split implied a private/public boundary that doesn't exist. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

_resolve_one_leaf already loads + composes every leaf and reads benchmark.trained_on as part of the call; return that string in the tuple instead of re-running load_composed_config on the head leaf just to fetch the same field. Saves one ~1-2 s compose per submit invocation. Also adds a cross-leaf trained_on consistency check mirroring the existing organelle / model_name checks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CELL-DINO is a DINOv2-architecture ViT pretrained on fluorescence microscopy (Human Protein Atlas). The channel_adaptive_dino_vitl16 checkpoint processes one channel at a time through a single-channel ViT-L/16 stem; the wrapper reshapes (B, C, H, W) -> (B*C, 1, H, W), runs the backbone, and mean-pools the cls token across channels for a fixed-dim embedding regardless of input channel count. Weights load from a local .pth state_dict — nothing fetched from the network. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the CellDinoFeatureExtractor adapter (wraps viscy_models.foundation.CellDinoModel; same extract_features(image_2d) contract as DinoV3FeatureExtractor / DynaCLRFeatureExtractor), threads celldino through the GT cache (FeatureKind literal, _CacheContext field, manifest validation, fov_gt_deep_features dispatch), accumulates pred/gt embeddings per (FOV, t), dispatches per-timepoint pairwise + dataset-level FID/KID/PRC/MIND/probe via the existing parallel pool (max_workers bumped to 4), saves the per-cell NPZs, and surfaces the backbone in cross_condition_probe. CELL-DINO is opt-in: it only runs when `feature_extractor.celldino. weights_path` is set in the eval config. Existing eval dirs leave the field null and see zero behavior change — the cross-condition probe emits NaN rows with a "missing embeddings file" reason for any pair where one side did not run the backbone. Why: CELL-DINO is DINOv2-architecture ViT pretrained on fluorescence microscopy (HPA), so its embeddings carry cell-biology priors that neither DINOv3 (LVD-1689M natural images) nor DynaCLR (in-house) does. Useful as a third reference for the cross-condition probe and as a fourth dataset-level FID/KID/PRC/MIND track. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- `cross_condition_probe._FEATURE_TYPES` derives from `cache.FeatureKind` via `typing.get_args` instead of hand-maintaining a parallel tuple, so the 5th backbone touches one line instead of two. - `pipeline_cache._resolve_force` reads `force.gt_celldino` directly (matching the peers `gt_cp`/`gt_dinov3`/`gt_dynaclr`); the defensive `getattr(force, "gt_celldino", False)` was dead since eval.yaml now always defines the field. - `pipeline_cache.fov_gt_deep_features` docstring updated: it accepts `celldino` in addition to `dinov3`/`dynaclr`. - `eval.yaml` celldino block comment shrunk from 3 lines to 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Factors the per-cell crop construction out of _extract_per_cell_features into a public build_pred_crops / build_target_crops / features_from_crops trio so the 2-D max-z projection, cell iteration, and crop construction run once per (FOV, timepoint) instead of once per backbone. The prediction-side loop in pipeline.py now builds crops one time and dispatches to DINOv3 / DynaCLR / CellDINO over the same list — eliminates 3× redundant pure-CPU work per timepoint. Each deep extractor (DinoV3, DynaCLR, CellDino) gains an extract_features_batch(images, batch_size=...) method that stacks all cells of one (FOV, t) into a single forward, chunked to bound VRAM (64 for DynaCLR, 32 for the ViT backbones). features_from_crops uses the batch path when available and falls back to per-cell when an extractor only exposes extract_features — keeps the test-stub _IdentityExtractor working and gives the same backward-compat to any future extractor that does not need batching. Net effect on the deep-feature stage: 1 max-z + 1 cell iteration per timepoint instead of 3-4; one forward per backbone per (FOV, t) instead of N forwards per cell. Wallclock impact depends on cell count per FOV but is the highest-leverage item the simplify reviewer flagged on the CELL-DINO PR — should yield 2-5× speedup on plates with hundreds of cells per timepoint. deep_target_features / deep_pred_features stay as backward-compat wrappers so the GT cache compute_fn path and the existing test suite keep working unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR migrates dynacell's evaluation pipeline to torch-fidelity for FID/KID/PRC/MIND metrics, adds a CELL-DINO foundation model wrapper, vendors pycytominer-style CP feature selection, and introduces FOV-stratified linear probes plus a cross-condition probe CLI. It also ships high-level eval-runner shell scripts, Hydra-override staging tools, and a new save_paths.py that mirrors paper-aggregation conventions.

Changes:

New feature-similarity stack: feature_metrics.py (torch-fidelity FID/KID/PRC/MIND + bootstrap stds), feature_select.py (variance + correlation pruning), linear_probe.py (MAD-scaled FOV-stratified AUROC), and cross_condition_probe.py CLI; pipeline.py rewired to compose deep-feature crops once and threadpool the per-prefix metrics.
New CELL-DINO foundation backbone (viscy_models.foundation.CellDinoModel + vendored DINOv2 ViT), wired through utils.CellDinoFeatureExtractor, the eval cache, and the eval Hydra config.
Eval-runner ergonomics: submit_evaluation_{job,batch}.py, evaluate_local.sh, evaluate_batch.sh, plus save_paths.py for canonical save-dir naming bit-identical to the paper repo.

Reviewed changes

Copilot reviewed 24 out of 25 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
applications/dynacell/pyproject.toml	Adds `scikit-learn`, pinned `torch-fidelity` git dep
uv.lock	Lock entries for new eval deps
applications/dynacell/src/dynacell/evaluation/feature_metrics.py	Torch-fidelity backed FID/KID/PRC/MIND + cosine, w/ pairwise variant
applications/dynacell/src/dynacell/evaluation/feature_metrics_test.py	End-to-end metric tests
applications/dynacell/src/dynacell/evaluation/feature_select.py	Vendored variance + correlation pruning
applications/dynacell/src/dynacell/evaluation/feature_select_test.py	Pruning tests
applications/dynacell/src/dynacell/evaluation/linear_probe.py	MADScaler + FOV-stratified AUROC + paired_auroc + indistinguishability
applications/dynacell/src/dynacell/evaluation/linear_probe_test.py	Linear probe tests
applications/dynacell/src/dynacell/evaluation/cross_condition_probe.py	Cross-condition diagnostic CLI on saved embeddings
applications/dynacell/src/dynacell/evaluation/metrics.py	Splits per-cell crop construction from extractor invocation; drops legacy `cp_pairwise`/`deep_pairwise`/dataset helpers
applications/dynacell/src/dynacell/evaluation/utils.py	Removes scipy-based FID/KID; adds `*_batch` extractor methods + CellDinoFeatureExtractor
applications/dynacell/src/dynacell/evaluation/pipeline.py	Re-wired to use new feature stack, threadpool, CELL-DINO branch, mask sidecar
applications/dynacell/src/dynacell/evaluation/pipeline_cache.py	CELL-DINO cache plumbing + sha hash hoist
applications/dynacell/src/dynacell/evaluation/cache.py	Adds `celldino` FeatureKind + group helpers
applications/dynacell/src/dynacell/evaluation/_configs/eval.yaml	Adds `feature_extractor.celldino` block
applications/dynacell/src/dynacell/evaluation/save_paths.py	Canonical eval save-dir naming (paper-aligned, intentionally duplicated)
applications/dynacell/tools/submit_evaluation_job.py	Stage one `dynacell evaluate` invocation from a predict leaf
applications/dynacell/tools/submit_evaluation_batch.py	Chain N per-leaf invocations into one sbatch
applications/dynacell/tools/evaluate_local.sh	Local-host wrapper around `submit_evaluation_job.py`
applications/dynacell/tools/evaluate_batch.sh	sbatch wrapper around `submit_evaluation_batch.py`
applications/dynacell/tests/test_evaluation_pipeline.py	Stub modules for new evaluation submodules
applications/dynacell/tests/test_evaluation_metrics.py	Drops legacy pairwise tests, adds batched-extractor tests
packages/viscy-models/src/viscy_models/foundation/cell_dino.py	CELL-DINO wrapper (preprocess + channel-mean cls pooling)
packages/viscy-models/src/viscy_models/foundation/_dinov2_vit.py	Vendored eval-only DINOv2 ViT
packages/viscy-models/src/viscy_models/foundation/init.py	Re-export `CellDinoModel`

Comments suppressed due to low confidence (5)

packages/viscy-models/src/viscy_models/foundation/cell_dino.py:83

nn.Module.load_state_dict(state_dict, strict=True) raises RuntimeError itself when there are any missing or unexpected keys, so the subsequent if missing or unexpected: block is unreachable in normal operation. Either pass strict=False so the manual check actually runs (and provides the more descriptive error message) or drop the check entirely.

        missing, unexpected = self.model.load_state_dict(state_dict, strict=True)
        if missing or unexpected:
            raise RuntimeError(f"CELL-DINO state_dict mismatch — missing={missing}, unexpected={unexpected}")

applications/dynacell/src/dynacell/evaluation/cross_condition_probe.py:124

MAD normalization is applied twice on the same feature matrix: first per-plate here via MADScaler().fit_transform(...), then again inside paired_auroc/fov_stratified_auroc whose Pipeline includes MADScaler() as the first step (linear_probe.py line 131-143). The second pass operates on already MAD-normalized data, which is unlikely to harm correctness but does undo part of the per-plate centering and produces non-obvious feature scales for the logistic regression. If the per-plate MAD is the intended normalization (per the PR description's "per-plate MAD scaling to cancel illumination/exposure offsets"), consider exposing a way for fov_stratified_auroc to skip the inner scaler when the caller has already normalized — or document that double normalization is intentional.

    # Per-plate MAD normalization: cancels per-plate intensity offsets
    # (illumination, exposure, gain) that would otherwise dominate the
    # classifier — especially on CP regionprops, where raw intensity
    # columns make different plates trivially separable (AUROC = 1.0).
    x0_scaled = MADScaler().fit_transform(x0.astype(np.float64))
    x1_scaled = MADScaler().fit_transform(x1.astype(np.float64))
    # Tag FOV ids by condition so the two sides cannot collide.
    fov0_tagged = np.asarray([f"{c0}::{f}" for f in fov0])
    fov1_tagged = np.asarray([f"{c1}::{f}" for f in fov1])
    result = paired_auroc(x0_scaled, x1_scaled, fov0_tagged, fov1_tagged, n_splits=n_splits, rng_seed=rng_seed)

applications/dynacell/tools/evaluate_local.sh:171

This printf … | xargs … > /dev/null line is dead code: xargs -I {} echo {} echoes each token to stdout which is then sent to /dev/null, and any parse errors from xargs are not checked (set -e would catch a non-zero exit, but xargs -I will not error on simple newline-separated inputs). The actual command parse happens in the subshell below via the while IFS= read -r loop. Consider removing this line.

  printf '%s\n' "$cmd_text" | xargs -d '\n' -I {} echo {} > /dev/null  # validate parse

packages/viscy-models/src/viscy_models/foundation/_dinov2_vit.py:302

x.shape for a 4D image tensor is (B, C, H, W), so this destructuring binds w to the height and h to the width. The downstream interpolate_pos_encoding(x, w, h) is then called with positional args that are consistently swapped throughout this module (e.g. w0 = w // self.patch_size, h0 = h // self.patch_size). For square inputs (CELL-DINO is trained at 224×224) the output is identical, but on any non-square input the positional encoding grid would be transposed. This mirrors the original DINOv2 source which has the same naming bug; consider adding a clarifying comment so future readers don't mistake w/h for actual width/height.

    def prepare_tokens_with_masks(self, x: Tensor, masks: Optional[Tensor] = None) -> Tensor:
        _, _, w, h = x.shape
        x = self.patch_embed(x)
        if masks is not None:
            x = torch.where(masks.unsqueeze(-1), self.mask_token.to(x.dtype).unsqueeze(0), x)
        x = torch.cat((self.cls_token.expand(x.shape[0], -1, -1), x), dim=1)
        x = x + self.interpolate_pos_encoding(x, w, h)

applications/dynacell/tools/submit_evaluation_job.py:156

The plate label inference relies on hardcoded suffix matching against _mock / _denv / _zikv and the literal stem predict__ipsc_confocal. This is duplicated verbatim in submit_evaluation_batch.py (lines 144-154). Any new test plate added later will require both files to be updated in lockstep. Consider extracting the plate-from-leaf-stem logic into dynacell.evaluation.save_paths (alongside the related path naming) so there is a single source of truth.

    # Plate label for save_dir: ipsc | mock | denv | zikv.
    # Predict leaves are named predict__<test_set>[_<cond>].yml. Extract the
    # trailing condition for a549; the bare ipsc_confocal predict leaf is iPSC.
    leaf_stem = leaf_path.stem  # predict__a549_mantis_mock
    if leaf_stem == "predict__ipsc_confocal":
        test_plate = "ipsc"
    elif leaf_stem.endswith("_mock"):
        test_plate = "mock"
    elif leaf_stem.endswith("_denv"):
        test_plate = "denv"
    elif leaf_stem.endswith("_zikv"):
        test_plate = "zikv"
    else:
        raise SystemExit(
            f"cannot infer test plate from leaf filename: {leaf_path.name} "
            f"(expected predict__ipsc_confocal.yml or predict__*_{{mock,denv,zikv}}.yml)"
        )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

`--print-cmd` already prints the resolved command tokens, but without `--dry-run` it then fell through to `subprocess.run(cmd_tokens)`. That contradicted the help text ("print the literal `dynacell evaluate` command to stdout (no exec)") and meant a caller probing the staged invocation would also actually execute it. Gate the subprocess.run call so `--print-cmd` alone short-circuits at exit code 0. `--dry-run` still wins when both are passed; combining them retains the existing provenance behavior used by evaluate_local.sh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

evaluate_batch.sh built `EVAL_${ORG}_${MODEL^^}_${TRAIN}_ON_${TEST}` and passed it via `--job-name`, but `${MODEL^^}` keeps the raw code-side config key (e.g. `FCMAE_VSCYTO3D_SCRATCH`) while submit_evaluation_batch.py's `_composite_job_name` runs the model through `paper_key()` first (e.g. `UNEXT2`). Result: the SLURM job name disagreed with the staged provenance files for the same inputs. Remove the bash JOB_NAME construction entirely. The python tool's default already produces the canonical `EVAL_<ORG>_<PAPER>_<TRAIN>_ON_<TEST>` shape via paper_key translation. A user who needs a custom job name can still pass `--job-name FOO` through the extra-args tail of the bash wrapper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`range(0, depth, depth // block_chunks)` produces more than `block_chunks` BlockChunks when depth doesn't divide cleanly (e.g. depth=25, block_chunks=4 -> 5 chunks of [6, 6, 6, 6, 1]). The current ViT-L/16 configuration is fine (depth=24, block_chunks=4), but a future ViT size with a non-divisible depth would silently produce a state_dict layout the loader cannot match. Raise ValueError up front rather than constructing a malformed module. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`_require_cell_dino` instructed users to `pip install viscy-models`, but no PyPI distribution by that name provides the in-tree workspace package. A user hitting the ImportError would not be able to recover by following the suggestion. Rewrite the message to point at the actual install paths: `uv sync --all-packages --all-extras` from the repo root, or `pip install -e packages/viscy-models` for an editable single-package install. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two Copilot review findings on feature_select.py: 1. The cp_selected_feature_mask.json sidecar emitted by pipeline.py hardcoded `{freq_cut: 0.05, unique_cut: 0.01, corr_threshold: 0.9}`, matching the in-tree `select_features` defaults but drifting silently if either side ever changed. Promote the three thresholds to module-level constants (DEFAULT_FREQ_CUT / DEFAULT_UNIQUE_CUT / DEFAULT_CORR_THRESHOLD), use them as kwarg defaults across variance_threshold / correlation_threshold / select_features, and reference the same constants from pipeline.py so the sidecar tracks the actual call. 2. Note variance_threshold's per-column np.unique loop in its docstring; it's designed for the ~30-col CP regionprops matrix and should not be applied to deep-feature matrices (1024+ columns) without vectorizing first. The test_evaluation_pipeline.py stub gains the three new constants so the pipeline-under-stub still imports cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A prior simplify commit (680337e) replaced `getattr(force, "gt_celldino", False)` with `force.gt_celldino` in `_resolve_force`, but the test fixtures in test_pipeline_cache.py were not updated to include the new key. Result: every test that goes through `init_cache_context` or `_resolve_force` directly raised `omegaconf.errors.ConfigAttributeError: Missing key gt_celldino`. Add the key to the three force_recompute dict literals (the shared fixture builder plus the two inline `_resolve_force_*` test bodies). This realigns the test config shape with eval.yaml, which always emits the field. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t drift save_paths.eval_save_dir is deliberately duplicated with paper/scripts/compute_all_organelle_precision_recall.py:eval_dir_for in a separate git repo. Any silent divergence between the two definitions breaks downstream paper aggregation scripts (they can no longer find metrics in the expected location), but the failure surfaces only when a paper job runs against a fresh eval. Add a parametrized pin test covering one representative case from each branch of eval_save_dir: - iPSC test x {ipsc, a549, joint} train - A549 test x {ipsc, a549, joint} train The pins encode the actual expected paths (relative to data_root) and fail on any drift from save_paths.py, forcing a future change here to also touch the paper-side mirror. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…r contract Three docstring-only clarifications from Copilot review: 1. `pyproject.toml`: note the torch-fidelity git-SHA pin's upgrade path (drop the `@5e211a9` once upstream cuts a `>=0.5.0` release with MIND), so the pin doesn't become permanent technical debt. 2. `feature_metrics.compute_feature_similarity`: spell out that Precision / Recall / F1 are bootstrap *means* over resamples drawn with replacement, so the absolute numbers are systematically lower than single-shot PRC and not directly comparable to non-bootstrap tables. Within-scheme comparisons (across models / plates / conditions evaluated with the same setting) are still valid. 3. `metrics.features_from_crops`: document the implicit return-type contract for `extract_features` / `extract_features_batch`. Both paths call `.detach().cpu()` and assume a torch.Tensor; the batched path additionally assumes the leading dim equals `len(crops)`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ided The module docstring and `weights_path` Parameter docstring previously called the Biohub HPC checkpoint path the model's "default", but `weights_path` is a required init arg with no default. Rephrase both to make clear the caller passes the path explicitly and the HPC location is one example. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

alxndrkalinin and others added 15 commits May 15, 2026 09:30

alxndrkalinin requested a review from Copilot May 15, 2026 19:35

Copilot started reviewing on behalf of alxndrkalinin May 15, 2026 19:36 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

alxndrkalinin and others added 9 commits May 15, 2026 13:38

alxndrkalinin merged commit ef3dcd7 into dynacell-models May 15, 2026

alxndrkalinin deleted the feature-metrics-torch-fidelity branch May 15, 2026 22:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dynacell): migrate eval metrics to torch-fidelity, add probes#423

feat(dynacell): migrate eval metrics to torch-fidelity, add probes#423
alxndrkalinin merged 24 commits into
dynacell-modelsfrom
feature-metrics-torch-fidelity

alxndrkalinin commented May 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alxndrkalinin commented May 15, 2026

Summary

Test plan

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants