Skip to content

feat(dynacell): migrate eval metrics to torch-fidelity, add probes#423

Merged
alxndrkalinin merged 24 commits into
dynacell-modelsfrom
feature-metrics-torch-fidelity
May 15, 2026
Merged

feat(dynacell): migrate eval metrics to torch-fidelity, add probes#423
alxndrkalinin merged 24 commits into
dynacell-modelsfrom
feature-metrics-torch-fidelity

Conversation

@alxndrkalinin
Copy link
Copy Markdown
Collaborator

Summary

  • Migrate dynacell's eval pipeline to torch-fidelity for FID/KID/PRC/MIND, adding KID std + non-parametric P/R/F1 bootstrap + MIND (kernel-free, motivated by the paper's Appendix I mode-collapse analysis).
  • Add CP feature selection (vendored pycytominer variance + correlation pruning, BSD-3), FOV-stratified linear probes (real-vs-pred + cross-condition), and the dynacell.evaluation.cross_condition_probe CLI with per-plate MAD scaling to cancel illumination/exposure offsets.
  • Add high-level eval-runner scripts (evaluate_local.sh, evaluate_batch.sh) plus submit_evaluation_{job,batch}.py Hydra-override staging tools; save-dir naming routes through a new save_paths.py that stays bit-identical to the paper aggregation scripts.

Stacks on dynacell-models (PR #404).

Test plan

  • uv run pytest applications/dynacell/src/dynacell/evaluation/ applications/dynacell/tests/test_evaluation_metrics.py applications/dynacell/tests/test_evaluation_pipeline.py --deselect applications/dynacell/tests/test_evaluation_metrics.py::test_gain_and_offset_errors_are_not_scale_invariant → 41/41 pass (10 feature_metrics, 8 feature_select, 9 linear_probe, 13 evaluation_metrics, 1 evaluation_pipeline). The one deselected test is a pre-existing failure on dynacell-models (commit bfc61891 made nrmse scale-invariant).
  • uvx ruff check clean on all touched files.
  • submit_evaluation_job.py --dry-run on membrane/fnet3d_paper/a549_mantis/predict__a549_mantis_mock.yml produces canonical eval_fnet3d_a549trained_membrane_mock save_dir with correct .cmd + .yml provenance.
  • eval_save_dir matches paper/scripts/compute_all_organelle_precision_recall.py:eval_dir_for on 5 spot-checked (organelle, model, train_set, plate) combos including iPSC-trained, A549-trained, and joint flavors.
  • Real-data cross_condition_probe on three celldiff joint membrane A549 plates: CP AUROCs in [0.46, 0.71] (no longer 1.0 after per-plate MAD + select_features), DINOv3/DynaCLR at ceiling consistent with cross-condition morphology being learnable in deep embedding space.
  • End-to-end uv run dynacell evaluate ... on a real predict zarr (requires DINOv3 + DynaCLR weights cached + segmentation; left for reviewer to drive once stacked PR lands).

Notes

  • torch-fidelity is pinned via git URL to commit 5e211a9 because MIND is not in any tagged release yet.
  • save_paths.py is deliberately duplicated with the paper repo's compute_all_organelle_precision_recall.py:eval_dir_for; both files carry a comment pointing at the other.
  • Per-(FOV, timepoint) KID columns on ~50-cell A549 cohorts are noisy and flagged as exploratory.

🤖 Generated with Claude Code

alxndrkalinin and others added 15 commits May 15, 2026 09:30
torch-fidelity pinned to a commit on toshas/torch-fidelity master that
includes MIND on top of v0.4.0 — MIND is not in any tagged release yet
and the upcoming eval pipeline depends on it. scikit-learn is already a
transitive dep but pinning it explicitly under the eval extra protects
the new linear-probe code path against an upstream prune.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Produces the same on-disk save_dir layout the paper aggregation scripts
in /hpc/mydata/alex.kalinin/dynacell expect (paper key + train-set
infix + organelle long-form + infection suffix). The library copy here
and the helper in paper/scripts/compute_all_organelle_precision_recall.py
must stay in sync; both files carry a comment naming the other as VisCy
and the paper repo are independent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wraps torch-fidelity's *_features_to_metric helpers so the eval pipeline
can keep feeding pre-extracted per-cell embeddings directly (the library's
top-level calculate_metrics() can't accept precomputed features). Three
gaps the prior in-tree implementations did not close are now closed:
KID gets a subset-resampled mean+std, P/R/F1 gain a non-parametric
bootstrap, and MIND (sliced 2-Wasserstein) joins as a kernel-free
sensitivity-to-distribution-shape diagnostic motivated by the paper's
Appendix I mode-collapse analysis.

Small-cohort KID auto-shrinks the subset size and NaNs out below 16
cells; PRC bootstrap resamples both sides with replacement before each
manifold rebuild; tensor conversion happens once outside the loop.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Vendors pycytominer's variance + correlation feature-selection logic
(BSD-3 attribution in module header) so the CP regionprops track in
the eval pipeline can drop near-constant or redundantly-correlated
columns before they reach FID/KID/PRC. Operates on raw numpy without
a pandas dep, statistics computed on pooled (gt, pred) so the same
keep-mask is applied to both sides.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Probe-based diagnostics that FID/KID/MIND/PRC do not provide: how
linearly distinguishable are real vs predicted embeddings, and how well
do features preserve biological treatment information across plates.

GroupKFold by FOV holds out whole fields of view so per-FOV illumination,
segmentation artifacts, and biological clustering can't leak across the
split. MADScaler (median + median-absolute-deviation, not the IQR-based
sklearn RobustScaler) lives inside the CV Pipeline so normalization
statistics are computed only on the training fold. Logistic regression
defaults to class_weight='balanced' to handle unequal mock/infected
cohort sizes.

`paired_auroc` is the shared helper for the two probe call sites
(real-vs-predicted in pipeline.py and between-condition in the
post-hoc cross_condition_probe CLI).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the in-tree _frechet_distance, _polynomial_mmd, and
_pairwise_feature_metrics in utils.py, plus cp_pairwise / deep_pairwise /
cp_dataset_fid_kid / deep_dataset_fid_kid / _nan_pairwise in metrics.py,
with calls into the new feature_metrics module. Dataset-level CP track
also runs feature_select pruning before similarity, plus the
real-vs-predicted linear probe per feature type. The three feature-type
similarity+probe calls run via ThreadPoolExecutor(max_workers=3) since
torch-fidelity, sklearn LBFGS, and numpy BLAS all release the GIL on
their inner loops.

CP-specific per-(FOV, timepoint) preprocessing (target-zero-column drop
and per-side z-score) moves to a small helper in pipeline.py since the
deep tracks need neither step.

Tests in test_evaluation_metrics.py drop the seven cp_pairwise /
deep_pairwise / pinned-value cases (covered by feature_metrics_test.py);
test_evaluation_pipeline.py grows stubs for the new feature_metrics,
feature_select, and linear_probe modules so the cache-reuse path can
still import the pipeline without pulling in torch-fidelity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Post-hoc diagnostic that runs FOV-stratified probes between two
infection conditions (mock vs denv, mock vs zikv) on the per-cell
embeddings saved by the eval pipeline. Per-plate MAD scaling cancels
the illumination/exposure/gain offsets that would otherwise let any
classifier separate the plates trivially (CP at AUROC=1.0 before this
guard); CP-only feature_select then drops near-constant columns before
the linear probe sees them.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mirrors the predict_{local,batch}.sh / submit_benchmark_{job,batch}.py
pair so an entire (organelle, model, train_set, test_set) eval set
launches with one command. Unlike Lightning's predict CLI, dynacell
evaluate is Hydra-decorated and takes override key=value tokens
directly — the staging tools read the predict leaf, extract
output_store / dataset_ref / benchmark.*, build the override list,
and emit `uv run dynacell evaluate …` invocations.

evaluate_local.sh supports --parallel N and --cross-condition-probe;
the latter triggers an automatic post-step that runs
`python -m dynacell.evaluation.cross_condition_probe` over the three
A549 per-plate save_dirs once all per-plate evals complete.
evaluate_batch.sh wraps a single sbatch job that runs each leaf's
eval in series via srun.

Save-dir naming routes through dynacell.evaluation.save_paths so the
paths stay bit-identical to the paper aggregation scripts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The dataset-level metrics + linear-probe block runs three feature types
concurrently via ThreadPoolExecutor. Without a thread-pool guard, each
worker's sklearn LBFGS + numpy ops can grab cpu_count() BLAS threads on
top of the three outer threads, oversubscribing cores on dense nodes.
threadpoolctl.threadpool_limits(1) wraps the executor block so each
outer thread gets exactly one BLAS thread.

Rename `_cp_pairwise_preprocess` to `_cp_dropzero_zscore` so the name
no longer collides visually with the deleted `cp_pairwise` helper.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ORGANELLE_EVAL_TARGET and ORGANELLE_A549_PLATE were already public
because the eval-runner tools import them; promote PAPER_KEY and
ORGANELLE_PAPER from underscore-prefixed to public for the same
reason. The four lookup tables are sibling concepts and the leading-
underscore split implied a private/public boundary that doesn't exist.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_resolve_one_leaf already loads + composes every leaf and reads
benchmark.trained_on as part of the call; return that string in the
tuple instead of re-running load_composed_config on the head leaf
just to fetch the same field. Saves one ~1-2 s compose per submit
invocation. Also adds a cross-leaf trained_on consistency check
mirroring the existing organelle / model_name checks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CELL-DINO is a DINOv2-architecture ViT pretrained on
fluorescence microscopy (Human Protein Atlas). The
channel_adaptive_dino_vitl16 checkpoint processes one channel
at a time through a single-channel ViT-L/16 stem; the wrapper
reshapes (B, C, H, W) -> (B*C, 1, H, W), runs the backbone, and
mean-pools the cls token across channels for a fixed-dim
embedding regardless of input channel count. Weights load from
a local .pth state_dict — nothing fetched from the network.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the CellDinoFeatureExtractor adapter (wraps
viscy_models.foundation.CellDinoModel; same extract_features(image_2d)
contract as DinoV3FeatureExtractor / DynaCLRFeatureExtractor), threads
celldino through the GT cache (FeatureKind literal, _CacheContext field,
manifest validation, fov_gt_deep_features dispatch), accumulates
pred/gt embeddings per (FOV, t), dispatches per-timepoint pairwise +
dataset-level FID/KID/PRC/MIND/probe via the existing parallel pool
(max_workers bumped to 4), saves the per-cell NPZs, and surfaces the
backbone in cross_condition_probe.

CELL-DINO is opt-in: it only runs when `feature_extractor.celldino.
weights_path` is set in the eval config. Existing eval dirs leave the
field null and see zero behavior change — the cross-condition probe
emits NaN rows with a "missing embeddings file" reason for any pair
where one side did not run the backbone.

Why: CELL-DINO is DINOv2-architecture ViT pretrained on fluorescence
microscopy (HPA), so its embeddings carry cell-biology priors that
neither DINOv3 (LVD-1689M natural images) nor DynaCLR (in-house) does.
Useful as a third reference for the cross-condition probe and as a
fourth dataset-level FID/KID/PRC/MIND track.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- `cross_condition_probe._FEATURE_TYPES` derives from `cache.FeatureKind`
  via `typing.get_args` instead of hand-maintaining a parallel tuple, so
  the 5th backbone touches one line instead of two.
- `pipeline_cache._resolve_force` reads `force.gt_celldino` directly
  (matching the peers `gt_cp`/`gt_dinov3`/`gt_dynaclr`); the defensive
  `getattr(force, "gt_celldino", False)` was dead since eval.yaml now
  always defines the field.
- `pipeline_cache.fov_gt_deep_features` docstring updated: it accepts
  `celldino` in addition to `dinov3`/`dynaclr`.
- `eval.yaml` celldino block comment shrunk from 3 lines to 1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Factors the per-cell crop construction out of _extract_per_cell_features
into a public build_pred_crops / build_target_crops / features_from_crops
trio so the 2-D max-z projection, cell iteration, and crop construction
run once per (FOV, timepoint) instead of once per backbone. The
prediction-side loop in pipeline.py now builds crops one time and
dispatches to DINOv3 / DynaCLR / CellDINO over the same list — eliminates
3× redundant pure-CPU work per timepoint.

Each deep extractor (DinoV3, DynaCLR, CellDino) gains an
extract_features_batch(images, batch_size=...) method that stacks all
cells of one (FOV, t) into a single forward, chunked to bound VRAM
(64 for DynaCLR, 32 for the ViT backbones). features_from_crops uses
the batch path when available and falls back to per-cell when an
extractor only exposes extract_features — keeps the test-stub
_IdentityExtractor working and gives the same backward-compat to any
future extractor that does not need batching.

Net effect on the deep-feature stage: 1 max-z + 1 cell iteration per
timepoint instead of 3-4; one forward per backbone per (FOV, t)
instead of N forwards per cell. Wallclock impact depends on cell
count per FOV but is the highest-leverage item the simplify reviewer
flagged on the CELL-DINO PR — should yield 2-5× speedup on plates with
hundreds of cells per timepoint.

deep_target_features / deep_pred_features stay as backward-compat
wrappers so the GT cache compute_fn path and the existing test suite
keep working unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates dynacell's evaluation pipeline to torch-fidelity for FID/KID/PRC/MIND metrics, adds a CELL-DINO foundation model wrapper, vendors pycytominer-style CP feature selection, and introduces FOV-stratified linear probes plus a cross-condition probe CLI. It also ships high-level eval-runner shell scripts, Hydra-override staging tools, and a new save_paths.py that mirrors paper-aggregation conventions.

Changes:

  • New feature-similarity stack: feature_metrics.py (torch-fidelity FID/KID/PRC/MIND + bootstrap stds), feature_select.py (variance + correlation pruning), linear_probe.py (MAD-scaled FOV-stratified AUROC), and cross_condition_probe.py CLI; pipeline.py rewired to compose deep-feature crops once and threadpool the per-prefix metrics.
  • New CELL-DINO foundation backbone (viscy_models.foundation.CellDinoModel + vendored DINOv2 ViT), wired through utils.CellDinoFeatureExtractor, the eval cache, and the eval Hydra config.
  • Eval-runner ergonomics: submit_evaluation_{job,batch}.py, evaluate_local.sh, evaluate_batch.sh, plus save_paths.py for canonical save-dir naming bit-identical to the paper repo.

Reviewed changes

Copilot reviewed 24 out of 25 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
applications/dynacell/pyproject.toml Adds scikit-learn, pinned torch-fidelity git dep
uv.lock Lock entries for new eval deps
applications/dynacell/src/dynacell/evaluation/feature_metrics.py Torch-fidelity backed FID/KID/PRC/MIND + cosine, w/ pairwise variant
applications/dynacell/src/dynacell/evaluation/feature_metrics_test.py End-to-end metric tests
applications/dynacell/src/dynacell/evaluation/feature_select.py Vendored variance + correlation pruning
applications/dynacell/src/dynacell/evaluation/feature_select_test.py Pruning tests
applications/dynacell/src/dynacell/evaluation/linear_probe.py MADScaler + FOV-stratified AUROC + paired_auroc + indistinguishability
applications/dynacell/src/dynacell/evaluation/linear_probe_test.py Linear probe tests
applications/dynacell/src/dynacell/evaluation/cross_condition_probe.py Cross-condition diagnostic CLI on saved embeddings
applications/dynacell/src/dynacell/evaluation/metrics.py Splits per-cell crop construction from extractor invocation; drops legacy cp_pairwise/deep_pairwise/dataset helpers
applications/dynacell/src/dynacell/evaluation/utils.py Removes scipy-based FID/KID; adds *_batch extractor methods + CellDinoFeatureExtractor
applications/dynacell/src/dynacell/evaluation/pipeline.py Re-wired to use new feature stack, threadpool, CELL-DINO branch, mask sidecar
applications/dynacell/src/dynacell/evaluation/pipeline_cache.py CELL-DINO cache plumbing + sha hash hoist
applications/dynacell/src/dynacell/evaluation/cache.py Adds celldino FeatureKind + group helpers
applications/dynacell/src/dynacell/evaluation/_configs/eval.yaml Adds feature_extractor.celldino block
applications/dynacell/src/dynacell/evaluation/save_paths.py Canonical eval save-dir naming (paper-aligned, intentionally duplicated)
applications/dynacell/tools/submit_evaluation_job.py Stage one dynacell evaluate invocation from a predict leaf
applications/dynacell/tools/submit_evaluation_batch.py Chain N per-leaf invocations into one sbatch
applications/dynacell/tools/evaluate_local.sh Local-host wrapper around submit_evaluation_job.py
applications/dynacell/tools/evaluate_batch.sh sbatch wrapper around submit_evaluation_batch.py
applications/dynacell/tests/test_evaluation_pipeline.py Stub modules for new evaluation submodules
applications/dynacell/tests/test_evaluation_metrics.py Drops legacy pairwise tests, adds batched-extractor tests
packages/viscy-models/src/viscy_models/foundation/cell_dino.py CELL-DINO wrapper (preprocess + channel-mean cls pooling)
packages/viscy-models/src/viscy_models/foundation/_dinov2_vit.py Vendored eval-only DINOv2 ViT
packages/viscy-models/src/viscy_models/foundation/init.py Re-export CellDinoModel
Comments suppressed due to low confidence (5)

packages/viscy-models/src/viscy_models/foundation/cell_dino.py:83

  • nn.Module.load_state_dict(state_dict, strict=True) raises RuntimeError itself when there are any missing or unexpected keys, so the subsequent if missing or unexpected: block is unreachable in normal operation. Either pass strict=False so the manual check actually runs (and provides the more descriptive error message) or drop the check entirely.
        missing, unexpected = self.model.load_state_dict(state_dict, strict=True)
        if missing or unexpected:
            raise RuntimeError(f"CELL-DINO state_dict mismatch — missing={missing}, unexpected={unexpected}")

applications/dynacell/src/dynacell/evaluation/cross_condition_probe.py:124

  • MAD normalization is applied twice on the same feature matrix: first per-plate here via MADScaler().fit_transform(...), then again inside paired_auroc/fov_stratified_auroc whose Pipeline includes MADScaler() as the first step (linear_probe.py line 131-143). The second pass operates on already MAD-normalized data, which is unlikely to harm correctness but does undo part of the per-plate centering and produces non-obvious feature scales for the logistic regression. If the per-plate MAD is the intended normalization (per the PR description's "per-plate MAD scaling to cancel illumination/exposure offsets"), consider exposing a way for fov_stratified_auroc to skip the inner scaler when the caller has already normalized — or document that double normalization is intentional.
    # Per-plate MAD normalization: cancels per-plate intensity offsets
    # (illumination, exposure, gain) that would otherwise dominate the
    # classifier — especially on CP regionprops, where raw intensity
    # columns make different plates trivially separable (AUROC = 1.0).
    x0_scaled = MADScaler().fit_transform(x0.astype(np.float64))
    x1_scaled = MADScaler().fit_transform(x1.astype(np.float64))
    # Tag FOV ids by condition so the two sides cannot collide.
    fov0_tagged = np.asarray([f"{c0}::{f}" for f in fov0])
    fov1_tagged = np.asarray([f"{c1}::{f}" for f in fov1])
    result = paired_auroc(x0_scaled, x1_scaled, fov0_tagged, fov1_tagged, n_splits=n_splits, rng_seed=rng_seed)

applications/dynacell/tools/evaluate_local.sh:171

  • This printf … | xargs … > /dev/null line is dead code: xargs -I {} echo {} echoes each token to stdout which is then sent to /dev/null, and any parse errors from xargs are not checked (set -e would catch a non-zero exit, but xargs -I will not error on simple newline-separated inputs). The actual command parse happens in the subshell below via the while IFS= read -r loop. Consider removing this line.
  printf '%s\n' "$cmd_text" | xargs -d '\n' -I {} echo {} > /dev/null  # validate parse

packages/viscy-models/src/viscy_models/foundation/_dinov2_vit.py:302

  • x.shape for a 4D image tensor is (B, C, H, W), so this destructuring binds w to the height and h to the width. The downstream interpolate_pos_encoding(x, w, h) is then called with positional args that are consistently swapped throughout this module (e.g. w0 = w // self.patch_size, h0 = h // self.patch_size). For square inputs (CELL-DINO is trained at 224×224) the output is identical, but on any non-square input the positional encoding grid would be transposed. This mirrors the original DINOv2 source which has the same naming bug; consider adding a clarifying comment so future readers don't mistake w/h for actual width/height.
    def prepare_tokens_with_masks(self, x: Tensor, masks: Optional[Tensor] = None) -> Tensor:
        _, _, w, h = x.shape
        x = self.patch_embed(x)
        if masks is not None:
            x = torch.where(masks.unsqueeze(-1), self.mask_token.to(x.dtype).unsqueeze(0), x)
        x = torch.cat((self.cls_token.expand(x.shape[0], -1, -1), x), dim=1)
        x = x + self.interpolate_pos_encoding(x, w, h)

applications/dynacell/tools/submit_evaluation_job.py:156

  • The plate label inference relies on hardcoded suffix matching against _mock / _denv / _zikv and the literal stem predict__ipsc_confocal. This is duplicated verbatim in submit_evaluation_batch.py (lines 144-154). Any new test plate added later will require both files to be updated in lockstep. Consider extracting the plate-from-leaf-stem logic into dynacell.evaluation.save_paths (alongside the related path naming) so there is a single source of truth.
    # Plate label for save_dir: ipsc | mock | denv | zikv.
    # Predict leaves are named predict__<test_set>[_<cond>].yml. Extract the
    # trailing condition for a549; the bare ipsc_confocal predict leaf is iPSC.
    leaf_stem = leaf_path.stem  # predict__a549_mantis_mock
    if leaf_stem == "predict__ipsc_confocal":
        test_plate = "ipsc"
    elif leaf_stem.endswith("_mock"):
        test_plate = "mock"
    elif leaf_stem.endswith("_denv"):
        test_plate = "denv"
    elif leaf_stem.endswith("_zikv"):
        test_plate = "zikv"
    else:
        raise SystemExit(
            f"cannot infer test plate from leaf filename: {leaf_path.name} "
            f"(expected predict__ipsc_confocal.yml or predict__*_{{mock,denv,zikv}}.yml)"
        )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread applications/dynacell/tools/submit_evaluation_job.py
Comment thread packages/viscy-models/src/viscy_models/foundation/_dinov2_vit.py
Comment thread applications/dynacell/pyproject.toml Outdated
Comment thread applications/dynacell/src/dynacell/evaluation/save_paths.py
Comment thread applications/dynacell/src/dynacell/evaluation/linear_probe.py
Comment thread applications/dynacell/src/dynacell/evaluation/feature_select.py
Comment thread applications/dynacell/src/dynacell/evaluation/feature_metrics.py
Comment thread applications/dynacell/src/dynacell/evaluation/feature_metrics.py
Comment thread applications/dynacell/src/dynacell/evaluation/pipeline.py
alxndrkalinin and others added 9 commits May 15, 2026 13:38
`--print-cmd` already prints the resolved command tokens, but without
`--dry-run` it then fell through to `subprocess.run(cmd_tokens)`. That
contradicted the help text ("print the literal `dynacell evaluate`
command to stdout (no exec)") and meant a caller probing the staged
invocation would also actually execute it.

Gate the subprocess.run call so `--print-cmd` alone short-circuits at
exit code 0. `--dry-run` still wins when both are passed; combining
them retains the existing provenance behavior used by evaluate_local.sh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
evaluate_batch.sh built `EVAL_${ORG}_${MODEL^^}_${TRAIN}_ON_${TEST}` and
passed it via `--job-name`, but `${MODEL^^}` keeps the raw code-side
config key (e.g. `FCMAE_VSCYTO3D_SCRATCH`) while
submit_evaluation_batch.py's `_composite_job_name` runs the model
through `paper_key()` first (e.g. `UNEXT2`). Result: the SLURM job name
disagreed with the staged provenance files for the same inputs.

Remove the bash JOB_NAME construction entirely. The python tool's
default already produces the canonical `EVAL_<ORG>_<PAPER>_<TRAIN>_ON_<TEST>`
shape via paper_key translation. A user who needs a custom job name can
still pass `--job-name FOO` through the extra-args tail of the bash
wrapper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`range(0, depth, depth // block_chunks)` produces more than
`block_chunks` BlockChunks when depth doesn't divide cleanly (e.g.
depth=25, block_chunks=4 -> 5 chunks of [6, 6, 6, 6, 1]). The current
ViT-L/16 configuration is fine (depth=24, block_chunks=4), but a future
ViT size with a non-divisible depth would silently produce a state_dict
layout the loader cannot match.

Raise ValueError up front rather than constructing a malformed module.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_require_cell_dino` instructed users to `pip install viscy-models`,
but no PyPI distribution by that name provides the in-tree workspace
package. A user hitting the ImportError would not be able to recover
by following the suggestion.

Rewrite the message to point at the actual install paths:
`uv sync --all-packages --all-extras` from the repo root, or
`pip install -e packages/viscy-models` for an editable single-package
install.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two Copilot review findings on feature_select.py:

1. The cp_selected_feature_mask.json sidecar emitted by pipeline.py
   hardcoded `{freq_cut: 0.05, unique_cut: 0.01, corr_threshold: 0.9}`,
   matching the in-tree `select_features` defaults but drifting silently
   if either side ever changed. Promote the three thresholds to
   module-level constants (DEFAULT_FREQ_CUT / DEFAULT_UNIQUE_CUT /
   DEFAULT_CORR_THRESHOLD), use them as kwarg defaults across
   variance_threshold / correlation_threshold / select_features, and
   reference the same constants from pipeline.py so the sidecar tracks
   the actual call.
2. Note variance_threshold's per-column np.unique loop in its docstring;
   it's designed for the ~30-col CP regionprops matrix and should not
   be applied to deep-feature matrices (1024+ columns) without
   vectorizing first.

The test_evaluation_pipeline.py stub gains the three new constants so
the pipeline-under-stub still imports cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A prior simplify commit (680337e) replaced
`getattr(force, "gt_celldino", False)` with `force.gt_celldino` in
`_resolve_force`, but the test fixtures in test_pipeline_cache.py were
not updated to include the new key. Result: every test that goes
through `init_cache_context` or `_resolve_force` directly raised
`omegaconf.errors.ConfigAttributeError: Missing key gt_celldino`.

Add the key to the three force_recompute dict literals (the shared
fixture builder plus the two inline `_resolve_force_*` test bodies).
This realigns the test config shape with eval.yaml, which always emits
the field.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t drift

save_paths.eval_save_dir is deliberately duplicated with
paper/scripts/compute_all_organelle_precision_recall.py:eval_dir_for in
a separate git repo. Any silent divergence between the two definitions
breaks downstream paper aggregation scripts (they can no longer find
metrics in the expected location), but the failure surfaces only when a
paper job runs against a fresh eval.

Add a parametrized pin test covering one representative case from each
branch of eval_save_dir:
- iPSC test x {ipsc, a549, joint} train
- A549 test x {ipsc, a549, joint} train

The pins encode the actual expected paths (relative to data_root) and
fail on any drift from save_paths.py, forcing a future change here to
also touch the paper-side mirror.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r contract

Three docstring-only clarifications from Copilot review:

1. `pyproject.toml`: note the torch-fidelity git-SHA pin's upgrade path
   (drop the `@5e211a9` once upstream cuts a `>=0.5.0` release with
   MIND), so the pin doesn't become permanent technical debt.
2. `feature_metrics.compute_feature_similarity`: spell out that
   Precision / Recall / F1 are bootstrap *means* over resamples drawn
   with replacement, so the absolute numbers are systematically lower
   than single-shot PRC and not directly comparable to non-bootstrap
   tables. Within-scheme comparisons (across models / plates /
   conditions evaluated with the same setting) are still valid.
3. `metrics.features_from_crops`: document the implicit return-type
   contract for `extract_features` / `extract_features_batch`. Both
   paths call `.detach().cpu()` and assume a torch.Tensor; the batched
   path additionally assumes the leading dim equals `len(crops)`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ided

The module docstring and `weights_path` Parameter docstring previously
called the Biohub HPC checkpoint path the model's "default", but
`weights_path` is a required init arg with no default. Rephrase both
to make clear the caller passes the path explicitly and the HPC
location is one example.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alxndrkalinin alxndrkalinin merged commit ef3dcd7 into dynacell-models May 15, 2026
@alxndrkalinin alxndrkalinin deleted the feature-metrics-torch-fidelity branch May 15, 2026 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants