feat(dynacell): migrate eval metrics to torch-fidelity, add probes#423
Conversation
torch-fidelity pinned to a commit on toshas/torch-fidelity master that includes MIND on top of v0.4.0 — MIND is not in any tagged release yet and the upcoming eval pipeline depends on it. scikit-learn is already a transitive dep but pinning it explicitly under the eval extra protects the new linear-probe code path against an upstream prune. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Produces the same on-disk save_dir layout the paper aggregation scripts in /hpc/mydata/alex.kalinin/dynacell expect (paper key + train-set infix + organelle long-form + infection suffix). The library copy here and the helper in paper/scripts/compute_all_organelle_precision_recall.py must stay in sync; both files carry a comment naming the other as VisCy and the paper repo are independent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wraps torch-fidelity's *_features_to_metric helpers so the eval pipeline can keep feeding pre-extracted per-cell embeddings directly (the library's top-level calculate_metrics() can't accept precomputed features). Three gaps the prior in-tree implementations did not close are now closed: KID gets a subset-resampled mean+std, P/R/F1 gain a non-parametric bootstrap, and MIND (sliced 2-Wasserstein) joins as a kernel-free sensitivity-to-distribution-shape diagnostic motivated by the paper's Appendix I mode-collapse analysis. Small-cohort KID auto-shrinks the subset size and NaNs out below 16 cells; PRC bootstrap resamples both sides with replacement before each manifold rebuild; tensor conversion happens once outside the loop. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Vendors pycytominer's variance + correlation feature-selection logic (BSD-3 attribution in module header) so the CP regionprops track in the eval pipeline can drop near-constant or redundantly-correlated columns before they reach FID/KID/PRC. Operates on raw numpy without a pandas dep, statistics computed on pooled (gt, pred) so the same keep-mask is applied to both sides. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Probe-based diagnostics that FID/KID/MIND/PRC do not provide: how linearly distinguishable are real vs predicted embeddings, and how well do features preserve biological treatment information across plates. GroupKFold by FOV holds out whole fields of view so per-FOV illumination, segmentation artifacts, and biological clustering can't leak across the split. MADScaler (median + median-absolute-deviation, not the IQR-based sklearn RobustScaler) lives inside the CV Pipeline so normalization statistics are computed only on the training fold. Logistic regression defaults to class_weight='balanced' to handle unequal mock/infected cohort sizes. `paired_auroc` is the shared helper for the two probe call sites (real-vs-predicted in pipeline.py and between-condition in the post-hoc cross_condition_probe CLI). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the in-tree _frechet_distance, _polynomial_mmd, and _pairwise_feature_metrics in utils.py, plus cp_pairwise / deep_pairwise / cp_dataset_fid_kid / deep_dataset_fid_kid / _nan_pairwise in metrics.py, with calls into the new feature_metrics module. Dataset-level CP track also runs feature_select pruning before similarity, plus the real-vs-predicted linear probe per feature type. The three feature-type similarity+probe calls run via ThreadPoolExecutor(max_workers=3) since torch-fidelity, sklearn LBFGS, and numpy BLAS all release the GIL on their inner loops. CP-specific per-(FOV, timepoint) preprocessing (target-zero-column drop and per-side z-score) moves to a small helper in pipeline.py since the deep tracks need neither step. Tests in test_evaluation_metrics.py drop the seven cp_pairwise / deep_pairwise / pinned-value cases (covered by feature_metrics_test.py); test_evaluation_pipeline.py grows stubs for the new feature_metrics, feature_select, and linear_probe modules so the cache-reuse path can still import the pipeline without pulling in torch-fidelity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Post-hoc diagnostic that runs FOV-stratified probes between two infection conditions (mock vs denv, mock vs zikv) on the per-cell embeddings saved by the eval pipeline. Per-plate MAD scaling cancels the illumination/exposure/gain offsets that would otherwise let any classifier separate the plates trivially (CP at AUROC=1.0 before this guard); CP-only feature_select then drops near-constant columns before the linear probe sees them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mirrors the predict_{local,batch}.sh / submit_benchmark_{job,batch}.py
pair so an entire (organelle, model, train_set, test_set) eval set
launches with one command. Unlike Lightning's predict CLI, dynacell
evaluate is Hydra-decorated and takes override key=value tokens
directly — the staging tools read the predict leaf, extract
output_store / dataset_ref / benchmark.*, build the override list,
and emit `uv run dynacell evaluate …` invocations.
evaluate_local.sh supports --parallel N and --cross-condition-probe;
the latter triggers an automatic post-step that runs
`python -m dynacell.evaluation.cross_condition_probe` over the three
A549 per-plate save_dirs once all per-plate evals complete.
evaluate_batch.sh wraps a single sbatch job that runs each leaf's
eval in series via srun.
Save-dir naming routes through dynacell.evaluation.save_paths so the
paths stay bit-identical to the paper aggregation scripts.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The dataset-level metrics + linear-probe block runs three feature types concurrently via ThreadPoolExecutor. Without a thread-pool guard, each worker's sklearn LBFGS + numpy ops can grab cpu_count() BLAS threads on top of the three outer threads, oversubscribing cores on dense nodes. threadpoolctl.threadpool_limits(1) wraps the executor block so each outer thread gets exactly one BLAS thread. Rename `_cp_pairwise_preprocess` to `_cp_dropzero_zscore` so the name no longer collides visually with the deleted `cp_pairwise` helper. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ORGANELLE_EVAL_TARGET and ORGANELLE_A549_PLATE were already public because the eval-runner tools import them; promote PAPER_KEY and ORGANELLE_PAPER from underscore-prefixed to public for the same reason. The four lookup tables are sibling concepts and the leading- underscore split implied a private/public boundary that doesn't exist. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_resolve_one_leaf already loads + composes every leaf and reads benchmark.trained_on as part of the call; return that string in the tuple instead of re-running load_composed_config on the head leaf just to fetch the same field. Saves one ~1-2 s compose per submit invocation. Also adds a cross-leaf trained_on consistency check mirroring the existing organelle / model_name checks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CELL-DINO is a DINOv2-architecture ViT pretrained on fluorescence microscopy (Human Protein Atlas). The channel_adaptive_dino_vitl16 checkpoint processes one channel at a time through a single-channel ViT-L/16 stem; the wrapper reshapes (B, C, H, W) -> (B*C, 1, H, W), runs the backbone, and mean-pools the cls token across channels for a fixed-dim embedding regardless of input channel count. Weights load from a local .pth state_dict — nothing fetched from the network. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the CellDinoFeatureExtractor adapter (wraps viscy_models.foundation.CellDinoModel; same extract_features(image_2d) contract as DinoV3FeatureExtractor / DynaCLRFeatureExtractor), threads celldino through the GT cache (FeatureKind literal, _CacheContext field, manifest validation, fov_gt_deep_features dispatch), accumulates pred/gt embeddings per (FOV, t), dispatches per-timepoint pairwise + dataset-level FID/KID/PRC/MIND/probe via the existing parallel pool (max_workers bumped to 4), saves the per-cell NPZs, and surfaces the backbone in cross_condition_probe. CELL-DINO is opt-in: it only runs when `feature_extractor.celldino. weights_path` is set in the eval config. Existing eval dirs leave the field null and see zero behavior change — the cross-condition probe emits NaN rows with a "missing embeddings file" reason for any pair where one side did not run the backbone. Why: CELL-DINO is DINOv2-architecture ViT pretrained on fluorescence microscopy (HPA), so its embeddings carry cell-biology priors that neither DINOv3 (LVD-1689M natural images) nor DynaCLR (in-house) does. Useful as a third reference for the cross-condition probe and as a fourth dataset-level FID/KID/PRC/MIND track. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- `cross_condition_probe._FEATURE_TYPES` derives from `cache.FeatureKind` via `typing.get_args` instead of hand-maintaining a parallel tuple, so the 5th backbone touches one line instead of two. - `pipeline_cache._resolve_force` reads `force.gt_celldino` directly (matching the peers `gt_cp`/`gt_dinov3`/`gt_dynaclr`); the defensive `getattr(force, "gt_celldino", False)` was dead since eval.yaml now always defines the field. - `pipeline_cache.fov_gt_deep_features` docstring updated: it accepts `celldino` in addition to `dinov3`/`dynaclr`. - `eval.yaml` celldino block comment shrunk from 3 lines to 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Factors the per-cell crop construction out of _extract_per_cell_features into a public build_pred_crops / build_target_crops / features_from_crops trio so the 2-D max-z projection, cell iteration, and crop construction run once per (FOV, timepoint) instead of once per backbone. The prediction-side loop in pipeline.py now builds crops one time and dispatches to DINOv3 / DynaCLR / CellDINO over the same list — eliminates 3× redundant pure-CPU work per timepoint. Each deep extractor (DinoV3, DynaCLR, CellDino) gains an extract_features_batch(images, batch_size=...) method that stacks all cells of one (FOV, t) into a single forward, chunked to bound VRAM (64 for DynaCLR, 32 for the ViT backbones). features_from_crops uses the batch path when available and falls back to per-cell when an extractor only exposes extract_features — keeps the test-stub _IdentityExtractor working and gives the same backward-compat to any future extractor that does not need batching. Net effect on the deep-feature stage: 1 max-z + 1 cell iteration per timepoint instead of 3-4; one forward per backbone per (FOV, t) instead of N forwards per cell. Wallclock impact depends on cell count per FOV but is the highest-leverage item the simplify reviewer flagged on the CELL-DINO PR — should yield 2-5× speedup on plates with hundreds of cells per timepoint. deep_target_features / deep_pred_features stay as backward-compat wrappers so the GT cache compute_fn path and the existing test suite keep working unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR migrates dynacell's evaluation pipeline to torch-fidelity for FID/KID/PRC/MIND metrics, adds a CELL-DINO foundation model wrapper, vendors pycytominer-style CP feature selection, and introduces FOV-stratified linear probes plus a cross-condition probe CLI. It also ships high-level eval-runner shell scripts, Hydra-override staging tools, and a new save_paths.py that mirrors paper-aggregation conventions.
Changes:
- New feature-similarity stack:
feature_metrics.py(torch-fidelity FID/KID/PRC/MIND + bootstrap stds),feature_select.py(variance + correlation pruning),linear_probe.py(MAD-scaled FOV-stratified AUROC), andcross_condition_probe.pyCLI; pipeline.py rewired to compose deep-feature crops once and threadpool the per-prefix metrics. - New CELL-DINO foundation backbone (
viscy_models.foundation.CellDinoModel+ vendored DINOv2 ViT), wired throughutils.CellDinoFeatureExtractor, the eval cache, and the eval Hydra config. - Eval-runner ergonomics:
submit_evaluation_{job,batch}.py,evaluate_local.sh,evaluate_batch.sh, plussave_paths.pyfor canonical save-dir naming bit-identical to the paper repo.
Reviewed changes
Copilot reviewed 24 out of 25 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| applications/dynacell/pyproject.toml | Adds scikit-learn, pinned torch-fidelity git dep |
| uv.lock | Lock entries for new eval deps |
| applications/dynacell/src/dynacell/evaluation/feature_metrics.py | Torch-fidelity backed FID/KID/PRC/MIND + cosine, w/ pairwise variant |
| applications/dynacell/src/dynacell/evaluation/feature_metrics_test.py | End-to-end metric tests |
| applications/dynacell/src/dynacell/evaluation/feature_select.py | Vendored variance + correlation pruning |
| applications/dynacell/src/dynacell/evaluation/feature_select_test.py | Pruning tests |
| applications/dynacell/src/dynacell/evaluation/linear_probe.py | MADScaler + FOV-stratified AUROC + paired_auroc + indistinguishability |
| applications/dynacell/src/dynacell/evaluation/linear_probe_test.py | Linear probe tests |
| applications/dynacell/src/dynacell/evaluation/cross_condition_probe.py | Cross-condition diagnostic CLI on saved embeddings |
| applications/dynacell/src/dynacell/evaluation/metrics.py | Splits per-cell crop construction from extractor invocation; drops legacy cp_pairwise/deep_pairwise/dataset helpers |
| applications/dynacell/src/dynacell/evaluation/utils.py | Removes scipy-based FID/KID; adds *_batch extractor methods + CellDinoFeatureExtractor |
| applications/dynacell/src/dynacell/evaluation/pipeline.py | Re-wired to use new feature stack, threadpool, CELL-DINO branch, mask sidecar |
| applications/dynacell/src/dynacell/evaluation/pipeline_cache.py | CELL-DINO cache plumbing + sha hash hoist |
| applications/dynacell/src/dynacell/evaluation/cache.py | Adds celldino FeatureKind + group helpers |
| applications/dynacell/src/dynacell/evaluation/_configs/eval.yaml | Adds feature_extractor.celldino block |
| applications/dynacell/src/dynacell/evaluation/save_paths.py | Canonical eval save-dir naming (paper-aligned, intentionally duplicated) |
| applications/dynacell/tools/submit_evaluation_job.py | Stage one dynacell evaluate invocation from a predict leaf |
| applications/dynacell/tools/submit_evaluation_batch.py | Chain N per-leaf invocations into one sbatch |
| applications/dynacell/tools/evaluate_local.sh | Local-host wrapper around submit_evaluation_job.py |
| applications/dynacell/tools/evaluate_batch.sh | sbatch wrapper around submit_evaluation_batch.py |
| applications/dynacell/tests/test_evaluation_pipeline.py | Stub modules for new evaluation submodules |
| applications/dynacell/tests/test_evaluation_metrics.py | Drops legacy pairwise tests, adds batched-extractor tests |
| packages/viscy-models/src/viscy_models/foundation/cell_dino.py | CELL-DINO wrapper (preprocess + channel-mean cls pooling) |
| packages/viscy-models/src/viscy_models/foundation/_dinov2_vit.py | Vendored eval-only DINOv2 ViT |
| packages/viscy-models/src/viscy_models/foundation/init.py | Re-export CellDinoModel |
Comments suppressed due to low confidence (5)
packages/viscy-models/src/viscy_models/foundation/cell_dino.py:83
nn.Module.load_state_dict(state_dict, strict=True)raisesRuntimeErroritself when there are any missing or unexpected keys, so the subsequentif missing or unexpected:block is unreachable in normal operation. Either passstrict=Falseso the manual check actually runs (and provides the more descriptive error message) or drop the check entirely.
missing, unexpected = self.model.load_state_dict(state_dict, strict=True)
if missing or unexpected:
raise RuntimeError(f"CELL-DINO state_dict mismatch — missing={missing}, unexpected={unexpected}")
applications/dynacell/src/dynacell/evaluation/cross_condition_probe.py:124
- MAD normalization is applied twice on the same feature matrix: first per-plate here via
MADScaler().fit_transform(...), then again insidepaired_auroc/fov_stratified_aurocwhosePipelineincludesMADScaler()as the first step (linear_probe.py line 131-143). The second pass operates on already MAD-normalized data, which is unlikely to harm correctness but does undo part of the per-plate centering and produces non-obvious feature scales for the logistic regression. If the per-plate MAD is the intended normalization (per the PR description's "per-plate MAD scaling to cancel illumination/exposure offsets"), consider exposing a way forfov_stratified_aurocto skip the inner scaler when the caller has already normalized — or document that double normalization is intentional.
# Per-plate MAD normalization: cancels per-plate intensity offsets
# (illumination, exposure, gain) that would otherwise dominate the
# classifier — especially on CP regionprops, where raw intensity
# columns make different plates trivially separable (AUROC = 1.0).
x0_scaled = MADScaler().fit_transform(x0.astype(np.float64))
x1_scaled = MADScaler().fit_transform(x1.astype(np.float64))
# Tag FOV ids by condition so the two sides cannot collide.
fov0_tagged = np.asarray([f"{c0}::{f}" for f in fov0])
fov1_tagged = np.asarray([f"{c1}::{f}" for f in fov1])
result = paired_auroc(x0_scaled, x1_scaled, fov0_tagged, fov1_tagged, n_splits=n_splits, rng_seed=rng_seed)
applications/dynacell/tools/evaluate_local.sh:171
- This
printf … | xargs … > /dev/nullline is dead code:xargs -I {} echo {}echoes each token to stdout which is then sent to/dev/null, and any parse errors fromxargsare not checked (set -ewould catch a non-zero exit, butxargs -Iwill not error on simple newline-separated inputs). The actual command parse happens in the subshell below via thewhile IFS= read -rloop. Consider removing this line.
printf '%s\n' "$cmd_text" | xargs -d '\n' -I {} echo {} > /dev/null # validate parse
packages/viscy-models/src/viscy_models/foundation/_dinov2_vit.py:302
x.shapefor a 4D image tensor is(B, C, H, W), so this destructuring bindswto the height andhto the width. The downstreaminterpolate_pos_encoding(x, w, h)is then called with positional args that are consistently swapped throughout this module (e.g.w0 = w // self.patch_size,h0 = h // self.patch_size). For square inputs (CELL-DINO is trained at 224×224) the output is identical, but on any non-square input the positional encoding grid would be transposed. This mirrors the original DINOv2 source which has the same naming bug; consider adding a clarifying comment so future readers don't mistakew/hfor actual width/height.
def prepare_tokens_with_masks(self, x: Tensor, masks: Optional[Tensor] = None) -> Tensor:
_, _, w, h = x.shape
x = self.patch_embed(x)
if masks is not None:
x = torch.where(masks.unsqueeze(-1), self.mask_token.to(x.dtype).unsqueeze(0), x)
x = torch.cat((self.cls_token.expand(x.shape[0], -1, -1), x), dim=1)
x = x + self.interpolate_pos_encoding(x, w, h)
applications/dynacell/tools/submit_evaluation_job.py:156
- The plate label inference relies on hardcoded suffix matching against
_mock/_denv/_zikvand the literal stempredict__ipsc_confocal. This is duplicated verbatim insubmit_evaluation_batch.py(lines 144-154). Any new test plate added later will require both files to be updated in lockstep. Consider extracting the plate-from-leaf-stem logic intodynacell.evaluation.save_paths(alongside the related path naming) so there is a single source of truth.
# Plate label for save_dir: ipsc | mock | denv | zikv.
# Predict leaves are named predict__<test_set>[_<cond>].yml. Extract the
# trailing condition for a549; the bare ipsc_confocal predict leaf is iPSC.
leaf_stem = leaf_path.stem # predict__a549_mantis_mock
if leaf_stem == "predict__ipsc_confocal":
test_plate = "ipsc"
elif leaf_stem.endswith("_mock"):
test_plate = "mock"
elif leaf_stem.endswith("_denv"):
test_plate = "denv"
elif leaf_stem.endswith("_zikv"):
test_plate = "zikv"
else:
raise SystemExit(
f"cannot infer test plate from leaf filename: {leaf_path.name} "
f"(expected predict__ipsc_confocal.yml or predict__*_{{mock,denv,zikv}}.yml)"
)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
`--print-cmd` already prints the resolved command tokens, but without
`--dry-run` it then fell through to `subprocess.run(cmd_tokens)`. That
contradicted the help text ("print the literal `dynacell evaluate`
command to stdout (no exec)") and meant a caller probing the staged
invocation would also actually execute it.
Gate the subprocess.run call so `--print-cmd` alone short-circuits at
exit code 0. `--dry-run` still wins when both are passed; combining
them retains the existing provenance behavior used by evaluate_local.sh.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
evaluate_batch.sh built `EVAL_${ORG}_${MODEL^^}_${TRAIN}_ON_${TEST}` and
passed it via `--job-name`, but `${MODEL^^}` keeps the raw code-side
config key (e.g. `FCMAE_VSCYTO3D_SCRATCH`) while
submit_evaluation_batch.py's `_composite_job_name` runs the model
through `paper_key()` first (e.g. `UNEXT2`). Result: the SLURM job name
disagreed with the staged provenance files for the same inputs.
Remove the bash JOB_NAME construction entirely. The python tool's
default already produces the canonical `EVAL_<ORG>_<PAPER>_<TRAIN>_ON_<TEST>`
shape via paper_key translation. A user who needs a custom job name can
still pass `--job-name FOO` through the extra-args tail of the bash
wrapper.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`range(0, depth, depth // block_chunks)` produces more than `block_chunks` BlockChunks when depth doesn't divide cleanly (e.g. depth=25, block_chunks=4 -> 5 chunks of [6, 6, 6, 6, 1]). The current ViT-L/16 configuration is fine (depth=24, block_chunks=4), but a future ViT size with a non-divisible depth would silently produce a state_dict layout the loader cannot match. Raise ValueError up front rather than constructing a malformed module. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_require_cell_dino` instructed users to `pip install viscy-models`, but no PyPI distribution by that name provides the in-tree workspace package. A user hitting the ImportError would not be able to recover by following the suggestion. Rewrite the message to point at the actual install paths: `uv sync --all-packages --all-extras` from the repo root, or `pip install -e packages/viscy-models` for an editable single-package install. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two Copilot review findings on feature_select.py:
1. The cp_selected_feature_mask.json sidecar emitted by pipeline.py
hardcoded `{freq_cut: 0.05, unique_cut: 0.01, corr_threshold: 0.9}`,
matching the in-tree `select_features` defaults but drifting silently
if either side ever changed. Promote the three thresholds to
module-level constants (DEFAULT_FREQ_CUT / DEFAULT_UNIQUE_CUT /
DEFAULT_CORR_THRESHOLD), use them as kwarg defaults across
variance_threshold / correlation_threshold / select_features, and
reference the same constants from pipeline.py so the sidecar tracks
the actual call.
2. Note variance_threshold's per-column np.unique loop in its docstring;
it's designed for the ~30-col CP regionprops matrix and should not
be applied to deep-feature matrices (1024+ columns) without
vectorizing first.
The test_evaluation_pipeline.py stub gains the three new constants so
the pipeline-under-stub still imports cleanly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A prior simplify commit (680337e) replaced `getattr(force, "gt_celldino", False)` with `force.gt_celldino` in `_resolve_force`, but the test fixtures in test_pipeline_cache.py were not updated to include the new key. Result: every test that goes through `init_cache_context` or `_resolve_force` directly raised `omegaconf.errors.ConfigAttributeError: Missing key gt_celldino`. Add the key to the three force_recompute dict literals (the shared fixture builder plus the two inline `_resolve_force_*` test bodies). This realigns the test config shape with eval.yaml, which always emits the field. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t drift
save_paths.eval_save_dir is deliberately duplicated with
paper/scripts/compute_all_organelle_precision_recall.py:eval_dir_for in
a separate git repo. Any silent divergence between the two definitions
breaks downstream paper aggregation scripts (they can no longer find
metrics in the expected location), but the failure surfaces only when a
paper job runs against a fresh eval.
Add a parametrized pin test covering one representative case from each
branch of eval_save_dir:
- iPSC test x {ipsc, a549, joint} train
- A549 test x {ipsc, a549, joint} train
The pins encode the actual expected paths (relative to data_root) and
fail on any drift from save_paths.py, forcing a future change here to
also touch the paper-side mirror.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r contract Three docstring-only clarifications from Copilot review: 1. `pyproject.toml`: note the torch-fidelity git-SHA pin's upgrade path (drop the `@5e211a9` once upstream cuts a `>=0.5.0` release with MIND), so the pin doesn't become permanent technical debt. 2. `feature_metrics.compute_feature_similarity`: spell out that Precision / Recall / F1 are bootstrap *means* over resamples drawn with replacement, so the absolute numbers are systematically lower than single-shot PRC and not directly comparable to non-bootstrap tables. Within-scheme comparisons (across models / plates / conditions evaluated with the same setting) are still valid. 3. `metrics.features_from_crops`: document the implicit return-type contract for `extract_features` / `extract_features_batch`. Both paths call `.detach().cpu()` and assume a torch.Tensor; the batched path additionally assumes the leading dim equals `len(crops)`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ided The module docstring and `weights_path` Parameter docstring previously called the Biohub HPC checkpoint path the model's "default", but `weights_path` is a required init arg with no default. Rephrase both to make clear the caller passes the path explicitly and the HPC location is one example. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
dynacell.evaluation.cross_condition_probeCLI with per-plate MAD scaling to cancel illumination/exposure offsets.evaluate_local.sh,evaluate_batch.sh) plussubmit_evaluation_{job,batch}.pyHydra-override staging tools; save-dir naming routes through a newsave_paths.pythat stays bit-identical to the paper aggregation scripts.Stacks on
dynacell-models(PR #404).Test plan
uv run pytest applications/dynacell/src/dynacell/evaluation/ applications/dynacell/tests/test_evaluation_metrics.py applications/dynacell/tests/test_evaluation_pipeline.py --deselect applications/dynacell/tests/test_evaluation_metrics.py::test_gain_and_offset_errors_are_not_scale_invariant→ 41/41 pass (10 feature_metrics, 8 feature_select, 9 linear_probe, 13 evaluation_metrics, 1 evaluation_pipeline). The one deselected test is a pre-existing failure ondynacell-models(commitbfc61891madenrmsescale-invariant).uvx ruff checkclean on all touched files.submit_evaluation_job.py --dry-runonmembrane/fnet3d_paper/a549_mantis/predict__a549_mantis_mock.ymlproduces canonicaleval_fnet3d_a549trained_membrane_mocksave_dir with correct.cmd+.ymlprovenance.eval_save_dirmatchespaper/scripts/compute_all_organelle_precision_recall.py:eval_dir_foron 5 spot-checked (organelle, model, train_set, plate) combos including iPSC-trained, A549-trained, and joint flavors.cross_condition_probeon three celldiff joint membrane A549 plates: CP AUROCs in [0.46, 0.71] (no longer 1.0 after per-plate MAD + select_features), DINOv3/DynaCLR at ceiling consistent with cross-condition morphology being learnable in deep embedding space.uv run dynacell evaluate ...on a real predict zarr (requires DINOv3 + DynaCLR weights cached + segmentation; left for reviewer to drive once stacked PR lands).Notes
torch-fidelityis pinned via git URL to commit5e211a9because MIND is not in any tagged release yet.save_paths.pyis deliberately duplicated with the paper repo'scompute_all_organelle_precision_recall.py:eval_dir_for; both files carry a comment pointing at the other.🤖 Generated with Claude Code