feat(panda): annotation-based nuclei labeling by xrusnack · Pull Request #10 · RationAI/nuclei-graph-transformer

xrusnack · 2026-03-24T11:11:54Z

Depends on #8 #9 #12 #11

Summary by CodeRabbit

New Features
- Per-nucleus annotation label generation for PANDA preprocessing.
- Polygons-to-raster visualization extended for Radboud and prostate cancer workflows.
Chores
- Added/updated preprocessing and visualization configs and presets, including default mask tile size (512×512).
- Added job submission scripts for annotation-labeling and rasterization runs.
- Switched dataset artifact URIs to project-template paths and added annotation/mask MLflow artifact locations.
- Included Gleason score in exported dataset metadata.

coderabbitai · 2026-03-24T11:12:11Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4eeadeb4-ee6f-4ae4-9b27-53edae5de2da

📥 Commits

Reviewing files that changed from the base of the PR and between bae197b and 9742857.

📒 Files selected for processing (1)

configs/data/sources/panda.yaml

🚧 Files skipped from review as they are similar to previous changes (1)

configs/data/sources/panda.yaml

📝 Walkthrough

Walkthrough

Adds an annotation-label generation pipeline (configs, Ray preprocessing, MLflow logging, k8s runner), updates PANDA dataset config (mask tile sizes, MLflow URIs, new supervision.annotation), refactors polygons→raster visualization to use metadata URI lists and item-based Ray tasks, and adjusts related configs and job scripts.

Changes

Cohort / File(s)	Summary
PANDA Dataset Configuration `configs/data/sources/panda.yaml`	Added `dataset.mask_tile_width`/`dataset.mask_tile_height` (512), replaced hardcoded `mlflow-artifacts:/97/...` URIs with `${mlflow_project_pc}/...` for exploration/split/metadata, and added `dataset.providers.radboud.supervision.annotation` MLflow artifact URI.
Annotation Labels Preprocessing `configs/preprocessing/annotation_labels.yaml`, `configs/experiment/preprocessing/annotation_labels/panda.yaml`, `preprocessing/annotation_labels.py`, `scripts/preprocessing/run_annotation_labels.py`	New global and experiment configs; new Ray remote `label_slide` and `main` entrypoint that convert nucleus polygons + annotation TIFFs into per-nucleus binary `annot_label`, emit per-slide Parquet outputs and log them to MLflow; added k8s job submit script.
Polygons→Raster Visualization `visualization/polygons2raster.py`, `configs/visualization/polygons2raster/radboud.yaml`, `configs/visualization/polygons2raster/prostate_cancer_mmci_tl.yaml`, `scripts/visualization/run_polygons_rasterization.py`	Refactored to build per-slide item dicts from `uris2df(config.metadata_uris)`, changed Ray task to accept `item` dict, removed mpp-based level selection (fixed level=0), replaced single train/test metadata URIs with `metadata_uris` list, added radboud visualization config and k8s runner script.
Metadata mapping `preprocessing/metadata_mapping/panda.py`	Added `gleason_score` column into slides-to-mapping DataFrame (persisted/logged as before).
Experiment / Visualization configs `configs/experiment/visualization/annotation_labels.yaml`	Removed `heatmap_labels_uri` entry while keeping `visualization_mode` and `label_column`.
Job name tweak `scripts/exploration/prostate_cancer_mmci_tl/run_save_metadataset.py`	Updated Kubernetes job name to include `_mmci_tl` suffix.

Sequence Diagrams

sequenceDiagram
    participant Entrypoint as Entrypoint (Hydra / MLflow)
    participant Metadata as Metadata CSV/Parquet
    participant Ray as Ray Scheduler
    participant Nuclei as Nuclei Parquet
    participant TIFF as Annotation Mask TIFF
    participant MLflow as MLflow Artifacts

    Entrypoint->>Metadata: download/concat metadata (config.metadata_uri)
    Entrypoint->>Entrypoint: filter slides with has_annotation & has_segmentation
    Entrypoint->>Ray: submit label_slide tasks for each slide

    loop per slide (parallel)
        Ray->>Nuclei: load nuclei polygons for slide
        Ray->>TIFF: read slide annotation mask
        Ray->>Ray: transform polygon coords -> mask pixels
        Ray->>Ray: sample mask vertices, apply provider thresholds
        Ray->>Ray: compute per-nucleus coverage -> annot_label
        Ray->>Nuclei: write slide-level output Parquet
    end

    Entrypoint->>MLflow: log output Parquets under config.mlflow_artifact_path

sequenceDiagram
    participant Entrypoint as Entrypoint
    participant Metadata as Metadata URIs
    participant Ray as Ray Scheduler
    participant Slide as WSI Slide
    participant Nuclei as Nuclei Parquet
    participant Raster as Rasterizer
    participant MLflow as MLflow Artifacts

    Entrypoint->>Metadata: load/concat metadata via uris2df(config.metadata_uris)
    Entrypoint->>Entrypoint: deduplicate and build per-slide item dicts
    Entrypoint->>Ray: submit process_slide tasks with item dicts

    loop per slide (parallel)
        Ray->>Slide: open slide at level 0
        Ray->>Nuclei: load nuclei polygons from slide_nuclei_path
        Ray->>Raster: rasterize polygons into tiles (mask_tile_width / mask_tile_height)
        Raster->>Raster: apply labels/CAMs/predictions if present
        Raster->>MLflow: write/upload TIFF tiles
    end

    Entrypoint->>MLflow: log/aggregate raster outputs

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

feat(panda): data exploration #8: earlier PANDA config and MLflow URI changes that this PR builds upon.
feat(panda): metadata mapping #12: modifies PANDA dataset config and metadata-related MLflow URIs.
feat(panda): standardize nuclei seg data #9: updates radboud split MLflow artifact URI in the same dataset config.

Suggested reviewers

matejpekar
vejtek

Poem

🐰 I hopped through masks and Parquets bright,

Polygons traced and labeled by night,
Ray spun tasks while MLflow kept score,
Slides to labels, one file more,
Carrots, code, and pipelines—what delight! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(panda): annotation-based nuclei labeling' accurately reflects the main changes, which involve adding annotation-based labeling for nuclei in the PANDA dataset through new preprocessing pipeline components.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch panda/annotation-labeling

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-03-24T11:12:15Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new feature for annotation-based nuclei labeling specifically for the PANDA dataset. It sets up the necessary data configurations, implements the core logic for assigning labels to individual nuclei based on their spatial relationship with expert annotations, and provides a mechanism to run this preprocessing step as a scalable job. This enhancement is crucial for enabling downstream analysis and model training that relies on accurately labeled nuclei within whole slide images.

Highlights

PANDA Dataset Integration: Integrated the PANDA dataset by adding new configuration files (configs/data/sources/panda.yaml) to define its paths and MLflow URIs.
Annotation-based Nuclei Labeling: Implemented a new preprocessing script (preprocessing/annotation_labels.py) that assigns binary labels to nuclei based on their overlap with carcinoma annotations, distinguishing between Radboud and Karolinska data providers.
Configuration for Labeling: Introduced new configuration files (configs/preprocessing/annotation_labels.yaml and configs/experiment/preprocessing/annotation_labels/panda.yaml) to manage parameters for the annotation-based nuclei labeling process.
Kubernetes Job for Preprocessing: Added a Kubernetes job submission script (scripts/preprocessing/run_annotation_labels.py) to automate the execution of the annotation-based nuclei labeling pipeline.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new preprocessing pipeline for annotation-based nuclei labeling on the PANDA dataset. The changes include new Hydra configuration files to manage the experiment, the core Python script for the labeling logic using Ray for parallelization, and a job submission script.

My review focuses on correctness and potential runtime errors, in line with the repository's style guide. I've identified two high-severity issues that are likely to cause the pipeline to crash:

The mask handling logic in preprocessing/annotation_labels.py is not robust to all possible image dimension layouts (e.g., channels-first).
The job submission script in scripts/preprocessing/run_annotation_labels.py uses incorrect Hydra syntax, which will likely prevent the job from running.

I have provided specific code suggestions to fix these issues. The overall structure and use of Hydra for configuration are well-implemented.

coderabbitai

🧹 Nitpick comments (3)

preprocessing/annotation_labels.py (1)

58-61: Validate data_provider explicitly.

Every non-"radboud" row currently falls into the Karolinska threshold. If the metadata ever contains an unexpected value or casing variant, this will silently generate wrong labels instead of failing fast.

🧭 Suggested change

-    if provider == "radboud":
-        is_carcinoma_vertex = annot_labels >= 3
-    else:  # karolinska
-        is_carcinoma_vertex = annot_labels >= 2
+    provider = str(provider).strip().lower()
+    if provider == "radboud":
+        is_carcinoma_vertex = annot_labels >= 3
+    elif provider == "karolinska":
+        is_carcinoma_vertex = annot_labels >= 2
+    else:
+        raise ValueError(f"Unsupported data_provider={provider!r} for slide {slide_id}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@preprocessing/annotation_labels.py` around lines 58 - 61, The current branch
uses any non-"radboud" value as Karolinska which can silently mislabel data;
update the logic around provider/`data_provider` (the variable currently named
`provider`) to explicitly validate allowed values (e.g., normalize casing with
.lower()) and then set `is_carcinoma_vertex = annot_labels >= 3` for "radboud"
and `annot_labels >= 2` for "karolinska"; if the provider is not one of the
expected values raise a ValueError (or similar) so unexpected or missing
providers fail fast instead of defaulting to Karolinska.

configs/data/sources/panda.yaml (1)

11-11: Consider rooting slides_properties in config.

slides_properties is the only PANDA path here that can't be moved with data_path or project_path, which makes this source harder to reuse outside the current filesystem layout.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@configs/data/sources/panda.yaml` at line 11, The slides_properties entry is
hardcoded to an absolute path; make it configurable and relative by rooting it
under the existing data_path or project_path so the PANDA source is portable:
change the config value for slides_properties to a relative path (e.g.
PANDA/slides.parquet) and update the code that reads slides_properties to join
it with data_path or project_path (use the same path variable used for other
PANDA files) before opening the file so behavior is unchanged for current
deployments but works on other filesystems.

scripts/preprocessing/run_annotation_labels.py (1)

11-15: Pin the checkout to an explicit ref.

Right now the job runs whatever commit happens to be at the remote default branch when the pod starts, so the same launcher can generate different label artifacts over time.

📌 Suggested change

     script=[
         "git clone https://github.com/RationAI/nuclei-graph-transformer.git workdir",
         "cd workdir",
+        "git checkout <commit-sha-or-tag>",
         "uv sync --frozen",
         "uv run -m preprocessing.annotation_labels +experiment=preprocessing/annotation_labels/...",
     ],

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/preprocessing/run_annotation_labels.py` around lines 11 - 15, The git
clone in the script array currently checks out the repository's default branch;
change it to pin to an explicit commit/ref by updating the clone step in
run_annotation_labels.py's script (the "git clone
https://github.com/RationAI/nuclei-graph-transformer.git workdir" entry) to
fetch and checkout a specific tag/commit/branch (e.g., use git clone --depth 1
--branch <ref> ... or clone then git -C workdir checkout <commit>) so the job
always runs a deterministic ref; ensure the chosen ref is stored/parametrized so
future runs can update it intentionally.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@configs/data/sources/panda.yaml`:
- Line 11: The slides_properties entry is hardcoded to an absolute path; make it
configurable and relative by rooting it under the existing data_path or
project_path so the PANDA source is portable: change the config value for
slides_properties to a relative path (e.g. PANDA/slides.parquet) and update the
code that reads slides_properties to join it with data_path or project_path (use
the same path variable used for other PANDA files) before opening the file so
behavior is unchanged for current deployments but works on other filesystems.

In `@preprocessing/annotation_labels.py`:
- Around line 58-61: The current branch uses any non-"radboud" value as
Karolinska which can silently mislabel data; update the logic around
provider/`data_provider` (the variable currently named `provider`) to explicitly
validate allowed values (e.g., normalize casing with .lower()) and then set
`is_carcinoma_vertex = annot_labels >= 3` for "radboud" and `annot_labels >= 2`
for "karolinska"; if the provider is not one of the expected values raise a
ValueError (or similar) so unexpected or missing providers fail fast instead of
defaulting to Karolinska.

In `@scripts/preprocessing/run_annotation_labels.py`:
- Around line 11-15: The git clone in the script array currently checks out the
repository's default branch; change it to pin to an explicit commit/ref by
updating the clone step in run_annotation_labels.py's script (the "git clone
https://github.com/RationAI/nuclei-graph-transformer.git workdir" entry) to
fetch and checkout a specific tag/commit/branch (e.g., use git clone --depth 1
--branch <ref> ... or clone then git -C workdir checkout <commit>) so the job
always runs a deterministic ref; ensure the chosen ref is stored/parametrized so
future runs can update it intentionally.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 64217716-d3da-4393-8815-8befe4b937d0

📥 Commits

Reviewing files that changed from the base of the PR and between a377343 and 4877977.

📒 Files selected for processing (7)

configs/base.yaml
configs/data/sources/panda.yaml
configs/data/sources/prostate_cancer.yaml
configs/experiment/preprocessing/annotation_labels/panda.yaml
configs/preprocessing/annotation_labels.yaml
preprocessing/annotation_labels.py
scripts/preprocessing/run_annotation_labels.py

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/visualization/run_polygons_rasterization.py`:
- Around line 4-18: The module currently calls submit_job(...) at import time
causing side effects; wrap that call in a main function (e.g., def main():
submit_job(...)) and invoke main() only under the guard if __name__ ==
"__main__": so importing the module doesn't submit the Kubernetes job; ensure
the existing submit_job invocation and its arguments remain unchanged and move
them into the new main function (referencing submit_job and the new main) so
tests and tooling can import the module safely.

In `@visualization/polygons2raster.py`:
- Around line 119-123: The code currently hardcodes level = 0 and then does mask
= Image.new("L", size=mask_size), which forces a full-resolution in-memory
raster for very large WSIs; change this by selecting an appropriate pyramid
level instead of always 0 (use slide.level_count or compute desired level from
mask_mpp_x/mask_mpp_y or an external target_mpp parameter via
slide.slide_resolution/slide.level_dimensions) and avoid creating a single giant
Image for level 0 — either create the mask at a downsampled level or implement a
tiled/streamed rasterization path that iterates over tiles (fetching regions via
OpenSlide.read_region) and writes tiles directly to disk or to the output TIFF
writer rather than materializing mask = Image.new(...) for the entire slide;
update references around OpenSlide(item["slide_path"]), level,
slide.slide_resolution, slide.level_dimensions, and the mask creation to use the
selected level or tile streaming.
- Around line 148-153: uris2df currently returns an empty DataFrame with only
"slide_path" when uris is empty, but callers expect "slide_nuclei_path" too;
update uris2df so that the empty branch returns a schema-consistent DataFrame
including both "slide_path" and "slide_nuclei_path" (and any other columns that
callers index later) to avoid KeyError, or alternatively validate/raise a clear
error if those columns are required; locate and modify the uris2df function to
return the consistent column set.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 210a64e5-693f-46ba-bfbc-7349b85dc2d6

📥 Commits

Reviewing files that changed from the base of the PR and between 4877977 and f251fed.

📒 Files selected for processing (8)

configs/data/sources/panda.yaml
configs/experiment/preprocessing/annotation_labels/panda.yaml
configs/preprocessing/annotation_labels.yaml
configs/visualization/polygons2raster/prostate_cancer.yaml
configs/visualization/polygons2raster/radboud.yaml
preprocessing/annotation_labels.py
scripts/visualization/run_polygons_rasterization.py
visualization/polygons2raster.py

✅ Files skipped from review due to trivial changes (2)

configs/visualization/polygons2raster/radboud.yaml
configs/experiment/preprocessing/annotation_labels/panda.yaml

🚧 Files skipped from review as they are similar to previous changes (2)

configs/preprocessing/annotation_labels.yaml
preprocessing/annotation_labels.py

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

visualization/polygons2raster.py (1)

170-177: ⚠️ Potential issue | 🔴 Critical

Read the prediction threshold from the renamed config field.

The configs added in this PR use predictions_thr, but Line 177 still reads pred_thr. In mode 2 that keeps pred_thr=None, so Line 66 asserts and prediction rasterization never starts.

💡 Proposed fix

-                "pred_thr": config.get("pred_thr", None),
+                "pred_thr": config.get(
+                    "predictions_thr",
+                    config.get("pred_thr", None),
+                ),

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@visualization/polygons2raster.py` around lines 170 - 177, The kwargs dict
builds fn_kwargs with the prediction threshold using the old key "pred_thr";
update the dict to read the renamed config field by replacing
config.get("pred_thr", None) with config.get("predictions_thr", None) (so
fn_kwargs["pred_thr"] or the threshold entry gets the value from
config["predictions_thr"]) to ensure the prediction threshold is set when
running visualization_mode 2 and allows the rasterization/assertion in the
prediction path to proceed.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@visualization/polygons2raster.py`:
- Around line 170-177: The kwargs dict builds fn_kwargs with the prediction
threshold using the old key "pred_thr"; update the dict to read the renamed
config field by replacing config.get("pred_thr", None) with
config.get("predictions_thr", None) (so fn_kwargs["pred_thr"] or the threshold
entry gets the value from config["predictions_thr"]) to ensure the prediction
threshold is set when running visualization_mode 2 and allows the
rasterization/assertion in the prediction path to proceed.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 959f2119-0ba4-4263-bde9-806f8bcd032a

📥 Commits

Reviewing files that changed from the base of the PR and between f251fed and d5df20c.

📒 Files selected for processing (1)

visualization/polygons2raster.py

coderabbitai

♻️ Duplicate comments (1)

visualization/polygons2raster.py (1)

148-152: ⚠️ Potential issue | 🟡 Minor

Handle empty uris list and validate required columns.

If uris is an empty list, pd.concat([]) raises ValueError: No objects to concatenate. Additionally, Line 168 expects columns ["slide_path", "slide_nuclei_path"] to exist, but there's no validation that the loaded Parquet files contain them.

🛡️ Proposed fix to add validation

 def uris2df(uris: list[str]) -> pd.DataFrame:
     """Loads and merges multiple metadata Parquet files into a single DataFrame."""
+    if not uris:
+        raise ValueError("metadata_uris cannot be empty")
     batches = [pd.read_parquet(download_artifacts(uri)) for uri in uris]
-    return pd.concat(batches, ignore_index=True).drop_duplicates(subset=["slide_path"])
+    metadata = pd.concat(batches, ignore_index=True).drop_duplicates(subset=["slide_path"])
+    required = {"slide_path", "slide_nuclei_path"}
+    missing = required - set(metadata.columns)
+    if missing:
+        raise KeyError(f"Missing required metadata columns: {sorted(missing)}")
+    return metadata

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@visualization/polygons2raster.py` around lines 148 - 152, In uris2df, handle
an empty uris list and validate required columns: if uris is empty return an
empty pandas DataFrame with at least the expected columns ["slide_path",
"slide_nuclei_path"] (and any other columns you need downstream) so callers
don't hit pd.concat([]); after reading and concatenating batches in uris2df,
verify that the resulting DataFrame contains the required columns and if any are
missing raise a clear ValueError listing the missing column names so callers
know which Parquet inputs are malformed.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@visualization/polygons2raster.py`:
- Around line 148-152: In uris2df, handle an empty uris list and validate
required columns: if uris is empty return an empty pandas DataFrame with at
least the expected columns ["slide_path", "slide_nuclei_path"] (and any other
columns you need downstream) so callers don't hit pd.concat([]); after reading
and concatenating batches in uris2df, verify that the resulting DataFrame
contains the required columns and if any are missing raise a clear ValueError
listing the missing column names so callers know which Parquet inputs are
malformed.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a1533ef1-410e-424f-9d59-4c5f8e0c3d29

📥 Commits

Reviewing files that changed from the base of the PR and between 84ff377 and 34c075c.

📒 Files selected for processing (5)

configs/data/sources/panda.yaml
configs/visualization/polygons2raster/prostate_cancer_mmci_tl.yaml
preprocessing/annotation_labels.py
scripts/exploration/prostate_cancer_mmci_tl/run_save_metadataset.py
visualization/polygons2raster.py

✅ Files skipped from review due to trivial changes (1)

scripts/exploration/prostate_cancer_mmci_tl/run_save_metadataset.py

🚧 Files skipped from review as they are similar to previous changes (2)

configs/data/sources/panda.yaml
preprocessing/annotation_labels.py

coderabbitai

🧹 Nitpick comments (1)

preprocessing/metadata_mapping/panda.py (1)
40-52: Verify NaN handling for gleason_score in downstream consumers.

The gleason_score column is passed through from the input metadata CSV without explicit validation. Since the input may contain NaN values for incomplete or failed slides, verify that downstream code (particularly any DataModule or labeling logic) either filters these rows or handles NaN values gracefully.

Note: Other similar columns (mpp_x, mpp_y) follow the same pattern without explicit validation here, suggesting this design may be intentional and validated elsewhere in the pipeline.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preprocessing/metadata_mapping/panda.py` around lines 40 - 52, map_df is
currently built from slides["gleason_score"] without validation, so add explicit
NaN handling before or during DataFrame construction: check
slides["gleason_score"] for NaN and either drop those rows (e.g., filter slides
= slides[~slides["gleason_score"].isna()]) and log a warning about dropped
slides, or add a boolean column (e.g., "has_valid_gleason") to map_df to mark
missing labels so downstream DataModule/labeling logic can skip or handle them;
also apply the same validation pattern to mpp_x/mpp_y if needed. Ensure you
reference the map_df creation and slides["gleason_score"] (and optionally
mpp_x/mpp_y) when making the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@preprocessing/metadata_mapping/panda.py`:
- Around line 40-52: map_df is currently built from slides["gleason_score"]
without validation, so add explicit NaN handling before or during DataFrame
construction: check slides["gleason_score"] for NaN and either drop those rows
(e.g., filter slides = slides[~slides["gleason_score"].isna()]) and log a
warning about dropped slides, or add a boolean column (e.g.,
"has_valid_gleason") to map_df to mark missing labels so downstream
DataModule/labeling logic can skip or handle them; also apply the same
validation pattern to mpp_x/mpp_y if needed. Ensure you reference the map_df
creation and slides["gleason_score"] (and optionally mpp_x/mpp_y) when making
the change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 57647393-618f-45dc-94a7-e1549d31dd42

📥 Commits

Reviewing files that changed from the base of the PR and between 34c075c and 727af55.

📒 Files selected for processing (2)

configs/data/sources/panda.yaml
preprocessing/metadata_mapping/panda.py

🚧 Files skipped from review as they are similar to previous changes (1)

configs/data/sources/panda.yaml

xrusnack added 15 commits March 22, 2026 21:11

feat: annot labels wip

edeb160

feat: labeling

ceedfa7

feat: job script

ce9def3

feat: simplifications

5677c7a

fix: docstring

04d0804

fix: repo url

d297b85

fix: ids

903498f

fix: id key

8820b89

fix: channels

efca62d

fix: redundant col

6eb0fcc

fix: docs

2c518bb

fix: table data selection

2509178

fix: uri

375ce70

fix: condition

8c4cc3f

fix: paths

4877977

xrusnack requested review from matejpekar and vejtek March 24, 2026 11:11

xrusnack self-assigned this Mar 24, 2026

xrusnack requested a review from a team March 24, 2026 11:11

gemini-code-assist Bot reviewed Mar 24, 2026

View reviewed changes

Comment thread preprocessing/annotation_labels.py Outdated

Comment thread scripts/preprocessing/run_annotation_labels.py

coderabbitai Bot reviewed Mar 24, 2026

View reviewed changes

xrusnack added 2 commits March 24, 2026 11:43

feat: add uri

1ab8346

fix: comment

2069660

matejpekar requested changes Mar 25, 2026

View reviewed changes

Comment thread preprocessing/annotation_labels.py Outdated

xrusnack marked this pull request as draft March 27, 2026 13:12

xrusnack added 3 commits March 27, 2026 13:22

chore: col names, rev

ef49116

chore: config structure

4ba62b1

fix: col names

cd01a3c

xrusnack added 5 commits March 30, 2026 20:28

chore: merge master

6c498a6

feat: vis job script

4c3c8ab

feat: add dataset to cfg

3db0102

fix: Path casting

562fc5a

fix: Path casting

f251fed

xrusnack marked this pull request as ready for review March 30, 2026 20:51

coderabbitai Bot reviewed Mar 30, 2026

View reviewed changes

Comment thread scripts/visualization/run_polygons_rasterization.py

Comment thread visualization/polygons2raster.py

Comment thread visualization/polygons2raster.py Outdated

chore: apply suggestion

d5df20c

coderabbitai Bot reviewed Mar 30, 2026

View reviewed changes

xrusnack added 3 commits March 30, 2026 21:22

fix: cfg key

84ff377

fix: comment clarification

3cdd830

chore: merge master

34c075c

coderabbitai Bot reviewed Apr 2, 2026

View reviewed changes

xrusnack requested a review from matejpekar April 2, 2026 12:29

feat: add gleason score column

727af55

coderabbitai Bot reviewed Apr 3, 2026

View reviewed changes

xrusnack and others added 3 commits April 5, 2026 22:22

feat: structure labels under provider folder

e0660d3

feat: update uris

bae197b

fix: uri

9742857

matejpekar approved these changes Apr 7, 2026

View reviewed changes

vejtek approved these changes Apr 9, 2026

View reviewed changes

xrusnack merged commit fcce74f into master Apr 9, 2026
3 of 4 checks passed

xrusnack deleted the panda/annotation-labeling branch April 9, 2026 10:38

coderabbitai Bot mentioned this pull request Apr 9, 2026

feat(panda): annotation masks #15

Closed

xrusnack added a commit that referenced this pull request Apr 14, 2026

feat(panda): annotation-based nuclei labeling (#10)

5a7d162

xrusnack added a commit that referenced this pull request Apr 14, 2026

feat(panda): annotation-based nuclei labeling (#10)

cbfbb15

This was referenced Apr 14, 2026

feat: datamodule #2

Open

feat(panda): exploration — exclude failed qc slides #17

Merged

feat: change nuclei labels storage #18

Open

feat: postprocessing scripts #19

Open

Conversation

xrusnack commented Mar 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagrams

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot commented Mar 24, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xrusnack commented Mar 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 24, 2026 •

edited

Loading