Skip to content

feat(panda): annotation-based nuclei labeling#10

Merged
xrusnack merged 35 commits intomasterfrom
panda/annotation-labeling
Apr 9, 2026
Merged

feat(panda): annotation-based nuclei labeling#10
xrusnack merged 35 commits intomasterfrom
panda/annotation-labeling

Conversation

@xrusnack
Copy link
Copy Markdown
Member

@xrusnack xrusnack commented Mar 24, 2026

Depends on #8 #9 #12 #11

Summary by CodeRabbit

  • New Features

    • Per-nucleus annotation label generation for PANDA preprocessing.
    • Polygons-to-raster visualization extended for Radboud and prostate cancer workflows.
  • Chores

    • Added/updated preprocessing and visualization configs and presets, including default mask tile size (512×512).
    • Added job submission scripts for annotation-labeling and rasterization runs.
    • Switched dataset artifact URIs to project-template paths and added annotation/mask MLflow artifact locations.
    • Included Gleason score in exported dataset metadata.

@xrusnack xrusnack requested review from matejpekar and vejtek March 24, 2026 11:11
@xrusnack xrusnack self-assigned this Mar 24, 2026
@xrusnack xrusnack requested a review from a team March 24, 2026 11:11
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 24, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4eeadeb4-ee6f-4ae4-9b27-53edae5de2da

📥 Commits

Reviewing files that changed from the base of the PR and between bae197b and 9742857.

📒 Files selected for processing (1)
  • configs/data/sources/panda.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • configs/data/sources/panda.yaml

📝 Walkthrough

Walkthrough

Adds an annotation-label generation pipeline (configs, Ray preprocessing, MLflow logging, k8s runner), updates PANDA dataset config (mask tile sizes, MLflow URIs, new supervision.annotation), refactors polygons→raster visualization to use metadata URI lists and item-based Ray tasks, and adjusts related configs and job scripts.

Changes

Cohort / File(s) Summary
PANDA Dataset Configuration
configs/data/sources/panda.yaml
Added dataset.mask_tile_width/dataset.mask_tile_height (512), replaced hardcoded mlflow-artifacts:/97/... URIs with ${mlflow_project_pc}/... for exploration/split/metadata, and added dataset.providers.radboud.supervision.annotation MLflow artifact URI.
Annotation Labels Preprocessing
configs/preprocessing/annotation_labels.yaml, configs/experiment/preprocessing/annotation_labels/panda.yaml, preprocessing/annotation_labels.py, scripts/preprocessing/run_annotation_labels.py
New global and experiment configs; new Ray remote label_slide and main entrypoint that convert nucleus polygons + annotation TIFFs into per-nucleus binary annot_label, emit per-slide Parquet outputs and log them to MLflow; added k8s job submit script.
Polygons→Raster Visualization
visualization/polygons2raster.py, configs/visualization/polygons2raster/radboud.yaml, configs/visualization/polygons2raster/prostate_cancer_mmci_tl.yaml, scripts/visualization/run_polygons_rasterization.py
Refactored to build per-slide item dicts from uris2df(config.metadata_uris), changed Ray task to accept item dict, removed mpp-based level selection (fixed level=0), replaced single train/test metadata URIs with metadata_uris list, added radboud visualization config and k8s runner script.
Metadata mapping
preprocessing/metadata_mapping/panda.py
Added gleason_score column into slides-to-mapping DataFrame (persisted/logged as before).
Experiment / Visualization configs
configs/experiment/visualization/annotation_labels.yaml
Removed heatmap_labels_uri entry while keeping visualization_mode and label_column.
Job name tweak
scripts/exploration/prostate_cancer_mmci_tl/run_save_metadataset.py
Updated Kubernetes job name to include _mmci_tl suffix.

Sequence Diagrams

sequenceDiagram
    participant Entrypoint as Entrypoint (Hydra / MLflow)
    participant Metadata as Metadata CSV/Parquet
    participant Ray as Ray Scheduler
    participant Nuclei as Nuclei Parquet
    participant TIFF as Annotation Mask TIFF
    participant MLflow as MLflow Artifacts

    Entrypoint->>Metadata: download/concat metadata (config.metadata_uri)
    Entrypoint->>Entrypoint: filter slides with has_annotation & has_segmentation
    Entrypoint->>Ray: submit label_slide tasks for each slide

    loop per slide (parallel)
        Ray->>Nuclei: load nuclei polygons for slide
        Ray->>TIFF: read slide annotation mask
        Ray->>Ray: transform polygon coords -> mask pixels
        Ray->>Ray: sample mask vertices, apply provider thresholds
        Ray->>Ray: compute per-nucleus coverage -> annot_label
        Ray->>Nuclei: write slide-level output Parquet
    end

    Entrypoint->>MLflow: log output Parquets under config.mlflow_artifact_path
Loading
sequenceDiagram
    participant Entrypoint as Entrypoint
    participant Metadata as Metadata URIs
    participant Ray as Ray Scheduler
    participant Slide as WSI Slide
    participant Nuclei as Nuclei Parquet
    participant Raster as Rasterizer
    participant MLflow as MLflow Artifacts

    Entrypoint->>Metadata: load/concat metadata via uris2df(config.metadata_uris)
    Entrypoint->>Entrypoint: deduplicate and build per-slide item dicts
    Entrypoint->>Ray: submit process_slide tasks with item dicts

    loop per slide (parallel)
        Ray->>Slide: open slide at level 0
        Ray->>Nuclei: load nuclei polygons from slide_nuclei_path
        Ray->>Raster: rasterize polygons into tiles (mask_tile_width / mask_tile_height)
        Raster->>Raster: apply labels/CAMs/predictions if present
        Raster->>MLflow: write/upload TIFF tiles
    end

    Entrypoint->>MLflow: log/aggregate raster outputs
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • matejpekar
  • vejtek

Poem

🐰 I hopped through masks and Parquets bright,

Polygons traced and labeled by night,
Ray spun tasks while MLflow kept score,
Slides to labels, one file more,
Carrots, code, and pipelines—what delight! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(panda): annotation-based nuclei labeling' accurately reflects the main changes, which involve adding annotation-based labeling for nuclei in the PANDA dataset through new preprocessing pipeline components.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch panda/annotation-labeling

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new feature for annotation-based nuclei labeling specifically for the PANDA dataset. It sets up the necessary data configurations, implements the core logic for assigning labels to individual nuclei based on their spatial relationship with expert annotations, and provides a mechanism to run this preprocessing step as a scalable job. This enhancement is crucial for enabling downstream analysis and model training that relies on accurately labeled nuclei within whole slide images.

Highlights

  • PANDA Dataset Integration: Integrated the PANDA dataset by adding new configuration files (configs/data/sources/panda.yaml) to define its paths and MLflow URIs.
  • Annotation-based Nuclei Labeling: Implemented a new preprocessing script (preprocessing/annotation_labels.py) that assigns binary labels to nuclei based on their overlap with carcinoma annotations, distinguishing between Radboud and Karolinska data providers.
  • Configuration for Labeling: Introduced new configuration files (configs/preprocessing/annotation_labels.yaml and configs/experiment/preprocessing/annotation_labels/panda.yaml) to manage parameters for the annotation-based nuclei labeling process.
  • Kubernetes Job for Preprocessing: Added a Kubernetes job submission script (scripts/preprocessing/run_annotation_labels.py) to automate the execution of the annotation-based nuclei labeling pipeline.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new preprocessing pipeline for annotation-based nuclei labeling on the PANDA dataset. The changes include new Hydra configuration files to manage the experiment, the core Python script for the labeling logic using Ray for parallelization, and a job submission script.

My review focuses on correctness and potential runtime errors, in line with the repository's style guide. I've identified two high-severity issues that are likely to cause the pipeline to crash:

  • The mask handling logic in preprocessing/annotation_labels.py is not robust to all possible image dimension layouts (e.g., channels-first).
  • The job submission script in scripts/preprocessing/run_annotation_labels.py uses incorrect Hydra syntax, which will likely prevent the job from running.

I have provided specific code suggestions to fix these issues. The overall structure and use of Hydra for configuration are well-implemented.

Comment thread preprocessing/annotation_labels.py Outdated
Comment thread scripts/preprocessing/run_annotation_labels.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
preprocessing/annotation_labels.py (1)

58-61: Validate data_provider explicitly.

Every non-"radboud" row currently falls into the Karolinska threshold. If the metadata ever contains an unexpected value or casing variant, this will silently generate wrong labels instead of failing fast.

🧭 Suggested change
-    if provider == "radboud":
-        is_carcinoma_vertex = annot_labels >= 3
-    else:  # karolinska
-        is_carcinoma_vertex = annot_labels >= 2
+    provider = str(provider).strip().lower()
+    if provider == "radboud":
+        is_carcinoma_vertex = annot_labels >= 3
+    elif provider == "karolinska":
+        is_carcinoma_vertex = annot_labels >= 2
+    else:
+        raise ValueError(f"Unsupported data_provider={provider!r} for slide {slide_id}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preprocessing/annotation_labels.py` around lines 58 - 61, The current branch
uses any non-"radboud" value as Karolinska which can silently mislabel data;
update the logic around provider/`data_provider` (the variable currently named
`provider`) to explicitly validate allowed values (e.g., normalize casing with
.lower()) and then set `is_carcinoma_vertex = annot_labels >= 3` for "radboud"
and `annot_labels >= 2` for "karolinska"; if the provider is not one of the
expected values raise a ValueError (or similar) so unexpected or missing
providers fail fast instead of defaulting to Karolinska.
configs/data/sources/panda.yaml (1)

11-11: Consider rooting slides_properties in config.

slides_properties is the only PANDA path here that can't be moved with data_path or project_path, which makes this source harder to reuse outside the current filesystem layout.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@configs/data/sources/panda.yaml` at line 11, The slides_properties entry is
hardcoded to an absolute path; make it configurable and relative by rooting it
under the existing data_path or project_path so the PANDA source is portable:
change the config value for slides_properties to a relative path (e.g.
PANDA/slides.parquet) and update the code that reads slides_properties to join
it with data_path or project_path (use the same path variable used for other
PANDA files) before opening the file so behavior is unchanged for current
deployments but works on other filesystems.
scripts/preprocessing/run_annotation_labels.py (1)

11-15: Pin the checkout to an explicit ref.

Right now the job runs whatever commit happens to be at the remote default branch when the pod starts, so the same launcher can generate different label artifacts over time.

📌 Suggested change
     script=[
         "git clone https://github.com/RationAI/nuclei-graph-transformer.git workdir",
         "cd workdir",
+        "git checkout <commit-sha-or-tag>",
         "uv sync --frozen",
         "uv run -m preprocessing.annotation_labels +experiment=preprocessing/annotation_labels/...",
     ],
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/preprocessing/run_annotation_labels.py` around lines 11 - 15, The git
clone in the script array currently checks out the repository's default branch;
change it to pin to an explicit commit/ref by updating the clone step in
run_annotation_labels.py's script (the "git clone
https://github.com/RationAI/nuclei-graph-transformer.git workdir" entry) to
fetch and checkout a specific tag/commit/branch (e.g., use git clone --depth 1
--branch <ref> ... or clone then git -C workdir checkout <commit>) so the job
always runs a deterministic ref; ensure the chosen ref is stored/parametrized so
future runs can update it intentionally.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@configs/data/sources/panda.yaml`:
- Line 11: The slides_properties entry is hardcoded to an absolute path; make it
configurable and relative by rooting it under the existing data_path or
project_path so the PANDA source is portable: change the config value for
slides_properties to a relative path (e.g. PANDA/slides.parquet) and update the
code that reads slides_properties to join it with data_path or project_path (use
the same path variable used for other PANDA files) before opening the file so
behavior is unchanged for current deployments but works on other filesystems.

In `@preprocessing/annotation_labels.py`:
- Around line 58-61: The current branch uses any non-"radboud" value as
Karolinska which can silently mislabel data; update the logic around
provider/`data_provider` (the variable currently named `provider`) to explicitly
validate allowed values (e.g., normalize casing with .lower()) and then set
`is_carcinoma_vertex = annot_labels >= 3` for "radboud" and `annot_labels >= 2`
for "karolinska"; if the provider is not one of the expected values raise a
ValueError (or similar) so unexpected or missing providers fail fast instead of
defaulting to Karolinska.

In `@scripts/preprocessing/run_annotation_labels.py`:
- Around line 11-15: The git clone in the script array currently checks out the
repository's default branch; change it to pin to an explicit commit/ref by
updating the clone step in run_annotation_labels.py's script (the "git clone
https://github.com/RationAI/nuclei-graph-transformer.git workdir" entry) to
fetch and checkout a specific tag/commit/branch (e.g., use git clone --depth 1
--branch <ref> ... or clone then git -C workdir checkout <commit>) so the job
always runs a deterministic ref; ensure the chosen ref is stored/parametrized so
future runs can update it intentionally.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 64217716-d3da-4393-8815-8befe4b937d0

📥 Commits

Reviewing files that changed from the base of the PR and between a377343 and 4877977.

📒 Files selected for processing (7)
  • configs/base.yaml
  • configs/data/sources/panda.yaml
  • configs/data/sources/prostate_cancer.yaml
  • configs/experiment/preprocessing/annotation_labels/panda.yaml
  • configs/preprocessing/annotation_labels.yaml
  • preprocessing/annotation_labels.py
  • scripts/preprocessing/run_annotation_labels.py

Comment thread preprocessing/annotation_labels.py Outdated
@xrusnack xrusnack marked this pull request as draft March 27, 2026 13:12
@xrusnack xrusnack marked this pull request as ready for review March 30, 2026 20:51
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/visualization/run_polygons_rasterization.py`:
- Around line 4-18: The module currently calls submit_job(...) at import time
causing side effects; wrap that call in a main function (e.g., def main():
submit_job(...)) and invoke main() only under the guard if __name__ ==
"__main__": so importing the module doesn't submit the Kubernetes job; ensure
the existing submit_job invocation and its arguments remain unchanged and move
them into the new main function (referencing submit_job and the new main) so
tests and tooling can import the module safely.

In `@visualization/polygons2raster.py`:
- Around line 119-123: The code currently hardcodes level = 0 and then does mask
= Image.new("L", size=mask_size), which forces a full-resolution in-memory
raster for very large WSIs; change this by selecting an appropriate pyramid
level instead of always 0 (use slide.level_count or compute desired level from
mask_mpp_x/mask_mpp_y or an external target_mpp parameter via
slide.slide_resolution/slide.level_dimensions) and avoid creating a single giant
Image for level 0 — either create the mask at a downsampled level or implement a
tiled/streamed rasterization path that iterates over tiles (fetching regions via
OpenSlide.read_region) and writes tiles directly to disk or to the output TIFF
writer rather than materializing mask = Image.new(...) for the entire slide;
update references around OpenSlide(item["slide_path"]), level,
slide.slide_resolution, slide.level_dimensions, and the mask creation to use the
selected level or tile streaming.
- Around line 148-153: uris2df currently returns an empty DataFrame with only
"slide_path" when uris is empty, but callers expect "slide_nuclei_path" too;
update uris2df so that the empty branch returns a schema-consistent DataFrame
including both "slide_path" and "slide_nuclei_path" (and any other columns that
callers index later) to avoid KeyError, or alternatively validate/raise a clear
error if those columns are required; locate and modify the uris2df function to
return the consistent column set.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 210a64e5-693f-46ba-bfbc-7349b85dc2d6

📥 Commits

Reviewing files that changed from the base of the PR and between 4877977 and f251fed.

📒 Files selected for processing (8)
  • configs/data/sources/panda.yaml
  • configs/experiment/preprocessing/annotation_labels/panda.yaml
  • configs/preprocessing/annotation_labels.yaml
  • configs/visualization/polygons2raster/prostate_cancer.yaml
  • configs/visualization/polygons2raster/radboud.yaml
  • preprocessing/annotation_labels.py
  • scripts/visualization/run_polygons_rasterization.py
  • visualization/polygons2raster.py
✅ Files skipped from review due to trivial changes (2)
  • configs/visualization/polygons2raster/radboud.yaml
  • configs/experiment/preprocessing/annotation_labels/panda.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
  • configs/preprocessing/annotation_labels.yaml
  • preprocessing/annotation_labels.py

Comment thread scripts/visualization/run_polygons_rasterization.py
Comment thread visualization/polygons2raster.py
Comment thread visualization/polygons2raster.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
visualization/polygons2raster.py (1)

170-177: ⚠️ Potential issue | 🔴 Critical

Read the prediction threshold from the renamed config field.

The configs added in this PR use predictions_thr, but Line 177 still reads pred_thr. In mode 2 that keeps pred_thr=None, so Line 66 asserts and prediction rasterization never starts.

💡 Proposed fix
-                "pred_thr": config.get("pred_thr", None),
+                "pred_thr": config.get(
+                    "predictions_thr",
+                    config.get("pred_thr", None),
+                ),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@visualization/polygons2raster.py` around lines 170 - 177, The kwargs dict
builds fn_kwargs with the prediction threshold using the old key "pred_thr";
update the dict to read the renamed config field by replacing
config.get("pred_thr", None) with config.get("predictions_thr", None) (so
fn_kwargs["pred_thr"] or the threshold entry gets the value from
config["predictions_thr"]) to ensure the prediction threshold is set when
running visualization_mode 2 and allows the rasterization/assertion in the
prediction path to proceed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@visualization/polygons2raster.py`:
- Around line 170-177: The kwargs dict builds fn_kwargs with the prediction
threshold using the old key "pred_thr"; update the dict to read the renamed
config field by replacing config.get("pred_thr", None) with
config.get("predictions_thr", None) (so fn_kwargs["pred_thr"] or the threshold
entry gets the value from config["predictions_thr"]) to ensure the prediction
threshold is set when running visualization_mode 2 and allows the
rasterization/assertion in the prediction path to proceed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 959f2119-0ba4-4263-bde9-806f8bcd032a

📥 Commits

Reviewing files that changed from the base of the PR and between f251fed and d5df20c.

📒 Files selected for processing (1)
  • visualization/polygons2raster.py

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
visualization/polygons2raster.py (1)

148-152: ⚠️ Potential issue | 🟡 Minor

Handle empty uris list and validate required columns.

If uris is an empty list, pd.concat([]) raises ValueError: No objects to concatenate. Additionally, Line 168 expects columns ["slide_path", "slide_nuclei_path"] to exist, but there's no validation that the loaded Parquet files contain them.

🛡️ Proposed fix to add validation
 def uris2df(uris: list[str]) -> pd.DataFrame:
     """Loads and merges multiple metadata Parquet files into a single DataFrame."""
+    if not uris:
+        raise ValueError("metadata_uris cannot be empty")
     batches = [pd.read_parquet(download_artifacts(uri)) for uri in uris]
-    return pd.concat(batches, ignore_index=True).drop_duplicates(subset=["slide_path"])
+    metadata = pd.concat(batches, ignore_index=True).drop_duplicates(subset=["slide_path"])
+    required = {"slide_path", "slide_nuclei_path"}
+    missing = required - set(metadata.columns)
+    if missing:
+        raise KeyError(f"Missing required metadata columns: {sorted(missing)}")
+    return metadata
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@visualization/polygons2raster.py` around lines 148 - 152, In uris2df, handle
an empty uris list and validate required columns: if uris is empty return an
empty pandas DataFrame with at least the expected columns ["slide_path",
"slide_nuclei_path"] (and any other columns you need downstream) so callers
don't hit pd.concat([]); after reading and concatenating batches in uris2df,
verify that the resulting DataFrame contains the required columns and if any are
missing raise a clear ValueError listing the missing column names so callers
know which Parquet inputs are malformed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@visualization/polygons2raster.py`:
- Around line 148-152: In uris2df, handle an empty uris list and validate
required columns: if uris is empty return an empty pandas DataFrame with at
least the expected columns ["slide_path", "slide_nuclei_path"] (and any other
columns you need downstream) so callers don't hit pd.concat([]); after reading
and concatenating batches in uris2df, verify that the resulting DataFrame
contains the required columns and if any are missing raise a clear ValueError
listing the missing column names so callers know which Parquet inputs are
malformed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a1533ef1-410e-424f-9d59-4c5f8e0c3d29

📥 Commits

Reviewing files that changed from the base of the PR and between 84ff377 and 34c075c.

📒 Files selected for processing (5)
  • configs/data/sources/panda.yaml
  • configs/visualization/polygons2raster/prostate_cancer_mmci_tl.yaml
  • preprocessing/annotation_labels.py
  • scripts/exploration/prostate_cancer_mmci_tl/run_save_metadataset.py
  • visualization/polygons2raster.py
✅ Files skipped from review due to trivial changes (1)
  • scripts/exploration/prostate_cancer_mmci_tl/run_save_metadataset.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • configs/data/sources/panda.yaml
  • preprocessing/annotation_labels.py

@xrusnack xrusnack requested a review from matejpekar April 2, 2026 12:29
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
preprocessing/metadata_mapping/panda.py (1)

40-52: Verify NaN handling for gleason_score in downstream consumers.

The gleason_score column is passed through from the input metadata CSV without explicit validation. Since the input may contain NaN values for incomplete or failed slides, verify that downstream code (particularly any DataModule or labeling logic) either filters these rows or handles NaN values gracefully.

Note: Other similar columns (mpp_x, mpp_y) follow the same pattern without explicit validation here, suggesting this design may be intentional and validated elsewhere in the pipeline.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preprocessing/metadata_mapping/panda.py` around lines 40 - 52, map_df is
currently built from slides["gleason_score"] without validation, so add explicit
NaN handling before or during DataFrame construction: check
slides["gleason_score"] for NaN and either drop those rows (e.g., filter slides
= slides[~slides["gleason_score"].isna()]) and log a warning about dropped
slides, or add a boolean column (e.g., "has_valid_gleason") to map_df to mark
missing labels so downstream DataModule/labeling logic can skip or handle them;
also apply the same validation pattern to mpp_x/mpp_y if needed. Ensure you
reference the map_df creation and slides["gleason_score"] (and optionally
mpp_x/mpp_y) when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@preprocessing/metadata_mapping/panda.py`:
- Around line 40-52: map_df is currently built from slides["gleason_score"]
without validation, so add explicit NaN handling before or during DataFrame
construction: check slides["gleason_score"] for NaN and either drop those rows
(e.g., filter slides = slides[~slides["gleason_score"].isna()]) and log a
warning about dropped slides, or add a boolean column (e.g.,
"has_valid_gleason") to map_df to mark missing labels so downstream
DataModule/labeling logic can skip or handle them; also apply the same
validation pattern to mpp_x/mpp_y if needed. Ensure you reference the map_df
creation and slides["gleason_score"] (and optionally mpp_x/mpp_y) when making
the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 57647393-618f-45dc-94a7-e1549d31dd42

📥 Commits

Reviewing files that changed from the base of the PR and between 34c075c and 727af55.

📒 Files selected for processing (2)
  • configs/data/sources/panda.yaml
  • preprocessing/metadata_mapping/panda.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • configs/data/sources/panda.yaml

@xrusnack xrusnack merged commit fcce74f into master Apr 9, 2026
3 of 4 checks passed
@xrusnack xrusnack deleted the panda/annotation-labeling branch April 9, 2026 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants