feat(panda): annotation-based nuclei labeling#10
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds an annotation-label generation pipeline (configs, Ray preprocessing, MLflow logging, k8s runner), updates PANDA dataset config (mask tile sizes, MLflow URIs, new supervision.annotation), refactors polygons→raster visualization to use metadata URI lists and item-based Ray tasks, and adjusts related configs and job scripts. Changes
Sequence DiagramssequenceDiagram
participant Entrypoint as Entrypoint (Hydra / MLflow)
participant Metadata as Metadata CSV/Parquet
participant Ray as Ray Scheduler
participant Nuclei as Nuclei Parquet
participant TIFF as Annotation Mask TIFF
participant MLflow as MLflow Artifacts
Entrypoint->>Metadata: download/concat metadata (config.metadata_uri)
Entrypoint->>Entrypoint: filter slides with has_annotation & has_segmentation
Entrypoint->>Ray: submit label_slide tasks for each slide
loop per slide (parallel)
Ray->>Nuclei: load nuclei polygons for slide
Ray->>TIFF: read slide annotation mask
Ray->>Ray: transform polygon coords -> mask pixels
Ray->>Ray: sample mask vertices, apply provider thresholds
Ray->>Ray: compute per-nucleus coverage -> annot_label
Ray->>Nuclei: write slide-level output Parquet
end
Entrypoint->>MLflow: log output Parquets under config.mlflow_artifact_path
sequenceDiagram
participant Entrypoint as Entrypoint
participant Metadata as Metadata URIs
participant Ray as Ray Scheduler
participant Slide as WSI Slide
participant Nuclei as Nuclei Parquet
participant Raster as Rasterizer
participant MLflow as MLflow Artifacts
Entrypoint->>Metadata: load/concat metadata via uris2df(config.metadata_uris)
Entrypoint->>Entrypoint: deduplicate and build per-slide item dicts
Entrypoint->>Ray: submit process_slide tasks with item dicts
loop per slide (parallel)
Ray->>Slide: open slide at level 0
Ray->>Nuclei: load nuclei polygons from slide_nuclei_path
Ray->>Raster: rasterize polygons into tiles (mask_tile_width / mask_tile_height)
Raster->>Raster: apply labels/CAMs/predictions if present
Raster->>MLflow: write/upload TIFF tiles
end
Entrypoint->>MLflow: log/aggregate raster outputs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new feature for annotation-based nuclei labeling specifically for the PANDA dataset. It sets up the necessary data configurations, implements the core logic for assigning labels to individual nuclei based on their spatial relationship with expert annotations, and provides a mechanism to run this preprocessing step as a scalable job. This enhancement is crucial for enabling downstream analysis and model training that relies on accurately labeled nuclei within whole slide images. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new preprocessing pipeline for annotation-based nuclei labeling on the PANDA dataset. The changes include new Hydra configuration files to manage the experiment, the core Python script for the labeling logic using Ray for parallelization, and a job submission script.
My review focuses on correctness and potential runtime errors, in line with the repository's style guide. I've identified two high-severity issues that are likely to cause the pipeline to crash:
- The mask handling logic in
preprocessing/annotation_labels.pyis not robust to all possible image dimension layouts (e.g., channels-first). - The job submission script in
scripts/preprocessing/run_annotation_labels.pyuses incorrect Hydra syntax, which will likely prevent the job from running.
I have provided specific code suggestions to fix these issues. The overall structure and use of Hydra for configuration are well-implemented.
There was a problem hiding this comment.
🧹 Nitpick comments (3)
preprocessing/annotation_labels.py (1)
58-61: Validatedata_providerexplicitly.Every non-
"radboud"row currently falls into the Karolinska threshold. If the metadata ever contains an unexpected value or casing variant, this will silently generate wrong labels instead of failing fast.🧭 Suggested change
- if provider == "radboud": - is_carcinoma_vertex = annot_labels >= 3 - else: # karolinska - is_carcinoma_vertex = annot_labels >= 2 + provider = str(provider).strip().lower() + if provider == "radboud": + is_carcinoma_vertex = annot_labels >= 3 + elif provider == "karolinska": + is_carcinoma_vertex = annot_labels >= 2 + else: + raise ValueError(f"Unsupported data_provider={provider!r} for slide {slide_id}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preprocessing/annotation_labels.py` around lines 58 - 61, The current branch uses any non-"radboud" value as Karolinska which can silently mislabel data; update the logic around provider/`data_provider` (the variable currently named `provider`) to explicitly validate allowed values (e.g., normalize casing with .lower()) and then set `is_carcinoma_vertex = annot_labels >= 3` for "radboud" and `annot_labels >= 2` for "karolinska"; if the provider is not one of the expected values raise a ValueError (or similar) so unexpected or missing providers fail fast instead of defaulting to Karolinska.configs/data/sources/panda.yaml (1)
11-11: Consider rootingslides_propertiesin config.
slides_propertiesis the only PANDA path here that can't be moved withdata_pathorproject_path, which makes this source harder to reuse outside the current filesystem layout.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@configs/data/sources/panda.yaml` at line 11, The slides_properties entry is hardcoded to an absolute path; make it configurable and relative by rooting it under the existing data_path or project_path so the PANDA source is portable: change the config value for slides_properties to a relative path (e.g. PANDA/slides.parquet) and update the code that reads slides_properties to join it with data_path or project_path (use the same path variable used for other PANDA files) before opening the file so behavior is unchanged for current deployments but works on other filesystems.scripts/preprocessing/run_annotation_labels.py (1)
11-15: Pin the checkout to an explicit ref.Right now the job runs whatever commit happens to be at the remote default branch when the pod starts, so the same launcher can generate different label artifacts over time.
📌 Suggested change
script=[ "git clone https://github.com/RationAI/nuclei-graph-transformer.git workdir", "cd workdir", + "git checkout <commit-sha-or-tag>", "uv sync --frozen", "uv run -m preprocessing.annotation_labels +experiment=preprocessing/annotation_labels/...", ],🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/preprocessing/run_annotation_labels.py` around lines 11 - 15, The git clone in the script array currently checks out the repository's default branch; change it to pin to an explicit commit/ref by updating the clone step in run_annotation_labels.py's script (the "git clone https://github.com/RationAI/nuclei-graph-transformer.git workdir" entry) to fetch and checkout a specific tag/commit/branch (e.g., use git clone --depth 1 --branch <ref> ... or clone then git -C workdir checkout <commit>) so the job always runs a deterministic ref; ensure the chosen ref is stored/parametrized so future runs can update it intentionally.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@configs/data/sources/panda.yaml`:
- Line 11: The slides_properties entry is hardcoded to an absolute path; make it
configurable and relative by rooting it under the existing data_path or
project_path so the PANDA source is portable: change the config value for
slides_properties to a relative path (e.g. PANDA/slides.parquet) and update the
code that reads slides_properties to join it with data_path or project_path (use
the same path variable used for other PANDA files) before opening the file so
behavior is unchanged for current deployments but works on other filesystems.
In `@preprocessing/annotation_labels.py`:
- Around line 58-61: The current branch uses any non-"radboud" value as
Karolinska which can silently mislabel data; update the logic around
provider/`data_provider` (the variable currently named `provider`) to explicitly
validate allowed values (e.g., normalize casing with .lower()) and then set
`is_carcinoma_vertex = annot_labels >= 3` for "radboud" and `annot_labels >= 2`
for "karolinska"; if the provider is not one of the expected values raise a
ValueError (or similar) so unexpected or missing providers fail fast instead of
defaulting to Karolinska.
In `@scripts/preprocessing/run_annotation_labels.py`:
- Around line 11-15: The git clone in the script array currently checks out the
repository's default branch; change it to pin to an explicit commit/ref by
updating the clone step in run_annotation_labels.py's script (the "git clone
https://github.com/RationAI/nuclei-graph-transformer.git workdir" entry) to
fetch and checkout a specific tag/commit/branch (e.g., use git clone --depth 1
--branch <ref> ... or clone then git -C workdir checkout <commit>) so the job
always runs a deterministic ref; ensure the chosen ref is stored/parametrized so
future runs can update it intentionally.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 64217716-d3da-4393-8815-8befe4b937d0
📒 Files selected for processing (7)
configs/base.yamlconfigs/data/sources/panda.yamlconfigs/data/sources/prostate_cancer.yamlconfigs/experiment/preprocessing/annotation_labels/panda.yamlconfigs/preprocessing/annotation_labels.yamlpreprocessing/annotation_labels.pyscripts/preprocessing/run_annotation_labels.py
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/visualization/run_polygons_rasterization.py`:
- Around line 4-18: The module currently calls submit_job(...) at import time
causing side effects; wrap that call in a main function (e.g., def main():
submit_job(...)) and invoke main() only under the guard if __name__ ==
"__main__": so importing the module doesn't submit the Kubernetes job; ensure
the existing submit_job invocation and its arguments remain unchanged and move
them into the new main function (referencing submit_job and the new main) so
tests and tooling can import the module safely.
In `@visualization/polygons2raster.py`:
- Around line 119-123: The code currently hardcodes level = 0 and then does mask
= Image.new("L", size=mask_size), which forces a full-resolution in-memory
raster for very large WSIs; change this by selecting an appropriate pyramid
level instead of always 0 (use slide.level_count or compute desired level from
mask_mpp_x/mask_mpp_y or an external target_mpp parameter via
slide.slide_resolution/slide.level_dimensions) and avoid creating a single giant
Image for level 0 — either create the mask at a downsampled level or implement a
tiled/streamed rasterization path that iterates over tiles (fetching regions via
OpenSlide.read_region) and writes tiles directly to disk or to the output TIFF
writer rather than materializing mask = Image.new(...) for the entire slide;
update references around OpenSlide(item["slide_path"]), level,
slide.slide_resolution, slide.level_dimensions, and the mask creation to use the
selected level or tile streaming.
- Around line 148-153: uris2df currently returns an empty DataFrame with only
"slide_path" when uris is empty, but callers expect "slide_nuclei_path" too;
update uris2df so that the empty branch returns a schema-consistent DataFrame
including both "slide_path" and "slide_nuclei_path" (and any other columns that
callers index later) to avoid KeyError, or alternatively validate/raise a clear
error if those columns are required; locate and modify the uris2df function to
return the consistent column set.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 210a64e5-693f-46ba-bfbc-7349b85dc2d6
📒 Files selected for processing (8)
configs/data/sources/panda.yamlconfigs/experiment/preprocessing/annotation_labels/panda.yamlconfigs/preprocessing/annotation_labels.yamlconfigs/visualization/polygons2raster/prostate_cancer.yamlconfigs/visualization/polygons2raster/radboud.yamlpreprocessing/annotation_labels.pyscripts/visualization/run_polygons_rasterization.pyvisualization/polygons2raster.py
✅ Files skipped from review due to trivial changes (2)
- configs/visualization/polygons2raster/radboud.yaml
- configs/experiment/preprocessing/annotation_labels/panda.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
- configs/preprocessing/annotation_labels.yaml
- preprocessing/annotation_labels.py
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
visualization/polygons2raster.py (1)
170-177:⚠️ Potential issue | 🔴 CriticalRead the prediction threshold from the renamed config field.
The configs added in this PR use
predictions_thr, but Line 177 still readspred_thr. In mode 2 that keepspred_thr=None, so Line 66 asserts and prediction rasterization never starts.💡 Proposed fix
- "pred_thr": config.get("pred_thr", None), + "pred_thr": config.get( + "predictions_thr", + config.get("pred_thr", None), + ),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@visualization/polygons2raster.py` around lines 170 - 177, The kwargs dict builds fn_kwargs with the prediction threshold using the old key "pred_thr"; update the dict to read the renamed config field by replacing config.get("pred_thr", None) with config.get("predictions_thr", None) (so fn_kwargs["pred_thr"] or the threshold entry gets the value from config["predictions_thr"]) to ensure the prediction threshold is set when running visualization_mode 2 and allows the rasterization/assertion in the prediction path to proceed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@visualization/polygons2raster.py`:
- Around line 170-177: The kwargs dict builds fn_kwargs with the prediction
threshold using the old key "pred_thr"; update the dict to read the renamed
config field by replacing config.get("pred_thr", None) with
config.get("predictions_thr", None) (so fn_kwargs["pred_thr"] or the threshold
entry gets the value from config["predictions_thr"]) to ensure the prediction
threshold is set when running visualization_mode 2 and allows the
rasterization/assertion in the prediction path to proceed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 959f2119-0ba4-4263-bde9-806f8bcd032a
📒 Files selected for processing (1)
visualization/polygons2raster.py
There was a problem hiding this comment.
♻️ Duplicate comments (1)
visualization/polygons2raster.py (1)
148-152:⚠️ Potential issue | 🟡 MinorHandle empty
urislist and validate required columns.If
urisis an empty list,pd.concat([])raisesValueError: No objects to concatenate. Additionally, Line 168 expects columns["slide_path", "slide_nuclei_path"]to exist, but there's no validation that the loaded Parquet files contain them.🛡️ Proposed fix to add validation
def uris2df(uris: list[str]) -> pd.DataFrame: """Loads and merges multiple metadata Parquet files into a single DataFrame.""" + if not uris: + raise ValueError("metadata_uris cannot be empty") batches = [pd.read_parquet(download_artifacts(uri)) for uri in uris] - return pd.concat(batches, ignore_index=True).drop_duplicates(subset=["slide_path"]) + metadata = pd.concat(batches, ignore_index=True).drop_duplicates(subset=["slide_path"]) + required = {"slide_path", "slide_nuclei_path"} + missing = required - set(metadata.columns) + if missing: + raise KeyError(f"Missing required metadata columns: {sorted(missing)}") + return metadata🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@visualization/polygons2raster.py` around lines 148 - 152, In uris2df, handle an empty uris list and validate required columns: if uris is empty return an empty pandas DataFrame with at least the expected columns ["slide_path", "slide_nuclei_path"] (and any other columns you need downstream) so callers don't hit pd.concat([]); after reading and concatenating batches in uris2df, verify that the resulting DataFrame contains the required columns and if any are missing raise a clear ValueError listing the missing column names so callers know which Parquet inputs are malformed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@visualization/polygons2raster.py`:
- Around line 148-152: In uris2df, handle an empty uris list and validate
required columns: if uris is empty return an empty pandas DataFrame with at
least the expected columns ["slide_path", "slide_nuclei_path"] (and any other
columns you need downstream) so callers don't hit pd.concat([]); after reading
and concatenating batches in uris2df, verify that the resulting DataFrame
contains the required columns and if any are missing raise a clear ValueError
listing the missing column names so callers know which Parquet inputs are
malformed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a1533ef1-410e-424f-9d59-4c5f8e0c3d29
📒 Files selected for processing (5)
configs/data/sources/panda.yamlconfigs/visualization/polygons2raster/prostate_cancer_mmci_tl.yamlpreprocessing/annotation_labels.pyscripts/exploration/prostate_cancer_mmci_tl/run_save_metadataset.pyvisualization/polygons2raster.py
✅ Files skipped from review due to trivial changes (1)
- scripts/exploration/prostate_cancer_mmci_tl/run_save_metadataset.py
🚧 Files skipped from review as they are similar to previous changes (2)
- configs/data/sources/panda.yaml
- preprocessing/annotation_labels.py
There was a problem hiding this comment.
🧹 Nitpick comments (1)
preprocessing/metadata_mapping/panda.py (1)
40-52: Verify NaN handling forgleason_scorein downstream consumers.The
gleason_scorecolumn is passed through from the input metadata CSV without explicit validation. Since the input may contain NaN values for incomplete or failed slides, verify that downstream code (particularly any DataModule or labeling logic) either filters these rows or handles NaN values gracefully.Note: Other similar columns (
mpp_x,mpp_y) follow the same pattern without explicit validation here, suggesting this design may be intentional and validated elsewhere in the pipeline.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@preprocessing/metadata_mapping/panda.py` around lines 40 - 52, map_df is currently built from slides["gleason_score"] without validation, so add explicit NaN handling before or during DataFrame construction: check slides["gleason_score"] for NaN and either drop those rows (e.g., filter slides = slides[~slides["gleason_score"].isna()]) and log a warning about dropped slides, or add a boolean column (e.g., "has_valid_gleason") to map_df to mark missing labels so downstream DataModule/labeling logic can skip or handle them; also apply the same validation pattern to mpp_x/mpp_y if needed. Ensure you reference the map_df creation and slides["gleason_score"] (and optionally mpp_x/mpp_y) when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@preprocessing/metadata_mapping/panda.py`:
- Around line 40-52: map_df is currently built from slides["gleason_score"]
without validation, so add explicit NaN handling before or during DataFrame
construction: check slides["gleason_score"] for NaN and either drop those rows
(e.g., filter slides = slides[~slides["gleason_score"].isna()]) and log a
warning about dropped slides, or add a boolean column (e.g.,
"has_valid_gleason") to map_df to mark missing labels so downstream
DataModule/labeling logic can skip or handle them; also apply the same
validation pattern to mpp_x/mpp_y if needed. Ensure you reference the map_df
creation and slides["gleason_score"] (and optionally mpp_x/mpp_y) when making
the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 57647393-618f-45dc-94a7-e1549d31dd42
📒 Files selected for processing (2)
configs/data/sources/panda.yamlpreprocessing/metadata_mapping/panda.py
🚧 Files skipped from review as they are similar to previous changes (1)
- configs/data/sources/panda.yaml
Depends on #8 #9 #12 #11
Summary by CodeRabbit
New Features
Chores