feat(panda): radboud train-test split by xrusnack · Pull Request #11 · RationAI/nuclei-graph-transformer

xrusnack · 2026-03-28T23:45:05Z

Depends on #8

Summary by CodeRabbit

New Features
- Added data split preprocessing functionality with stratified train/test splitting capabilities
- Enabled provider-based data filtering for dataset-specific splitting workflows
Chores
- Updated dataset artifact references for metadata
- Added configuration files and execution scripts for data preprocessing workflows

gemini-code-assist · 2026-03-28T23:45:14Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements a structured pipeline for performing train-test splits on medical imaging datasets. It provides the necessary logic to filter, stratify, and export data splits, while ensuring these results are tracked and versioned via MLflow. The changes also include the configuration management required to apply these splits to specific data providers and a mechanism to execute these tasks in a containerized environment.

Highlights

New Data Splitting Pipeline: Introduced a new preprocessing script preprocessing/data_split.py to perform stratified train-test splits on datasets, with integration for MLflow artifact logging.
Configuration Updates: Added new Hydra configurations for PANDA dataset sources and specific experiment setups for the Radboud provider.
Infrastructure Support: Added a submission script scripts/preprocessing/run_data_split.py to facilitate running the splitting process as a Kubernetes job.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-03-28T23:45:18Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b20323db-3778-494a-840e-4e528fcbacef

📥 Commits

Reviewing files that changed from the base of the PR and between 0fadee7 and 6d06ff3.

📒 Files selected for processing (1)

configs/data/sources/panda.yaml

🚧 Files skipped from review as they are similar to previous changes (1)

configs/data/sources/panda.yaml

📝 Walkthrough

Walkthrough

This PR introduces data split preprocessing capabilities for the PANDA dataset with Radboud provider support. Changes include updates to dataset source configurations, new preprocessing configuration files with stratified train/test split parameters, a data split script implementation with MLflow artifact management, and a remote job submission script.

Changes

Cohort / File(s)	Summary
Data Source Configuration `configs/data/sources/panda.yaml`	Updated MLflow URI for PANDA exploration metadata artifact and added new `radboud` provider with split CSV reference.
Preprocessing Configuration `configs/preprocessing/data_split.yaml`, `configs/experiment/preprocessing/data_split/radboud.yaml`	Introduced global data split configuration defining `test_size`, `stratify_column`, and optional provider-based restrictions; added Radboud experiment-specific preprocessing config with stratification by `gleason_score` and provider filtering.
Data Split Implementation `preprocessing/data_split.py`, `scripts/preprocessing/run_data_split.py`	Implemented data split script with MLflow artifact download, provider-based filtering, stratified train/test split via sklearn, and summary statistics computation; added remote job submission script for Kubernetes execution.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

feat(panda): data exploration #8: Modifies the same configs/data/sources/panda.yaml file, updating dataset MLflow URIs and introducing the Radboud provider—this PR extends that configuration foundation.

Suggested reviewers

matejpekar
vejtek

Poem

🐰 A split in the data, so fair and so true,
With Radboud's provider, the stratification's due,
From MLflow artifacts, the metadata flows,
Train and test sets—where the learning grows! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main change: adding a train-test split feature specifically for the Radboud dataset within the PANDA project.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch panda/train-test-split

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a data splitting pipeline for the PANDA dataset, including new Hydra configurations, a Python script for stratified train-test splits with MLflow logging, and a Kubernetes job submission script. Key feedback includes replacing the ... placeholder in the submission script to prevent runtime errors, parameterizing hardcoded absolute paths in the dataset configuration for better portability, and aligning the Hydra configuration with the Python logic to make data restrictions properly optional.

coderabbitai

🧹 Nitpick comments (2)

preprocessing/data_split.py (2)
20-23: OmegaConf None check may not behave as expected.

With OmegaConf, checking config.restriction is not None may not work as intended since the restriction key is always defined in the config template (with required ??? fields). If you want to support optional restrictions in the future, consider checking for a sentinel value or adjusting the config structure.

For the current implementation where restriction is always provided by experiment configs, this works correctly—just noting for future flexibility.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preprocessing/data_split.py` around lines 20 - 23, Replace the direct None
check on config.restriction with an OmegaConf-safe check: test that
config.restriction is neither missing nor explicitly set to null using
OmegaConf.is_missing and OmegaConf.is_none before reading
provider_column/provider_value and filtering slides (i.e., change the condition
around config.restriction, which guards access to provider_column/provider_value
and the slides = slides[...] line, to use not
(OmegaConf.is_missing(config.restriction) or
OmegaConf.is_none(config.restriction))).
31-32: Consider using .copy() to avoid SettingWithCopyWarning.

train_test_split may return views of the original DataFrame. Assigning to the set column on these views can trigger pandas SettingWithCopyWarning and may not modify the data as intended in some pandas versions.
Proposed fix
     train, test = train_test_split(
         slides,
         test_size=config.test_size,
         random_state=42,
         stratify=slides[config.stratify_column],
     )
-    train["set"] = "train"
-    test["set"] = "test"
+    train = train.copy()
+    test = test.copy()
+    train["set"] = "train"
+    test["set"] = "test"
     split = pd.concat([train, test], ignore_index=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@preprocessing/data_split.py` around lines 31 - 32, The assignment
train["set"] = "train" and test["set"] = "test" can operate on views returned by
train_test_split and trigger pandas SettingWithCopyWarning; to fix, ensure you
work on explicit copies by calling train = train.copy() and test = test.copy()
(or otherwise use .loc to assign into an ensured copy) before setting the "set"
column so the modifications to the DataFrame variables train and test are
applied reliably.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@preprocessing/data_split.py`:
- Around line 20-23: Replace the direct None check on config.restriction with an
OmegaConf-safe check: test that config.restriction is neither missing nor
explicitly set to null using OmegaConf.is_missing and OmegaConf.is_none before
reading provider_column/provider_value and filtering slides (i.e., change the
condition around config.restriction, which guards access to
provider_column/provider_value and the slides = slides[...] line, to use not
(OmegaConf.is_missing(config.restriction) or
OmegaConf.is_none(config.restriction))).
- Around line 31-32: The assignment train["set"] = "train" and test["set"] =
"test" can operate on views returned by train_test_split and trigger pandas
SettingWithCopyWarning; to fix, ensure you work on explicit copies by calling
train = train.copy() and test = test.copy() (or otherwise use .loc to assign
into an ensured copy) before setting the "set" column so the modifications to
the DataFrame variables train and test are applied reliably.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 562c68da-a276-4d98-99ad-e66ddef87711

📥 Commits

Reviewing files that changed from the base of the PR and between a377343 and 6614e8f.

📒 Files selected for processing (7)

configs/base.yaml
configs/data/sources/panda.yaml
configs/data/sources/prostate_cancer.yaml
configs/experiment/preprocessing/data_split/radboud.yaml
configs/preprocessing/data_split.yaml
preprocessing/data_split.py
scripts/preprocessing/run_data_split.py

xrusnack added 15 commits March 27, 2026 21:51

feat: train-test split

f4f9041

fix: ruff

7d600f7

fix: config key

3c7682c

fix: run name

5a7fc91

fix: conf structure

4c65671

chore: delete unnecessary keys in cfg

0a0ebc4

fix: filter slides

a8d2513

fix: uri

4af1d0b

fix: naming

e8c0e79

chore: simplify

6643a25

feat: total summary

b9ca91e

chore: simplify

d21e276

fix: data path

8782c13

chore: uri

fbecfd8

fix: mlflow artifact path

6614e8f

xrusnack requested review from matejpekar and vejtek March 28, 2026 23:45

xrusnack self-assigned this Mar 28, 2026

xrusnack requested a review from a team March 28, 2026 23:45

gemini-code-assist Bot reviewed Mar 28, 2026

View reviewed changes

Comment thread scripts/preprocessing/run_data_split.py

Comment thread configs/data/sources/panda.yaml

Comment thread configs/preprocessing/data_split.yaml Outdated

coderabbitai Bot reviewed Mar 28, 2026

View reviewed changes

chore: apply suggestion

0fadee7

This was referenced Mar 30, 2026

feat(panda): metadata mapping #12

Merged

chore: documentation cleanup #13

Open

matejpekar previously approved these changes Mar 30, 2026

View reviewed changes

vejtek previously approved these changes Mar 30, 2026

View reviewed changes

chore: merge master

6d06ff3

xrusnack dismissed vejtek’s stale review via 6d06ff3 March 30, 2026 18:43

xrusnack dismissed matejpekar’s stale review via 6d06ff3 March 30, 2026 18:43

xrusnack requested review from matejpekar and vejtek March 30, 2026 18:46

matejpekar approved these changes Mar 30, 2026

View reviewed changes

xrusnack mentioned this pull request Mar 30, 2026

feat(panda): annotation-based nuclei labeling #10

Merged

vejtek approved these changes Apr 1, 2026

View reviewed changes

xrusnack merged commit d8f5fd5 into master Apr 1, 2026
3 of 4 checks passed

xrusnack deleted the panda/train-test-split branch April 1, 2026 07:50

coderabbitai Bot mentioned this pull request Apr 1, 2026

feat(panda): standardize nuclei seg data #9

Merged

coderabbitai Bot mentioned this pull request Apr 9, 2026

feat(panda): annotation masks #15

Closed

xrusnack added a commit that referenced this pull request Apr 14, 2026

feat(panda): radboud train-test split (#11)

655548c

xrusnack added a commit that referenced this pull request Apr 14, 2026

feat(panda): radboud train-test split (#11)

81a2594

coderabbitai Bot mentioned this pull request Apr 14, 2026

feat: datamodule #2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(panda): radboud train-test split#11

feat(panda): radboud train-test split#11
xrusnack merged 17 commits intomasterfrom
panda/train-test-split

xrusnack commented Mar 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

gemini-code-assist Bot commented Mar 28, 2026

Uh oh!

coderabbitai Bot commented Mar 28, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xrusnack commented Mar 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

gemini-code-assist Bot commented Mar 28, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai Bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xrusnack commented Mar 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 28, 2026 •

edited

Loading