feat(panda): data exploration by xrusnack · Pull Request #8 · RationAI/nuclei-graph-transformer

xrusnack · 2026-03-24T10:51:35Z

Summary by CodeRabbit

New Features
- Added PANDA dataset support with automated slide validation, per-slide metadata export and grouped summary generation.
- Added job submission tooling for running exploration workflows for PANDA and prostate cancer datasets.
Chores
- Updated base data path and split MLflow artifact targets into separate sensitive/public project URIs.
- Added global dataset and exploration configurations and updated prostate cancer exploration wiring.

Removed mask tile dimensions from PANDA dataset config.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (3)

exploration/panda/save_metadataset.py (3)

43-44: ⚠️ Potential issue | 🟠 Major

Per-image percentile makes the blank-slide filter effectively non-discriminative.

Line 43 computes “below this image’s own 95th percentile,” which is typically ~95% regardless of content, so Line 44 won’t reliably reject blank/noisy slides.

🛠️ Suggested fix

-            tissue_ratio = np.mean(thumb_array < np.percentile(thumb_array, 95))
+            # Fixed background cutoff in grayscale (near-white background ~255)
+            tissue_ratio = float(np.mean(thumb_array < 220))
             if tissue_ratio < tissue_threshold:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@exploration/panda/save_metadataset.py` around lines 43 - 44, tissue_ratio is
computed against each image's own 95th percentile which makes the filter
ineffective; change the logic in the tissue_ratio calculation (currently using
thumb_array and np.percentile) to compare pixels against a consistent threshold
instead — either compute a global 95th-percentile across all thumbnails once
(e.g., global_95 = np.percentile(all_thumbs, 95) and use np.mean(thumb_array <
global_95)) or use an absolute intensity cutoff or Otsu thresholding, and then
compare that to tissue_threshold; update the code paths that reference
tissue_ratio, thumb_array and tissue_threshold to use this global/absolute
thresholding approach.

138-159: ⚠️ Potential issue | 🟠 Major

Ensure Ray shutdown runs even when dataframe/artifact steps fail.

If any exception occurs after ray.init(...), Line 159 is skipped and Ray resources may leak in the job container.

🛠️ Suggested fix

 def main(config: DictConfig, logger: MLFlowLogger) -> None:
     ray.init(num_cpus=config.max_concurrent)
-
-    with TemporaryDirectory() as output_dir:
-        log_file = Path(output_dir) / "errors.log"
-
-        df, summary_df = get_dataframes(
-            metadata_csv_path=Path(config.metadata_csv),
-            slides_dir=Path(config.slides_dir),
-            annots_dir=Path(config.label_masks_dir),
-            properties_pq_path=Path(config.slides_properties_parquet),
-            tissue_threshold=config.tissue_threshold,
-            log_file=log_file,
-        )
-
-        df.to_csv(Path(output_dir) / "slides_metadata.csv", index=False)
-        summary_df.to_csv(Path(output_dir) / "summary.csv", index=False)
-
-        logger.log_artifacts(local_dir=output_dir, artifact_path="panda")
-        slide_dataset = mlflow.data.pandas_dataset.from_pandas(df, name="panda")
-        mlflow.log_input(slide_dataset, context="slides_metadata")
-
-    ray.shutdown()
+    try:
+        with TemporaryDirectory() as output_dir:
+            log_file = Path(output_dir) / "errors.log"
+
+            df, summary_df = get_dataframes(
+                metadata_csv_path=Path(config.metadata_csv),
+                slides_dir=Path(config.slides_dir),
+                annots_dir=Path(config.label_masks_dir),
+                properties_pq_path=Path(config.slides_properties_parquet),
+                tissue_threshold=config.tissue_threshold,
+                log_file=log_file,
+            )
+
+            df.to_csv(Path(output_dir) / "slides_metadata.csv", index=False)
+            summary_df.to_csv(Path(output_dir) / "summary.csv", index=False)
+
+            logger.log_artifacts(local_dir=output_dir, artifact_path="panda")
+            slide_dataset = mlflow.data.pandas_dataset.from_pandas(df, name="panda")
+            mlflow.log_input(slide_dataset, context="slides_metadata")
+    finally:
+        ray.shutdown()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@exploration/panda/save_metadataset.py` around lines 138 - 159, After calling
ray.init(...), wrap the work that uses Ray (the with TemporaryDirectory() block
that calls get_dataframes, logger.log_artifacts, mlflow.* etc.) in a try/finally
so that ray.shutdown() is always executed; specifically, call
ray.init(num_cpus=config.max_concurrent) then try: (TemporaryDirectory() as
output_dir: ... df/summary_df work, logger.log_artifacts,
mlflow.data.pandas_dataset.from_pandas, mlflow.log_input) finally:
ray.shutdown(), and re-raise any exception instead of swallowing it so failures
still propagate.

54-55: ⚠️ Potential issue | 🟠 Major

Avoid full TIFF materialization in Ray workers for mask validation.

Line 54 reads the whole mask into memory just to check emptiness/corruption. With max_concurrent workers, this can cause avoidable RAM/I/O spikes.

🛠️ Suggested fix

-            mask = tifffile.imread(str(mask_path))
-            if mask is None or mask.size == 0:
+            with tifffile.TiffFile(str(mask_path)) as tif:
+                if not tif.pages:
+                    log(f"MASK_EMPTY: {slide_id}")
+                    return False
+                # Decode one page to still catch codec/decompression issues.
+                first_page = tif.pages[0].asarray()
+                if first_page is None or first_page.size == 0:
+                    log(f"MASK_EMPTY: {slide_id}")
+                    return False
-                log(f"MASK_EMPTY: {slide_id}")
-                return False

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@exploration/panda/save_metadataset.py` around lines 54 - 55, The code
currently calls tifffile.imread(str(mask_path)) into variable mask to validate
emptiness, which materializes the whole TIFF and can spike memory in Ray
workers; replace that read with a lightweight check using tifffile.memmap or
tifffile.TiffFile to avoid full materialization: open the file with
tifffile.memmap(str(mask_path)) or with tifffile.TiffFile(str(mask_path)) and
inspect pages/shape/bytecount (or check os.path.getsize(mask_path) as a cheap
pre-check) and only then, if needed, load/convert to an array; update the code
references to mask_path and mask accordingly so emptiness/corruption is
determined without reading the entire image into RAM.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@exploration/panda/save_metadataset.py`:
- Around line 76-90: The results are being misaligned because ray.wait returns
completed object refs out-of-order, but the code zips slide_ids with accumulated
results; change the task submission to keep the mapping between each future and
its slide_id (e.g., build futures as object refs and a dict/map from each future
ref to its slide_id when creating them in validate_sample), then when ray.wait
returns a batch (done) call ray.get(done) and iterate over the parallel pairs of
done refs and their returned values to assign validity to the correct slide_id
(update the valid set using the mapping from future ref -> slide_id rather than
zipping slide_ids with results).

---

Duplicate comments:
In `@exploration/panda/save_metadataset.py`:
- Around line 43-44: tissue_ratio is computed against each image's own 95th
percentile which makes the filter ineffective; change the logic in the
tissue_ratio calculation (currently using thumb_array and np.percentile) to
compare pixels against a consistent threshold instead — either compute a global
95th-percentile across all thumbnails once (e.g., global_95 =
np.percentile(all_thumbs, 95) and use np.mean(thumb_array < global_95)) or use
an absolute intensity cutoff or Otsu thresholding, and then compare that to
tissue_threshold; update the code paths that reference tissue_ratio, thumb_array
and tissue_threshold to use this global/absolute thresholding approach.
- Around line 138-159: After calling ray.init(...), wrap the work that uses Ray
(the with TemporaryDirectory() block that calls get_dataframes,
logger.log_artifacts, mlflow.* etc.) in a try/finally so that ray.shutdown() is
always executed; specifically, call ray.init(num_cpus=config.max_concurrent)
then try: (TemporaryDirectory() as output_dir: ... df/summary_df work,
logger.log_artifacts, mlflow.data.pandas_dataset.from_pandas, mlflow.log_input)
finally: ray.shutdown(), and re-raise any exception instead of swallowing it so
failures still propagate.
- Around line 54-55: The code currently calls tifffile.imread(str(mask_path))
into variable mask to validate emptiness, which materializes the whole TIFF and
can spike memory in Ray workers; replace that read with a lightweight check
using tifffile.memmap or tifffile.TiffFile to avoid full materialization: open
the file with tifffile.memmap(str(mask_path)) or with
tifffile.TiffFile(str(mask_path)) and inspect pages/shape/bytecount (or check
os.path.getsize(mask_path) as a cheap pre-check) and only then, if needed,
load/convert to an array; update the code references to mask_path and mask
accordingly so emptiness/corruption is determined without reading the entire
image into RAM.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 76dd9c9b-6ca0-4753-8cb3-81f9b638dc59

📥 Commits

Reviewing files that changed from the base of the PR and between b713d8d and 74fa8bc.

📒 Files selected for processing (5)

configs/base.yaml
configs/data/sources/panda.yaml
configs/data/sources/prostate_cancer.yaml
configs/exploration/panda/save_metadataset.yaml
exploration/panda/save_metadataset.py

✅ Files skipped from review due to trivial changes (1)

configs/exploration/panda/save_metadataset.yaml

🚧 Files skipped from review as they are similar to previous changes (3)

configs/base.yaml
configs/data/sources/prostate_cancer.yaml
configs/data/sources/panda.yaml

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

exploration/panda/save_metadataset.py (1)
43-46: ⚠️ Potential issue | 🟠 Major

Tissue-ratio calculation is effectively a no-op.

As discussed in prior comments: np.mean(thumb_array < np.percentile(thumb_array, 95)) computes the fraction of pixels below the image's own 95th percentile, which is always ≈0.95 for any non-constant image. This means the < tissue_threshold check (default 0.001) will never trigger, even for completely blank/white slides.

For filtering blank slides, use a fixed intensity threshold instead:
🛠️ Proposed fix
-            tissue_ratio = np.mean(thumb_array < np.percentile(thumb_array, 95))
-            if tissue_ratio < tissue_threshold:
+            # Pixels darker than 220 (out of 255) are considered tissue
+            # Background/blank slides are near-white (~255)
+            tissue_ratio = np.mean(thumb_array < 220)
+            if tissue_ratio < tissue_threshold:  # e.g., < 0.01 means < 1% tissue
Note: With a fixed threshold of 220, you may want to adjust tissue_threshold in config from 0.001 to something like 0.01 (1% tissue minimum).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@exploration/panda/save_metadataset.py` around lines 43 - 46, The tissue-ratio
check in the block that computes tissue_ratio using np.percentile is a no-op;
replace the percentile-based test with a fixed intensity threshold so
tissue_ratio = np.mean(thumb_array < FIXED_INTENSITY) (use a constant like 220
or a configurable value) and then compare that fraction against
tissue_threshold; update the code around the variables thumb_array,
tissue_ratio, tissue_threshold, and the log/return for slide_id accordingly so
blank/white slides are properly filtered (also consider changing the default
tissue_threshold in config to ~0.01 if needed).

🧹 Nitpick comments (2)

exploration/panda/save_metadataset.py (2)

54-55: mask is None check is misleading.

tifffile.imread() raises an exception on read errors rather than returning None. The mask is None branch is dead code.

♻️ Suggested simplification

             mask = tifffile.imread(str(mask_path))
-            if mask is None or mask.size == 0:
+            if mask.size == 0:
                 log(f"MASK_EMPTY: {slide_id}")
                 return False

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@exploration/panda/save_metadataset.py` around lines 54 - 55, The check "mask
is None" is dead because tifffile.imread(mask_path) raises on errors; update the
code around tifffile.imread(str(mask_path)) to remove the redundant None check
and instead wrap the read in a try/except to catch IO/read errors (e.g.,
OSError, tifffile.TiffFileError) and handle them by logging/continuing/raising
as appropriate; keep the mask.size == 0 check to detect empty arrays and handle
that case (using the mask variable) after a successful read.

138-159: Ray lifecycle not wrapped in try/finally — orphaned processes on failure.

If an exception occurs between ray.init() (line 138) and ray.shutdown() (line 159), Ray worker processes won't be cleaned up. In a Kubernetes pod, this may leave orphaned processes.

I note matejpekar's prior comment "Do not use try" — if the intent is to let the job fail loudly, consider using atexit to ensure cleanup without catching exceptions:

♻️ Alternative: use atexit for cleanup without try/except

+import atexit
+
 `@with_cli_args`(["+exploration=panda/save_metadataset"])
 `@hydra.main`(config_path="../../configs", config_name="exploration", version_base=None)
 `@autolog`
 def main(config: DictConfig, logger: MLFlowLogger) -> None:
     ray.init(num_cpus=config.max_concurrent)
+    atexit.register(ray.shutdown)
 
     with TemporaryDirectory() as output_dir:
         # ... rest of code ...
-
-    ray.shutdown()

This ensures cleanup on both normal exit and unhandled exceptions without suppressing errors.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@exploration/panda/save_metadataset.py` around lines 138 - 159, After calling
ray.init(num_cpus=config.max_concurrent), register ray.shutdown with atexit
(e.g., atexit.register(ray.shutdown)) so Ray is cleaned up on normal exit and
unhandled exceptions without adding try/except; place the atexit registration
immediately after ray.init and you can keep or remove the explicit
ray.shutdown() at the end (double-registration is safe but if you keep it ensure
it won’t raise on a second call). This change touches the ray.init(...) /
ray.shutdown() usage in the block that builds dataframes and logs artifacts
(TemporaryDirectory context, get_dataframes, logger.log_artifacts,
mlflow.data.pandas_dataset.from_pandas).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@exploration/panda/save_metadataset.py`:
- Around line 83-90: The validation results are being paired with slide_ids in
submission order but ray.wait returns futures in completion order; fix by
mapping futures to slide_ids before the wait loop (e.g., future_to_slide = {f:
sid for f, sid in zip(futures, slide_ids)}), then inside the loop when you get
done and call ray.get(done), iterate over zip(done, ray.get(done)) and use the
mapping to add the correct slide_id to the valid set (instead of relying on
results list order); also remove/replace uses of the accumulated results list
with direct mapping so pairing stays correct.

---

Duplicate comments:
In `@exploration/panda/save_metadataset.py`:
- Around line 43-46: The tissue-ratio check in the block that computes
tissue_ratio using np.percentile is a no-op; replace the percentile-based test
with a fixed intensity threshold so tissue_ratio = np.mean(thumb_array <
FIXED_INTENSITY) (use a constant like 220 or a configurable value) and then
compare that fraction against tissue_threshold; update the code around the
variables thumb_array, tissue_ratio, tissue_threshold, and the log/return for
slide_id accordingly so blank/white slides are properly filtered (also consider
changing the default tissue_threshold in config to ~0.01 if needed).

---

Nitpick comments:
In `@exploration/panda/save_metadataset.py`:
- Around line 54-55: The check "mask is None" is dead because
tifffile.imread(mask_path) raises on errors; update the code around
tifffile.imread(str(mask_path)) to remove the redundant None check and instead
wrap the read in a try/except to catch IO/read errors (e.g., OSError,
tifffile.TiffFileError) and handle them by logging/continuing/raising as
appropriate; keep the mask.size == 0 check to detect empty arrays and handle
that case (using the mask variable) after a successful read.
- Around line 138-159: After calling ray.init(num_cpus=config.max_concurrent),
register ray.shutdown with atexit (e.g., atexit.register(ray.shutdown)) so Ray
is cleaned up on normal exit and unhandled exceptions without adding try/except;
place the atexit registration immediately after ray.init and you can keep or
remove the explicit ray.shutdown() at the end (double-registration is safe but
if you keep it ensure it won’t raise on a second call). This change touches the
ray.init(...) / ray.shutdown() usage in the block that builds dataframes and
logs artifacts (TemporaryDirectory context, get_dataframes,
logger.log_artifacts, mlflow.data.pandas_dataset.from_pandas).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6f059cd7-7d2e-49ba-b8d0-45aea2689c9b

📥 Commits

Reviewing files that changed from the base of the PR and between b713d8d and 0b7ac5c.

📒 Files selected for processing (5)

configs/base.yaml
configs/data/sources/panda.yaml
configs/data/sources/prostate_cancer.yaml
configs/exploration/panda/save_metadataset.yaml
exploration/panda/save_metadataset.py

✅ Files skipped from review due to trivial changes (2)

configs/exploration/panda/save_metadataset.yaml
configs/data/sources/panda.yaml

🚧 Files skipped from review as they are similar to previous changes (2)

configs/data/sources/prostate_cancer.yaml
configs/base.yaml

xrusnack and others added 28 commits March 22, 2026 10:20

feat: panda exploration

6deab1d

feat: slide properties

a044637

feat: inits and ruff

b3eb988

fix: uri

3b362c1

feat: rename method

efd7b47

feat: add checks

2dc66d0

feat: remove mask tile dims

68c8805

Removed mask tile dimensions from PANDA dataset config.

feat: add validation checks

15e5d70

fix: git url

1c4d0a4

feat: concurrency

f9b758b

feat: resources

1d35911

feat: job scripts

d601c0d

chore: formatting and comments

6bd7a5d

fix: check

44d8024

feat: error logs

4ec4048

fix: mypy

a223567

fix: file name

236db24

feat: properties computation

3b74bf5

fix: path

bab1bb7

fix: mypy

f5c3a1e

fix: ray

00c8852

chore: simplify

f814c1c

fix: projects mount

23c8c32

feat: config uri

53db74a

chore: del redundant key

03d9df4

fix: check seg file existance

729aea4

chore: comments

5d0d687

fix: config uri

b5fccbe

xrusnack requested review from matejpekar and vejtek March 24, 2026 10:51

xrusnack and others added 5 commits March 27, 2026 22:16

chore: cleanup

d9245f3

fix: ruff

d6924e2

chore: cleanup

0973d82

fix: duplicate col

c05a813

fix: merge conflict

21cc2d6

matejpekar requested changes Mar 28, 2026

View reviewed changes

Comment thread exploration/panda/save_metadataset.py Outdated

Comment thread exploration/panda/save_metadataset.py

xrusnack added 4 commits March 28, 2026 22:13

fix: review

d8f0444

fix: review

4006723

chore: uri

163ee68

chore: uri

74fa8bc

xrusnack marked this pull request as ready for review March 28, 2026 22:49

xrusnack requested a review from matejpekar March 28, 2026 22:49

fix: typo in comment

0b7ac5c

coderabbitai Bot reviewed Mar 28, 2026

View reviewed changes

Comment thread exploration/panda/save_metadataset.py Outdated

coderabbitai Bot reviewed Mar 28, 2026

View reviewed changes

Comment thread exploration/panda/save_metadataset.py Outdated

xrusnack added 2 commits March 28, 2026 23:01

chore: apply sugestion

b59753c

chore: format

8dc6c48

matejpekar approved these changes Mar 28, 2026

View reviewed changes

This was referenced Mar 28, 2026

feat(panda): radboud train-test split #11

Merged

feat(panda): metadata mapping #12

Merged

chore: documentation cleanup #13

Open

vejtek approved these changes Mar 30, 2026

View reviewed changes

xrusnack merged commit 9aefc0b into master Mar 30, 2026
3 of 4 checks passed

xrusnack deleted the panda/data-exploration branch March 30, 2026 16:22

This was referenced Apr 9, 2026

feat(panda): annotation masks #15

Closed

feat: datamodule #2

Open

This was referenced Apr 17, 2026

feat(panda): exploration — exclude failed qc slides #17

Merged

feat: job scripts #14

Open

feat: change nuclei labels storage #18

Open

feat: postprocessing scripts #19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(panda): data exploration#8

feat(panda): data exploration#8
xrusnack merged 47 commits intomasterfrom
panda/data-exploration

xrusnack commented Mar 24, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xrusnack commented Mar 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xrusnack commented Mar 24, 2026 •

edited by coderabbitai Bot

Loading