Skip to content

feat(panda): data exploration#8

Merged
xrusnack merged 47 commits intomasterfrom
panda/data-exploration
Mar 30, 2026
Merged

feat(panda): data exploration#8
xrusnack merged 47 commits intomasterfrom
panda/data-exploration

Conversation

@xrusnack
Copy link
Copy Markdown
Member

@xrusnack xrusnack commented Mar 24, 2026

Summary by CodeRabbit

  • New Features

    • Added PANDA dataset support with automated slide validation, per-slide metadata export and grouped summary generation.
    • Added job submission tooling for running exploration workflows for PANDA and prostate cancer datasets.
  • Chores

    • Updated base data path and split MLflow artifact targets into separate sensitive/public project URIs.
    • Added global dataset and exploration configurations and updated prostate cancer exploration wiring.

@xrusnack xrusnack requested review from matejpekar and vejtek March 24, 2026 10:51
Comment thread exploration/panda/save_metadataset.py Outdated
Comment thread exploration/panda/save_metadataset.py
@xrusnack xrusnack marked this pull request as ready for review March 28, 2026 22:49
@xrusnack xrusnack requested a review from matejpekar March 28, 2026 22:49
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
exploration/panda/save_metadataset.py (3)

43-44: ⚠️ Potential issue | 🟠 Major

Per-image percentile makes the blank-slide filter effectively non-discriminative.

Line 43 computes “below this image’s own 95th percentile,” which is typically ~95% regardless of content, so Line 44 won’t reliably reject blank/noisy slides.

🛠️ Suggested fix
-            tissue_ratio = np.mean(thumb_array < np.percentile(thumb_array, 95))
+            # Fixed background cutoff in grayscale (near-white background ~255)
+            tissue_ratio = float(np.mean(thumb_array < 220))
             if tissue_ratio < tissue_threshold:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@exploration/panda/save_metadataset.py` around lines 43 - 44, tissue_ratio is
computed against each image's own 95th percentile which makes the filter
ineffective; change the logic in the tissue_ratio calculation (currently using
thumb_array and np.percentile) to compare pixels against a consistent threshold
instead — either compute a global 95th-percentile across all thumbnails once
(e.g., global_95 = np.percentile(all_thumbs, 95) and use np.mean(thumb_array <
global_95)) or use an absolute intensity cutoff or Otsu thresholding, and then
compare that to tissue_threshold; update the code paths that reference
tissue_ratio, thumb_array and tissue_threshold to use this global/absolute
thresholding approach.

138-159: ⚠️ Potential issue | 🟠 Major

Ensure Ray shutdown runs even when dataframe/artifact steps fail.

If any exception occurs after ray.init(...), Line 159 is skipped and Ray resources may leak in the job container.

🛠️ Suggested fix
 def main(config: DictConfig, logger: MLFlowLogger) -> None:
     ray.init(num_cpus=config.max_concurrent)
-
-    with TemporaryDirectory() as output_dir:
-        log_file = Path(output_dir) / "errors.log"
-
-        df, summary_df = get_dataframes(
-            metadata_csv_path=Path(config.metadata_csv),
-            slides_dir=Path(config.slides_dir),
-            annots_dir=Path(config.label_masks_dir),
-            properties_pq_path=Path(config.slides_properties_parquet),
-            tissue_threshold=config.tissue_threshold,
-            log_file=log_file,
-        )
-
-        df.to_csv(Path(output_dir) / "slides_metadata.csv", index=False)
-        summary_df.to_csv(Path(output_dir) / "summary.csv", index=False)
-
-        logger.log_artifacts(local_dir=output_dir, artifact_path="panda")
-        slide_dataset = mlflow.data.pandas_dataset.from_pandas(df, name="panda")
-        mlflow.log_input(slide_dataset, context="slides_metadata")
-
-    ray.shutdown()
+    try:
+        with TemporaryDirectory() as output_dir:
+            log_file = Path(output_dir) / "errors.log"
+
+            df, summary_df = get_dataframes(
+                metadata_csv_path=Path(config.metadata_csv),
+                slides_dir=Path(config.slides_dir),
+                annots_dir=Path(config.label_masks_dir),
+                properties_pq_path=Path(config.slides_properties_parquet),
+                tissue_threshold=config.tissue_threshold,
+                log_file=log_file,
+            )
+
+            df.to_csv(Path(output_dir) / "slides_metadata.csv", index=False)
+            summary_df.to_csv(Path(output_dir) / "summary.csv", index=False)
+
+            logger.log_artifacts(local_dir=output_dir, artifact_path="panda")
+            slide_dataset = mlflow.data.pandas_dataset.from_pandas(df, name="panda")
+            mlflow.log_input(slide_dataset, context="slides_metadata")
+    finally:
+        ray.shutdown()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@exploration/panda/save_metadataset.py` around lines 138 - 159, After calling
ray.init(...), wrap the work that uses Ray (the with TemporaryDirectory() block
that calls get_dataframes, logger.log_artifacts, mlflow.* etc.) in a try/finally
so that ray.shutdown() is always executed; specifically, call
ray.init(num_cpus=config.max_concurrent) then try: (TemporaryDirectory() as
output_dir: ... df/summary_df work, logger.log_artifacts,
mlflow.data.pandas_dataset.from_pandas, mlflow.log_input) finally:
ray.shutdown(), and re-raise any exception instead of swallowing it so failures
still propagate.

54-55: ⚠️ Potential issue | 🟠 Major

Avoid full TIFF materialization in Ray workers for mask validation.

Line 54 reads the whole mask into memory just to check emptiness/corruption. With max_concurrent workers, this can cause avoidable RAM/I/O spikes.

🛠️ Suggested fix
-            mask = tifffile.imread(str(mask_path))
-            if mask is None or mask.size == 0:
+            with tifffile.TiffFile(str(mask_path)) as tif:
+                if not tif.pages:
+                    log(f"MASK_EMPTY: {slide_id}")
+                    return False
+                # Decode one page to still catch codec/decompression issues.
+                first_page = tif.pages[0].asarray()
+                if first_page is None or first_page.size == 0:
+                    log(f"MASK_EMPTY: {slide_id}")
+                    return False
-                log(f"MASK_EMPTY: {slide_id}")
-                return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@exploration/panda/save_metadataset.py` around lines 54 - 55, The code
currently calls tifffile.imread(str(mask_path)) into variable mask to validate
emptiness, which materializes the whole TIFF and can spike memory in Ray
workers; replace that read with a lightweight check using tifffile.memmap or
tifffile.TiffFile to avoid full materialization: open the file with
tifffile.memmap(str(mask_path)) or with tifffile.TiffFile(str(mask_path)) and
inspect pages/shape/bytecount (or check os.path.getsize(mask_path) as a cheap
pre-check) and only then, if needed, load/convert to an array; update the code
references to mask_path and mask accordingly so emptiness/corruption is
determined without reading the entire image into RAM.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@exploration/panda/save_metadataset.py`:
- Around line 76-90: The results are being misaligned because ray.wait returns
completed object refs out-of-order, but the code zips slide_ids with accumulated
results; change the task submission to keep the mapping between each future and
its slide_id (e.g., build futures as object refs and a dict/map from each future
ref to its slide_id when creating them in validate_sample), then when ray.wait
returns a batch (done) call ray.get(done) and iterate over the parallel pairs of
done refs and their returned values to assign validity to the correct slide_id
(update the valid set using the mapping from future ref -> slide_id rather than
zipping slide_ids with results).

---

Duplicate comments:
In `@exploration/panda/save_metadataset.py`:
- Around line 43-44: tissue_ratio is computed against each image's own 95th
percentile which makes the filter ineffective; change the logic in the
tissue_ratio calculation (currently using thumb_array and np.percentile) to
compare pixels against a consistent threshold instead — either compute a global
95th-percentile across all thumbnails once (e.g., global_95 =
np.percentile(all_thumbs, 95) and use np.mean(thumb_array < global_95)) or use
an absolute intensity cutoff or Otsu thresholding, and then compare that to
tissue_threshold; update the code paths that reference tissue_ratio, thumb_array
and tissue_threshold to use this global/absolute thresholding approach.
- Around line 138-159: After calling ray.init(...), wrap the work that uses Ray
(the with TemporaryDirectory() block that calls get_dataframes,
logger.log_artifacts, mlflow.* etc.) in a try/finally so that ray.shutdown() is
always executed; specifically, call ray.init(num_cpus=config.max_concurrent)
then try: (TemporaryDirectory() as output_dir: ... df/summary_df work,
logger.log_artifacts, mlflow.data.pandas_dataset.from_pandas, mlflow.log_input)
finally: ray.shutdown(), and re-raise any exception instead of swallowing it so
failures still propagate.
- Around line 54-55: The code currently calls tifffile.imread(str(mask_path))
into variable mask to validate emptiness, which materializes the whole TIFF and
can spike memory in Ray workers; replace that read with a lightweight check
using tifffile.memmap or tifffile.TiffFile to avoid full materialization: open
the file with tifffile.memmap(str(mask_path)) or with
tifffile.TiffFile(str(mask_path)) and inspect pages/shape/bytecount (or check
os.path.getsize(mask_path) as a cheap pre-check) and only then, if needed,
load/convert to an array; update the code references to mask_path and mask
accordingly so emptiness/corruption is determined without reading the entire
image into RAM.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 76dd9c9b-6ca0-4753-8cb3-81f9b638dc59

📥 Commits

Reviewing files that changed from the base of the PR and between b713d8d and 74fa8bc.

📒 Files selected for processing (5)
  • configs/base.yaml
  • configs/data/sources/panda.yaml
  • configs/data/sources/prostate_cancer.yaml
  • configs/exploration/panda/save_metadataset.yaml
  • exploration/panda/save_metadataset.py
✅ Files skipped from review due to trivial changes (1)
  • configs/exploration/panda/save_metadataset.yaml
🚧 Files skipped from review as they are similar to previous changes (3)
  • configs/base.yaml
  • configs/data/sources/prostate_cancer.yaml
  • configs/data/sources/panda.yaml

Comment thread exploration/panda/save_metadataset.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
exploration/panda/save_metadataset.py (1)

43-46: ⚠️ Potential issue | 🟠 Major

Tissue-ratio calculation is effectively a no-op.

As discussed in prior comments: np.mean(thumb_array < np.percentile(thumb_array, 95)) computes the fraction of pixels below the image's own 95th percentile, which is always ≈0.95 for any non-constant image. This means the < tissue_threshold check (default 0.001) will never trigger, even for completely blank/white slides.

For filtering blank slides, use a fixed intensity threshold instead:

🛠️ Proposed fix
-            tissue_ratio = np.mean(thumb_array < np.percentile(thumb_array, 95))
-            if tissue_ratio < tissue_threshold:
+            # Pixels darker than 220 (out of 255) are considered tissue
+            # Background/blank slides are near-white (~255)
+            tissue_ratio = np.mean(thumb_array < 220)
+            if tissue_ratio < tissue_threshold:  # e.g., < 0.01 means < 1% tissue

Note: With a fixed threshold of 220, you may want to adjust tissue_threshold in config from 0.001 to something like 0.01 (1% tissue minimum).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@exploration/panda/save_metadataset.py` around lines 43 - 46, The tissue-ratio
check in the block that computes tissue_ratio using np.percentile is a no-op;
replace the percentile-based test with a fixed intensity threshold so
tissue_ratio = np.mean(thumb_array < FIXED_INTENSITY) (use a constant like 220
or a configurable value) and then compare that fraction against
tissue_threshold; update the code around the variables thumb_array,
tissue_ratio, tissue_threshold, and the log/return for slide_id accordingly so
blank/white slides are properly filtered (also consider changing the default
tissue_threshold in config to ~0.01 if needed).
🧹 Nitpick comments (2)
exploration/panda/save_metadataset.py (2)

54-55: mask is None check is misleading.

tifffile.imread() raises an exception on read errors rather than returning None. The mask is None branch is dead code.

♻️ Suggested simplification
             mask = tifffile.imread(str(mask_path))
-            if mask is None or mask.size == 0:
+            if mask.size == 0:
                 log(f"MASK_EMPTY: {slide_id}")
                 return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@exploration/panda/save_metadataset.py` around lines 54 - 55, The check "mask
is None" is dead because tifffile.imread(mask_path) raises on errors; update the
code around tifffile.imread(str(mask_path)) to remove the redundant None check
and instead wrap the read in a try/except to catch IO/read errors (e.g.,
OSError, tifffile.TiffFileError) and handle them by logging/continuing/raising
as appropriate; keep the mask.size == 0 check to detect empty arrays and handle
that case (using the mask variable) after a successful read.

138-159: Ray lifecycle not wrapped in try/finally — orphaned processes on failure.

If an exception occurs between ray.init() (line 138) and ray.shutdown() (line 159), Ray worker processes won't be cleaned up. In a Kubernetes pod, this may leave orphaned processes.

I note matejpekar's prior comment "Do not use try" — if the intent is to let the job fail loudly, consider using atexit to ensure cleanup without catching exceptions:

♻️ Alternative: use atexit for cleanup without try/except
+import atexit
+
 `@with_cli_args`(["+exploration=panda/save_metadataset"])
 `@hydra.main`(config_path="../../configs", config_name="exploration", version_base=None)
 `@autolog`
 def main(config: DictConfig, logger: MLFlowLogger) -> None:
     ray.init(num_cpus=config.max_concurrent)
+    atexit.register(ray.shutdown)
 
     with TemporaryDirectory() as output_dir:
         # ... rest of code ...
-
-    ray.shutdown()

This ensures cleanup on both normal exit and unhandled exceptions without suppressing errors.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@exploration/panda/save_metadataset.py` around lines 138 - 159, After calling
ray.init(num_cpus=config.max_concurrent), register ray.shutdown with atexit
(e.g., atexit.register(ray.shutdown)) so Ray is cleaned up on normal exit and
unhandled exceptions without adding try/except; place the atexit registration
immediately after ray.init and you can keep or remove the explicit
ray.shutdown() at the end (double-registration is safe but if you keep it ensure
it won’t raise on a second call). This change touches the ray.init(...) /
ray.shutdown() usage in the block that builds dataframes and logs artifacts
(TemporaryDirectory context, get_dataframes, logger.log_artifacts,
mlflow.data.pandas_dataset.from_pandas).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@exploration/panda/save_metadataset.py`:
- Around line 83-90: The validation results are being paired with slide_ids in
submission order but ray.wait returns futures in completion order; fix by
mapping futures to slide_ids before the wait loop (e.g., future_to_slide = {f:
sid for f, sid in zip(futures, slide_ids)}), then inside the loop when you get
done and call ray.get(done), iterate over zip(done, ray.get(done)) and use the
mapping to add the correct slide_id to the valid set (instead of relying on
results list order); also remove/replace uses of the accumulated results list
with direct mapping so pairing stays correct.

---

Duplicate comments:
In `@exploration/panda/save_metadataset.py`:
- Around line 43-46: The tissue-ratio check in the block that computes
tissue_ratio using np.percentile is a no-op; replace the percentile-based test
with a fixed intensity threshold so tissue_ratio = np.mean(thumb_array <
FIXED_INTENSITY) (use a constant like 220 or a configurable value) and then
compare that fraction against tissue_threshold; update the code around the
variables thumb_array, tissue_ratio, tissue_threshold, and the log/return for
slide_id accordingly so blank/white slides are properly filtered (also consider
changing the default tissue_threshold in config to ~0.01 if needed).

---

Nitpick comments:
In `@exploration/panda/save_metadataset.py`:
- Around line 54-55: The check "mask is None" is dead because
tifffile.imread(mask_path) raises on errors; update the code around
tifffile.imread(str(mask_path)) to remove the redundant None check and instead
wrap the read in a try/except to catch IO/read errors (e.g., OSError,
tifffile.TiffFileError) and handle them by logging/continuing/raising as
appropriate; keep the mask.size == 0 check to detect empty arrays and handle
that case (using the mask variable) after a successful read.
- Around line 138-159: After calling ray.init(num_cpus=config.max_concurrent),
register ray.shutdown with atexit (e.g., atexit.register(ray.shutdown)) so Ray
is cleaned up on normal exit and unhandled exceptions without adding try/except;
place the atexit registration immediately after ray.init and you can keep or
remove the explicit ray.shutdown() at the end (double-registration is safe but
if you keep it ensure it won’t raise on a second call). This change touches the
ray.init(...) / ray.shutdown() usage in the block that builds dataframes and
logs artifacts (TemporaryDirectory context, get_dataframes,
logger.log_artifacts, mlflow.data.pandas_dataset.from_pandas).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6f059cd7-7d2e-49ba-b8d0-45aea2689c9b

📥 Commits

Reviewing files that changed from the base of the PR and between b713d8d and 0b7ac5c.

📒 Files selected for processing (5)
  • configs/base.yaml
  • configs/data/sources/panda.yaml
  • configs/data/sources/prostate_cancer.yaml
  • configs/exploration/panda/save_metadataset.yaml
  • exploration/panda/save_metadataset.py
✅ Files skipped from review due to trivial changes (2)
  • configs/exploration/panda/save_metadataset.yaml
  • configs/data/sources/panda.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
  • configs/data/sources/prostate_cancer.yaml
  • configs/base.yaml

Comment thread exploration/panda/save_metadataset.py Outdated
@xrusnack xrusnack merged commit 9aefc0b into master Mar 30, 2026
3 of 4 checks passed
@xrusnack xrusnack deleted the panda/data-exploration branch March 30, 2026 16:22
This was referenced Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants