Conversation
Removed mask tile dimensions from PANDA dataset config.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
exploration/panda/save_metadataset.py (3)
43-44:⚠️ Potential issue | 🟠 MajorPer-image percentile makes the blank-slide filter effectively non-discriminative.
Line 43 computes “below this image’s own 95th percentile,” which is typically ~95% regardless of content, so Line 44 won’t reliably reject blank/noisy slides.
🛠️ Suggested fix
- tissue_ratio = np.mean(thumb_array < np.percentile(thumb_array, 95)) + # Fixed background cutoff in grayscale (near-white background ~255) + tissue_ratio = float(np.mean(thumb_array < 220)) if tissue_ratio < tissue_threshold:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@exploration/panda/save_metadataset.py` around lines 43 - 44, tissue_ratio is computed against each image's own 95th percentile which makes the filter ineffective; change the logic in the tissue_ratio calculation (currently using thumb_array and np.percentile) to compare pixels against a consistent threshold instead — either compute a global 95th-percentile across all thumbnails once (e.g., global_95 = np.percentile(all_thumbs, 95) and use np.mean(thumb_array < global_95)) or use an absolute intensity cutoff or Otsu thresholding, and then compare that to tissue_threshold; update the code paths that reference tissue_ratio, thumb_array and tissue_threshold to use this global/absolute thresholding approach.
138-159:⚠️ Potential issue | 🟠 MajorEnsure Ray shutdown runs even when dataframe/artifact steps fail.
If any exception occurs after
ray.init(...), Line 159 is skipped and Ray resources may leak in the job container.🛠️ Suggested fix
def main(config: DictConfig, logger: MLFlowLogger) -> None: ray.init(num_cpus=config.max_concurrent) - - with TemporaryDirectory() as output_dir: - log_file = Path(output_dir) / "errors.log" - - df, summary_df = get_dataframes( - metadata_csv_path=Path(config.metadata_csv), - slides_dir=Path(config.slides_dir), - annots_dir=Path(config.label_masks_dir), - properties_pq_path=Path(config.slides_properties_parquet), - tissue_threshold=config.tissue_threshold, - log_file=log_file, - ) - - df.to_csv(Path(output_dir) / "slides_metadata.csv", index=False) - summary_df.to_csv(Path(output_dir) / "summary.csv", index=False) - - logger.log_artifacts(local_dir=output_dir, artifact_path="panda") - slide_dataset = mlflow.data.pandas_dataset.from_pandas(df, name="panda") - mlflow.log_input(slide_dataset, context="slides_metadata") - - ray.shutdown() + try: + with TemporaryDirectory() as output_dir: + log_file = Path(output_dir) / "errors.log" + + df, summary_df = get_dataframes( + metadata_csv_path=Path(config.metadata_csv), + slides_dir=Path(config.slides_dir), + annots_dir=Path(config.label_masks_dir), + properties_pq_path=Path(config.slides_properties_parquet), + tissue_threshold=config.tissue_threshold, + log_file=log_file, + ) + + df.to_csv(Path(output_dir) / "slides_metadata.csv", index=False) + summary_df.to_csv(Path(output_dir) / "summary.csv", index=False) + + logger.log_artifacts(local_dir=output_dir, artifact_path="panda") + slide_dataset = mlflow.data.pandas_dataset.from_pandas(df, name="panda") + mlflow.log_input(slide_dataset, context="slides_metadata") + finally: + ray.shutdown()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@exploration/panda/save_metadataset.py` around lines 138 - 159, After calling ray.init(...), wrap the work that uses Ray (the with TemporaryDirectory() block that calls get_dataframes, logger.log_artifacts, mlflow.* etc.) in a try/finally so that ray.shutdown() is always executed; specifically, call ray.init(num_cpus=config.max_concurrent) then try: (TemporaryDirectory() as output_dir: ... df/summary_df work, logger.log_artifacts, mlflow.data.pandas_dataset.from_pandas, mlflow.log_input) finally: ray.shutdown(), and re-raise any exception instead of swallowing it so failures still propagate.
54-55:⚠️ Potential issue | 🟠 MajorAvoid full TIFF materialization in Ray workers for mask validation.
Line 54 reads the whole mask into memory just to check emptiness/corruption. With
max_concurrentworkers, this can cause avoidable RAM/I/O spikes.🛠️ Suggested fix
- mask = tifffile.imread(str(mask_path)) - if mask is None or mask.size == 0: + with tifffile.TiffFile(str(mask_path)) as tif: + if not tif.pages: + log(f"MASK_EMPTY: {slide_id}") + return False + # Decode one page to still catch codec/decompression issues. + first_page = tif.pages[0].asarray() + if first_page is None or first_page.size == 0: + log(f"MASK_EMPTY: {slide_id}") + return False - log(f"MASK_EMPTY: {slide_id}") - return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@exploration/panda/save_metadataset.py` around lines 54 - 55, The code currently calls tifffile.imread(str(mask_path)) into variable mask to validate emptiness, which materializes the whole TIFF and can spike memory in Ray workers; replace that read with a lightweight check using tifffile.memmap or tifffile.TiffFile to avoid full materialization: open the file with tifffile.memmap(str(mask_path)) or with tifffile.TiffFile(str(mask_path)) and inspect pages/shape/bytecount (or check os.path.getsize(mask_path) as a cheap pre-check) and only then, if needed, load/convert to an array; update the code references to mask_path and mask accordingly so emptiness/corruption is determined without reading the entire image into RAM.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@exploration/panda/save_metadataset.py`:
- Around line 76-90: The results are being misaligned because ray.wait returns
completed object refs out-of-order, but the code zips slide_ids with accumulated
results; change the task submission to keep the mapping between each future and
its slide_id (e.g., build futures as object refs and a dict/map from each future
ref to its slide_id when creating them in validate_sample), then when ray.wait
returns a batch (done) call ray.get(done) and iterate over the parallel pairs of
done refs and their returned values to assign validity to the correct slide_id
(update the valid set using the mapping from future ref -> slide_id rather than
zipping slide_ids with results).
---
Duplicate comments:
In `@exploration/panda/save_metadataset.py`:
- Around line 43-44: tissue_ratio is computed against each image's own 95th
percentile which makes the filter ineffective; change the logic in the
tissue_ratio calculation (currently using thumb_array and np.percentile) to
compare pixels against a consistent threshold instead — either compute a global
95th-percentile across all thumbnails once (e.g., global_95 =
np.percentile(all_thumbs, 95) and use np.mean(thumb_array < global_95)) or use
an absolute intensity cutoff or Otsu thresholding, and then compare that to
tissue_threshold; update the code paths that reference tissue_ratio, thumb_array
and tissue_threshold to use this global/absolute thresholding approach.
- Around line 138-159: After calling ray.init(...), wrap the work that uses Ray
(the with TemporaryDirectory() block that calls get_dataframes,
logger.log_artifacts, mlflow.* etc.) in a try/finally so that ray.shutdown() is
always executed; specifically, call ray.init(num_cpus=config.max_concurrent)
then try: (TemporaryDirectory() as output_dir: ... df/summary_df work,
logger.log_artifacts, mlflow.data.pandas_dataset.from_pandas, mlflow.log_input)
finally: ray.shutdown(), and re-raise any exception instead of swallowing it so
failures still propagate.
- Around line 54-55: The code currently calls tifffile.imread(str(mask_path))
into variable mask to validate emptiness, which materializes the whole TIFF and
can spike memory in Ray workers; replace that read with a lightweight check
using tifffile.memmap or tifffile.TiffFile to avoid full materialization: open
the file with tifffile.memmap(str(mask_path)) or with
tifffile.TiffFile(str(mask_path)) and inspect pages/shape/bytecount (or check
os.path.getsize(mask_path) as a cheap pre-check) and only then, if needed,
load/convert to an array; update the code references to mask_path and mask
accordingly so emptiness/corruption is determined without reading the entire
image into RAM.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 76dd9c9b-6ca0-4753-8cb3-81f9b638dc59
📒 Files selected for processing (5)
configs/base.yamlconfigs/data/sources/panda.yamlconfigs/data/sources/prostate_cancer.yamlconfigs/exploration/panda/save_metadataset.yamlexploration/panda/save_metadataset.py
✅ Files skipped from review due to trivial changes (1)
- configs/exploration/panda/save_metadataset.yaml
🚧 Files skipped from review as they are similar to previous changes (3)
- configs/base.yaml
- configs/data/sources/prostate_cancer.yaml
- configs/data/sources/panda.yaml
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
exploration/panda/save_metadataset.py (1)
43-46:⚠️ Potential issue | 🟠 MajorTissue-ratio calculation is effectively a no-op.
As discussed in prior comments:
np.mean(thumb_array < np.percentile(thumb_array, 95))computes the fraction of pixels below the image's own 95th percentile, which is always ≈0.95 for any non-constant image. This means the< tissue_thresholdcheck (default 0.001) will never trigger, even for completely blank/white slides.For filtering blank slides, use a fixed intensity threshold instead:
🛠️ Proposed fix
- tissue_ratio = np.mean(thumb_array < np.percentile(thumb_array, 95)) - if tissue_ratio < tissue_threshold: + # Pixels darker than 220 (out of 255) are considered tissue + # Background/blank slides are near-white (~255) + tissue_ratio = np.mean(thumb_array < 220) + if tissue_ratio < tissue_threshold: # e.g., < 0.01 means < 1% tissueNote: With a fixed threshold of 220, you may want to adjust
tissue_thresholdin config from 0.001 to something like 0.01 (1% tissue minimum).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@exploration/panda/save_metadataset.py` around lines 43 - 46, The tissue-ratio check in the block that computes tissue_ratio using np.percentile is a no-op; replace the percentile-based test with a fixed intensity threshold so tissue_ratio = np.mean(thumb_array < FIXED_INTENSITY) (use a constant like 220 or a configurable value) and then compare that fraction against tissue_threshold; update the code around the variables thumb_array, tissue_ratio, tissue_threshold, and the log/return for slide_id accordingly so blank/white slides are properly filtered (also consider changing the default tissue_threshold in config to ~0.01 if needed).
🧹 Nitpick comments (2)
exploration/panda/save_metadataset.py (2)
54-55:mask is Nonecheck is misleading.
tifffile.imread()raises an exception on read errors rather than returningNone. Themask is Nonebranch is dead code.♻️ Suggested simplification
mask = tifffile.imread(str(mask_path)) - if mask is None or mask.size == 0: + if mask.size == 0: log(f"MASK_EMPTY: {slide_id}") return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@exploration/panda/save_metadataset.py` around lines 54 - 55, The check "mask is None" is dead because tifffile.imread(mask_path) raises on errors; update the code around tifffile.imread(str(mask_path)) to remove the redundant None check and instead wrap the read in a try/except to catch IO/read errors (e.g., OSError, tifffile.TiffFileError) and handle them by logging/continuing/raising as appropriate; keep the mask.size == 0 check to detect empty arrays and handle that case (using the mask variable) after a successful read.
138-159: Ray lifecycle not wrapped in try/finally — orphaned processes on failure.If an exception occurs between
ray.init()(line 138) andray.shutdown()(line 159), Ray worker processes won't be cleaned up. In a Kubernetes pod, this may leave orphaned processes.I note matejpekar's prior comment "Do not use try" — if the intent is to let the job fail loudly, consider using
atexitto ensure cleanup without catching exceptions:♻️ Alternative: use atexit for cleanup without try/except
+import atexit + `@with_cli_args`(["+exploration=panda/save_metadataset"]) `@hydra.main`(config_path="../../configs", config_name="exploration", version_base=None) `@autolog` def main(config: DictConfig, logger: MLFlowLogger) -> None: ray.init(num_cpus=config.max_concurrent) + atexit.register(ray.shutdown) with TemporaryDirectory() as output_dir: # ... rest of code ... - - ray.shutdown()This ensures cleanup on both normal exit and unhandled exceptions without suppressing errors.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@exploration/panda/save_metadataset.py` around lines 138 - 159, After calling ray.init(num_cpus=config.max_concurrent), register ray.shutdown with atexit (e.g., atexit.register(ray.shutdown)) so Ray is cleaned up on normal exit and unhandled exceptions without adding try/except; place the atexit registration immediately after ray.init and you can keep or remove the explicit ray.shutdown() at the end (double-registration is safe but if you keep it ensure it won’t raise on a second call). This change touches the ray.init(...) / ray.shutdown() usage in the block that builds dataframes and logs artifacts (TemporaryDirectory context, get_dataframes, logger.log_artifacts, mlflow.data.pandas_dataset.from_pandas).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@exploration/panda/save_metadataset.py`:
- Around line 83-90: The validation results are being paired with slide_ids in
submission order but ray.wait returns futures in completion order; fix by
mapping futures to slide_ids before the wait loop (e.g., future_to_slide = {f:
sid for f, sid in zip(futures, slide_ids)}), then inside the loop when you get
done and call ray.get(done), iterate over zip(done, ray.get(done)) and use the
mapping to add the correct slide_id to the valid set (instead of relying on
results list order); also remove/replace uses of the accumulated results list
with direct mapping so pairing stays correct.
---
Duplicate comments:
In `@exploration/panda/save_metadataset.py`:
- Around line 43-46: The tissue-ratio check in the block that computes
tissue_ratio using np.percentile is a no-op; replace the percentile-based test
with a fixed intensity threshold so tissue_ratio = np.mean(thumb_array <
FIXED_INTENSITY) (use a constant like 220 or a configurable value) and then
compare that fraction against tissue_threshold; update the code around the
variables thumb_array, tissue_ratio, tissue_threshold, and the log/return for
slide_id accordingly so blank/white slides are properly filtered (also consider
changing the default tissue_threshold in config to ~0.01 if needed).
---
Nitpick comments:
In `@exploration/panda/save_metadataset.py`:
- Around line 54-55: The check "mask is None" is dead because
tifffile.imread(mask_path) raises on errors; update the code around
tifffile.imread(str(mask_path)) to remove the redundant None check and instead
wrap the read in a try/except to catch IO/read errors (e.g., OSError,
tifffile.TiffFileError) and handle them by logging/continuing/raising as
appropriate; keep the mask.size == 0 check to detect empty arrays and handle
that case (using the mask variable) after a successful read.
- Around line 138-159: After calling ray.init(num_cpus=config.max_concurrent),
register ray.shutdown with atexit (e.g., atexit.register(ray.shutdown)) so Ray
is cleaned up on normal exit and unhandled exceptions without adding try/except;
place the atexit registration immediately after ray.init and you can keep or
remove the explicit ray.shutdown() at the end (double-registration is safe but
if you keep it ensure it won’t raise on a second call). This change touches the
ray.init(...) / ray.shutdown() usage in the block that builds dataframes and
logs artifacts (TemporaryDirectory context, get_dataframes,
logger.log_artifacts, mlflow.data.pandas_dataset.from_pandas).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6f059cd7-7d2e-49ba-b8d0-45aea2689c9b
📒 Files selected for processing (5)
configs/base.yamlconfigs/data/sources/panda.yamlconfigs/data/sources/prostate_cancer.yamlconfigs/exploration/panda/save_metadataset.yamlexploration/panda/save_metadataset.py
✅ Files skipped from review due to trivial changes (2)
- configs/exploration/panda/save_metadataset.yaml
- configs/data/sources/panda.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
- configs/data/sources/prostate_cancer.yaml
- configs/base.yaml
Summary by CodeRabbit
New Features
Chores