Skip to content

Models optimalizations#5

Open
Jurgee wants to merge 9 commits intomainfrom
feature/prostate-opt
Open

Models optimalizations#5
Jurgee wants to merge 9 commits intomainfrom
feature/prostate-opt

Conversation

@Jurgee
Copy link
Copy Markdown
Collaborator

@Jurgee Jurgee commented Mar 23, 2026

Summary by CodeRabbit

  • Chores

    • Increased compute/memory for model workers, larger batching and longer batch waits to improve throughput and latency.
    • Adjusted autoscaling, request and queue limits for more stable handling under load.
    • Updated GPU worker configuration (MIG GPU type, explicit GPU count) and upgraded container image.
    • Switched model runtime source to a GitHub archive.
  • New Features

    • Exposed new model configuration options for TensorRT and threading and added sensible defaults for model runtime behavior.

@Jurgee Jurgee requested review from a team, JakubPekar, Copilot and ejdam87 March 23, 2026 20:33
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 23, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Updated deployment and cluster settings in ray-service.yaml for three apps (BinaryClassifier, SemanticSegmentation, HeatmapBuilder): switched working_dir ZIP sources to GitHub, raised request/queue/autoscaling limits, adjusted actor resources, batching, and GPU worker specs. Updated model code to add TensorRT config options, avoid mutating config["model"], adjust ONNX Runtime threading, and set default tile sizes.

Changes

Cohort / File(s) Summary
Ray service deployments
ray-service.yaml
Updated three Serve deployments (BinaryClassifier, SemanticSegmentation, HeatmapBuilder): changed runtime_env.working_dir ZIP URLs from GitLab master→GitHub main; increased max_ongoing_requests / max_queued_requests and autoscaling_config.target_ongoing_requests per app; adjusted batching and user_config values per app (including new trt_builder_optimization_level for BinaryClassifier and SemanticSegmentation, intra-op thread setting for SemanticSegmentation).
BinaryClassifier deployment specifics
ray-service.yaml
For BinaryClassifier: increased actor memory (4Gi→8Gi), raised max_batch_size and batch_wait_timeout_s, added user_config.trt_builder_optimization_level: 3, scoped trt_cache_path to app-specific subdir, doubled autoscale/request limits.
SemanticSegmentation deployment specifics
ray-service.yaml
For SemanticSegmentation: added user_config.intra_op_num_threads: 4, added trt_builder_optimization_level: 3, scoped trt_cache_path to app-specific subdir, raised autoscale/request limits and queued capacity.
HeatmapBuilder deployment specifics
ray-service.yaml
For HeatmapBuilder: increased request/queue/autoscaling limits only.
GPU worker group / cluster config
ray-service.yaml
In rayClusterConfig.workerGroupSpecs (gpu-workers): added rayStartParams.num-gpus: "1", removed GPU nodeSelector constraint, bumped image tag cerit.io/rationai/model-service:2.53.0-gpu2.54.0-gpu, and changed GPU resource key from nvidia.com/gpu: 1nvidia.com/mig-2g.20gb: 1.
Model runtime changes
models/binary_classifier.py, models/semantic_segmentation.py
Extended TypedDict Config with TensorRT options (trt_builder_optimization_level, and in semantic_segmentation also trt_max_workspace_size); set default self.tile_size in constructors; changed ONNX Runtime session options to stop setting inter_op_num_threads (retain intra_op_num_threads from config); avoid mutating config["model"] by copying to model_config, popping _target_ from the copy, and passing **model_config to providers.
BinaryClassifier code specifics
models/binary_classifier.py
Added trt_builder_optimization_level to Config, set default tile_size = 512, removed typed-dict ignore on TensorRT options, and updated provider instantiation to use a copy of config["model"].
SemanticSegmentation code specifics
models/semantic_segmentation.py
Added trt_max_workspace_size and trt_builder_optimization_level to Config, set default tile_size = 1024, removed setting of inter_op_num_threads, ensured provider instantiation uses a copied model_config, and kept async decompression semantics unchanged.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped from repo zip to GitHub's gate,
Memory doubled, batches grow elate,
GPUs re-tagged and neatly aligned,
Configs copied clean — no mutating mind,
A nibble, a hop, the pipeline's up to date!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Models optimalizations' is vague and generic. It uses a misspelled term ('optimalizations' instead of 'optimizations') and fails to clearly describe the specific changes: updating model configurations, increasing request limits, adding TensorRT optimizations, and updating GPU resource specifications. Revise the title to be more specific and accurate, such as 'Optimize model configurations and increase request handling limits' or 'Update models with TensorRT optimization and increased autoscaling limits'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/prostate-opt

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on optimizing the deployment configuration for the prostate classifier service within the Ray cluster. The changes aim to improve resource utilization, throughput, and flexibility by updating the model service source, fine-tuning Ray deployment parameters, increasing allocated resources, and adapting to NVIDIA MIG for GPU management. These adjustments are intended to enhance the classifier's performance and scalability.

Highlights

  • Model Service Source Update: Updated the working_dir for the model service runtime environment to point to a GitHub feature branch, ensuring the deployment uses a more recent development version.
  • Deployment Parameter Tuning: Adjusted key deployment parameters for the BinaryClassifier, including reducing max_ongoing_requests and target_ongoing_requests while significantly increasing max_queued_requests and max_batch_size to optimize throughput and queue management.
  • Resource Allocation Increase: Increased resource allocation for Ray actors, doubling num_cpus from 4 to 8 and memory from 4 GiB to 8 GiB, to support more intensive processing demands.
  • NVIDIA MIG Integration: Configured Ray worker groups to explicitly request 1 GPU via rayStartParams and updated the GPU resource limit to utilize NVIDIA MIG (Multi-Instance GPU) partitioning, specifically mig-2g.20gb, for more efficient GPU resource management.
  • Node Selector Removal: Removed the specific nodeSelector for NVIDIA-A40 GPUs, allowing for greater flexibility in node scheduling within the Kubernetes cluster.
  • Container Image Update: Updated the ray-worker container image from a specific version (2.53.0-gpu) to latest-gpu, ensuring the deployment always uses the most recent GPU-enabled build of the model service.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the configuration for the Ray service, primarily to optimize the prostate classifier deployment. The changes include adjusting resource allocations (CPU, memory), tuning request handling and batching parameters, and updating the worker configuration to use MIG-enabled GPUs. While these changes seem aimed at improving performance, I've found a few issues that need to be addressed: a critical bug that will cause the service to fail due to a missing configuration key, the use of a :latest image tag which is bad practice for production, and a working_dir URL pointing to a potentially temporary feature branch. Please see my detailed comments for suggestions on how to fix these issues.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the RayService configuration for the prostate-classifier-1 Ray Serve application, aiming to improve throughput/latency characteristics and adjust GPU worker scheduling (including MIG usage) while also switching the runtime working_dir source.

Changes:

  • Tune BinaryClassifier deployment parameters (request limits, autoscaling target, CPU/memory, batching settings).
  • Switch prostate-classifier-1 runtime_env.working_dir to a GitHub archive URL.
  • Update gpu-workers to use a MIG GPU resource and add rayStartParams.num-gpus, and change the GPU worker image tag.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ray-service.yaml`:
- Line 12: The working_dir currently points to a feature branch zip URL (the
value of working_dir) which will break after the branch is deleted; update
working_dir to reference a stable source such as the main branch URL, a release
tag, or a commit SHA (e.g., replace the feature/main URL with the main branch
archive URL or a tagged-release/commit SHA URL) so deployments remain
reproducible and won't break after the PR merge.
- Line 222: The deployment is using an unpinned image tag ("image:
cerit.io/rationai/model-service:latest-gpu") which makes releases
non-reproducible; change this to a specific versioned tag (for example match the
cpu-workers pinned tag "2.53.0" or another explicit semver) so the ray-service
group uses a fixed image; update the image line to
"cerit.io/rationai/model-service:<version>-gpu" and ensure the chosen version is
consistent with the cpu-workers versioning strategy.
- Line 228: The current resource line nvidia.com/mig-2g.20gb: 1 assigns a 20GB
MIG partition but the container memory request/limit is set to 24Gi which
exceeds that partition; update the memory request and limit in the same pod spec
(where memory: 24Gi is declared) to a value ≤20Gi (e.g., 20Gi) or revert the GPU
resource to the full device by changing nvidia.com/mig-2g.20gb: 1 back to
nvidia.com/gpu: 1; additionally confirm cluster nodes use A100/H100 with MIG
mode enabled and the matching mig-2g.20gb partition configured before deploying.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dbd9ab21-362c-4343-b736-71232f8bf8f1

📥 Commits

Reviewing files that changed from the base of the PR and between 6beb06f and 3da8983.

📒 Files selected for processing (1)
  • ray-service.yaml

@Jurgee Jurgee self-assigned this Mar 23, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
models/semantic_segmentation.py (1)

122-134: Consider consistency with binary_classifier.py for LZ4 decompression.

While not part of this PR's changes, binary_classifier.py wraps lz4.decompress in asyncio.to_thread() (line 128) to avoid blocking the event loop, but this file calls it synchronously. For larger payloads, this could block the async event loop.

Consider aligning with the other model for consistency:

-        data = self.lz4.decompress(await request.body())
+        data = await asyncio.to_thread(self.lz4.decompress, await request.body())

This would require adding import asyncio at the top of the file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@models/semantic_segmentation.py` around lines 122 - 134, The endpoint root
currently calls self.lz4.decompress synchronously which can block the event loop
for large payloads; change the call to run in a thread (use asyncio.to_thread)
e.g. await asyncio.to_thread(self.lz4.decompress, await request.body()) and
ensure asyncio is imported at the top; keep the rest of the flow (creating image
from bytes, calling self.predict, and returning the compressed prediction)
unchanged and reference the methods/attributes root, self.lz4.decompress,
self.predict, and self.tile_size when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@models/semantic_segmentation.py`:
- Around line 122-134: The endpoint root currently calls self.lz4.decompress
synchronously which can block the event loop for large payloads; change the call
to run in a thread (use asyncio.to_thread) e.g. await
asyncio.to_thread(self.lz4.decompress, await request.body()) and ensure asyncio
is imported at the top; keep the rest of the flow (creating image from bytes,
calling self.predict, and returning the compressed prediction) unchanged and
reference the methods/attributes root, self.lz4.decompress, self.predict, and
self.tile_size when making the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3f76a249-a486-4ffa-b655-a4c25645b8f4

📥 Commits

Reviewing files that changed from the base of the PR and between c4f21a3 and 7bbd50c.

📒 Files selected for processing (2)
  • models/binary_classifier.py
  • models/semantic_segmentation.py

@matejpekar matejpekar requested review from Adames4 and matejpekar and removed request for JakubPekar and ejdam87 March 24, 2026 20:02
@Jurgee Jurgee marked this pull request as draft March 24, 2026 20:04
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@models/semantic_segmentation.py`:
- Line 124: The code wrongly awaits the synchronous lz4.frame.decompress call;
change the line to first await request.body() into a variable (e.g., raw = await
request.body()) and then call self.lz4.decompress(raw) without await, assigning
the result to data so data is the decompressed bytes; update references to data
accordingly (look for self.lz4.decompress and request.body() in the method).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 74295de0-9ed3-46ec-a859-13a0a0fc3b64

📥 Commits

Reviewing files that changed from the base of the PR and between 7bbd50c and 2134a4a.

📒 Files selected for processing (1)
  • models/semantic_segmentation.py

@Jurgee Jurgee force-pushed the feature/prostate-opt branch from 2134a4a to 7bbd50c Compare March 24, 2026 20:13
@Jurgee Jurgee marked this pull request as ready for review March 28, 2026 17:01
@Jurgee Jurgee requested a review from matejpekar March 28, 2026 17:01
@Jurgee Jurgee changed the title Prostate classifier optimalization Models optimalizations Mar 28, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
ray-service.yaml (1)

12-12: ⚠️ Potential issue | 🟠 Major

Pin working_dir to the revision that actually contains this change set.

These URLs now fetch refs/heads/main.zip, while this PR also changes models/binary_classifier.py and models/semantic_segmentation.py. A rollout from this branch would therefore boot replicas from current main, not the code under review, and later restarts/autoscaling will keep pulling a moving target. Please switch these to an immutable commit or release artifact that includes this PR.

Also applies to: 44-44, 78-78

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ray-service.yaml` at line 12, The working_dir in the Ray service manifest is
pointing to a moving ref
(https://github.com/RationAI/model-service/archive/refs/heads/main.zip) which
will not include the PR's changes; update the working_dir entries (the
working_dir YAML key) to point to an immutable artifact that contains this
branch—e.g., a commit tar/zip URL or a release/tag that includes the changes to
models/binary_classifier.py and models/semantic_segmentation.py—so replicas and
future restarts always fetch the exact code for this PR.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@models/binary_classifier.py`:
- Line 20: The TypedDict Config is missing the TensorRT key
trt_max_workspace_size used by reconfigure(), causing the type contract to be
out of sync; update the Config TypedDict to either add trt_max_workspace_size:
int (and keep trt_builder_optimization_level: int) or mark both TensorRT entries
(trt_builder_optimization_level and trt_max_workspace_size) as NotRequired[int]
to reflect that reconfigure() uses .get() fallbacks—modify the Config definition
accordingly and keep reconfigure()’s use of config.get(...) unchanged.

---

Duplicate comments:
In `@ray-service.yaml`:
- Line 12: The working_dir in the Ray service manifest is pointing to a moving
ref (https://github.com/RationAI/model-service/archive/refs/heads/main.zip)
which will not include the PR's changes; update the working_dir entries (the
working_dir YAML key) to point to an immutable artifact that contains this
branch—e.g., a commit tar/zip URL or a release/tag that includes the changes to
models/binary_classifier.py and models/semantic_segmentation.py—so replicas and
future restarts always fetch the exact code for this PR.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1cf0e22f-2b83-4ea5-ab92-e196a1400001

📥 Commits

Reviewing files that changed from the base of the PR and between 2134a4a and 92ac910.

📒 Files selected for processing (3)
  • models/binary_classifier.py
  • models/semantic_segmentation.py
  • ray-service.yaml

@matejpekar matejpekar mentioned this pull request Mar 28, 2026
@@ -61,8 +62,10 @@ spec:
mpp: 0.468
max_batch_size: 8
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong w.r.t. to other settings

- name: HeatmapBuilder
max_ongoing_requests: 16
max_queued_requests: 64
max_ongoing_requests: 24
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt that 12 Gb RAM is enough for 24 ongoing requests. In each request you are asynchronously loading multiple 1024x1024 tiles

import lz4.frame

self.lz4 = lz4.frame
self.tile_size = 1024 # default, will be overridden by reconfigure
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There shouldn't be default for this

Suggested change
self.tile_size = 1024 # default, will be overridden by reconfigure

@serve.deployment(num_replicas="auto")
@serve.ingress(fastapi)
class SemanticSegmentation:
"""Semantic segmentation for tissue tiles using ONNX Runtime with GPU and TensorRT support."""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI slop

async def predict(
self, images: list[NDArray[np.uint8]]
) -> list[NDArray[np.float16]]:
"""Run inference on a batch of images."""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI slop

Suggested change
"""Run inference on a batch of images."""

import lz4.frame

self.lz4 = lz4.frame
self.tile_size = 512 # default, will be overridden by reconfigure
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.tile_size = 512 # default, will be overridden by reconfigure

@@ -69,8 +72,10 @@ def reconfigure(self, config: Config) -> None:
"trt_engine_cache_path": cache_path,
"trt_max_workspace_size": config.get(
"trt_max_workspace_size", 8 * 1024 * 1024 * 1024
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"trt_max_workspace_size", 8 * 1024 * 1024 * 1024
"trt_max_workspace_size", 8 * 1024**3

# Configure ONNX Runtime session
sess_options = ort.SessionOptions()
sess_options.intra_op_num_threads = config["intra_op_num_threads"]
sess_options.inter_op_num_threads = 1
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you removing this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants