Skip to content

feat: Allow fine tuning sqs pooling#737

Merged
zeljkoX merged 3 commits intomainfrom
sqs-pooling-tuning
Apr 2, 2026
Merged

feat: Allow fine tuning sqs pooling#737
zeljkoX merged 3 commits intomainfrom
sqs-pooling-tuning

Conversation

@zeljkoX
Copy link
Copy Markdown
Collaborator

@zeljkoX zeljkoX commented Apr 1, 2026

Summary

  • Make SQS long-poll wait times and poller counts configurable via env vars (SQS_*_WAIT_TIME_SECONDS, SQS_*_POLLER_COUNT)
  • Add multi-poller support: multiple concurrent ReceiveMessage loops per queue sharing one concurrency semaphore, improving message pickup
    smoothness on bursty queues
  • Add segment-level dwell-time metrics to transaction_processing_seconds histogram (request_queue_dwell, prepare_duration,
    submission_queue_dwell, submit_duration) to isolate queue wait vs handler processing in P90 latency

Testing Process

Checklist

  • Add a reference to related issues in the PR description.
  • Add unit tests if applicable.

Note

If you are using Relayer in your stack, consider adding your team or organization to our list of Relayer Users in the Wild!

Summary by CodeRabbit

Release Notes

  • New Features

    • SQS deployments can now be tuned per queue with configurable wait times and polling concurrency
    • Enhanced transaction monitoring with stage-level timing metrics for queue dwell and processing durations
  • Documentation

    • Added SQS performance tuning guide with recommended high-throughput configuration examples
    • Documented transaction processing metrics for visibility into queue and processing performance

@zeljkoX zeljkoX requested a review from a team as a code owner April 1, 2026 10:37
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 1, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a2b75bfb-2333-4c64-984f-4cd4d365324f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

This PR adds SQS queue performance tuning capabilities through environment variables for wait time and poller count configuration, introduces comprehensive transaction lifecycle timing metrics with stage-specific labels, refactors queue polling to support multiple concurrent pollers per queue with permit-based concurrency control, and updates queue type accessors and documentation to support these features.

Changes

Cohort / File(s) Summary
Documentation & Metrics Definition
docs/configuration/index.mdx, src/metrics/README.md, src/metrics/mod.rs
Added documentation for SQS environment variables (WaitTimeSeconds, POLLER_COUNT) and performance tuning guidance. Introduced public histogram stage label constants and observe_processing_time() helper for recording transaction lifecycle metrics with relayer_id, network_type, and stage labels.
Queue Configuration & Type Updates
src/config/server_config.rs, src/queues/queue_type.rs
Added get_sqs_wait_time() and get_sqs_poller_count() config helpers with bounds checking and env var parsing. Renamed polling_interval_secs() to default_wait_time_secs(), added sqs_env_key() and default_poller_count() accessors to QueueType.
Handler Instrumentation
src/jobs/handlers/transaction_request_handler.rs, src/jobs/handlers/transaction_submission_handler.rs
Added timing metrics to handlers: queue dwell time (from transaction creation/job timestamp to processing), preparation duration, and submission duration. Metrics are recorded via observe_processing_time() with appropriate stage labels; errors in timestamp parsing are silently skipped.
Queue Worker Refactoring
src/queues/sqs/worker.rs
Introduced PollLoopConfig struct and refactored spawn_worker_for_queue to spawn multiple concurrent pollers (via poller_count) sharing a semaphore. New run_poll_loop encapsulates per-poller logic with permit-based message batch distribution. Added get_wait_time_for_queue() and get_poller_count_for_queue() helpers; log messages updated to include poller_id.
Test Updates
src/queues/mod.rs
Updated test_queue_type_polling_intervals_appropriate to validate using default_wait_time_secs() instead of the renamed polling_interval_secs().

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

cla: allowlist

Suggested reviewers

  • tirumerla
  • collins-w

Poem

🐰 Pollers multiply like carrots in the spring,
Each semaphore doles out its permit share,
Queue dwell time measured, metrics ring—
Transactions flow faster through the air!
Performance tuned with env vars fair. 🚀

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: Allow fine tuning sqs pooling' is partially related to the changeset; it covers poller count configurability but omits the equally important wait-time configurability and monitoring metrics additions.
Description check ✅ Passed The description covers the main changes (env vars, multi-poller support, metrics) but the Testing Process section is empty and related issues are not referenced, leaving two checklist items incomplete.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sqs-pooling-tuning

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 1, 2026

Codecov Report

❌ Patch coverage is 54.05904% with 249 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.22%. Comparing base (aede8aa) to head (8a04f40).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/queues/sqs/worker.rs 33.66% 199 Missing ⚠️
src/jobs/handlers/transaction_request_handler.rs 0.00% 22 Missing ⚠️
...rc/jobs/handlers/transaction_submission_handler.rs 0.00% 21 Missing ⚠️
src/queues/queue_type.rs 94.50% 5 Missing ⚠️
src/config/server_config.rs 98.43% 1 Missing ⚠️
src/metrics/mod.rs 97.50% 1 Missing ⚠️
Additional details and impacted files
Flag Coverage Δ
ai 0.00% <0.00%> (ø)
dev 90.22% <54.05%> (-0.06%) ⬇️
properties 0.01% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

@@            Coverage Diff             @@
##             main     #737      +/-   ##
==========================================
- Coverage   90.27%   90.22%   -0.06%     
==========================================
  Files         290      290              
  Lines      121698   122082     +384     
==========================================
+ Hits       109868   110151     +283     
- Misses      11830    11931     +101     
Files with missing lines Coverage Δ
src/queues/mod.rs 52.45% <100.00%> (ø)
src/config/server_config.rs 94.56% <98.43%> (+0.21%) ⬆️
src/metrics/mod.rs 92.06% <97.50%> (+2.52%) ⬆️
src/queues/queue_type.rs 96.73% <94.50%> (-0.66%) ⬇️
...rc/jobs/handlers/transaction_submission_handler.rs 47.00% <0.00%> (-10.29%) ⬇️
src/jobs/handlers/transaction_request_handler.rs 16.32% <0.00%> (-13.31%) ⬇️
src/queues/sqs/worker.rs 53.62% <33.66%> (+1.96%) ⬆️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
src/config/server_config.rs (1)

677-707: Add focused unit tests for the new SQS env parsers.

Please add tests for unset/invalid/zero/upper-bound cases (WAIT_TIME_SECONDS clamped at 20, POLLER_COUNT clamped to minimum 1). This logic is easy to regress silently.

As per coding guidelines, "Test coverage/quality for changed or critical paths".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/config/server_config.rs` around lines 677 - 707, Add focused unit tests
for get_sqs_wait_time and get_sqs_poller_count: cover unset (env var absent ->
returns default), invalid (non-numeric -> returns default), zero and below-min
cases (e.g., WAIT_TIME_SECONDS=0 should clamp to 0? — ensure behavior matches
intended; POLLER_COUNT=0 must clamp to 1), and upper-bound for wait time
(WAIT_TIME_SECONDS > 20 must return 20). Use the functions
get_sqs_wait_time(queue_key, default) and get_sqs_poller_count(queue_key,
default), set and unset the relevant environment variables
(SQS_{QUEUE_KEY}_WAIT_TIME_SECONDS and SQS_{QUEUE_KEY}_POLLER_COUNT) in the test
harness, and assert the returned values match expected clamped/default outcomes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/jobs/handlers/transaction_request_handler.rs`:
- Around line 70-78: The computed dwell_secs derived from parsing
transaction.created_at (via chrono::DateTime::parse_from_rfc3339 and
created_time.with_timezone(&Utc)) can be negative; before calling
observe_processing_time with STAGE_REQUEST_QUEUE_DWELL, clamp dwell_secs to a
non‑negative value (e.g., dwell_secs = max(0.0, computed_value)) so negative
durations from clock skew/bad data are recorded as zero; update the logic around
Utc::now() - created_time and pass the clamped dwell_secs to
observe_processing_time.

In `@src/queues/sqs/worker.rs`:
- Around line 89-90: After calling get_poller_count_for_queue(queue_type) assign
its result to poller_count and validate it is > 0; if it is 0, either return an
Err or panic (fail fast) with a clear message (e.g., "invalid poller_count 0 for
<queue_type>") or fallback to a safe default like 1 before spawning poll loops.
Update the code around the poller_count variable (the spot that reads
get_poller_count_for_queue and the places that use poller_count to spawn poll
loops) to perform this check so no zero value can silently cause no pollers to
be spawned.
- Around line 133-135: The drain loop currently ignores possible JoinError from
pollers; change the loop to handle the Result from
poller_handles.join_next().await: use while let Some(res) =
poller_handles.join_next().await { match res { Ok(_) => {} , Err(err) => {
error!(queue_type = ?queue_type, "poller task panicked: {:?}", err);
panic!("poller task panicked: {:?}", err); } } } so poller panics are logged
with context and not silently swallowed; reference poller_handles, join_next(),
and the JoinError result in your change.

---

Nitpick comments:
In `@src/config/server_config.rs`:
- Around line 677-707: Add focused unit tests for get_sqs_wait_time and
get_sqs_poller_count: cover unset (env var absent -> returns default), invalid
(non-numeric -> returns default), zero and below-min cases (e.g.,
WAIT_TIME_SECONDS=0 should clamp to 0? — ensure behavior matches intended;
POLLER_COUNT=0 must clamp to 1), and upper-bound for wait time
(WAIT_TIME_SECONDS > 20 must return 20). Use the functions
get_sqs_wait_time(queue_key, default) and get_sqs_poller_count(queue_key,
default), set and unset the relevant environment variables
(SQS_{QUEUE_KEY}_WAIT_TIME_SECONDS and SQS_{QUEUE_KEY}_POLLER_COUNT) in the test
harness, and assert the returned values match expected clamped/default outcomes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bace1086-9246-4e82-b84a-580291ce405c

📥 Commits

Reviewing files that changed from the base of the PR and between 58de92a and ee6e9ef.

📒 Files selected for processing (9)
  • docs/configuration/index.mdx
  • src/config/server_config.rs
  • src/jobs/handlers/transaction_request_handler.rs
  • src/jobs/handlers/transaction_submission_handler.rs
  • src/metrics/README.md
  • src/metrics/mod.rs
  • src/queues/mod.rs
  • src/queues/queue_type.rs
  • src/queues/sqs/worker.rs

Comment thread src/jobs/handlers/transaction_request_handler.rs
Comment thread src/queues/sqs/worker.rs
Comment on lines +89 to 90
let poller_count = get_poller_count_for_queue(queue_type);
let visibility_timeout = queue_type.visibility_timeout_secs();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate poller_count to prevent silent queue stoppage.

At Line 89 and Line 843, poller_count is used without a non-zero guard. A misconfigured value of 0 will spawn no poll loops and the queue will stop being processed without failing fast.

Suggested fix
 fn get_poller_count_for_queue(queue_type: QueueType) -> usize {
-    ServerConfig::get_sqs_poller_count(queue_type.sqs_env_key(), queue_type.default_poller_count())
+    let configured =
+        ServerConfig::get_sqs_poller_count(queue_type.sqs_env_key(), queue_type.default_poller_count());
+    if configured == 0 {
+        warn!(
+            queue_type = ?queue_type,
+            "Configured poller count is 0; clamping to 1"
+        );
+        1
+    } else {
+        configured
+    }
 }

Also applies to: 842-844

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/queues/sqs/worker.rs` around lines 89 - 90, After calling
get_poller_count_for_queue(queue_type) assign its result to poller_count and
validate it is > 0; if it is 0, either return an Err or panic (fail fast) with a
clear message (e.g., "invalid poller_count 0 for <queue_type>") or fallback to a
safe default like 1 before spawning poll loops. Update the code around the
poller_count variable (the spot that reads get_poller_count_for_queue and the
places that use poller_count to spawn poll loops) to perform this check so no
zero value can silently cause no pollers to be spawned.

Comment thread src/queues/sqs/worker.rs
@zeljkoX zeljkoX merged commit 05c3eee into main Apr 2, 2026
25 of 26 checks passed
@zeljkoX zeljkoX deleted the sqs-pooling-tuning branch April 2, 2026 22:48
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 2, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants