feat: Allow fine tuning sqs pooling by zeljkoX · Pull Request #737 · OpenZeppelin/openzeppelin-relayer

zeljkoX · 2026-04-01T10:37:08Z

Summary

Make SQS long-poll wait times and poller counts configurable via env vars (SQS_*_WAIT_TIME_SECONDS, SQS_*_POLLER_COUNT)
Add multi-poller support: multiple concurrent ReceiveMessage loops per queue sharing one concurrency semaphore, improving message pickup
smoothness on bursty queues
Add segment-level dwell-time metrics to transaction_processing_seconds histogram (request_queue_dwell, prepare_duration,
submission_queue_dwell, submit_duration) to isolate queue wait vs handler processing in P90 latency

Testing Process

Checklist

Add a reference to related issues in the PR description.
Add unit tests if applicable.

Note

If you are using Relayer in your stack, consider adding your team or organization to our list of Relayer Users in the Wild!

Summary by CodeRabbit

Release Notes

New Features
- SQS deployments can now be tuned per queue with configurable wait times and polling concurrency
- Enhanced transaction monitoring with stage-level timing metrics for queue dwell and processing durations
Documentation
- Added SQS performance tuning guide with recommended high-throughput configuration examples
- Documented transaction processing metrics for visibility into queue and processing performance

coderabbitai · 2026-04-01T10:37:23Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a2b75bfb-2333-4c64-984f-4cd4d365324f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Walkthrough

This PR adds SQS queue performance tuning capabilities through environment variables for wait time and poller count configuration, introduces comprehensive transaction lifecycle timing metrics with stage-specific labels, refactors queue polling to support multiple concurrent pollers per queue with permit-based concurrency control, and updates queue type accessors and documentation to support these features.

Changes

Cohort / File(s)	Summary
Documentation & Metrics Definition `docs/configuration/index.mdx`, `src/metrics/README.md`, `src/metrics/mod.rs`	Added documentation for SQS environment variables (`WaitTimeSeconds`, `POLLER_COUNT`) and performance tuning guidance. Introduced public histogram stage label constants and `observe_processing_time()` helper for recording transaction lifecycle metrics with `relayer_id`, `network_type`, and `stage` labels.
Queue Configuration & Type Updates `src/config/server_config.rs`, `src/queues/queue_type.rs`	Added `get_sqs_wait_time()` and `get_sqs_poller_count()` config helpers with bounds checking and env var parsing. Renamed `polling_interval_secs()` to `default_wait_time_secs()`, added `sqs_env_key()` and `default_poller_count()` accessors to `QueueType`.
Handler Instrumentation `src/jobs/handlers/transaction_request_handler.rs`, `src/jobs/handlers/transaction_submission_handler.rs`	Added timing metrics to handlers: queue dwell time (from transaction creation/job timestamp to processing), preparation duration, and submission duration. Metrics are recorded via `observe_processing_time()` with appropriate stage labels; errors in timestamp parsing are silently skipped.
Queue Worker Refactoring `src/queues/sqs/worker.rs`	Introduced `PollLoopConfig` struct and refactored `spawn_worker_for_queue` to spawn multiple concurrent pollers (via `poller_count`) sharing a semaphore. New `run_poll_loop` encapsulates per-poller logic with permit-based message batch distribution. Added `get_wait_time_for_queue()` and `get_poller_count_for_queue()` helpers; log messages updated to include `poller_id`.
Test Updates `src/queues/mod.rs`	Updated `test_queue_type_polling_intervals_appropriate` to validate using `default_wait_time_secs()` instead of the renamed `polling_interval_secs()`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

cla: allowlist

Suggested reviewers

tirumerla
collins-w

Poem

🐰 Pollers multiply like carrots in the spring,
Each semaphore doles out its permit share,
Queue dwell time measured, metrics ring—
Transactions flow faster through the air!
Performance tuned with env vars fair. 🚀

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: Allow fine tuning sqs pooling' is partially related to the changeset; it covers poller count configurability but omits the equally important wait-time configurability and monitoring metrics additions.
Description check	✅ Passed	The description covers the main changes (env vars, multi-poller support, metrics) but the Testing Process section is empty and related issues are not referenced, leaving two checklist items incomplete.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch sqs-pooling-tuning

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-04-01T10:41:42Z

Codecov Report

❌ Patch coverage is 54.05904% with 249 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.22%. Comparing base (aede8aa) to head (8a04f40).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/queues/sqs/worker.rs	33.66%	199 Missing ⚠️
src/jobs/handlers/transaction_request_handler.rs	0.00%	22 Missing ⚠️
...rc/jobs/handlers/transaction_submission_handler.rs	0.00%	21 Missing ⚠️
src/queues/queue_type.rs	94.50%	5 Missing ⚠️
src/config/server_config.rs	98.43%	1 Missing ⚠️
src/metrics/mod.rs	97.50%	1 Missing ⚠️

Additional details and impacted files

Flag	Coverage Δ
ai	`0.00% <0.00%> (ø)`
dev	`90.22% <54.05%> (-0.06%)`	⬇️
properties	`0.01% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

@@            Coverage Diff             @@
##             main     #737      +/-   ##
==========================================
- Coverage   90.27%   90.22%   -0.06%     
==========================================
  Files         290      290              
  Lines      121698   122082     +384     
==========================================
+ Hits       109868   110151     +283     
- Misses      11830    11931     +101

Files with missing lines	Coverage Δ
src/queues/mod.rs	`52.45% <100.00%> (ø)`
src/config/server_config.rs	`94.56% <98.43%> (+0.21%)`	⬆️
src/metrics/mod.rs	`92.06% <97.50%> (+2.52%)`	⬆️
src/queues/queue_type.rs	`96.73% <94.50%> (-0.66%)`	⬇️
...rc/jobs/handlers/transaction_submission_handler.rs	`47.00% <0.00%> (-10.29%)`	⬇️
src/jobs/handlers/transaction_request_handler.rs	`16.32% <0.00%> (-13.31%)`	⬇️
src/queues/sqs/worker.rs	`53.62% <33.66%> (+1.96%)`	⬆️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

src/config/server_config.rs (1)
677-707: Add focused unit tests for the new SQS env parsers.

Please add tests for unset/invalid/zero/upper-bound cases (WAIT_TIME_SECONDS clamped at 20, POLLER_COUNT clamped to minimum 1). This logic is easy to regress silently.

As per coding guidelines, "Test coverage/quality for changed or critical paths".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/config/server_config.rs` around lines 677 - 707, Add focused unit tests
for get_sqs_wait_time and get_sqs_poller_count: cover unset (env var absent ->
returns default), invalid (non-numeric -> returns default), zero and below-min
cases (e.g., WAIT_TIME_SECONDS=0 should clamp to 0? — ensure behavior matches
intended; POLLER_COUNT=0 must clamp to 1), and upper-bound for wait time
(WAIT_TIME_SECONDS > 20 must return 20). Use the functions
get_sqs_wait_time(queue_key, default) and get_sqs_poller_count(queue_key,
default), set and unset the relevant environment variables
(SQS_{QUEUE_KEY}_WAIT_TIME_SECONDS and SQS_{QUEUE_KEY}_POLLER_COUNT) in the test
harness, and assert the returned values match expected clamped/default outcomes.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/jobs/handlers/transaction_request_handler.rs`:
- Around line 70-78: The computed dwell_secs derived from parsing
transaction.created_at (via chrono::DateTime::parse_from_rfc3339 and
created_time.with_timezone(&Utc)) can be negative; before calling
observe_processing_time with STAGE_REQUEST_QUEUE_DWELL, clamp dwell_secs to a
non‑negative value (e.g., dwell_secs = max(0.0, computed_value)) so negative
durations from clock skew/bad data are recorded as zero; update the logic around
Utc::now() - created_time and pass the clamped dwell_secs to
observe_processing_time.

In `@src/queues/sqs/worker.rs`:
- Around line 89-90: After calling get_poller_count_for_queue(queue_type) assign
its result to poller_count and validate it is > 0; if it is 0, either return an
Err or panic (fail fast) with a clear message (e.g., "invalid poller_count 0 for
<queue_type>") or fallback to a safe default like 1 before spawning poll loops.
Update the code around the poller_count variable (the spot that reads
get_poller_count_for_queue and the places that use poller_count to spawn poll
loops) to perform this check so no zero value can silently cause no pollers to
be spawned.
- Around line 133-135: The drain loop currently ignores possible JoinError from
pollers; change the loop to handle the Result from
poller_handles.join_next().await: use while let Some(res) =
poller_handles.join_next().await { match res { Ok(_) => {} , Err(err) => {
error!(queue_type = ?queue_type, "poller task panicked: {:?}", err);
panic!("poller task panicked: {:?}", err); } } } so poller panics are logged
with context and not silently swallowed; reference poller_handles, join_next(),
and the JoinError result in your change.

---

Nitpick comments:
In `@src/config/server_config.rs`:
- Around line 677-707: Add focused unit tests for get_sqs_wait_time and
get_sqs_poller_count: cover unset (env var absent -> returns default), invalid
(non-numeric -> returns default), zero and below-min cases (e.g.,
WAIT_TIME_SECONDS=0 should clamp to 0? — ensure behavior matches intended;
POLLER_COUNT=0 must clamp to 1), and upper-bound for wait time
(WAIT_TIME_SECONDS > 20 must return 20). Use the functions
get_sqs_wait_time(queue_key, default) and get_sqs_poller_count(queue_key,
default), set and unset the relevant environment variables
(SQS_{QUEUE_KEY}_WAIT_TIME_SECONDS and SQS_{QUEUE_KEY}_POLLER_COUNT) in the test
harness, and assert the returned values match expected clamped/default outcomes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bace1086-9246-4e82-b84a-580291ce405c

📥 Commits

Reviewing files that changed from the base of the PR and between 58de92a and ee6e9ef.

📒 Files selected for processing (9)

docs/configuration/index.mdx
src/config/server_config.rs
src/jobs/handlers/transaction_request_handler.rs
src/jobs/handlers/transaction_submission_handler.rs
src/metrics/README.md
src/metrics/mod.rs
src/queues/mod.rs
src/queues/queue_type.rs
src/queues/sqs/worker.rs

coderabbitai · 2026-04-01T10:43:48Z

+    let poller_count = get_poller_count_for_queue(queue_type);
    let visibility_timeout = queue_type.visibility_timeout_secs();


⚠️ Potential issue | 🟠 Major

Validate poller_count to prevent silent queue stoppage.

At Line 89 and Line 843, poller_count is used without a non-zero guard. A misconfigured value of 0 will spawn no poll loops and the queue will stop being processed without failing fast.

Suggested fix

fn get_poller_count_for_queue(queue_type: QueueType) -> usize { - ServerConfig::get_sqs_poller_count(queue_type.sqs_env_key(), queue_type.default_poller_count()) + let configured = + ServerConfig::get_sqs_poller_count(queue_type.sqs_env_key(), queue_type.default_poller_count()); + if configured == 0 { + warn!( + queue_type = ?queue_type, + "Configured poller count is 0; clamping to 1" + ); + 1 + } else { + configured + } }

Also applies to: 842-844

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/queues/sqs/worker.rs` around lines 89 - 90, After calling get_poller_count_for_queue(queue_type) assign its result to poller_count and validate it is > 0; if it is 0, either return an Err or panic (fail fast) with a clear message (e.g., "invalid poller_count 0 for <queue_type>") or fallback to a safe default like 1 before spawning poll loops. Update the code around the poller_count variable (the spot that reads get_poller_count_for_queue and the places that use poller_count to spawn poll loops) to perform this check so no zero value can silently cause no pollers to be spawned.

feat: Allow fine tuning sqs pooling

ee6e9ef

zeljkoX requested a review from a team as a code owner April 1, 2026 10:37

github-actions Bot added the cla: allowlist label Apr 1, 2026

coderabbitai Bot reviewed Apr 1, 2026

View reviewed changes

chore: PR suggestions

6066203

dylankilkenny approved these changes Apr 2, 2026

View reviewed changes

chore: PR suggestions and unit tests

8a04f40

zeljkoX merged commit 05c3eee into main Apr 2, 2026
25 of 26 checks passed

zeljkoX deleted the sqs-pooling-tuning branch April 2, 2026 22:48

github-actions Bot locked and limited conversation to collaborators Apr 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Allow fine tuning sqs pooling#737

feat: Allow fine tuning sqs pooling#737
zeljkoX merged 3 commits intomainfrom
sqs-pooling-tuning

zeljkoX commented Apr 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 1, 2026 •

edited

Loading

Review skipped

Uh oh!

codecov Bot commented Apr 1, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 1, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		let poller_count = get_poller_count_for_queue(queue_type);
		let visibility_timeout = queue_type.visibility_timeout_secs();

Conversation

zeljkoX commented Apr 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing Process

Checklist

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Uh oh!

codecov Bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zeljkoX commented Apr 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 1, 2026 •

edited

Loading

codecov Bot commented Apr 1, 2026 •

edited

Loading