Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughThis PR adds SQS queue performance tuning capabilities through environment variables for wait time and poller count configuration, introduces comprehensive transaction lifecycle timing metrics with stage-specific labels, refactors queue polling to support multiple concurrent pollers per queue with permit-based concurrency control, and updates queue type accessors and documentation to support these features. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is Additional details and impacted files
Flags with carried forward coverage won't be shown. Click here to find out more. @@ Coverage Diff @@
## main #737 +/- ##
==========================================
- Coverage 90.27% 90.22% -0.06%
==========================================
Files 290 290
Lines 121698 122082 +384
==========================================
+ Hits 109868 110151 +283
- Misses 11830 11931 +101
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
src/config/server_config.rs (1)
677-707: Add focused unit tests for the new SQS env parsers.Please add tests for unset/invalid/zero/upper-bound cases (
WAIT_TIME_SECONDSclamped at 20,POLLER_COUNTclamped to minimum 1). This logic is easy to regress silently.As per coding guidelines, "Test coverage/quality for changed or critical paths".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/config/server_config.rs` around lines 677 - 707, Add focused unit tests for get_sqs_wait_time and get_sqs_poller_count: cover unset (env var absent -> returns default), invalid (non-numeric -> returns default), zero and below-min cases (e.g., WAIT_TIME_SECONDS=0 should clamp to 0? — ensure behavior matches intended; POLLER_COUNT=0 must clamp to 1), and upper-bound for wait time (WAIT_TIME_SECONDS > 20 must return 20). Use the functions get_sqs_wait_time(queue_key, default) and get_sqs_poller_count(queue_key, default), set and unset the relevant environment variables (SQS_{QUEUE_KEY}_WAIT_TIME_SECONDS and SQS_{QUEUE_KEY}_POLLER_COUNT) in the test harness, and assert the returned values match expected clamped/default outcomes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/jobs/handlers/transaction_request_handler.rs`:
- Around line 70-78: The computed dwell_secs derived from parsing
transaction.created_at (via chrono::DateTime::parse_from_rfc3339 and
created_time.with_timezone(&Utc)) can be negative; before calling
observe_processing_time with STAGE_REQUEST_QUEUE_DWELL, clamp dwell_secs to a
non‑negative value (e.g., dwell_secs = max(0.0, computed_value)) so negative
durations from clock skew/bad data are recorded as zero; update the logic around
Utc::now() - created_time and pass the clamped dwell_secs to
observe_processing_time.
In `@src/queues/sqs/worker.rs`:
- Around line 89-90: After calling get_poller_count_for_queue(queue_type) assign
its result to poller_count and validate it is > 0; if it is 0, either return an
Err or panic (fail fast) with a clear message (e.g., "invalid poller_count 0 for
<queue_type>") or fallback to a safe default like 1 before spawning poll loops.
Update the code around the poller_count variable (the spot that reads
get_poller_count_for_queue and the places that use poller_count to spawn poll
loops) to perform this check so no zero value can silently cause no pollers to
be spawned.
- Around line 133-135: The drain loop currently ignores possible JoinError from
pollers; change the loop to handle the Result from
poller_handles.join_next().await: use while let Some(res) =
poller_handles.join_next().await { match res { Ok(_) => {} , Err(err) => {
error!(queue_type = ?queue_type, "poller task panicked: {:?}", err);
panic!("poller task panicked: {:?}", err); } } } so poller panics are logged
with context and not silently swallowed; reference poller_handles, join_next(),
and the JoinError result in your change.
---
Nitpick comments:
In `@src/config/server_config.rs`:
- Around line 677-707: Add focused unit tests for get_sqs_wait_time and
get_sqs_poller_count: cover unset (env var absent -> returns default), invalid
(non-numeric -> returns default), zero and below-min cases (e.g.,
WAIT_TIME_SECONDS=0 should clamp to 0? — ensure behavior matches intended;
POLLER_COUNT=0 must clamp to 1), and upper-bound for wait time
(WAIT_TIME_SECONDS > 20 must return 20). Use the functions
get_sqs_wait_time(queue_key, default) and get_sqs_poller_count(queue_key,
default), set and unset the relevant environment variables
(SQS_{QUEUE_KEY}_WAIT_TIME_SECONDS and SQS_{QUEUE_KEY}_POLLER_COUNT) in the test
harness, and assert the returned values match expected clamped/default outcomes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: bace1086-9246-4e82-b84a-580291ce405c
📒 Files selected for processing (9)
docs/configuration/index.mdxsrc/config/server_config.rssrc/jobs/handlers/transaction_request_handler.rssrc/jobs/handlers/transaction_submission_handler.rssrc/metrics/README.mdsrc/metrics/mod.rssrc/queues/mod.rssrc/queues/queue_type.rssrc/queues/sqs/worker.rs
| let poller_count = get_poller_count_for_queue(queue_type); | ||
| let visibility_timeout = queue_type.visibility_timeout_secs(); |
There was a problem hiding this comment.
Validate poller_count to prevent silent queue stoppage.
At Line 89 and Line 843, poller_count is used without a non-zero guard. A misconfigured value of 0 will spawn no poll loops and the queue will stop being processed without failing fast.
Suggested fix
fn get_poller_count_for_queue(queue_type: QueueType) -> usize {
- ServerConfig::get_sqs_poller_count(queue_type.sqs_env_key(), queue_type.default_poller_count())
+ let configured =
+ ServerConfig::get_sqs_poller_count(queue_type.sqs_env_key(), queue_type.default_poller_count());
+ if configured == 0 {
+ warn!(
+ queue_type = ?queue_type,
+ "Configured poller count is 0; clamping to 1"
+ );
+ 1
+ } else {
+ configured
+ }
}Also applies to: 842-844
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/queues/sqs/worker.rs` around lines 89 - 90, After calling
get_poller_count_for_queue(queue_type) assign its result to poller_count and
validate it is > 0; if it is 0, either return an Err or panic (fail fast) with a
clear message (e.g., "invalid poller_count 0 for <queue_type>") or fallback to a
safe default like 1 before spawning poll loops. Update the code around the
poller_count variable (the spot that reads get_poller_count_for_queue and the
places that use poller_count to spawn poll loops) to perform this check so no
zero value can silently cause no pollers to be spawned.
Summary
SQS_*_WAIT_TIME_SECONDS,SQS_*_POLLER_COUNT)ReceiveMessageloops per queue sharing one concurrency semaphore, improving message pickupsmoothness on bursty queues
transaction_processing_secondshistogram (request_queue_dwell,prepare_duration,submission_queue_dwell,submit_duration) to isolate queue wait vs handler processing in P90 latencyTesting Process
Checklist
Note
If you are using Relayer in your stack, consider adding your team or organization to our list of Relayer Users in the Wild!
Summary by CodeRabbit
Release Notes
New Features
Documentation