Skip to content

[scd] Time-bound subscription locks and add lock diagnostics#1384

Merged
barroco merged 3 commits intointeruss:masterfrom
ashishjsharda:fix/scd-lock-latency-1311
Mar 16, 2026
Merged

[scd] Time-bound subscription locks and add lock diagnostics#1384
barroco merged 3 commits intointeruss:masterfrom
ashishjsharda:fix/scd-lock-latency-1311

Conversation

@ashishjsharda
Copy link
Contributor

Problem

Issue #1311 reports high latency and timeouts for SCD operations under overlapping traffic patterns. Lock acquisition in LockSubscriptionsOnCells becomes expensive when many subscriptions share S2 cells, regardless of temporal relevance.

Solution

Narrow lock scope to subscriptions that overlap both the relevant S2 cells and the operation time window, while preserving correctness by always locking explicitly referenced subscription IDs. Add low-noise diagnostics to improve production observability of costly lock queries.

Changes

1. Time-bounded lock scope

Updated LockSubscriptionsOnCells to accept startTime and endTime, locking rows matching:

  • cells && $1
  • COALESCE(starts_at <= endTime, true)
  • COALESCE(ends_at >= startTime, true)
  • OR id = ANY(explicitSubscriptionIDs)

2. Call-site updates

Updated OIR mutation paths to pass the relevant time window:

  • Delete OIR: uses old.StartTime / old.EndTime
  • Upsert OIR: uses validParams.uExtent.StartTime / validParams.uExtent.EndTime

3. Lock diagnostics (thresholded, low-noise)

Warnings emitted when:

  • duration >= 500ms, or
  • matched_rows >= 1000

Fields logged:

  • global_lock
  • duration
  • matched_rows
  • cell_count
  • explicit_subscription_id_count

Failure warnings:

  • SCD global lock query failed
  • SCD subscription lock query failed

4. Documentation

Extended docs/operations/performances.md with diagnostics behavior, fields, and thresholds.

Validation

  • go test ./pkg/scd/store/datastore — passed
  • go test ./pkg/scd/... — passed locally

Risk / Compatibility

  • LockSubscriptionsOnCells interface signature updated; all in-repo call sites patched.
  • Cell-based lock scope intentionally narrowed by time overlap; explicit subscription ID locking preserved.
  • Diagnostics are passive and thresholded, minimizing steady-state log noise.

Follow-up

Closes #1311

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Mar 7, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

Copy link
Contributor

@barroco barroco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @ashishjsharda for this contribution.

Please see comments inline.

@barroco
Copy link
Contributor

barroco commented Mar 10, 2026

Thanks for the clarification @ashishjsharda, while looking at it again, I am not sure about the usage of RowsAffected. Can you please confirm the behaviour is correct ?

@ashishjsharda
Copy link
Contributor Author

Thanks for the clarification @ashishjsharda, while looking at it again, I am not sure about the usage of RowsAffected. Can you please confirm the behaviour is correct ?

Thanks for the follow-up. You’re right to question it.

In the latest update (590c4fe), we no longer rely on RowsAffected() for diagnostics behavior:

  • warning trigger is now duration-only (>= 4s),
  • row-count-based criteria were removed.

So the behavior of this PR no longer depends on RowsAffected() semantics for SELECT ... FOR UPDATE.

@barroco
Copy link
Contributor

barroco commented Mar 15, 2026

@ashishjsharda many thanks for the changes. I will review them tomorrow first thing. Please note that we are still missing your signature of the Contributor License Agreement (CLA) (See this automatic check) before being able to merge this PR.

Copy link
Contributor

@barroco barroco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ashishjsharda for the changes.
Knowing that:

  1. in terms of performance, the change is not making the situation worse and best case improves it as far as we have tested
  2. and that, the notifyVolume differing from validParams is out of scope of this PR

I am good to merge it. We will release it as patch update upon request so it can be tested in a production environment allowing users to easily rollback if there is any critical issue. Ideally, we will wait for this confirmation before releasing a new minor version including it.

Waiting for all automated checks to pass including the CLA to be signed to merge.

@ashishjsharda
Copy link
Contributor Author

Thank you @ashishjsharda for the changes. Knowing that:

  1. in terms of performance, the change is not making the situation worse and best case improves it as far as we have tested
  2. and that, the notifyVolume differing from validParams is out of scope of this PR

I am good to merge it. We will release it as patch update upon request so it can be tested in a production environment allowing users to easily rollback if there is any critical issue. Ideally, we will wait for this confirmation before releasing a new minor version including it.

Waiting for all automated checks to pass including the CLA to be signed to merge.

Thanks for the review and approval, @barroco.
I’ve signed the CLA, so the automated checks should pass now.
Appreciate the feedback on performance and the clarification around notifyVolume vs validParams.

@barroco barroco merged commit aca1f2f into interuss:master Mar 16, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

High Latency for SCD DSS Requests

3 participants