Skip to content

live-data fixups: UDS sockets, run-status, and front-end reset sequence#674

Merged
ekapadi merged 3 commits into
neutrons:nextfrom
ekapadi:EWM13769_live_data_sockets
May 13, 2026
Merged

live-data fixups: UDS sockets, run-status, and front-end reset sequence#674
ekapadi merged 3 commits into
neutrons:nextfrom
ekapadi:EWM13769_live_data_sockets

Conversation

@ekapadi
Copy link
Copy Markdown
Contributor

@ekapadi ekapadi commented Apr 15, 2026

Description of work

These changes fix several defects in SNAPRed's live-data system. Primarily the changes have to do with how dead-time intervals (i.e. intervals with a zero event count, and no timing information) are treated. In addition, several changes have been made in order to address open issues with the back-end (e.g. GroceryService, during live-data fallback) and front-end (e.g. live-data pane reset at run-number change) implementation. At present, some of these changes are implemented as temporary workarounds, as the root-cause of the associated issue requires a change in the SNSLiveEventDataListener implementation (upcoming).

Explanation of work

This commit includes the following changes:

  • 'LocalDataService.hasLiveDataConnection': allow both INET and UDS (Unix domain socket) addresses;
    these are follow-on changes from the ADARA-packet replay work.

  • 'MantidSnapper': cleanup live-data algorithms immediately following execution;
    this change is necessary to quickly release the 'SNSLiveEventDataListener' instances.

  • 'LoadLiveDataInterval': identify and respond to dead-time gaps:
    A dead-time gap is a period of time where the detected events count is zero; this is associated with the listener
    returning a chunk-workspace that usually has no time-interval information -- this is obviously a defect,
    but it is not fixed as part of this PR! When no time-interval information is present,
    getPulseTimeMin() and getPulseTimeMax() return <max DataAndTime> and <min DateAndTime> respectively.
    Previously, if these values were actually used as the chunk-interval, it resulted in a set of chunk-intervals that
    could never be completed, because a fully-overlapping sequence would never be obtained.

  • 'LoadLiveDataInterval': add fallback mechanism for empty chunks: search for alternate time-series PVs.

  • 'LoadLiveDataInterval': 'RunStatus' output property. 'RunStatus' is now derived from the PV logs, rather than
    strictly from the run-number. WIP: which logs are most reliable for this is still an open issue! It is possible
    that the best log to use for this is not actually being streamed (by ADARA) 'BL3:CS:RunControl:StateEnum'.

  • 'WorkflowPresenter': in response to a 'LiveDataState' exception, 'reset' first.
    This fixes the issue that previously, the warning QMessageBox blocked the live-data pane from resetting,
    resulting in defects such as continuing the reduce cycle even when the run-number had changed (but using the previous
    run-number's workspace).

  • 'GroceryService': Temporary work-arounds prior to SNSLiveEventDataListener fixes.
    Sometimes the run_number log is not actually attached:

    • use the new RunStatus if it is available (the live-data case);
      use the run-number to determine a state change only if it is non-zero.

    • in the live-data case: if the run-number is not set, set the run number on the workspace
      using the value from the metadata;

    • in the fallback case: if the run-number is not set, raise an exception
      (same as current behavior: we must have a run-number in this case to determine if the fallback applies).

To test

Unit test coverage should be fairly complete. Several of these changes can be best assessed by simply examining the code, and verifying that there's unit-test coverage. (You could run the Adara-packet player examples -- but that's almost certainly "above and beyond" the requirements of testing this PR!)

Dev testing

The way I've been testing this system is to simultaneously open SNAPRed's live-data pane on "next", and on this branch. (Remember to enable live-data in your dev.yml; and make sure your calibration directory is writable in case you need to create the state.) Once the metadata for an active run is displayed, set the time-interval to 5 minutes (this ensures that the per-reduction chunk-assembly sequence will not be overly long) -- then hit the "continue" key in both panes. Once it gets to that point, allow the reduction to continue without normalization. If the work of this PR has been successful, you'll see several defects occur in the "next" pane, that are bypassed in the new branch. Particularly, at run-state-change, the panel should reset, and go back to the "ready" state (, but "next" will either crash, or will continue to reduce, but with a mis-named workspace). If you do see dead-time intervals, next will perform the weird time-interval thing, timeout at chunk-assembly (, and possibly crash); this branch should simply ignore the dead-time (except for the warning) and continue without much issue.

CIS testing

Same as for dev testing.

Links to EWM items

EWM#13905

EWM#13769

Verification

  • the author has read the EWM story and acceptance critera
  • the reviewer has read the EWM story and acceptance criteria
  • the reviewer certifies the acceptance criteria below reflect the criteria in EWM

Acceptance Criteria

This list is for ease of reference, and does not replace reading the EWM story as part of the review. Verify this list matches the EWM story before reviewing.

  • acceptance criterion 1
  • acceptance criterion 2

@ekapadi ekapadi force-pushed the EWM13769_live_data_sockets branch from 34c369c to 9b96926 Compare April 15, 2026 13:28
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 15, 2026

Codecov Report

❌ Patch coverage is 98.24561% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.47%. Comparing base (2f47a17) to head (e015b38).

Files with missing lines Patch % Lines
src/snapred/backend/data/GroceryService.py 93.54% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             next     #674      +/-   ##
==========================================
+ Coverage   96.42%   96.47%   +0.04%     
==========================================
  Files          78       79       +1     
  Lines        7160     7258      +98     
==========================================
+ Hits         6904     7002      +98     
  Misses        256      256              
Flag Coverage Δ
integration 48.84% <14.91%> (-0.45%) ⬇️
unittests 96.19% <98.24%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ekapadi ekapadi force-pushed the EWM13769_live_data_sockets branch 4 times, most recently from db359bd to 52d6028 Compare April 20, 2026 14:38
  This commit includes the following changes:

  * 'LocalDataService.hasLiveDataConnection': allow both INET and UDS (Unix domain socket) addresses;
    these are follow-on changes from the ADARA-packet replay work.

  * 'MantidSnapper': cleanup live-data algorithms immediately following execution;
    this change is necessary to quickly release the 'SNSLiveEventDataListener' instances.

  * 'MantidSnapper': fix deadlock on timeout error in '_cleanupNonConcurrent' -- previously, mutex was never released.

  * 'LoadLiveDataInterval': identify and respond to dead-time gaps:
    A dead-time gap is a period of time where the detected events count is zero; this is associated with the listener
    returning a chunk-workspace that usually has no time-interval information -- this is obviously a defect,
    but it is not fixed as part of this PR!  When no time-interval information is present,
    `getPulseTimeMin()` and `getPulseTimeMax()` return \<max DataAndTime\> and \<min DateAndTime\> respectively.
    Previously, if these values were actually used as the chunk-interval, it resulted in a set of chunk-intervals that
    could never be completed, because a fully-overlapping sequence would never be obtained.

  * 'LoadLiveDataInterval': add fallback mechanism for empty chunks: search for alternate time-series PVs.

  * 'LoadLiveDataInterval': 'RunStatus' output property.  'RunStatus' is now derived from the PV logs, rather than
    strictly from the run-number.  WIP: which logs are most reliable for this is still an open issue!  It is possible
    that the best log to use for this is not actually being streamed (by ADARA) 'BL3:CS:RunControl:StateEnum'.

  * 'WorkflowPresenter': in response to a 'LiveDataState' exception, 'reset' *first*.
    This fixes the issue that previously, the warning `QMessageBox` blocked the live-data pane from resetting,
    resulting in defects such as continuing the reduce cycle even when the run-number had changed (but using the previous
    run-number's workspace).

  * 'GroceryService': Temporary work-arounds prior to `SNSLiveEventDataListener` fixes.
    Sometimes the `run_number` log is not actually attached:

      - use the new `RunStatus` if it is available (the live-data case);
        use the run-number to determine a state change *only* if it is non-zero.

      - in the live-data case: if the run-number is not set, *set* the run number on the workspace
        using the value from the metadata;

      - in the fallback case: if the run-number is not set, raise an exception
        (same as current behavior: we must have a run-number in this case to determine if the fallback applies).
@ekapadi ekapadi force-pushed the EWM13769_live_data_sockets branch from 9b5aa4c to dfb86df Compare May 13, 2026 14:37
)

self.mantidSnapper.executeQueue()
# collapse the `Callback`: *why* is this necessary?
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boost does something weird with how it fetches strings from python I think.
Though Im not sure why it isnt already unwrapped as its an output of mantidsnapper.

@ekapadi ekapadi force-pushed the EWM13769_live_data_sockets branch from 0f9875b to 390a699 Compare May 13, 2026 19:58
Copy link
Copy Markdown
Collaborator

@walshmm walshmm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have review these changes and am tentatively approving this PR once the CI passes.
I will be out for a few weeks, so when this happens I would appreciate if someone approved it on my behalf if code changes invalidate it.

I ran both this branch and next, but was unfortunately unable to recreate the issues this pr fixes. That being said, the code in the PR still functions properly, and the code indicates an improvement to the reliability of livedata if such issues did occur. It also adds code to help along "ADARA-packet replay" which would be useful for capturing future issue data for debugging and approving future changes.

I will note though that I am seeing
MANTID WARNING - 2026-05-13 16:22:52 - Python-[Warning] no explicit representation of timezones available for np.datetime64
warns again which were fixed as part of some work Reece had done. This maybe should be fixed? Either way, I dont want to hold up this pr much longer.

@ekapadi ekapadi merged commit ddd85d0 into neutrons:next May 13, 2026
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants