live-data fixups: UDS sockets, run-status, and front-end reset sequence#674
Conversation
34c369c to
9b96926
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## next #674 +/- ##
==========================================
+ Coverage 96.42% 96.47% +0.04%
==========================================
Files 78 79 +1
Lines 7160 7258 +98
==========================================
+ Hits 6904 7002 +98
Misses 256 256
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
db359bd to
52d6028
Compare
This commit includes the following changes:
* 'LocalDataService.hasLiveDataConnection': allow both INET and UDS (Unix domain socket) addresses;
these are follow-on changes from the ADARA-packet replay work.
* 'MantidSnapper': cleanup live-data algorithms immediately following execution;
this change is necessary to quickly release the 'SNSLiveEventDataListener' instances.
* 'MantidSnapper': fix deadlock on timeout error in '_cleanupNonConcurrent' -- previously, mutex was never released.
* 'LoadLiveDataInterval': identify and respond to dead-time gaps:
A dead-time gap is a period of time where the detected events count is zero; this is associated with the listener
returning a chunk-workspace that usually has no time-interval information -- this is obviously a defect,
but it is not fixed as part of this PR! When no time-interval information is present,
`getPulseTimeMin()` and `getPulseTimeMax()` return \<max DataAndTime\> and \<min DateAndTime\> respectively.
Previously, if these values were actually used as the chunk-interval, it resulted in a set of chunk-intervals that
could never be completed, because a fully-overlapping sequence would never be obtained.
* 'LoadLiveDataInterval': add fallback mechanism for empty chunks: search for alternate time-series PVs.
* 'LoadLiveDataInterval': 'RunStatus' output property. 'RunStatus' is now derived from the PV logs, rather than
strictly from the run-number. WIP: which logs are most reliable for this is still an open issue! It is possible
that the best log to use for this is not actually being streamed (by ADARA) 'BL3:CS:RunControl:StateEnum'.
* 'WorkflowPresenter': in response to a 'LiveDataState' exception, 'reset' *first*.
This fixes the issue that previously, the warning `QMessageBox` blocked the live-data pane from resetting,
resulting in defects such as continuing the reduce cycle even when the run-number had changed (but using the previous
run-number's workspace).
* 'GroceryService': Temporary work-arounds prior to `SNSLiveEventDataListener` fixes.
Sometimes the `run_number` log is not actually attached:
- use the new `RunStatus` if it is available (the live-data case);
use the run-number to determine a state change *only* if it is non-zero.
- in the live-data case: if the run-number is not set, *set* the run number on the workspace
using the value from the metadata;
- in the fallback case: if the run-number is not set, raise an exception
(same as current behavior: we must have a run-number in this case to determine if the fallback applies).
9b5aa4c to
dfb86df
Compare
| ) | ||
|
|
||
| self.mantidSnapper.executeQueue() | ||
| # collapse the `Callback`: *why* is this necessary? |
There was a problem hiding this comment.
Boost does something weird with how it fetches strings from python I think.
Though Im not sure why it isnt already unwrapped as its an output of mantidsnapper.
0f9875b to
390a699
Compare
for more information, see https://pre-commit.ci
walshmm
left a comment
There was a problem hiding this comment.
I have review these changes and am tentatively approving this PR once the CI passes.
I will be out for a few weeks, so when this happens I would appreciate if someone approved it on my behalf if code changes invalidate it.
I ran both this branch and next, but was unfortunately unable to recreate the issues this pr fixes. That being said, the code in the PR still functions properly, and the code indicates an improvement to the reliability of livedata if such issues did occur. It also adds code to help along "ADARA-packet replay" which would be useful for capturing future issue data for debugging and approving future changes.
I will note though that I am seeing
MANTID WARNING - 2026-05-13 16:22:52 - Python-[Warning] no explicit representation of timezones available for np.datetime64
warns again which were fixed as part of some work Reece had done. This maybe should be fixed? Either way, I dont want to hold up this pr much longer.
Description of work
These changes fix several defects in SNAPRed's live-data system. Primarily the changes have to do with how dead-time intervals (i.e. intervals with a zero event count, and no timing information) are treated. In addition, several changes have been made in order to address open issues with the back-end (e.g.
GroceryService, during live-data fallback) and front-end (e.g. live-data paneresetat run-number change) implementation. At present, some of these changes are implemented as temporary workarounds, as the root-cause of the associated issue requires a change in theSNSLiveEventDataListenerimplementation (upcoming).Explanation of work
This commit includes the following changes:
'LocalDataService.hasLiveDataConnection': allow both INET and UDS (Unix domain socket) addresses;
these are follow-on changes from the ADARA-packet replay work.
'MantidSnapper': cleanup live-data algorithms immediately following execution;
this change is necessary to quickly release the 'SNSLiveEventDataListener' instances.
'LoadLiveDataInterval': identify and respond to dead-time gaps:
A dead-time gap is a period of time where the detected events count is zero; this is associated with the listener
returning a chunk-workspace that usually has no time-interval information -- this is obviously a defect,
but it is not fixed as part of this PR! When no time-interval information is present,
getPulseTimeMin()andgetPulseTimeMax()return <max DataAndTime> and <min DateAndTime> respectively.Previously, if these values were actually used as the chunk-interval, it resulted in a set of chunk-intervals that
could never be completed, because a fully-overlapping sequence would never be obtained.
'LoadLiveDataInterval': add fallback mechanism for empty chunks: search for alternate time-series PVs.
'LoadLiveDataInterval': 'RunStatus' output property. 'RunStatus' is now derived from the PV logs, rather than
strictly from the run-number. WIP: which logs are most reliable for this is still an open issue! It is possible
that the best log to use for this is not actually being streamed (by ADARA) 'BL3:CS:RunControl:StateEnum'.
'WorkflowPresenter': in response to a 'LiveDataState' exception, 'reset' first.
This fixes the issue that previously, the warning
QMessageBoxblocked the live-data pane from resetting,resulting in defects such as continuing the reduce cycle even when the run-number had changed (but using the previous
run-number's workspace).
'GroceryService': Temporary work-arounds prior to
SNSLiveEventDataListenerfixes.Sometimes the
run_numberlog is not actually attached:use the new
RunStatusif it is available (the live-data case);use the run-number to determine a state change only if it is non-zero.
in the live-data case: if the run-number is not set, set the run number on the workspace
using the value from the metadata;
in the fallback case: if the run-number is not set, raise an exception
(same as current behavior: we must have a run-number in this case to determine if the fallback applies).
To test
Unit test coverage should be fairly complete. Several of these changes can be best assessed by simply examining the code, and verifying that there's unit-test coverage. (You could run the Adara-packet player examples -- but that's almost certainly "above and beyond" the requirements of testing this PR!)
Dev testing
The way I've been testing this system is to simultaneously open SNAPRed's live-data pane on "next", and on this branch. (Remember to enable live-data in your
dev.yml; and make sure your calibration directory is writable in case you need to create the state.) Once the metadata for an active run is displayed, set the time-interval to 5 minutes (this ensures that the per-reduction chunk-assembly sequence will not be overly long) -- then hit the "continue" key in both panes. Once it gets to that point, allow the reduction to continue without normalization. If the work of this PR has been successful, you'll see several defects occur in the "next" pane, that are bypassed in the new branch. Particularly, at run-state-change, the panel should reset, and go back to the "ready" state (, but "next" will either crash, or will continue to reduce, but with a mis-named workspace). If you do see dead-time intervals, next will perform the weird time-interval thing, timeout at chunk-assembly (, and possibly crash); this branch should simply ignore the dead-time (except for the warning) and continue without much issue.CIS testing
Same as for dev testing.
Links to EWM items
EWM#13905
EWM#13769
Verification
Acceptance Criteria
This list is for ease of reference, and does not replace reading the EWM story as part of the review. Verify this list matches the EWM story before reviewing.