Reset jobs on run start/stop from filewriter topic by SimonHeybrock · Pull Request #795 · scipp/esslivedata

SimonHeybrock · 2026-03-13T08:37:26Z

Summary

Subscribe services to the {instrument}_filewriter Kafka topic for run start (pl72) and run stop (6s4t) messages
Defer job resets to when the data stream reaches the transition timestamp, rather than resetting immediately when the control message is processed
Propagate stop_time from pl72 run-start messages, scheduling a second reset at stop time when set
Convert run control timestamps from ms (FlatBuffer wire format) to ns (domain convention) at the adapter boundary
Per-workflow reset_on_run_transition flag (default True) allows timeseries workflows to opt out. We may need to expose this to the user as a setting in the future - either globally or as a flag when starting a workflow. The mechanism in JobManager should make it simple to change this in the future (need to change source of the flag curently set as reset_on_run_transition=workflow_spec.reset_on_run_transition).

Fixes #791.

Motivation

When a new instrument run starts or stops, data reduction workflows should start fresh. Run control messages may carry future timestamps (e.g., "run starts in 1 minute"), so resetting immediately on receipt would discard data that still belongs to the current run. Resets must fire when the data stream actually reaches the transition point.

Design

Deferred resets: on_run_start/on_run_stop schedule reset times via bisect.insort into a sorted pending list on JobManager. Resets fire in _advance_to_time when end_time of incoming data reaches the scheduled time. Multiple pending resets within the same data batch collapse into a single _reset_eligible_jobs call.
Own domain types: RunStart/RunStop dataclasses in core/message.py, decoupled from streaming_data_types FlatBuffer types.
ms→ns conversion: Done once in RunControlAdapter. All downstream code uses nanoseconds.
Per-job opt-in: reset_on_run_transition flag on WorkflowSpec, threaded through to Job. Timeseries workflows set False.

Test plan

We do not have fakes that create run start/stop - and if we did we would need to ensure they match what NICOS is producing. It is probably more productive to deploy this and try it in the wild.

Deploy and verify with real NICOS run start/stop messages

🤖 Generated with Claude Code

Subscribe all services to the {instrument}_filewriter Kafka topic and reset eligible jobs when a run starts or stops. This ensures accumulators don't carry stale data across run boundaries. Key changes: - Add RunStart/RunStop domain types in core/message.py - Add RunControlAdapter to deserialize pl72/6s4t FlatBuffer messages - Add filewriter_topic to StreamMapping (derived from instrument name) - Add on_run_start/on_run_stop to JobManager with per-job opt-in - Add reset_on_run_transition flag to WorkflowSpec (default True) - Timeseries workflows opt out (reset_on_run_transition=False) - All four services (detector, monitor, reduction, timeseries) subscribe

Run start/stop messages carry future timestamps, but resets were firing immediately when the message was processed. Now reset times are scheduled and only fire when the data stream reaches the transition point. Also propagates stop_time from pl72 (run start) messages and converts all run control timestamps from ms (wire format) to ns (domain convention) at the adapter boundary. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The existing test_past_reset_time_fires_on_next_data only tested a reset time that hadn't been reached yet. The new test covers the case where data has already advanced past T=5000 and a RunStart arrives with T=3000, verifying the reset fires on the next data push.

YooSunYoung

I have some minor comments/questions but looks fine in general!

YooSunYoung · 2026-03-18T09:16:52Z

src/ess/livedata/core/job_manager.py

+        if info.stop_time is not None:
+            self._schedule_reset(info.stop_time)


Does NICOS not send run stop message if there is stop_time?
If they do, I think this part is unnecessary.

I don't know, better to keep it rather than making assumptions that might not be true. No harm to this, is there?

YooSunYoung · 2026-03-18T09:29:21Z

src/ess/livedata/core/job_manager.py

+    def _fire_pending_resets(self, end_time: int) -> None:
+        """Fire pending resets whose scheduled time has been reached by data."""
+        if not self._pending_reset_times:
+            return
+        triggered = 0
+        for t in self._pending_reset_times:
+            if t <= end_time:
+                triggered += 1
+            else:
+                break
+        if triggered:
+            self._pending_reset_times = self._pending_reset_times[triggered:]
+            self._reset_eligible_jobs()


Can we also write this with bisect_right?

YooSunYoung · 2026-03-18T09:34:23Z

tests/core/run_transition_test.py

+        manager.on_run_start(RunStart(run_name='run_1', start_time=100))
+        # No jobs scheduled/active, but push data past the reset time


[MINOR]
Can we add another assertion to make sure it has the reset time before the data was pushed?

Suggested change

manager.on_run_start(RunStart(run_name='run_1', start_time=100))

# No jobs scheduled/active, but push data past the reset time

manager.on_run_start(RunStart(run_name='run_1', start_time=100))

assert manager._pending_reset_times == [100]

# No jobs scheduled/active, but push data past the reset time

Use bisect_right instead of manual loop in _fire_pending_resets. Add assertion for pending reset state before data push in test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

SimonHeybrock force-pushed the worktree-run-transition-reset branch from 51745fd to 489b654 Compare March 16, 2026 13:25

SimonHeybrock changed the title ~~Reset jobs on run start/stop from filewriter topic~~ Reset jobs on run transitions using data-driven timestamps Mar 16, 2026

SimonHeybrock marked this pull request as ready for review March 17, 2026 05:37

SimonHeybrock changed the title ~~Reset jobs on run transitions using data-driven timestamps~~ Reset jobs on run transitions Mar 17, 2026

SimonHeybrock changed the title ~~Reset jobs on run transitions~~ Reset jobs on run start/stop from filewriter topic Mar 17, 2026

YooSunYoung approved these changes Mar 18, 2026

View reviewed changes

SimonHeybrock and others added 2 commits March 18, 2026 10:27

Address review comments on run-transition reset PR

5ecf3fe

Use bisect_right instead of manual loop in _fire_pending_resets. Add assertion for pending reset state before data push in test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'main' into worktree-run-transition-reset

da32bbf

SimonHeybrock enabled auto-merge March 18, 2026 12:21

SimonHeybrock merged commit b6e2674 into main Mar 18, 2026
4 checks passed

SimonHeybrock deleted the worktree-run-transition-reset branch March 18, 2026 12:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reset jobs on run start/stop from filewriter topic#795

Reset jobs on run start/stop from filewriter topic#795
SimonHeybrock merged 5 commits intomainfrom
worktree-run-transition-reset

SimonHeybrock commented Mar 13, 2026 •

edited

Loading

Uh oh!

YooSunYoung left a comment

Uh oh!

YooSunYoung Mar 18, 2026

Uh oh!

SimonHeybrock Mar 18, 2026 •

edited

Loading

Uh oh!

YooSunYoung Mar 18, 2026

Uh oh!

YooSunYoung Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if info.stop_time is not None:
		self._schedule_reset(info.stop_time)

		manager.on_run_start(RunStart(run_name='run_1', start_time=100))
		# No jobs scheduled/active, but push data past the reset time

Conversation

SimonHeybrock commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Design

Test plan

Uh oh!

YooSunYoung left a comment

Choose a reason for hiding this comment

Uh oh!

YooSunYoung Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

SimonHeybrock Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YooSunYoung Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

YooSunYoung Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SimonHeybrock commented Mar 13, 2026 •

edited

Loading

SimonHeybrock Mar 18, 2026 •

edited

Loading