Skip to content

Improve process exit tracking with SSH PM#896

Open
Aurashk wants to merge 15 commits into
developfrom
aurashk/improve-process-exit-tracking
Open

Improve process exit tracking with SSH PM#896
Aurashk wants to merge 15 commits into
developfrom
aurashk/improve-process-exit-tracking

Conversation

@Aurashk
Copy link
Copy Markdown
Contributor

@Aurashk Aurashk commented Apr 28, 2026

Description

Fixes issue #882

Improvements to process exit reporting and testing in the ssh PM

Type of change

  • New feature / enhancement
  • Optimization
  • Bug fix
  • Breaking change
  • Documentation

Change log

  • Each time a process exits, we classify it into one of the situations found in exit_status.py and report it accordingly. The resulting logs and exit codes from processes should be more accurate.
  • To make the implementation easier to work with I added RunningShellProcess defined in ssh_shell_process.py to keep track of exit status and all relevant pids of a running process. We could build on this further by tabulating the information too if it's useful.
  • I added testing that each process exit situation will result in the expected logging
  • I made the existing testing more robust by verifying that none of the processes started by the process manager was left as a zombie, this should help us catch any zombie process issues earlier in development.

Suggested manual testing checklist

Observe the logging behaviour and exit codes in the shell - which is now dependent on the way the process dies. And note that a sigkill on the remote process correctly reports that no exit code was available. You can do this killing a process in different ways and restarting it e.g.:

drunc-unified-shell ssh-standalone daqsystemtest/config/daqsystemtest/example-configs.data.xml local-1x1-config MyTest
boot
kill --name df-01
restart --name df-01
In another terminal do: ssh aurash@localhost kill -9 <pid of df-01> (SIGKILL)
restart --name df-01
In another terminal do: ssh aurash@localhost kill -3 <pid of df-01> (SIGQUIT)

Note you can easily get <pid of df-01> through ps --long-format in the unified shell

Developer checklist

Prior to marking this as "Ready for Review"

Tests ran on: HEP cluster from release 12/05/2026 nightly

Unit tests - some tests can't be ran on the CI. This is documented. If this PR checks a feature that can't be tested with CI, this has been marked appropriately.

Integration tests - the daqsystemtest_integtest_bundle requires a lot of resources, and connections to the EHN1 infrastructure. Check the cross referenced list if you can't run these. The developer needs to run at least the .

  • Unit tests (pytest --marker) passed
    • With relevant marker
    • Without marker
  • Integration tests passed
    • Only daqsystemtest_integtest_bundle.sh -k minimal_system_quick_test.py
    • Full daqsystemtest_integtest_bundle.sh
  • Testing skipped as there are no core code changes in this PR, this only relates to documentation/CI workflows

Final checklist prior to marking this as "Ready for Review"

  • Code is clearly commented.
  • New unit tests have been added, or is documented in # ISSUE NUMBER
  • A suitable reviewer has been chosen from this list.

Reviewer checklist

  • This branch has been rebased with develop prior to testing.
  • Suggested manual tests show changes.
  • CI workflows fails documented (if present)
  • Integration tests passed
    • Only concern yourself if failures related to drunc are in the log files
    • If non-drunc failure appears:
      • Validate failure in fresh working area
      • Contact Pawel if unsure

Once the features are validated and both the unit and integration tests pass, the PRs is ready to be merged.

Prior to merging

Choose one of the following an complete all substeps
  • Changes only affect the Run Control, are in a single repository, and do not affect the end user.
    • Changes are documented in docstrings and code comments
    • Wiki has been updated if architectural or endpoint changes
  • Otherwise
    • Workflow changes demonstrated in the Change Log (if necessary)
    • Wiki has been updated (if necessary)
    • #daq-sw-librarians Slack channel notified (see below)

Once completed, the reviewer can merge the PR.

Notification message for a Slack channel

Note - this should be to #dunedaq-integration for general workflow that isn't during a release candidate period, and to #daq-release-prep otherwise.

For an single merge that changes the user workflow

The CCM WG has an isolated PR ready to merge that affects user workflows. The PR is:

_URL_

I will leave time for any comments, otherwise will merge these at the end of the work day _Insert your time zone_.

For co-ordinated merge

The CCM WG has a set of co-ordinated merges ready to merge. The PRs are:

_URL_

_URL_


I will leave time for any comments, otherwise will merge these at the end of the day.

@Aurashk Aurashk changed the title Aurashk/improve process exit tracking Improve process exit tracking with SSH PM Apr 29, 2026
@Aurashk Aurashk requested a review from PawelPlesniak May 13, 2026 13:57
@Aurashk Aurashk marked this pull request as ready for review May 13, 2026 13:57
Aurashk and others added 15 commits May 13, 2026 14:58
… correct logging branch

Co-authored-by: Copilot <copilot@github.com>
cleanup testing code
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
…or a simpler tracking model

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
fix exit logic for unexpected process kill

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
@Aurashk Aurashk force-pushed the aurashk/improve-process-exit-tracking branch from a395b31 to e7d5d12 Compare May 13, 2026 14:00
@Aurashk
Copy link
Copy Markdown
Contributor Author

Aurashk commented May 13, 2026

Rebased on to develop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants