Skip to content

Raising errors with failed boots#865

Draft
PawelPlesniak wants to merge 8 commits intoprep-release/fddaq-v5.6.0from
PawelPlesniak/UnsucessfulBoot
Draft

Raising errors with failed boots#865
PawelPlesniak wants to merge 8 commits intoprep-release/fddaq-v5.6.0from
PawelPlesniak/UnsucessfulBoot

Conversation

@PawelPlesniak
Copy link
Copy Markdown
Collaborator

@PawelPlesniak PawelPlesniak commented Mar 27, 2026

Description

Fixes issue #817

If not all applications are alive when boot has completed, raise an error, and put the session in error.
MAKE THE RICH TABLE NOT OVERRIDE BOOT LOGS.

Type of change

  • New feature / enhancement
  • Optimization
  • Bug fix
  • Breaking change
  • Documentation

List of required branches from other repositories

N/A

Change log

New configurations

In file config/tests/nestedConfig.data.xml, there are now three new configuration files with applications that die at different stages of a session's lifetime. These are just for template, and can serve as a starting point for failure mode testing development. These sessions are

  • test-nested-config-failure-on-init - the fake_daq_applications that are spawned die before any processing happens
  • test-nested-config-failure-post-boot - the fake_daq_applications that are spawned die after they have been fully initialized.
  • test-nested-config-failure-failure-cmd - the fake_daq_applications that are spawned die on the first executed stateful command.

Booting safety

When booting, we now make a check to ensure that the expected number of applications are alive. If the incorrect number of applications is booted, log this. The initial plan here was to put the session into an error state in the case that the applications die, but due to this bug in the k8s PM, this blocks k8s operation, as such has been left as a comment that should be reintroduced later.

Logging around status table

When rendering the live table, we can now see when an application logged contents to the tty without the rich table overwriting it. The underpinning issue here was muliple instances of rich.Console objects, which were overwriting each other. In this issue this will be addressed formally, but this is sufficient for the scope of this release.

Notes

Suggested manual testing checklist

Standard runs should behave as expected.

There are additional configurations defined that intentionally fail. To use these, run the following commands

drunc-unified-shell ssh-standalone config/tests/nestedConfig.data.xml test-nested-config-failure-on-init pawel
drunc-unified-shell ssh-standalone config/tests/nestedConfig.data.xml test-nested-config-failure-post-boot pawel

The first will fail to boot as

$ drunc-unified-shell ssh-standalone config/tests/nestedConfig.data.xml test-nested-config-failure-on-init pawel
[2026/03/30 14:18:27 UTC] INFO       shell.py:180                             drunc.unified_shell                                Setting up to use the process manager with configuration ssh-standalone and configuration id "test-nested-config-failure-on-init" from oksconflibs:config/tests/nestedConfig.data.xml
[2026/03/30 14:18:27 UTC] INFO       shell.py:202                             drunc.unified_shell                                Starting process manager
[2026/03/30 14:18:27 UTC] INFO       process_manager.py:109                   drunc.process_manager                              process_manager communicating through address 10.73.136.71:39049
[2026/03/30 14:18:27 UTC] INFO       shell.py:533                             drunc.unified_shell                                unified_shell ready with process_manager and controller commands
drunc-unified-shell > boot
[2026/03/30 14:18:28 UTC] INFO       process_manager_driver.py:104            drunc.process_manager_driver                       Booting session pawel
[2026/03/30 14:18:28 UTC] INFO       process_manager_driver.py:483            drunc.process_manager_driver                       Configuration did not require modifications.
[2026/03/30 14:18:28 UTC] INFO       ssh_process_manager.py:368               drunc.process_manager.SSH_SHELL_process_manager    Booted 'local-connection-server' from session 'pawel' with UUID a1138745-b085-4322-8232-5be0b0176e04
[2026/03/30 14:18:29 UTC] INFO       ssh_process_manager.py:368               drunc.process_manager.SSH_SHELL_process_manager    Booted 'top-segment-controller' from session 'pawel' with UUID 772cc6ed-0c83-4e85-a6ec-732ac82a3ad0
[2026/03/30 14:18:29 UTC] INFO       ssh_process_manager.py:368               drunc.process_manager.SSH_SHELL_process_manager    Booted 'nested-segment-controller' from session 'pawel' with UUID 0d78efbf-09c3-4963-8443-89598cbae07d
[2026/03/30 14:18:30 UTC] INFO       ssh_process_manager.py:368               drunc.process_manager.SSH_SHELL_process_manager    Booted 'bottom-segment-1-controller' from session 'pawel' with UUID 80e32ef0-2e78-4d11-8927-70aae3ec911e
[2026/03/30 14:18:30 UTC] INFO       ssh_process_manager.py:368               drunc.process_manager.SSH_SHELL_process_manager    Booted 'bottom-segment-1-application' from session 'pawel' with UUID 72feeca6-b6f8-49f5-b6c7-fb4caa45d8f4
[2026/03/30 14:18:30 UTC] INFO       ssh_process_manager.py:368               drunc.process_manager.SSH_SHELL_process_manager    Booted 'bottom-segment-2-controller' from session 'pawel' with UUID d00874a1-d8cc-4b50-abc5-4d09d2c81351
[2026/03/30 14:18:30 UTC] INFO       ssh_process_manager.py:368               drunc.process_manager.SSH_SHELL_process_manager    Booted 'bottom-segment-2-application' from session 'pawel' with UUID cd7283d0-e73d-4ce6-b4eb-12f3f62f86cd
[2026/03/30 14:18:30 UTC] INFO       ssh_process_manager.py:368               drunc.process_manager.SSH_SHELL_process_manager    Booted 'nested-segment-application' from session 'pawel' with UUID 0a8bf2cd-3540-478b-ae7e-ae62f233bff8
⠙ Looking for top-segment-controller on the connectivity service... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -:--:-- 0:00:00[2026/03/30 14:18:31 UTC] INFO       ssh_process_manager.py:305               drunc.process_manager.SSH_SHELL_process_manager    Process 'bottom-segment-1-application' (session: 'pawel', user: 'pplesnia') process exited with exit code 1
⠹ Looking for top-segment-controller on the connectivity service... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -:--:-- 0:00:01[2026/03/30 14:18:31 UTC] INFO       ssh_process_manager.py:305               drunc.process_manager.SSH_SHELL_process_manager    Process 'bottom-segment-2-application' (session: 'pawel', user: 'pplesnia') process exited with exit code 1
⠼ Looking for top-segment-controller on the connectivity service... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -:--:-- 0:00:01[2026/03/30 14:18:31 UTC] INFO       ssh_process_manager.py:305               drunc.process_manager.SSH_SHELL_process_manager    Process 'nested-segment-application' (session: 'pawel', user: 'pplesnia') process exited with exit code 1
  Looking for top-segment-controller on the connectivity service... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 0:00:02
⠋ Trying to talk to the root controller... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -:--:-- 0:00:00
                                                         pawel status                                                         
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name                        ┃ Info ┃ State        ┃ Substate     ┃ In error ┃ Included ┃ Endpoint                          ┃
                                                      pawel status                                                       
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name                            ┃ Info ┃ State   ┃ Substate ┃ In error ┃ Included ┃ Endpoint                          ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ top-segment-controller          │      │ initial │ initial  │ No       │ Yes      │ grpc://np04-srv-029.cern.ch:30006 │
│   nested-segment-controller     │      │ initial │ initial  │ No       │ Yes      │ grpc://np04-srv-029.cern.ch:41293 │
│     bottom-segment-1-controller │      │ initial │ initial  │ No       │ Yes      │ grpc://np04-srv-029.cern.ch:46697 │
│     bottom-segment-2-controller │      │ initial │ initial  │ No       │ Yes      │ grpc://np04-srv-029.cern.ch:33943 │
└─────────────────────────────────┴──────┴─────────┴──────────┴──────────┴──────────┴───────────────────────────────────┘
Waiting on tree initialisation... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━  88% 0:00:09
[2026/03/30 14:19:36 UTC] ERROR      commands.py:111                          drunc.unified_shell.boot                           Booted, but 3 processes died after booting.

Note, there are additional logs that are suppressed by the rich table rendering, which can be seen in the logs as

Developer checklist

Prior to marking this as "Ready for Review"

Tests ran on: WHAT HOSTNAME from release RELEASE_NAME

Unit tests - some tests can't be ran on the CI. This is documented. If this PR checks a feature that can't be tested with CI, this has been marked appropriately.

Integration tests - the daqsystemtest_integtest_bundle requires a lot of resources, and connections to the EHN1 infrastructure. Check the cross referenced list if you can't run these. The developer needs to run at least the .

  • Unit tests (pytest --marker) passed
    • With relevant marker
    • Without marker
  • Integration tests passed
    • Only daqsystemtest_integtest_bundle.sh -k minimal_system_quick_test.py
    • Full daqsystemtest_integtest_bundle.sh
  • Testing skipped as there are no core code changes in this PR, this only relates to documentation/CI workflows

Final checklist prior to marking this as "Ready for Review"

  • Code is clearly commented.
  • New unit tests have been added, or is documented in # ISSUE NUMBER
  • A suitable reviewer has been chosen from this list.

Reviewer checklist

  • This branch has been rebased with develop prior to testing.
  • Suggested manual tests show changes.
  • CI workflows fails documented (if present)
  • Integration tests passed
    • Only concern yourself if failures related to drunc are in the log files
    • If non-drunc failure appears:
      • Validate failure in fresh working area
      • Contact Pawel if unsure

Once the features are validated and both the unit and integration tests pass, the PRs is ready to be merged.

Prior to merging

Choose one of the following an complete all substeps
  • Changes only affect the Run Control, are in a single repository, and do not affect the end user.
    • Changes are documented in docstrings and code comments
    • Wiki has been updated if architectural or endpoint changes
  • Otherwise
    • Workflow changes demonstrated in the Change Log (if necessary)
    • Wiki has been updated (if necessary)
    • #daq-sw-librarians Slack channel notified (see below)

Once completed, the reviewer can merge the PR.

Notification message for a Slack channel

Note - this should be to #dunedaq-integration for general workflow that isn't during a release candidate period, and to #daq-release-prep otherwise.

For an single merge that changes the user workflow

The CCM WG has an isolated PR ready to merge that affects user workflows. The PR is:

_URL_

I will leave time for any comments, otherwise will merge these at the end of the work day _Insert your time zone_.

For co-ordinated merge

The CCM WG has a set of co-ordinated merges ready to merge. The PRs are:

_URL_

_URL_


I will leave time for any comments, otherwise will merge these at the end of the day.

@PawelPlesniak
Copy link
Copy Markdown
Collaborator Author

The core of the issue for the log override when booting has been identified - the rich Console that the status table updater uses is not the same one that the logger uses. The status table updater overrides it there. This is going to be messy

@PawelPlesniak PawelPlesniak changed the title Introducing a counter for the number of booted processes Raising errors with failed boots Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants