Skip to content

correct flush command behaviour#824

Merged
PawelPlesniak merged 10 commits intodevelopfrom
aurashk/correct-flush-method-impl
Mar 27, 2026
Merged

correct flush command behaviour#824
PawelPlesniak merged 10 commits intodevelopfrom
aurashk/correct-flush-method-impl

Conversation

@Aurashk
Copy link
Copy Markdown
Contributor

@Aurashk Aurashk commented Mar 13, 2026

Description

Fixes issue #821
Fixes issue #564

Adds the proper implementation of flush in ssh PM

Type of change

  • New feature / enhancement
  • Optimization
  • Bug fix
  • Breaking change
  • Documentation

List of required branches from other repositories

NA

Change log

As described in #821 flush now reflects the intended behaviour

  • boot
  • kill any of the processes you booted directly by pid (not from the process manager)
  • ps still shows the process as alive == false
  • flush
  • ps does not show any trace of the unexpectedly killed process

Note: Doesn't specifically apply to the flush command, but the handling of unexpectedly killed processes is not totally robust, e.g. if you kill a controller then try to do status you get a gRPC exception coming up.

Here is an example of the new behaviour:

drunc-unified-shell > boot
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name                   ┃ Info ┃ State   ┃ Substate ┃ In error ┃ Included ┃ Endpoint                                    ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ root-controller        │      │ initial │ initial  │ No       │ Yes      │ grpc://aurash-VMware-Virtual-Platform:30006 │
│   df-controller        │      │ initial │ initial  │ No       │ Yes      │ grpc://aurash-VMware-Virtual-Platform:46527 │
│     df-01              │      │ initial │ idle     │ No       │ Yes      │ rest://aurash-VMware-Virtual-Platform:43153 │
│     dfo-01             │      │ initial │ idle     │ No       │ Yes      │ rest://aurash-VMware-Virtual-Platform:52087 │
│     tp-stream-writer   │      │ initial │ idle     │ No       │ Yes      │ rest://aurash-VMware-Virtual-Platform:33041 │
│   hsi-fake-controller  │      │ initial │ initial  │ No       │ Yes      │ grpc://aurash-VMware-Virtual-Platform:45777 │
│     hsi-fake-01        │      │ initial │ idle     │ No       │ Yes      │ rest://aurash-VMware-Virtual-Platform:59237 │
│     hsi-fake-to-tc-app │      │ initial │ idle     │ No       │ Yes      │ rest://aurash-VMware-Virtual-Platform:48713 │
│   ru-controller        │      │ initial │ initial  │ No       │ Yes      │ grpc://aurash-VMware-Virtual-Platform:33675 │
│     ru-01              │      │ initial │ idle     │ No       │ Yes      │ rest://aurash-VMware-Virtual-Platform:56503 │
│   trg-controller       │      │ initial │ initial  │ No       │ Yes      │ grpc://aurash-VMware-Virtual-Platform:37487 │
│     mlt                │      │ initial │ idle     │ No       │ Yes      │ rest://aurash-VMware-Virtual-Platform:41339 │
│     tc-maker-1         │      │ initial │ idle     │ No       │ Yes      │ rest://aurash-VMware-Virtual-Platform:46697 │
└────────────────────────┴──────┴─────────┴──────────┴──────────┴──────────┴─────────────────────────────────────────────┘
Waiting on tree initialisation... ━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   7% 0:01:06
[2026/03/17 17:20:24 UTC] INFO       commands.py:94                           drunc.unified_shell.boot                           Booted successfully
drunc-unified-shell > 

Then from another terminal kill one of the remote ssh processes:

ssh aurash@localhost kill -9 70210

You will then see this in the shell:

drunc-unified-shell > [2026/03/17 17:28:42 UTC] INFO       ssh_process_manager.py:301               drunc.process_manager.SSH_SHELL_process_manager    Process 'tc-maker-1' (session: 'MyTest', user: 
'aurash') process exited with exit code 137

ps and status:

drunc-unified-shell > ps
                                                  Processes running                                                  
┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┓
┃ session ┃ friendly name           ┃ user   ┃ host      ┃ uuid                                 ┃ alive ┃ exit-code ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━┩
│ MyTest  │ local-connection-server │ aurash │ localhost │ a30bb2e4-05c8-4c72-b3c0-264d4517c83d │ True  │ 0         │
│ MyTest  │ root-controller         │ aurash │ localhost │ 37bcb56e-bd7e-4fc0-9ba0-306de514ad02 │ True  │ 0         │
│ MyTest  │   df-controller         │ aurash │ localhost │ aaae6f6d-18a1-40bc-977f-bd666e8ed4a3 │ True  │ 0         │
│ MyTest  │     df-01               │ aurash │ localhost │ 98a5c8c3-ae39-4bdd-8e31-9ea74f7d11e8 │ True  │ 0         │
│ MyTest  │     dfo-01              │ aurash │ localhost │ ea3164f2-849b-4c65-87f6-f7bfe51f2626 │ True  │ 0         │
│ MyTest  │     tp-stream-writer    │ aurash │ localhost │ 33003c7e-d457-4cf5-aaaf-d2c3a765b828 │ True  │ 0         │
│ MyTest  │   hsi-fake-controller   │ aurash │ localhost │ 3ea7909f-1c35-4313-b7d1-986777f7963c │ True  │ 0         │
│ MyTest  │     hsi-fake-01         │ aurash │ localhost │ 6db3b941-5993-4678-af5f-6d3ccd2758b5 │ True  │ 0         │
│ MyTest  │     hsi-fake-to-tc-app  │ aurash │ localhost │ 295de0b4-7e88-42b2-93be-8d6e3c2bc418 │ True  │ 0         │
│ MyTest  │   ru-controller         │ aurash │ localhost │ d430239f-f571-4eef-9a35-1a081650f890 │ True  │ 0         │
│ MyTest  │     ru-01               │ aurash │ localhost │ d6dbab3c-e825-43f4-aa19-81c348b0a937 │ True  │ 0         │
│ MyTest  │   trg-controller        │ aurash │ localhost │ e4a0ecb9-8440-41d7-bdac-2853646d9753 │ True  │ 0         │
│ MyTest  │     mlt                 │ aurash │ localhost │ 33a0fb25-ebba-4bcc-b11e-bdca6c49b806 │ True  │ 0         │
│ MyTest  │     tc-maker-1          │ aurash │ localhost │ 806216c2-474d-4b6a-b170-9dcddc24d0b0 │ False │ 137       │
└─────────┴─────────────────────────┴────────┴───────────┴──────────────────────────────────────┴───────┴───────────┘
drunc-unified-shell > status
                                                      MyTest status                                                       
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name                   ┃ Info ┃ State   ┃ Substate ┃ In error ┃ Included ┃ Endpoint                                    ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ root-controller        │      │ initial │ initial  │ No       │ Yes      │ grpc://aurash-VMware-Virtual-Platform:30006 │
│   df-controller        │      │ initial │ initial  │ No       │ Yes      │ grpc://aurash-VMware-Virtual-Platform:45377 │
│     df-01              │      │ initial │ idle     │ No       │ Yes      │ rest://aurash-VMware-Virtual-Platform:34059 │
│     dfo-01             │      │ initial │ idle     │ No       │ Yes      │ rest://aurash-VMware-Virtual-Platform:54075 │
│     tp-stream-writer   │      │ initial │ idle     │ No       │ Yes      │ rest://aurash-VMware-Virtual-Platform:54295 │
│   hsi-fake-controller  │      │ initial │ initial  │ No       │ Yes      │ grpc://aurash-VMware-Virtual-Platform:37329 │
│     hsi-fake-01        │      │ initial │ idle     │ No       │ Yes      │ rest://aurash-VMware-Virtual-Platform:41813 │
│     hsi-fake-to-tc-app │      │ initial │ idle     │ No       │ Yes      │ rest://aurash-VMware-Virtual-Platform:46421 │
│   ru-controller        │      │ initial │ initial  │ No       │ Yes      │ grpc://aurash-VMware-Virtual-Platform:36571 │
│     ru-01              │      │ initial │ idle     │ No       │ Yes      │ rest://aurash-VMware-Virtual-Platform:43021 │
│   trg-controller       │      │ initial │ initial  │ No       │ Yes      │ grpc://aurash-VMware-Virtual-Platform:41893 │
│     mlt                │      │ initial │ idle     │ No       │ Yes      │ rest://aurash-VMware-Virtual-Platform:50789 │
│     tc-maker-1         │      │ initial │ idle     │ Yes      │ Yes      │ rest://aurash-VMware-Virtual-Platform:56405 │
└────────────────────────┴──────┴─────────┴──────────┴──────────┴──────────┴─────────────────────────────────────────────┘
[2026/03/17 17:29:37 UTC] INFO       shell_utils.py:165                       drunc.utils.ShellContext                           Current FSM status is initial. Available transitions 
are conf. Available sequence commands are start-run.
drunc-unified-shell > 

Then run flush, ps and status

drunc-unified-shell > flush
[2026/03/17 17:30:42 UTC] INFO       ssh_process_manager.py:560               drunc.process_manager.SSH_SHELL_process_manager    process_manager flushing dead processes matching ['.*']
[2026/03/17 17:30:43 UTC] INFO       ssh_process_lifetime_manager_shell.py:10 drunc.drunc.processes.ssh_process_lifetime_manager Skipping killing remote process 
806216c2-474d-4b6a-b170-9dcddc24d0b0 (PID 141535). It is already dead.
[2026/03/17 17:30:43 UTC] INFO       ssh_process_manager.py:609               drunc.process_manager.SSH_SHELL_process_manager    Flushed dead process 
806216c2-474d-4b6a-b170-9dcddc24d0b0 (name: tc-maker-1, exit code: 137).
                                              Flushed process                                               
┏━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┓
┃ session ┃ friendly name  ┃ user   ┃ host      ┃ uuid                                 ┃ alive ┃ exit-code ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━┩
│ MyTest  │     tc-maker-1 │ aurash │ localhost │ 806216c2-474d-4b6a-b170-9dcddc24d0b0 │ False │ 137       │
└─────────┴────────────────┴────────┴───────────┴──────────────────────────────────────┴───────┴───────────┘
drunc-unified-shell > ps
                                                  Processes running                                                  
┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┓
┃ session ┃ friendly name           ┃ user   ┃ host      ┃ uuid                                 ┃ alive ┃ exit-code ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━┩
│ MyTest  │ local-connection-server │ aurash │ localhost │ a30bb2e4-05c8-4c72-b3c0-264d4517c83d │ True  │ 0         │
│ MyTest  │ root-controller         │ aurash │ localhost │ 37bcb56e-bd7e-4fc0-9ba0-306de514ad02 │ True  │ 0         │
│ MyTest  │   df-controller         │ aurash │ localhost │ aaae6f6d-18a1-40bc-977f-bd666e8ed4a3 │ True  │ 0         │
│ MyTest  │     df-01               │ aurash │ localhost │ 98a5c8c3-ae39-4bdd-8e31-9ea74f7d11e8 │ True  │ 0         │
│ MyTest  │     dfo-01              │ aurash │ localhost │ ea3164f2-849b-4c65-87f6-f7bfe51f2626 │ True  │ 0         │
│ MyTest  │     tp-stream-writer    │ aurash │ localhost │ 33003c7e-d457-4cf5-aaaf-d2c3a765b828 │ True  │ 0         │
│ MyTest  │   hsi-fake-controller   │ aurash │ localhost │ 3ea7909f-1c35-4313-b7d1-986777f7963c │ True  │ 0         │
│ MyTest  │     hsi-fake-01         │ aurash │ localhost │ 6db3b941-5993-4678-af5f-6d3ccd2758b5 │ True  │ 0         │
│ MyTest  │     hsi-fake-to-tc-app  │ aurash │ localhost │ 295de0b4-7e88-42b2-93be-8d6e3c2bc418 │ True  │ 0         │
│ MyTest  │   ru-controller         │ aurash │ localhost │ d430239f-f571-4eef-9a35-1a081650f890 │ True  │ 0         │
│ MyTest  │     ru-01               │ aurash │ localhost │ d6dbab3c-e825-43f4-aa19-81c348b0a937 │ True  │ 0         │
│ MyTest  │   trg-controller        │ aurash │ localhost │ e4a0ecb9-8440-41d7-bdac-2853646d9753 │ True  │ 0         │
│ MyTest  │     mlt                 │ aurash │ localhost │ 33a0fb25-ebba-4bcc-b11e-bdca6c49b806 │ True  │ 0         │
└─────────┴─────────────────────────┴────────┴───────────┴──────────────────────────────────────┴───────┴───────────┘
drunc-unified-shell > status
                                                      MyTest status                                                       
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name                   ┃ Info ┃ State   ┃ Substate ┃ In error ┃ Included ┃ Endpoint                                    ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ root-controller        │      │ initial │ initial  │ No       │ Yes      │ grpc://aurash-VMware-Virtual-Platform:30006 │
│   df-controller        │      │ initial │ initial  │ No       │ Yes      │ grpc://aurash-VMware-Virtual-Platform:45377 │
│     df-01              │      │ initial │ idle     │ No       │ Yes      │ rest://aurash-VMware-Virtual-Platform:34059 │
│     dfo-01             │      │ initial │ idle     │ No       │ Yes      │ rest://aurash-VMware-Virtual-Platform:54075 │
│     tp-stream-writer   │      │ initial │ idle     │ No       │ Yes      │ rest://aurash-VMware-Virtual-Platform:54295 │
│   hsi-fake-controller  │      │ initial │ initial  │ No       │ Yes      │ grpc://aurash-VMware-Virtual-Platform:37329 │
│     hsi-fake-01        │      │ initial │ idle     │ No       │ Yes      │ rest://aurash-VMware-Virtual-Platform:41813 │
│     hsi-fake-to-tc-app │      │ initial │ idle     │ No       │ Yes      │ rest://aurash-VMware-Virtual-Platform:46421 │
│   ru-controller        │      │ initial │ initial  │ No       │ Yes      │ grpc://aurash-VMware-Virtual-Platform:36571 │
│     ru-01              │      │ initial │ idle     │ No       │ Yes      │ rest://aurash-VMware-Virtual-Platform:43021 │
│   trg-controller       │      │ initial │ initial  │ No       │ Yes      │ grpc://aurash-VMware-Virtual-Platform:41893 │
│     mlt                │      │ initial │ idle     │ No       │ Yes      │ rest://aurash-VMware-Virtual-Platform:50789 │
│     tc-maker-1         │      │ initial │ idle     │ Yes      │ Yes      │ rest://aurash-VMware-Virtual-Platform:56405 │
└────────────────────────┴──────┴─────────┴──────────┴──────────┴──────────┴─────────────────────────────────────────────┘

Note that the status output might not be correct, this should be fixed by #516 though.

Suggested manual testing checklist

- boot
- ssh {user}@localhost kill -9 {pid}
- ps still shows the process as alive == false
- flush
- ps does not show any trace of the unexpectedly killed process

Developer checklist

Prior to marking this as "Ready for Review"

Tests ran on: hep cluster from release NFD_DEV_260317_A9

Unit tests - some tests can't be ran on the CI. This is documented. If this PR checks a feature that can't be tested with CI, this has been marked appropriately.

Integration tests - the daqsystemtest_integtest_bundle requires a lot of resources, and connections to the EHN1 infrastructure. Check the cross referenced list if you can't run these. The developer needs to run at least the .

  • Unit tests (pytest --marker) passed
    • With relevant marker
    • Without marker
  • Integration tests passed
    • Only daqsystemtest_integtest_bundle.sh -k minimal_system_quick_test.py
    • Full daqsystemtest_integtest_bundle.sh
  • Testing skipped as there are no core code changes in this PR, this only relates to documentation/CI workflows

Final checklist prior to marking this as "Ready for Review"

  • Code is clearly commented.
  • New unit tests have been added, or is documented in # ISSUE NUMBER
  • [] A suitable reviewer has been chosen from this list.

Reviewer checklist

  • This branch has been rebased with develop prior to testing.
  • Suggested manual tests show changes.
  • CI workflows fails documented (if present)
  • Integration tests passed
    • Only concern yourself if failures related to drunc are in the log files
    • If non-drunc failure appears:
      • Validate failure in fresh working area
      • Contact Pawel if unsure

Once the features are validated and both the unit and integration tests pass, the PRs is ready to be merged.

Prior to merging

Choose one of the following an complete all substeps
  • Changes only affect the Run Control, are in a single repository, and do not affect the end user.
    • Changes are documented in docstrings and code comments
    • Wiki has been updated if architectural or endpoint changes
  • Otherwise
    • Workflow changes demonstrated in the Change Log (if necessary)
    • Wiki has been updated (if necessary)
    • #daq-sw-librarians Slack channel notified (see below)

Once completed, the reviewer can merge the PR.

Notification message for #daq-sw-librarians Slack channel

For an single merge that changes the user workflow

The CCM WG has an isolated PR ready to merge that affects user workflows. The PR is:

_URL_

I will leave time for any comments, otherwise will merge these at the end of the work day _Insert your time zone_.

For co-ordinated merge

The CCM WG has a set of co-ordinated merges ready to merge. The PRs are:

_URL_

_URL_


I will leave time for any comments, otherwise will merge these at the end of the day.

@Aurashk Aurashk marked this pull request as ready for review March 18, 2026 10:15
@emmuhamm emmuhamm linked an issue Mar 19, 2026 that may be closed by this pull request
@emmuhamm
Copy link
Copy Markdown
Contributor

emmuhamm commented Mar 19, 2026

Hi @Aurashk, thanks for this, this is amazing to see!

In my local session I've rebased this and ran the manual test, and it performs as expected so I'm a big fan of it. I also tried the flush --uuid [uuid] command and that also works as expected. And thanks for adding pytests as well :D

I'm not too much of an expert on the controller code so I won't leave any code comments, but the changes generally seem good for me. I'll let @PawelPlesniak give the code comments. Just for my information, can you let me know what was wrong before and how this was fixed?

One suggestion though, based on the issue desc:

On a similar note, the kill command at the moment does both kill and flush. It might be useful if kill has another argument that only kills the process and not flush it. This will let us test the flush command properly.

Would it be possible to implement this? Perhaps something like kill --no-flush or something similar. This would be incredibly useful for me in sorting out #765

I'll go ahead and run the full integration test suite since this is mainly feature complete already

@PawelPlesniak
Copy link
Copy Markdown
Collaborator

I will take a look at this on Monday as I am on AL tomorrow.
This likely also closes #564

@emmuhamm emmuhamm linked an issue Mar 20, 2026 that may be closed by this pull request
@Aurashk
Copy link
Copy Markdown
Contributor Author

Aurashk commented Mar 20, 2026

Hi @Aurashk, thanks for this, this is amazing to see!

In my local session I've rebased this and ran the manual test, and it performs as expected so I'm a big fan of it. I also tried the flush --uuid [uuid] command and that also works as expected. And thanks for adding pytests as well :D

I'm not too much of an expert on the controller code so I won't leave any code comments, but the changes generally seem good for me. I'll let @PawelPlesniak give the code comments. Just for my information, can you let me know what was wrong before and how this was fixed?

One suggestion though, based on the issue desc:

On a similar note, the kill command at the moment does both kill and flush. It might be useful if kill has another argument that only kills the process and not flush it. This will let us test the flush command properly.

Would it be possible to implement this? Perhaps something like kill --no-flush or something similar. This would be incredibly useful for me in sorting out #765

I'll go ahead and run the full integration test suite since this is mainly feature complete already

Thanks @emmuhamm,
The main thing fixed is that I don't think the flush implementation was complete before. There was a flush method in our process manager servicer class, but like most of the other methods this should have been calling a specific flush_impl_ from the ssh_process_manager. Otherwise it can't access the ssh-specific methods/data structures to clean up the process info properly.

I've had a look at the kill behaviour you're referring to and it won't be a problem to tweak, but want to clarify about the intended behaviour.

Currently kill will keep an archived exit code and the boot request of the dead process, so you are able to restart it unless you flush.
i.e.
This will work:
boot
kill --name df-01
restart --name df-01
But this throws an exception:
boot
kill --name df-01
flush
restart --name df-01
Either way you won't see the process you killed when you use ps (even though we have an exit code stored), but it is straightforward to change that if we add an optional flag of show_killed or something. The only complication is we would need to update the k8s version of ps accordingly.

It feels like ps should be consistent. Currently, if we kill the process through the PM we won't see it anymore in ps, but if it is crashed or killed otherwise it does show up as dead with ps.

@emmuhamm
Copy link
Copy Markdown
Contributor

Hi @Aurashk, thanks for your explanation, this makes sense to me.

Regarding the kill behaviour, let me better explain my intention for it.

In #765, I am building a test suite that essentially runs a lite version of the integ test on drunc’s process manager. This is done by spawning a drunc unified shell, booting it, running a couple of commands, and then parsing the stdout of that (unfortunate yeah).

With these constraints, in general the best way to test the flush command is by showing the ps table, killing it from another shell with kill -9, running a ps to show that the process has died, flushing it, and then running ps again to show that the process has been removed.

Unfortunately, I don’t have a good handle on the bit about killing via kill -9 [pid]. I was hoping that an option like that exists in the drunc shell so I can test this out or something like that.

I do agree that ps should be consistent and that if we kill it through the PM we shouldn’t see it in the ps, but if we kill it elsehow then it should show up as dead with ps.

I guess my request is more of a ‘can we have a command in the drunc shell that emulates a kill signal sent from the outside’?

@Aurashk
Copy link
Copy Markdown
Contributor Author

Aurashk commented Mar 23, 2026

Hi @Aurashk, thanks for your explanation, this makes sense to me.

Regarding the kill behaviour, let me better explain my intention for it.

In #765, I am building a test suite that essentially runs a lite version of the integ test on drunc’s process manager. This is done by spawning a drunc unified shell, booting it, running a couple of commands, and then parsing the stdout of that (unfortunate yeah).

With these constraints, in general the best way to test the flush command is by showing the ps table, killing it from another shell with kill -9, running a ps to show that the process has died, flushing it, and then running ps again to show that the process has been removed.

Unfortunately, I don’t have a good handle on the bit about killing via kill -9 [pid]. I was hoping that an option like that exists in the drunc shell so I can test this out or something like that.

I do agree that ps should be consistent and that if we kill it through the PM we shouldn’t see it in the ps, but if we kill it elsehow then it should show up as dead with ps.

I guess my request is more of a ‘can we have a command in the drunc shell that emulates a kill signal sent from the outside’?

Thanks @emmuhamm, that makes sense. It wouldn't be a problem to add such a command to make testing easier. Another way to consider is that we could add the remote pids to the ps table. Then if you parse them it should be easy to run the kill command with ssh {user}@{hostname} kill -9 {pid}. I'm wondering if it's useful having them on display anyway for other debugging/testing purposes.

@emmuhamm
Copy link
Copy Markdown
Contributor

Thanks Aurash, looking forward to seeing this command implemented!

Another way to consider is that we could add the remote pids to the ps table.

I think it would be good to have that in. The ps command has a --long command that adds in extra information; I think it would be good if the remote pids should be there (might crowd the ps table otherwise? not sure how long the new info would be)

Then if you parse them it should be easy to run the kill command with ssh {user}@{hostname} kill -9 {pid}

This would be grand but I think due to the limitations of the current integ testing framework, it really is separated to a 'run the drunc session and do everything there' phase and an independent 'go through the stdout and output files' phase. The commands to run the drunc session itself is (afaik) run-time defined, and we probably can't spawn an extra terminal session from the test itself. So because of these constraints, in the context of the tests, unfortunately I don't think we can do the ssh {user}@{hostname} kill -9 {pid}.

I'm wondering if it's useful having them on display anyway for other debugging/testing purposes.

But yes this would be very useful to have anyway :)

@Aurashk
Copy link
Copy Markdown
Contributor Author

Aurashk commented Mar 23, 2026

Thanks Aurash, looking forward to seeing this command implemented!

Another way to consider is that we could add the remote pids to the ps table.

I think it would be good to have that in. The ps command has a --long command that adds in extra information; I think it would be good if the remote pids should be there (might crowd the ps table otherwise? not sure how long the new info would be)

Then if you parse them it should be easy to run the kill command with ssh {user}@{hostname} kill -9 {pid}

This would be grand but I think due to the limitations of the current integ testing framework, it really is separated to a 'run the drunc session and do everything there' phase and an independent 'go through the stdout and output files' phase. The commands to run the drunc session itself is (afaik) run-time defined, and we probably can't spawn an extra terminal session from the test itself. So because of these constraints, in the context of the tests, unfortunately I don't think we can do the ssh {user}@{hostname} kill -9 {pid}.

I'm wondering if it's useful having them on display anyway for other debugging/testing purposes.

But yes this would be very useful to have anyway :)

Ah ok thanks for explaining that - all makes sense now. I've added the new features:

  • kill with the --crash option will directly kill processes as described as if they were killed outside of the process manager
  • ps --long-info will display pid of each process running on the remote machines
drunc-unified-shell > ps --long-format
                                                                      Processes running                                                                      
┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ session ┃ friendly name           ┃ user   ┃ host      ┃ uuid                                 ┃ alive ┃ exit-code ┃ remote-pid ┃ executable               ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ MyTest  │ local-connection-server │ aurash │ localhost │ c5aac111-02d6-44bc-85f4-2dece43641d0 │ True  │ 0         │ 2156006    │ source; echo             │
│ MyTest  │ root-controller         │ aurash │ localhost │ f3d2b7eb-7bfe-4c75-867f-2a44a727d075 │ True  │ 0         │ 2156218    │ source; drunc-controller │
│ MyTest  │   df-controller         │ aurash │ localhost │ 7d47d4d3-8651-425e-9a7e-8b4b35ce772b │ True  │ 0         │ 2156666    │ source; drunc-controller │
│ MyTest  │     df-01               │ aurash │ localhost │ 5dfe5fbd-a8d5-4f5c-bd29-1cbb4eb9e3a3 │ True  │ 0         │ 2157166    │ source; daq_application  │
│ MyTest  │     dfo-01              │ aurash │ localhost │ 828a2bd6-bbed-4cd0-a3d9-3d3bcf895f30 │ True  │ 0         │ 2157038    │ source; daq_application  │
│ MyTest  │     tp-stream-writer    │ aurash │ localhost │ 1ff90a43-dc1f-4320-bc6e-2d229cacd26a │ True  │ 0         │ 2156831    │ source; daq_application  │
│ MyTest  │   hsi-fake-controller   │ aurash │ localhost │ 582c3f8b-6ca8-4680-a742-5b59e49e385b │ True  │ 0         │ 2157880    │ source; drunc-controller │
│ MyTest  │     hsi-fake-01         │ aurash │ localhost │ 100334be-6348-43ea-9d39-afdf8eddb4cf │ True  │ 0         │ 2158296    │ source; daq_application  │
│ MyTest  │     hsi-fake-to-tc-app  │ aurash │ localhost │ 64eb7242-34ba-4a81-956c-9a8ca54936b0 │ True  │ 0         │ 2158231    │ source; daq_application  │
│ MyTest  │   ru-controller         │ aurash │ localhost │ c836b3ac-fded-4191-9853-0b5a74612ea8 │ True  │ 0         │ 2156360    │ source; drunc-controller │
│ MyTest  │     ru-01               │ aurash │ localhost │ 3de59373-bf09-4a20-8c0e-b9c99fdb8ce8 │ True  │ 0         │ 2156467    │ source; daq_application  │
│ MyTest  │   trg-controller        │ aurash │ localhost │ 05c7970a-c4d7-4334-8f10-c760bb937bd8 │ True  │ 0         │ 2157262    │ source; drunc-controller │
│ MyTest  │     mlt                 │ aurash │ localhost │ cf933e70-76ea-4328-8fd6-76b23e39b1f2 │ True  │ 0         │ 2157564    │ source; daq_application  │
│ MyTest  │     tc-maker-1          │ aurash │ localhost │ 23564da5-a05a-42d8-9204-5e2453ad4cb7 │ True  │ 0         │ 2157322    │ source; daq_application  │
└─────────┴─────────────────────────┴────────┴───────────┴──────────────────────────────────────┴───────┴───────────┴────────────┴──────────────────────────┘

Since I had to add to the schema for the new options this PR needs to go through for the new features to work.

@Aurashk
Copy link
Copy Markdown
Contributor Author

Aurashk commented Mar 23, 2026

Note that locally the pytest tests pass when using the aurashk/add-crash-option-to-kill-command branch for druncschema, but aren't passing here because the new schema is required

@emmuhamm
Copy link
Copy Markdown
Contributor

This is great, thanks Aurash! From the description, it looks like its doing exactly what we want. I'll go ahead and test this and see if I can implement it in the integ testing PR I have.

I'll still let Pawel give the in-depth code review

@emmuhamm emmuhamm added the bug Something isn't working label Mar 23, 2026
@emmuhamm
Copy link
Copy Markdown
Contributor

emmuhamm commented Mar 24, 2026

Doing some local tests, this is pretty much doing what I want to do, which is amazing ✨

One slight comment though, I noticed that when I run flush --help the crash option still pops up. I would have expected that flush shouldn't have the simulate a crash option?

drunc-unified-shell > flush --help
Usage: PROCESS_MANAGER CONFIGURATION_FILE CONFIGURATION_ID SESSION_NAME flush 
           [OPTIONS]

Options:
  --crash             Simulate a crash: send SIGKILL without any cleanup,
                      leaving the process manager in an unexpected-death
                      state.
  --uuid TEXT         Select the process of a particular UUIDs
  -u, --user TEXT     Select the process of a particular user
  -n, --name TEXT     Select the process of a particular names
  -s, --session TEXT  Select the processes on a particular session
  --help              Show this message and exit.

This isn't a merge blocker but I was wondering if this was expected? From what I know I guess not, so might be nice to remove.


Going to run the local tests later tonight. MSQT passes; will update this field for the full daqsystemtest

UPDATE:

⮕ Running daqsystemtest/3ru_1df_multirun_test.py ⬅
======================== 6 passed ✅ in 294.77s (0:04:54) =========================
⮕ Running daqsystemtest/3ru_3df_multirun_test.py ⬅
======================== 6 passed ✅ in 339.46s (0:05:39) =========================
⮕ Running daqsystemtest/example_system_test.py ⬅
======================== 12 passed ✅ in 362.47s (0:06:02) ========================
⮕ Running daqsystemtest/fake_data_producer_test.py ⬅
======================== 6 passed ✅ in 294.61s (0:04:54) =========================
⮕ Running daqsystemtest/long_window_readout_test.py ⬅
============================== 1 skipped 🟡 in 0.95s ==============================
⮕ Running daqsystemtest/minimal_system_quick_test.py ⬅
========================= 4 passed ✅ in 76.56s (0:01:16) =========================
⮕ Running daqsystemtest/readout_type_scan_test.py ⬅
======================== 33 passed ✅ in 873.79s (0:14:33) ========================
⮕ Running daqsystemtest/sample_ehn1_multihost_test.py ⬅
============================= 4 skipped 🟡 in 53.33s ==============================
⮕ Running daqsystemtest/small_footprint_quick_test.py ⬅
========================= 3 passed ✅ in 79.13s (0:01:19) =========================
⮕ Running daqsystemtest/tpg_state_collection_test.py ⬅
======================== 5 passed ✅ in 143.46s (0:02:23) =========================
⮕ Running daqsystemtest/tpreplay_test.py ⬅
======================== 6 passed ✅ in 175.68s (0:02:55) =========================
⮕ Running daqsystemtest/tpstream_writing_test.py ⬅
======================== 4 passed ✅ in 145.15s (0:02:25) =========================
⮕ Running daqsystemtest/trigger_bitwords_test.py ⬅
======================== 18 passed ✅ in 434.30s (0:07:14) ========================

Full integ test passes! I'll leave it to Pawel now to have a look

@PawelPlesniak
Copy link
Copy Markdown
Collaborator

Thank you both for the review, the extended functionality is great. The crash processes is especially good, and will help in defining integration tests for PM behaviours when applications crash unexpectedly - we can define checks to see that the applications are in fact handled correctly.

@PawelPlesniak PawelPlesniak merged commit 3d65ce3 into develop Mar 27, 2026
2 of 4 checks passed
@PawelPlesniak PawelPlesniak deleted the aurashk/correct-flush-method-impl branch March 27, 2026 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Process Manager

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: flush command broken implement test/ refactor flush method in process manager

4 participants