correct flush command behaviour#824
Conversation
|
Hi @Aurashk, thanks for this, this is amazing to see! In my local session I've rebased this and ran the manual test, and it performs as expected so I'm a big fan of it. I also tried the I'm not too much of an expert on the controller code so I won't leave any code comments, but the changes generally seem good for me. I'll let @PawelPlesniak give the code comments. Just for my information, can you let me know what was wrong before and how this was fixed? One suggestion though, based on the issue desc:
Would it be possible to implement this? Perhaps something like I'll go ahead and run the full integration test suite since this is mainly feature complete already |
|
I will take a look at this on Monday as I am on AL tomorrow. |
Thanks @emmuhamm, I've had a look at the Currently It feels like |
|
Hi @Aurashk, thanks for your explanation, this makes sense to me. Regarding the kill behaviour, let me better explain my intention for it. In #765, I am building a test suite that essentially runs a lite version of the integ test on drunc’s process manager. This is done by spawning a drunc unified shell, booting it, running a couple of commands, and then parsing the stdout of that (unfortunate yeah). With these constraints, in general the best way to test the flush command is by showing the ps table, killing it from another shell with kill -9, running a ps to show that the process has died, flushing it, and then running ps again to show that the process has been removed. Unfortunately, I don’t have a good handle on the bit about killing via I do agree that I guess my request is more of a ‘can we have a command in the drunc shell that emulates a kill signal sent from the outside’? |
Thanks @emmuhamm, that makes sense. It wouldn't be a problem to add such a command to make testing easier. Another way to consider is that we could add the remote pids to the |
|
Thanks Aurash, looking forward to seeing this command implemented!
I think it would be good to have that in. The
This would be grand but I think due to the limitations of the current integ testing framework, it really is separated to a 'run the drunc session and do everything there' phase and an independent 'go through the stdout and output files' phase. The commands to run the drunc session itself is (afaik) run-time defined, and we probably can't spawn an extra terminal session from the test itself. So because of these constraints, in the context of the tests, unfortunately I don't think we can do the
But yes this would be very useful to have anyway :) |
Ah ok thanks for explaining that - all makes sense now. I've added the new features:
Since I had to add to the schema for the new options this PR needs to go through for the new features to work. |
|
Note that locally the pytest tests pass when using the aurashk/add-crash-option-to-kill-command branch for druncschema, but aren't passing here because the new schema is required |
|
This is great, thanks Aurash! From the description, it looks like its doing exactly what we want. I'll go ahead and test this and see if I can implement it in the integ testing PR I have. I'll still let Pawel give the in-depth code review |
|
Doing some local tests, this is pretty much doing what I want to do, which is amazing ✨ One slight comment though, I noticed that when I run This isn't a merge blocker but I was wondering if this was expected? From what I know I guess not, so might be nice to remove. Going to run the local tests later tonight. MSQT passes; will update this field for the full daqsystemtest UPDATE: Full integ test passes! I'll leave it to Pawel now to have a look |
|
Thank you both for the review, the extended functionality is great. The crash processes is especially good, and will help in defining integration tests for PM behaviours when applications crash unexpectedly - we can define checks to see that the applications are in fact handled correctly. |
Description
Fixes issue #821
Fixes issue #564
Adds the proper implementation of flush in ssh PM
Type of change
List of required branches from other repositories
NA
Change log
As described in #821
flushnow reflects the intended behaviourNote: Doesn't specifically apply to the
flushcommand, but the handling of unexpectedly killed processes is not totally robust, e.g. if you kill a controller then try to dostatusyou get a gRPC exception coming up.Here is an example of the new behaviour:
Then from another terminal kill one of the remote ssh processes:
You will then see this in the shell:
ps and status:
Then run flush, ps and status
Note that the
statusoutput might not be correct, this should be fixed by #516 though.Suggested manual testing checklist
Developer checklist
Prior to marking this as "Ready for Review"
Tests ran on: hep cluster from release NFD_DEV_260317_A9
Unit tests - some tests can't be ran on the CI. This is documented. If this PR checks a feature that can't be tested with CI, this has been marked appropriately.
Integration tests - the
daqsystemtest_integtest_bundlerequires a lot of resources, and connections to the EHN1 infrastructure. Check the cross referenced list if you can't run these. The developer needs to run at least the .pytest --marker) passeddaqsystemtest_integtest_bundle.sh -k minimal_system_quick_test.pydaqsystemtest_integtest_bundle.shFinal checklist prior to marking this as "Ready for Review"
Reviewer checklist
druncare in the log filesdruncfailure appears:Once the features are validated and both the unit and integration tests pass, the PRs is ready to be merged.
Choose one of the following an complete all substepsPrior to merging
Once completed, the reviewer can merge the PR.
Notification message for #daq-sw-librarians Slack channel
For an single merge that changes the user workflow
For co-ordinated merge