feat: service restart support for file config plugin#3931
feat: service restart support for file config plugin#3931albinsuresh wants to merge 15 commits intothin-edge:mainfrom
Conversation
Robot Results
|
...s/RobotFramework/tests/cumulocity/configuration/tedge-configuration-plugin-with-service.toml
Outdated
Show resolved
Hide resolved
|
A few missing pieces/open questions:
|
| on_success = "apply" | ||
|
|
||
| [apply] | ||
| background_script = "sudo /usr/share/tedge/config-plugins/file apply ${.payload.type}" |
There was a problem hiding this comment.
Even when we convert this step into a builtin task, this plugin apply command should be executed as a background task, irrespectve of the type of the config and whethere that type restarts tedge-agent or not. That'll smplify this workflow.
There was a problem hiding this comment.
Though the downside of doing the background task is that the "verify" state will be executed before the "apply" command is done.
There was a problem hiding this comment.
Since we have a workdir available (introduced in 55c2d76), how about using that in the apply state to write a "completion marker file" to that directory, which the verify phase can check for afterwards with a timeout? We just need to formalize the "completion marker file" contract.
There was a problem hiding this comment.
Though can't we just tell people not to restart the tedge-agent from within a plugin? And if they want to restart a service then pass back some data via the workflow state exchange mechanism (e.g. :::begin-tedge:::)?
There was a problem hiding this comment.
And if they want to restart a service then pass back some data via the workflow state exchange mechanism (e.g. :::begin-tedge:::)?
Yeah, I had considered that option. But, these are the reasons why I tried to avoid this "restart request" exchange from the plugin to the agent via the command state:
- We'll have to introduce yet another state after
applyfor the the agent to receive the restart request from the plugin, and then process it by restarting the agent. It feels like an optional state in the workflow only there for once specific case (the agent restart). - We'll be exposing the workflow state update contract (
:::begin-tedge:::and:::begin-tedge:::) with the plugin, which was otherwise unaware of the workflow contracts. That's why I was looking for a simple file-system based contract via theworkdir. - Consistency in the sense that they don't need to care if they are restarting the agent or any other service and don't need to do things differently for one vs the other. It would have been different if all service restarts were done by the agent. But since the agent running as
tedgecan't do that, it brings that inconsistency. Though, we can get arround this problem by making the agent usetedgectl.
But, I agree that this workdir approach introduces yet another mechanism for metdata exchange between states, when the MQTT state is already available. I thought this would be less of a problem as the workdir would anyway be used for config backups and even other things like service restart detection using persisted metadata(pids). So, this would be just one additional thing to that list.
tests/RobotFramework/tests/cumulocity/configuration/composite_config_update.toml
Outdated
Show resolved
Hide resolved
tests/RobotFramework/tests/cumulocity/configuration/composite_config_update.toml
Outdated
Show resolved
Hide resolved
tests/RobotFramework/tests/cumulocity/configuration/plugins/file
Outdated
Show resolved
Hide resolved
tests/RobotFramework/tests/cumulocity/configuration/composite_config_update.toml
Outdated
Show resolved
Hide resolved
We decided to explore a more flexible design, using workflows.
tests/RobotFramework/tests/cumulocity/configuration/composite_config_update.toml
Outdated
Show resolved
Hide resolved
tests/RobotFramework/tests/cumulocity/configuration/composite_config_update.toml
Outdated
Show resolved
Hide resolved
didier-wenzek
left a comment
There was a problem hiding this comment.
I really like the new design for the communications between the workflow engine and an operation actor and this ability to extend the set of of atomic steps, supported for an operation, without touching the workflow engine.
| echo " list List all config types supported by this plugin" | ||
| echo " get <type> Print the config for the specified type to stdout" | ||
| echo " prepare <type> <new-config-path> --work-dir <dir> Prepare for configuration update (create backup)" | ||
| echo " set <type> <new-config-path> Update the config for the specified type from the new config path" | ||
| echo " apply <type> --work-dir <dir> Apply configuration (restart services)" | ||
| echo " verify <type> --work-dir <dir> Verify configuration was applied successfully" | ||
| echo " finalize <type> --work-dir <dir> Finalize configuration update (cleanup)" | ||
| echo " rollback <type> --work-dir <dir> Rollback configuration to previous state" |
There was a problem hiding this comment.
I failed to find the documentation for all the sub-commands a configuration plugin has to implement, what is mandatory vs optional? Do we have such docs?
There was a problem hiding this comment.
All are mandatory subcommands. I'll add the missing docs.
That being said, we can make the subcommands like prepare, apply, finalize, verifyandrollback` optional, if we find those excessive and feel that most users wouldn't need them. Not for backwards-compatibility reason, but purely for API simplicity. The backward-compatibility aspect is explained in this comment: #3931 (comment)
| _workdir: &Utf8Path, | ||
| ) -> Result<(), PluginError> { | ||
| info!("Configuration update finalized successfully for {config_type}"); | ||
| // No-op due for backward compatibility |
There was a problem hiding this comment.
Wouldn't be better for backward compatibility to make the prepare, finalize and rollback sub-commands optional. I.e. Not mandating a plugin to provide a no-op implementation.
There was a problem hiding this comment.
Well the backward compatibility is not in terms of the plugin API but in terms of the the previous config update behaviour. The plugin API (the subcommands) don't yet have a backward compatibility requirement because we haven't yet made a release yet with the existing config plugin APIs (list, get and set). Only the next release will introduce these APIs to the users and we can have the complete set (including prepare, apply and verify as well) included in the first release itself.
But in terms of behaviour, we're forced stay consistent with the previous contract (without plugins) where there was no prepare or apply or finalize logic performed. Even the apply for the service restart is performed only for the users that opts into that new feature with the service = <service-name> config. I can add another config setting for them to opt-into the backup/restore functionality as well.
There was a problem hiding this comment.
I think we're doing way too many steps here. We should aim at having a very minimal set of actions so that we don't overcomplicate things (even more).
Can we review if all of the state are 100% required? For example the following states come to mind:
- verify (can't this be done in "set")
- apply (can't this be done in "finalize"?)
Maybe having an example of the sequence would be useful (e.g. mermaidejs flow diagram) to show the states would be useful to then try and cut-out the extra stuff.
There was a problem hiding this comment.
Here is the diagram:
stateDiagram-v2
[*] --> executing
executing --> prepare
prepare --> set : successful
prepare --> failed : failed
set --> apply : successful
set --> failed : failed
apply --> verify : successful
apply --> rollback : failed
verify --> finalize : successful
verify --> rollback : failed
finalize --> successful: successful
finalize --> failed: failed
rollback --> failed
successful --> [*]
failed --> [*]
There was a problem hiding this comment.
Here are the options that I had considered to merge some states:
- Merge
applyintoset:
Decided against that as the failure state transition is different for both steps. Ifsetfails (the existing config file is not even replaced with the new one), we can move tofailedstraightaway, as we don't need to attempt a rollback. But if the failure happens duringapplyafter the config file isset(replaced), then we must try a rollback. - Combine
applyandverify:
Verify always felt like a standalone step as the logic would be independent fromapply. Especially since this step needs to determine whether weawait-agent-restartin the next state (iftedge-agentwas restarted) or proceed directly to thefinalizestate (for other services). finalizeintoverify:
Though there isn't much relevance forfinalizefor the use-cases that I had considered, left it as an independent state as a natural counterpart ofrollback. Here we can short-circuit the transition fromverify -> successful.
There was a problem hiding this comment.
1
Maybe it is a naming thing as both "set" and "apply" are the same thing for a user. "I'm setting the configuration, and I'm applying the configuration to the device"...where in your example it is more doing some post actions afterwards...but still I'm not sure what value it brings sep
2
In its current state, I'd say finalize is useless here. It seems that it is only executed on the happy-path...I'd be interested if the sm-plugins behave in a similar way that the finalize command is never executed when a proceeding error occurred
There was a problem hiding this comment.
With this proposal for handling agent self restart, I am now convinced that apply can be merged into set and finalize can be dropped as well. set/apply and verify must stay separate as there is the possibility of the agent restarting itself between them.
There was a problem hiding this comment.
@albinsuresh I just realized something, since we're now creating the /etc/tedge/operations/config_update.toml file within some initialization (or on startup), so we might have some subtle behaviour changes that we need to be aware of before merging, e.g.
- How is the config_update.toml updated over time when we have workflow changes?
- Do we allow users to have a their own custom config_update.toml which overrides the defualt behavior (I don't think we should, but we should document this to prevent users from trying to do it)?
| [apply] | ||
| action = "builtin:config_update:apply" | ||
| on_success = "verify" | ||
|
|
||
| [verify] | ||
| action = "builtin:config_update:verify" | ||
| on_success = "finalize" | ||
| on_error = "rollback" |
There was a problem hiding this comment.
To handle tedge-agent self-restart during a config update, we'll introduce a restartAgent flag in the payload that must be set by the plugin executed in the apply state to request an agent restart. The apply subcommand of the plugins can just print :::begin-tedge:::\n {"restartAgent": true} \n :::end-tedge::: to the stdout.
For handling this restart request we have the following alternatives:
Option A (Explicit states for agent restart)
[apply]
action = "builtin:config_update:apply"
on_success = "evaluate-agent-restart"
on_error = "rollback"
[evaluate-agent-restart]
condition = "${.payload.restartAgent}"
on_success = "restart-agent"
on_error = "verify"
[restart-agent]
action = "restart-agent"
on_exec = "await-agent-restart"
[await-agent-restart]
action = "await-agent-restart"
timeout_second = 30
on_timeout = "rollback"
on_success = "finalize"
[verify]
action = "builtin:config_update:verify"
on_success = "finalize"
on_error = "rollback"
The major additions would be the new built-in tasks: condition and restart-agent.
We can even combine both evaluate-agent-restart and restart-agent into a single action as follows:
[conditional-agent-restart]
action = "restart-agent"
input.condition = "${.payload.restartAgent}"
on_exec = "await-agent-restart"
on_error = "verify"
Option B (handle agent restart internally)
Here the workflow remains the same and the restart in handled internally by the agent in the apply and verify states.
[apply]
action = "builtin:config_update:apply"
on_success = "verify"
on_error = "rollback"
[verify]
action = "builtin:config_update:verify"
on_success = "finalize"
on_error = "rollback"
In the apply phase, after the plugin's apply command is executed, if the agent receives the restartAgent flag, then it will move the workflow to the verify state and then shutdown itself. On restart, the workflow will continue from verify and the agent would check if a restart was requested (the restartAgent flag value) and deem it successful before preforming the plugin's verify command.
Unlike a bg_script that may or may not have triggered an agent restart, which forced the addition of the await-agent-restart action with timeout, here we don't need that, as the agent is guaranteed to perform its self-shutdown in the apply phase if restartAgent flag was set.
I prefer Option B, as my PoV is that the agent restart is just an internal business and hence those steps need not be exposed via workflow steps which are overridable and I don't see anyone realistically overriding the agent restart states at all.
There was a problem hiding this comment.
The tricky part with an agent restart is not the action itself (action = "restart-agent") but detecting that the restart has been effective and that the workflow can proceed. For that purpose there is an "await-agent-restart" action.
There was a problem hiding this comment.
The tricky part with an agent restart is not the action itself (
action = "restart-agent") but detecting that the restart has been effective and that the workflow can proceed. For that purpose there is an "await-agent-restart" action.
Yes. And that's why Option A is using await-agent-restart combined with the restart-agent action.
Unlike a bg_script that may or may not have triggered an agent restart, which forced the addition of the await-agent-restart action with timeout, here we don't need that, as the agent is guaranteed to perform its self-shutdown in the apply phase if restartAgent flag was set.
What I was trying to say here is that Option B doesn't need that awaiting logic as the verify step is guaranteed to be executed only after the agent performs the shutdown in the apply state. Unless the agent fails to shutdown itself and still proceeds to process the next command state update for verify, which I assume is not possible.
There was a problem hiding this comment.
Further discussions revealed that Option B suffers from a race condition where the verify state transition from the apply state may not have reached the broker before the agent goes down. Hence, the assumption that the workflow would always resume from the verify state after the restart, is broken. So, for now we'll proceed with a simpler version of Option A as follows:
[apply]
action = "builtin:config_update:apply"
on_success = "evaluate-agent-restart"
on_error = "rollback"
[evaluate-agent-restart]
script = "test ${.payload.restartAgent} = \"true\""
on_exit.0 = "restart-agent"
on_exit.1 = "verify"
[restart-agent]
action = "restart-agent"
on_exec = "await-agent-restart"
[await-agent-restart]
action = "await-agent-restart"
timeout_second = 30
on_timeout = "rollback"
on_success = "finalize"
[verify]
action = "builtin:config_update:verify"
on_success = "successful"
on_error = "rollback"
| #[error("A shutdown has been requested")] | ||
| Shutdown, |
There was a problem hiding this comment.
This doesn't look like an error.
There was a problem hiding this comment.
| #[error("A shutdown has been requested")] | |
| Shutdown, | |
| #[error("A restart is required: not running latest version and configuration ")] | |
| RestartRequired, |
| // The agent will be restarted by systemd and resume from the persisted state | ||
| log_file.log_info("Shutting down agent for restart").await; | ||
|
|
||
| return Err(RuntimeError::Shutdown); |
There was a problem hiding this comment.
This might be working but this is a hack.
The proper way to request a shutdown is to send a RuntimeAction::Shutdown to the runtime.
There was a problem hiding this comment.
I had tried that approach using RuntimeAction::Shutdown first. But the problem was that the runtime was treating it as a "graceful shutdown" and exiting with exit code 0, preventing the service from getting restarted. That's why I switched to this approach, by throwing an error, similar to the approach that we took for handling self-restart when tedge is self-updated. In the software management code, we are throwing a RuntimeError::ActorError(Box::new(SoftwareManagerError::NotRunningLatestVersion)). It felt like an indirect way of requesting for a shutdown. So, I thought of formalising it with a dedicated RuntimeError::Shutdown variant.
So the difference with both approaches in my mind is the following:
- Graceful shutdown without restart:
RuntimeAction::Shutdown - Shutdown to be restarted:
RuntimeError::Shutdown
There was a problem hiding this comment.
Okay, it makes sense. Somehow we are sending a ConfigManagementError::NotUsingLatestConfiguration. But we aim here to be less specific with an agent restart action.
As RuntimeError::Shutdown really doesn't sound as an error, what about RuntimeError::RestartRequired?
reubenmiller
left a comment
There was a problem hiding this comment.
I ran into an issue when trying to trigger a tedge-agent restart after applying a new tedge.toml file, but the operation fails for some reason, this line was found in the operation log at least:
ERROR: builtin action 'builtin:config_update:verify' failed: No builtin operation step handler registered for config_update operation verify step
Below is tedge-configuration-plugin.toml that I configured in my test:
file: /etc/tedge/plugins/tedge-configuration-plugin.toml
files = [
{ path = '/etc/tedge/system.toml', type = 'system.toml', service = 'tedge-agent' },
{ path = '/etc/tedge/tedge.toml', type = 'tedge.toml', service = 'tedge-agent' },
{ path = '/etc/lighttpd/lighttpd.conf', type = 'lighttpd-conf', service = 'lighttpd', service_action = 'restart' }
]And the full operation log of the failed operation:
TST_actualize_shiny_bend_workflow-config_update-c8y-mapper-53005959.log
The issue was caused by using a So, the config manager actor did the first part of the config update until the agent restart, but could not resume it post-restart as the actor itself is not present. That single |
Resolved by 945642f |
| ... tedge.toml | ||
| ... tedge-log-plugin | ||
|
|
||
| Config update restarts service if configured in plugin configuration |
There was a problem hiding this comment.
To be removed, as I can't remember why I added this back then and the Set Configuration Should Restart Service test above covers what was intended anyway.
Proposed changes
system_servicesmodule fromtedgeinto a standalone crate to use from the pluginconfig_updateworkflow with granular operation stepstedge-agentitselfhow to detect a "reload" when service action isto be handled using dedicated plugins as a generic logic in thereloadinstead ifrestartfileplugin is not possibleTypes of changes
Paste Link to the issue
Checklist
just prepare-devonce)just formatas mentioned in CODING_GUIDELINESjust checkas mentioned in CODING_GUIDELINESFurther comments