Skip to content

feat: service restart support for file config plugin#3931

Open
albinsuresh wants to merge 15 commits intothin-edge:mainfrom
albinsuresh:feat/file-config-plugin-service-restart
Open

feat: service restart support for file config plugin#3931
albinsuresh wants to merge 15 commits intothin-edge:mainfrom
albinsuresh:feat/file-config-plugin-service-restart

Conversation

@albinsuresh
Copy link
Contributor

@albinsuresh albinsuresh commented Jan 15, 2026

Proposed changes

  • extract system_services module from tedge into a standalone crate to use from the plugin
  • service restart support for file config plugin
  • update config_update workflow with granular operation steps
  • handle plugins restarting the tedge-agent itself
  • how to detect a "reload" when service action is reload instead if restart to be handled using dedicated plugins as a generic logic in the file plugin is not possible

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
  • Documentation Update (if none of the other choices apply)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue


Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA (in all commits with git commit -s. You can activate automatic signing by running just prepare-dev once)
  • I ran just format as mentioned in CODING_GUIDELINES
  • I used just check as mentioned in CODING_GUIDELINES
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

@github-actions
Copy link
Contributor

github-actions bot commented Jan 15, 2026

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass % ⏱️ Duration
844 0 3 844 100 2h23m15.018229999s

@albinsuresh albinsuresh marked this pull request as ready for review January 15, 2026 20:08
Copy link
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved

didier-wenzek
didier-wenzek previously approved these changes Jan 19, 2026
@reubenmiller reubenmiller added the theme:configuration Theme: Configuration management label Jan 19, 2026
@reubenmiller
Copy link
Contributor

reubenmiller commented Jan 19, 2026

A few missing pieces/open questions:

  1. Missing log statements that the service is being restarted. No logs were present in the tedge-agent logs nor the workflow log (/var/log/tedge/agent/*.log)
  2. What happens if the "tedge-agent" is used in the service property? does this break the workflow? If so, then we might need to use the background action perform the restart of the tedge-agent

@albinsuresh albinsuresh marked this pull request as draft January 21, 2026 09:13
on_success = "apply"

[apply]
background_script = "sudo /usr/share/tedge/config-plugins/file apply ${.payload.type}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even when we convert this step into a builtin task, this plugin apply command should be executed as a background task, irrespectve of the type of the config and whethere that type restarts tedge-agent or not. That'll smplify this workflow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though the downside of doing the background task is that the "verify" state will be executed before the "apply" command is done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have a workdir available (introduced in 55c2d76), how about using that in the apply state to write a "completion marker file" to that directory, which the verify phase can check for afterwards with a timeout? We just need to formalize the "completion marker file" contract.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though can't we just tell people not to restart the tedge-agent from within a plugin? And if they want to restart a service then pass back some data via the workflow state exchange mechanism (e.g. :::begin-tedge:::)?

Copy link
Contributor Author

@albinsuresh albinsuresh Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if they want to restart a service then pass back some data via the workflow state exchange mechanism (e.g. :::begin-tedge:::)?

Yeah, I had considered that option. But, these are the reasons why I tried to avoid this "restart request" exchange from the plugin to the agent via the command state:

  1. We'll have to introduce yet another state after apply for the the agent to receive the restart request from the plugin, and then process it by restarting the agent. It feels like an optional state in the workflow only there for once specific case (the agent restart).
  2. We'll be exposing the workflow state update contract (:::begin-tedge::: and :::begin-tedge:::) with the plugin, which was otherwise unaware of the workflow contracts. That's why I was looking for a simple file-system based contract via the workdir.
  3. Consistency in the sense that they don't need to care if they are restarting the agent or any other service and don't need to do things differently for one vs the other. It would have been different if all service restarts were done by the agent. But since the agent running as tedge can't do that, it brings that inconsistency. Though, we can get arround this problem by making the agent use tedgectl.

But, I agree that this workdir approach introduces yet another mechanism for metdata exchange between states, when the MQTT state is already available. I thought this would be less of a problem as the workdir would anyway be used for config backups and even other things like service restart detection using persisted metadata(pids). So, this would be just one additional thing to that list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved by 7b83421

@didier-wenzek didier-wenzek dismissed their stale review January 22, 2026 10:06

We decided to explore a more flexible design, using workflows.

@albinsuresh albinsuresh marked this pull request as ready for review February 20, 2026 10:24
Copy link
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the new design for the communications between the workflow engine and an operation actor and this ability to extend the set of of atomic steps, supported for an operation, without touching the workflow engine.

Comment on lines +7 to +14
echo " list List all config types supported by this plugin"
echo " get <type> Print the config for the specified type to stdout"
echo " prepare <type> <new-config-path> --work-dir <dir> Prepare for configuration update (create backup)"
echo " set <type> <new-config-path> Update the config for the specified type from the new config path"
echo " apply <type> --work-dir <dir> Apply configuration (restart services)"
echo " verify <type> --work-dir <dir> Verify configuration was applied successfully"
echo " finalize <type> --work-dir <dir> Finalize configuration update (cleanup)"
echo " rollback <type> --work-dir <dir> Rollback configuration to previous state"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I failed to find the documentation for all the sub-commands a configuration plugin has to implement, what is mandatory vs optional? Do we have such docs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All are mandatory subcommands. I'll add the missing docs.

That being said, we can make the subcommands like prepare, apply, finalize, verifyandrollback` optional, if we find those excessive and feel that most users wouldn't need them. Not for backwards-compatibility reason, but purely for API simplicity. The backward-compatibility aspect is explained in this comment: #3931 (comment)

_workdir: &Utf8Path,
) -> Result<(), PluginError> {
info!("Configuration update finalized successfully for {config_type}");
// No-op due for backward compatibility
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't be better for backward compatibility to make the prepare, finalize and rollback sub-commands optional. I.e. Not mandating a plugin to provide a no-op implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the backward compatibility is not in terms of the plugin API but in terms of the the previous config update behaviour. The plugin API (the subcommands) don't yet have a backward compatibility requirement because we haven't yet made a release yet with the existing config plugin APIs (list, get and set). Only the next release will introduce these APIs to the users and we can have the complete set (including prepare, apply and verify as well) included in the first release itself.

But in terms of behaviour, we're forced stay consistent with the previous contract (without plugins) where there was no prepare or apply or finalize logic performed. Even the apply for the service restart is performed only for the users that opts into that new feature with the service = <service-name> config. I can add another config setting for them to opt-into the backup/restore functionality as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're doing way too many steps here. We should aim at having a very minimal set of actions so that we don't overcomplicate things (even more).

Can we review if all of the state are 100% required? For example the following states come to mind:

  • verify (can't this be done in "set")
  • apply (can't this be done in "finalize"?)

Maybe having an example of the sequence would be useful (e.g. mermaidejs flow diagram) to show the states would be useful to then try and cut-out the extra stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the diagram:

stateDiagram-v2
    [*] --> executing

    executing --> prepare
    prepare --> set : successful
    prepare --> failed : failed

    set --> apply : successful
    set --> failed : failed

    apply --> verify : successful
    apply --> rollback : failed

    verify --> finalize : successful
    verify --> rollback : failed

    finalize --> successful: successful
    finalize --> failed: failed
    rollback --> failed

    successful --> [*]
    failed --> [*]
Loading

Copy link
Contributor Author

@albinsuresh albinsuresh Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the options that I had considered to merge some states:

  1. Merge apply into set:
    Decided against that as the failure state transition is different for both steps. If set fails (the existing config file is not even replaced with the new one), we can move to failed straightaway, as we don't need to attempt a rollback. But if the failure happens during apply after the config file is set (replaced), then we must try a rollback.
  2. Combine apply and verify:
    Verify always felt like a standalone step as the logic would be independent from apply. Especially since this step needs to determine whether we await-agent-restart in the next state (if tedge-agent was restarted) or proceed directly to the finalize state (for other services).
  3. finalize into verify:
    Though there isn't much relevance for finalize for the use-cases that I had considered, left it as an independent state as a natural counterpart of rollback. Here we can short-circuit the transition from verify -> successful.

Copy link
Contributor

@reubenmiller reubenmiller Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1

Maybe it is a naming thing as both "set" and "apply" are the same thing for a user. "I'm setting the configuration, and I'm applying the configuration to the device"...where in your example it is more doing some post actions afterwards...but still I'm not sure what value it brings sep

2

In its current state, I'd say finalize is useless here. It seems that it is only executed on the happy-path...I'd be interested if the sm-plugins behave in a similar way that the finalize command is never executed when a proceeding error occurred

Copy link
Contributor Author

@albinsuresh albinsuresh Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this proposal for handling agent self restart, I am now convinced that apply can be merged into set and finalize can be dropped as well. set/apply and verify must stay separate as there is the possibility of the agent restarting itself between them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved by 04e66f7

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albinsuresh I just realized something, since we're now creating the /etc/tedge/operations/config_update.toml file within some initialization (or on startup), so we might have some subtle behaviour changes that we need to be aware of before merging, e.g.

  • How is the config_update.toml updated over time when we have workflow changes?
  • Do we allow users to have a their own custom config_update.toml which overrides the defualt behavior (I don't think we should, but we should document this to prevent users from trying to do it)?

Comment on lines +25 to +32
[apply]
action = "builtin:config_update:apply"
on_success = "verify"

[verify]
action = "builtin:config_update:verify"
on_success = "finalize"
on_error = "rollback"
Copy link
Contributor Author

@albinsuresh albinsuresh Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To handle tedge-agent self-restart during a config update, we'll introduce a restartAgent flag in the payload that must be set by the plugin executed in the apply state to request an agent restart. The apply subcommand of the plugins can just print :::begin-tedge:::\n {"restartAgent": true} \n :::end-tedge::: to the stdout.

For handling this restart request we have the following alternatives:

Option A (Explicit states for agent restart)

[apply]
action = "builtin:config_update:apply"
on_success = "evaluate-agent-restart"
on_error = "rollback"

[evaluate-agent-restart]
condition = "${.payload.restartAgent}"
on_success = "restart-agent"
on_error = "verify"

[restart-agent]
action = "restart-agent"
on_exec = "await-agent-restart"

[await-agent-restart]
action = "await-agent-restart"
timeout_second = 30
on_timeout = "rollback"
on_success = "finalize"

[verify]
action = "builtin:config_update:verify"
on_success = "finalize"
on_error = "rollback"

The major additions would be the new built-in tasks: condition and restart-agent.

We can even combine both evaluate-agent-restart and restart-agent into a single action as follows:

[conditional-agent-restart]
action = "restart-agent"
input.condition = "${.payload.restartAgent}"
on_exec = "await-agent-restart"
on_error = "verify"

Option B (handle agent restart internally)

Here the workflow remains the same and the restart in handled internally by the agent in the apply and verify states.

[apply]
action = "builtin:config_update:apply"
on_success = "verify"
on_error = "rollback"

[verify]
action = "builtin:config_update:verify"
on_success = "finalize"
on_error = "rollback"

In the apply phase, after the plugin's apply command is executed, if the agent receives the restartAgent flag, then it will move the workflow to the verify state and then shutdown itself. On restart, the workflow will continue from verify and the agent would check if a restart was requested (the restartAgent flag value) and deem it successful before preforming the plugin's verify command.

Unlike a bg_script that may or may not have triggered an agent restart, which forced the addition of the await-agent-restart action with timeout, here we don't need that, as the agent is guaranteed to perform its self-shutdown in the apply phase if restartAgent flag was set.

I prefer Option B, as my PoV is that the agent restart is just an internal business and hence those steps need not be exposed via workflow steps which are overridable and I don't see anyone realistically overriding the agent restart states at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tricky part with an agent restart is not the action itself (action = "restart-agent") but detecting that the restart has been effective and that the workflow can proceed. For that purpose there is an "await-agent-restart" action.

Copy link
Contributor Author

@albinsuresh albinsuresh Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tricky part with an agent restart is not the action itself (action = "restart-agent") but detecting that the restart has been effective and that the workflow can proceed. For that purpose there is an "await-agent-restart" action.

Yes. And that's why Option A is using await-agent-restart combined with the restart-agent action.

Unlike a bg_script that may or may not have triggered an agent restart, which forced the addition of the await-agent-restart action with timeout, here we don't need that, as the agent is guaranteed to perform its self-shutdown in the apply phase if restartAgent flag was set.

What I was trying to say here is that Option B doesn't need that awaiting logic as the verify step is guaranteed to be executed only after the agent performs the shutdown in the apply state. Unless the agent fails to shutdown itself and still proceeds to process the next command state update for verify, which I assume is not possible.

Copy link
Contributor Author

@albinsuresh albinsuresh Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further discussions revealed that Option B suffers from a race condition where the verify state transition from the apply state may not have reached the broker before the agent goes down. Hence, the assumption that the workflow would always resume from the verify state after the restart, is broken. So, for now we'll proceed with a simpler version of Option A as follows:

[apply]
action = "builtin:config_update:apply"
on_success = "evaluate-agent-restart"
on_error = "rollback"

[evaluate-agent-restart]
script = "test ${.payload.restartAgent} = \"true\""
on_exit.0 = "restart-agent"
on_exit.1 = "verify"

[restart-agent]
action = "restart-agent"
on_exec = "await-agent-restart"

[await-agent-restart]
action = "await-agent-restart"
timeout_second = 30
on_timeout = "rollback"
on_success = "finalize"

[verify]
action = "builtin:config_update:verify"
on_success = "successful"
on_error = "rollback"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved by commits 18af86c and 7b83421

Comment on lines +43 to +44
#[error("A shutdown has been requested")]
Shutdown,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look like an error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[error("A shutdown has been requested")]
Shutdown,
#[error("A restart is required: not running latest version and configuration ")]
RestartRequired,

// The agent will be restarted by systemd and resume from the persisted state
log_file.log_info("Shutting down agent for restart").await;

return Err(RuntimeError::Shutdown);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be working but this is a hack.

The proper way to request a shutdown is to send a RuntimeAction::Shutdown to the runtime.

Copy link
Contributor Author

@albinsuresh albinsuresh Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had tried that approach using RuntimeAction::Shutdown first. But the problem was that the runtime was treating it as a "graceful shutdown" and exiting with exit code 0, preventing the service from getting restarted. That's why I switched to this approach, by throwing an error, similar to the approach that we took for handling self-restart when tedge is self-updated. In the software management code, we are throwing a RuntimeError::ActorError(Box::new(SoftwareManagerError::NotRunningLatestVersion)). It felt like an indirect way of requesting for a shutdown. So, I thought of formalising it with a dedicated RuntimeError::Shutdown variant.

So the difference with both approaches in my mind is the following:

  • Graceful shutdown without restart: RuntimeAction::Shutdown
  • Shutdown to be restarted: RuntimeError::Shutdown

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, it makes sense. Somehow we are sending a ConfigManagementError::NotUsingLatestConfiguration. But we aim here to be less specific with an agent restart action.

As RuntimeError::Shutdown really doesn't sound as an error, what about RuntimeError::RestartRequired?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved by b1e3edb

Copy link
Contributor

@reubenmiller reubenmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into an issue when trying to trigger a tedge-agent restart after applying a new tedge.toml file, but the operation fails for some reason, this line was found in the operation log at least:

ERROR: builtin action 'builtin:config_update:verify' failed: No builtin operation step handler registered for config_update operation verify step

Below is tedge-configuration-plugin.toml that I configured in my test:

file: /etc/tedge/plugins/tedge-configuration-plugin.toml

files = [
    { path = '/etc/tedge/system.toml', type = 'system.toml', service = 'tedge-agent' },
    { path = '/etc/tedge/tedge.toml', type = 'tedge.toml', service = 'tedge-agent' },
    { path = '/etc/lighttpd/lighttpd.conf', type = 'lighttpd-conf', service = 'lighttpd', service_action = 'restart' }
]

And the full operation log of the failed operation:
TST_actualize_shiny_bend_workflow-config_update-c8y-mapper-53005959.log

@albinsuresh
Copy link
Contributor Author

tedge-configuration-plugin.toml

The issue was caused by using a tedge.toml that was intentionally or unintentionally disabling the config management feature of the tedge-agent with the following setting in it:

[agent.enable]
config_snapshot = false

So, the config manager actor did the first part of the config update until the agent restart, but could not resume it post-restart as the actor itself is not present.

That single config_snapshot flag determines whether the config manager actor is loaded or not. We should fix that logic to check both config_snapshot && config_update.

@albinsuresh albinsuresh temporarily deployed to Test Pull Request March 12, 2026 09:49 — with GitHub Actions Inactive
@albinsuresh
Copy link
Contributor Author

tedge-configuration-plugin.toml

The issue was caused by using a tedge.toml that was intentionally or unintentionally disabling the config management feature of the tedge-agent with the following setting in it:

[agent.enable]
config_snapshot = false

So, the config manager actor did the first part of the config update until the agent restart, but could not resume it post-restart as the actor itself is not present.

That single config_snapshot flag determines whether the config manager actor is loaded or not. We should fix that logic to check both config_snapshot && config_update.

Resolved by 945642f

... tedge.toml
... tedge-log-plugin

Config update restarts service if configured in plugin configuration
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be removed, as I can't remember why I added this back then and the Set Configuration Should Restart Service test above covers what was intended anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

theme:configuration Theme: Configuration management

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants