Skip to content

[FEATURE] Valitail & Node Collector Restart Machanism #83

@rrhubenov

Description

@rrhubenov

Make sure that the systemd units of valitail & the opentelemetry collector have a restart mechanism in place.

  • Research how systemd units are configured for healthchecks
  • Research what health checks are available for valitail & the opentelemetry collector
  • Apply healthchecks to systemd units for both components
  • Visit all other systemd units that are created via the OSConfig and find others that don't have health checks & automatic restarts setup
  • Open issue & tag relevant component owners

NOTE:
There's a possibility that valitail does not have any health checks that systemd would be able to call.
If the collector has a health check that can be configured, use that.

Also, a form of healthcheck might be a timeout on an endpoint. E.g. the /metrics endpoint on the otel collector -> if a timeout of 5 secs is reached, restart the container. This should be easily configurable via the systemd unit.

Update (26.05.2026):

  • Add validation for auth-token file for valitail and otel-col systemd units gardener#14898 got merged, which introduced a temporary solution for a missing auth-token for both valitail and opentelemetry-collector.
  • The root cause of the issue is a synchronization problem - the auth token required by the logging agents is added after the agents are created, causing them to enter a failed state that requires a restart. The underlying problem is that the bearertokenauthextension does not implement a file-watcher mechanism, so it cannot detect when the token file becomes available and reload automatically. More importantly, the Collector currently has no reliable way to transition into a failed/unhealthy state when one of its internal components fails — it only logs the error and continues running. The healthcheckv2extension is intended to address this limitation, but it is still under development and adoption across Collector components is limited. In our case, bearertokenauthextension does not integrate with it yet. An example of the expected integration pattern can be seen here. Should consider option to write custom code for these limitations in our opentelemetry-collector distribution.

Upstream:

Update: 28.05.2026

Metadata

Metadata

Assignees

Labels

kind/enhancementEnhancement, improvement, extension

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions