You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Make sure that the systemd units of valitail & the opentelemetry collector have a restart mechanism in place.
Research how systemd units are configured for healthchecks
Research what health checks are available for valitail & the opentelemetry collector
Apply healthchecks to systemd units for both components
Visit all other systemd units that are created via the OSConfig and find others that don't have health checks & automatic restarts setup
Open issue & tag relevant component owners
NOTE:
There's a possibility that valitail does not have any health checks that systemd would be able to call.
If the collector has a health check that can be configured, use that.
Also, a form of healthcheck might be a timeout on an endpoint. E.g. the /metrics endpoint on the otel collector -> if a timeout of 5 secs is reached, restart the container. This should be easily configurable via the systemd unit.
The root cause of the issue is a synchronization problem - the auth token required by the logging agents is added after the agents are created, causing them to enter a failed state that requires a restart. The underlying problem is that the bearertokenauthextension does not implement a file-watcher mechanism, so it cannot detect when the token file becomes available and reload automatically. More importantly, the Collector currently has no reliable way to transition into a failed/unhealthy state when one of its internal components fails — it only logs the error and continues running. The healthcheckv2extension is intended to address this limitation, but it is still under development and adoption across Collector components is limited. In our case, bearertokenauthextension does not integrate with it yet. An example of the expected integration pattern can be seen here. Should consider option to write custom code for these limitations in our opentelemetry-collector distribution.
Make sure that the systemd units of valitail & the opentelemetry collector have a restart mechanism in place.
NOTE:
There's a possibility that valitail does not have any health checks that systemd would be able to call.
If the collector has a health check that can be configured, use that.
Also, a form of healthcheck might be a timeout on an endpoint. E.g. the
/metricsendpoint on the otel collector -> if a timeout of 5 secs is reached, restart the container. This should be easily configurable via the systemd unit.Update (26.05.2026):
Upstream:
Update: 28.05.2026
OpenTelemetry Collectorif/metricsdoes not respond gardener#14928