What is the problem in detail?
In certain clusters where a higher load of logs are generated (e.g. some seeds), the OpenTelemetry Collector instance that acts as a log shipper on the Shoot nodes enters a sort of deadlock state in its goroutines.
This causes it to stop answering to scrapes from Prometheus, although this is only one symptom that is more evident due to the alerts.
A bigger issue is that the collector starts failing to send logs to the control plane.
The cause for this happening is still unknown, but the following patterns have been identified:
The issue happens only on clusters that appear to be bigger (which means increased load of logs)
When the issue begins, the collector stops closing FDs of files that have already been deleted (e.g. during log rotation):
# lsof -p 1510580 | awk 'NR < 2 || /deleted/'
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
opentelem 1510580 root 7r REG 259,3 107330861 1180216 /var/log/pods/kube-system_calico-node-2c642_2ba9fe21-cbf8-477f-8aaf-6e615ec2f8e8/calico-node/0.log.20260305-174620 (deleted)
opentelem 1510580 root 16r REG 259,3 22246 1312585 /var/log/pods/kube-system_egress-filter-applier-54cww_03314a85-89d6-4393-89fc-20a4164b8d97/egress-filter-applier/0.log (deleted)
There is a big number of CLOSE_WAIT connections. 1 per every scrape that Prometheus attempts. This can be easily seen with a tool that can list open connections (e.g. ss).
Manually calling curl -v -X GET http://127.0.0.1:18888/metrics when ssh-ed into a problematic node results in the requests hanging and another CLOSE_WAIT connection being opened.
Closing the connections with ss --tcp state CLOSE-WAIT --kill does not resolve the issue
Restarting the collector fixes the issue (at least temporarily)
Sadly, we haven't succeeded in reproducing the issue manually yet. This makes debugging and fixing the issue harder.
One possible temporary solution is to build a check into the systemd units of the collector that will call the /metrics endpoint and restart the unit if timing out.
What is the problem in detail?
In certain clusters where a higher load of logs are generated (e.g. some seeds), the OpenTelemetry Collector instance that acts as a log shipper on the Shoot nodes enters a sort of deadlock state in its goroutines.
This causes it to stop answering to scrapes from Prometheus, although this is only one symptom that is more evident due to the alerts.
A bigger issue is that the collector starts failing to send logs to the control plane.
The cause for this happening is still unknown, but the following patterns have been identified:
The issue happens only on clusters that appear to be bigger (which means increased load of logs)
When the issue begins, the collector stops closing FDs of files that have already been deleted (e.g. during log rotation):
There is a big number of CLOSE_WAIT connections. 1 per every scrape that Prometheus attempts. This can be easily seen with a tool that can list open connections (e.g. ss).
Manually calling curl -v -X GET http://127.0.0.1:18888/metrics when ssh-ed into a problematic node results in the requests hanging and another CLOSE_WAIT connection being opened.
Closing the connections with ss --tcp state CLOSE-WAIT --kill does not resolve the issue
Restarting the collector fixes the issue (at least temporarily)
Sadly, we haven't succeeded in reproducing the issue manually yet. This makes debugging and fixing the issue harder.
One possible temporary solution is to build a check into the systemd units of the collector that will call the /metrics endpoint and restart the unit if timing out.