Collector issue regarding FDs not being closed when under load

What is the problem in detail?
In certain clusters where a higher load of logs are generated (e.g. some seeds), the OpenTelemetry Collector instance that acts as a log shipper on the Shoot nodes enters a sort of deadlock state in its goroutines.

This causes it to stop answering to scrapes from Prometheus, although this is only one symptom that is more evident due to the alerts.
A bigger issue is that the collector starts failing to send logs to the control plane.

The cause for this happening is still unknown, but the following patterns have been identified:

The issue happens only on clusters that appear to be bigger (which means increased load of logs)
When the issue begins, the collector stops closing FDs of files that have already been deleted (e.g. during log rotation):
```
# lsof -p 1510580   | awk 'NR < 2 || /deleted/'
 COMMAND       PID USER   FD      TYPE             DEVICE  SIZE/OFF      NODE NAME
 opentelem 1510580 root    7r      REG              259,3 107330861   1180216 /var/log/pods/kube-system_calico-node-2c642_2ba9fe21-cbf8-477f-8aaf-6e615ec2f8e8/calico-node/0.log.20260305-174620 (deleted)
 opentelem 1510580 root   16r      REG              259,3     22246   1312585 /var/log/pods/kube-system_egress-filter-applier-54cww_03314a85-89d6-4393-89fc-20a4164b8d97/egress-filter-applier/0.log (deleted)
```
There is a big number of CLOSE_WAIT connections. 1 per every scrape that Prometheus attempts. This can be easily seen with a tool that can list open connections (e.g. ss).
Manually calling curl -v -X GET http://127.0.0.1:18888/metrics when ssh-ed into a problematic node results in the requests hanging and another CLOSE_WAIT connection being opened.
Closing the connections with ss --tcp state CLOSE-WAIT --kill does not resolve the issue
Restarting the collector fixes the issue (at least temporarily)
Sadly, we haven't succeeded in reproducing the issue manually yet. This makes debugging and fixing the issue harder.
One possible temporary solution is to build a check into the systemd units of the collector that will call the /metrics endpoint and restart the unit if timing out.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collector issue regarding FDs not being closed when under load #77

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Collector issue regarding FDs not being closed when under load #77

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions