Bug Report
Describe the bug
When running Fluent Bit on machines that are tailing high volume logs and gathering Windows metrics to write to OpenTelemetry endpoints, I intermittently see socket failures in other processes. The machine will have 10,000+ socket pair connections from Fluent Bit in TIME_WAIT and new socket creation will fail. I've attempted to mitigate this by increasing the range of dynamic ports and reducing TIME_WAIT to the minimum allowed value, but while looking into the issue it seems like this might be solvable on modern Windows platforms by using AF_UNIX.
The version of libevent included with monkey is ten years old. flb_pipe_create calls evutil_socketpair, which calls evutil_ersatz_socketpair_ on Windows. Windows 10+ added AF_UNIX and libevent added evutil_win_socketpair_afunix to support it. From the Windows AF_UNIX documentation, it sounds like this does IPC over pathnames and would not use two dynamic ports per event pipe, which should eliminate socket exhaustion from self-socketpairs.
To Reproduce
- Create a tail input on multiple files that receive high volume with at least one processor (
record_modifier worked for me).
- Create a Windows exporter metrics input
- Create an OpenTelemetry output.
- Run a program that creates TCP connections using dynamic ports (I used Python and urllib3 without connection pooling to put pressure on the TCP/IP stack to reproduce in a test environment).
- When a failure occurs, use netstat to grab all TCP connections on the machine. The majority of ports will be used for 127.0.0.1 socket pairs from Fluent Bit in a
TIME_WAIT status.
Expected behavior
Fluent Bit's event processing should not have an outsized negative impact on the TCP/IP stack of the system that it runs on.
Your Environment
- Version used: 5.0.3
- Configuration:
tail, windows_exporter_metrics, record_modifier, opentelemetry_envelope, opentelemetry
- Environment name and version (e.g. Kubernetes? What version?): N/A
- Server type and version: N/A
- Operating System and version: Windows 10 Pro
- Filters and plugins: N/A
Additional context
I use Fluent Bit for telemetry aggregation on machines that make many external TCP connections. Having to turn Fluent Bit off to avoid infrastructure failures makes monitoring these machines difficult.
Bug Report
Describe the bug
When running Fluent Bit on machines that are tailing high volume logs and gathering Windows metrics to write to OpenTelemetry endpoints, I intermittently see socket failures in other processes. The machine will have 10,000+ socket pair connections from Fluent Bit in TIME_WAIT and new socket creation will fail. I've attempted to mitigate this by increasing the range of dynamic ports and reducing TIME_WAIT to the minimum allowed value, but while looking into the issue it seems like this might be solvable on modern Windows platforms by using
AF_UNIX.The version of libevent included with monkey is ten years old. flb_pipe_create calls
evutil_socketpair, which calls evutil_ersatz_socketpair_ on Windows. Windows 10+ addedAF_UNIXand libevent added evutil_win_socketpair_afunix to support it. From the Windows AF_UNIX documentation, it sounds like this does IPC over pathnames and would not use two dynamic ports per event pipe, which should eliminate socket exhaustion from self-socketpairs.To Reproduce
record_modifierworked for me).TIME_WAITstatus.Expected behavior
Fluent Bit's event processing should not have an outsized negative impact on the TCP/IP stack of the system that it runs on.
Your Environment
tail,windows_exporter_metrics,record_modifier,opentelemetry_envelope,opentelemetryAdditional context
I use Fluent Bit for telemetry aggregation on machines that make many external TCP connections. Having to turn Fluent Bit off to avoid infrastructure failures makes monitoring these machines difficult.