Skip to content

[Windows Only] Agent v5+ APPCRASH (Heap Corruption 0xc0000374) and TCP Socket Hang with Upstream Swarm #11798

@Kruzerson

Description

@Kruzerson
  1. Environment

OS: Windows Server (This issue is exclusively observed on Windows)

Fluent Bit Agent: v5.x (Tested on 5.0.3+)

Fluent Bit Server (Upstream): v5.0.5 running behind a Docker Swarm VIP

Use Case: Heavy multiline application logs buffering to filesystem and forwarding to a central aggregator.

  1. Description
    There is a critical Windows-specific stability issue in the v5+ branch. The agent exhibits two distinct failure modes under moderate-to-heavy loads:

Issue A: Heap Corruption (APPCRASH)
The fluent-bit.exe service crashes silently. The Windows Event Viewer registers an APPCRASH event with exception code 0xc0000374 (Heap Corruption) in ntdll.dll. This is typically triggered when processing large multiline log blocks using the tail plugin combined with storage.type filesystem. When the crash occurs, in-flight memory chunks are lost.

Issue B: TCP Socket Hang (Deadlock)
When the upstream aggregator container (Docker Swarm) is restarted, the Windows agent completely freezes log transmission. The forward output plugin fails to detect the dropped connection (likely missing TCP RST from Swarm), keeps the "dead" socket open, and waits indefinitely for an ACK. The agent ignores net.io_timeout and net.keepalive_idle_timeout. Log flow only resumes after manually restarting the Windows service.

  1. Agent Configuration to Reproduce

YAML

service:
  flush: 5
  storage.type: filesystem
  storage.path: "C:/fluent-bit/buffer/"
  storage.sync: normal
  storage.backlog.mem_limit: 50M
  storage.max_chunks_up: 256
  
pipeline:
  inputs:
    - name: tail
      path: "C:/AppLogs/*/*/*/*.log"
      tag: app.logs
      multiline.parser: custom_multiline
      read_from_head: false
      db: "C:/fluent-bit/buffer/tail.db"
      buffer_chunk_size: 5M
      buffer_max_size: 10M

  outputs:
    - name: forward
      match: "*"
      host: 10.0.0.100
      port: 24224
      require_ack_response: true
      retry_limit: no_retries
      net.connect_timeout: 5s
      net.io_timeout: 15s
      net.keepalive: on
      net.keepalive_idle_timeout: 10s
  1. Expected Behavior

The tail plugin on Windows must handle large multiline logs without corrupting the heap memory (0xc0000374).

The forward plugin must strictly enforce net.io_timeout, forcefully close dead sockets when the upstream server goes away silently, and autonomously re-establish the connection without requiring a manual service restart.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions