Skip to content

in_tail: SIGSEGV in flb_lib_worker (flb_lib.c:909) persists on 5.0.8 under HTTP/OSIS 408 + circuit-breaker backpressure (follow-up to #11954) #12009

Description

@cameronattard

Summary

Follow-up to #11954 (filing as a separate issue with our configuration, as requested by @cosmo0920).

We still hit the flb_lib_worker SIGSEGV on 5.0.8 — the release that contains the #11956 timer-lifecycle fix. Same stack trace as #11954. It reproduces on a Kubernetes DaemonSet under sustained HTTP-output backpressure (AWS OpenSearch Ingestion / OSIS returning 408 → circuit breaker opens), with tail inputs using threaded on + filesystem storage. (#11954 attributes the crash to the tail input; our stack trace is identical — though we have not ourselves tested with the tail input removed.)

Version / environment

Sequence observed

  1. The http output (OSIS) begins returning 408 and the circuit breaker opens:
    [error] [output:http:http.1] <pipeline>.us-west-2.osis.amazonaws.com:443, HTTP status=408
    Circuit breaker is open. Unable to write to buffer.
    [error] [output:http:http.1] <pipeline>.us-west-2.osis.amazonaws.com:443, HTTP status=408
    Circuit breaker is open. Unable to write to buffer.
    
  2. ~1 minute later, SIGSEGV (same frame as SIGSEGV in flb_lib_worker (flb_lib.c:909) after upgrade to 5.0.6/5.0.7 with tail input #11954; stripped binary so only flb_lib_worker symbolized):
    [engine] caught signal (SIGSEGV)
    #0  0x...  in  flb_lib_worker() at src/flb_lib.c:909
    #1  0x...  in  ???() at ???:0
    #2  0x...  in  ???() at ???:0
    
  3. After the automatic restart, the s3 filesystem buffer size counter underflows:
    [error] [output:s3:s3.2] Buffer is full: current_buffer_size=18446744073549369344, store_dir_limit_size=10000000000 bytes
    
    184467440735493693442^64 − ~160 MB — an unsigned underflow. The S3 output then refuses all buffering (Could not buffer chunk), chunks back up, and the pod goes NotReady (health check returns 500). This secondary state clears on a manual pod delete (store_dir is re-scanned on startup). It appears to be a downstream consequence of the unclean shutdown from the crash, rather than an independent issue.

Configuration (sanitized)

Minimal subset exercising the crash path (filesystem storage + a threaded tail input + a backpressuring HTTP output):

[SERVICE]
    Flush 1
    storage.path /var/log/fluent-bit/
    storage.max_chunks_up 128
    storage.backlog.mem_limit 100M

# tail input — threaded + filesystem-backed (the crash is in this path; Inotify_Watcher at default On)
[INPUT]
    Name tail
    Path /var/log/containers/*.log
    multiline.parser cri
    Read_From_Head True
    DB /var/log/fluent-bit/tail-containers.db
    storage.type filesystem
    threaded on

# HTTP output to AWS OSIS — backs up (408 -> circuit breaker), creating the backpressure that precedes the crash
[OUTPUT]
    Name http
    Match *
    Host <pipeline>.us-west-2.osis.amazonaws.com
    Port 443
    URI /log/ingest
    Format json
    aws_auth true
    aws_region us-west-2
    aws_service osis
    tls On
    storage.total_limit_size 5G

# S3 output — filesystem-buffered; its size counter underflows after the crash restart
[OUTPUT]
    Name s3
    Match *
    bucket <s3-logs-bucket>
    region us-west-2
    use_put_object on
    store_dir /var/log/fluent-bit/s3
    store_dir_limit_size 10G

Our full production config additionally has tail (kube-audit) and systemd inputs, kubernetes/lua/modify/rewrite_tag filters, and a second s3 output — all available on request.

Questions for maintainers

  1. in_tail: Plug timer lifecycle glitches on progress check #11956 (in 5.0.8) was meant to fix the in_tail progress-check timer lifecycle, but the SIGSEGV still occurs for us on 5.0.8. Is there a remaining path (e.g. under threaded on + filesystem storage + heavy HTTP-output backpressure) not covered by that fix?
  2. Does Inotify_Watcher Off fully avoid the crashing code path on 5.0.8, or only reduce frequency? We can apply it as a workaround but want to confirm it's a complete mitigation rather than pinning to an older release (we've rolled back to 5.0.3 for now).

Happy to provide full debug-level logs or a core dump if that helps.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions