You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Follow-up to #11954 (filing as a separate issue with our configuration, as requested by @cosmo0920).
We still hit the flb_lib_worker SIGSEGV on 5.0.8 — the release that contains the #11956 timer-lifecycle fix. Same stack trace as #11954. It reproduces on a Kubernetes DaemonSet under sustained HTTP-output backpressure (AWS OpenSearch Ingestion / OSIS returning 408 → circuit breaker opens), with tail inputs using threaded on + filesystem storage. (#11954 attributes the crash to the tail input; our stack trace is identical — though we have not ourselves tested with the tail input removed.)
Version / environment
Fluent Bit 5.0.8, official image fluent/fluent-bit:5.0.8
Kubernetes DaemonSet, ~100+ Linux nodes (amd64 + arm64)
Our deployment history: we ran 5.0.3 in production for a long time with no such crashes, then upgraded directly to 5.0.8 and began hitting this SIGSEGV. We did not run any intermediate release (5.0.4–5.0.7) ourselves.
The http output (OSIS) begins returning 408 and the circuit breaker opens:
[error] [output:http:http.1] <pipeline>.us-west-2.osis.amazonaws.com:443, HTTP status=408
Circuit breaker is open. Unable to write to buffer.
[error] [output:http:http.1] <pipeline>.us-west-2.osis.amazonaws.com:443, HTTP status=408
Circuit breaker is open. Unable to write to buffer.
[engine] caught signal (SIGSEGV)
#0 0x... in flb_lib_worker() at src/flb_lib.c:909
#1 0x... in ???() at ???:0
#2 0x... in ???() at ???:0
After the automatic restart, the s3 filesystem buffer size counter underflows:
[error] [output:s3:s3.2] Buffer is full: current_buffer_size=18446744073549369344, store_dir_limit_size=10000000000 bytes
18446744073549369344 ≈ 2^64 − ~160 MB — an unsigned underflow. The S3 output then refuses all buffering (Could not buffer chunk), chunks back up, and the pod goes NotReady (health check returns 500). This secondary state clears on a manual pod delete (store_dir is re-scanned on startup). It appears to be a downstream consequence of the unclean shutdown from the crash, rather than an independent issue.
Configuration (sanitized)
Minimal subset exercising the crash path (filesystem storage + a threaded tail input + a backpressuring HTTP output):
[SERVICE]
Flush 1
storage.path /var/log/fluent-bit/
storage.max_chunks_up 128
storage.backlog.mem_limit 100M
# tail input — threaded + filesystem-backed (the crash is in this path; Inotify_Watcher at default On)[INPUT]
Name tail
Path /var/log/containers/*.log
multiline.parser cri
Read_From_Head True
DB /var/log/fluent-bit/tail-containers.db
storage.type filesystem
threaded on
# HTTP output to AWS OSIS — backs up (408 -> circuit breaker), creating the backpressure that precedes the crash[OUTPUT]
Name http
Match *
Host <pipeline>.us-west-2.osis.amazonaws.com
Port 443
URI /log/ingest
Format json
aws_auth true
aws_region us-west-2
aws_service osis
tls On
storage.total_limit_size 5G
# S3 output — filesystem-buffered; its size counter underflows after the crash restart[OUTPUT]
Name s3
Match *
bucket <s3-logs-bucket>
region us-west-2
use_put_object on
store_dir /var/log/fluent-bit/s3
store_dir_limit_size 10G
Our full production config additionally has tail (kube-audit) and systemd inputs, kubernetes/lua/modify/rewrite_tag filters, and a second s3 output — all available on request.
Questions for maintainers
in_tail: Plug timer lifecycle glitches on progress check #11956 (in 5.0.8) was meant to fix the in_tail progress-check timer lifecycle, but the SIGSEGV still occurs for us on 5.0.8. Is there a remaining path (e.g. under threaded on + filesystem storage + heavy HTTP-output backpressure) not covered by that fix?
Does Inotify_Watcher Offfully avoid the crashing code path on 5.0.8, or only reduce frequency? We can apply it as a workaround but want to confirm it's a complete mitigation rather than pinning to an older release (we've rolled back to 5.0.3 for now).
Happy to provide full debug-level logs or a core dump if that helps.
Summary
Follow-up to #11954 (filing as a separate issue with our configuration, as requested by @cosmo0920).
We still hit the
flb_lib_workerSIGSEGV on 5.0.8 — the release that contains the #11956 timer-lifecycle fix. Same stack trace as #11954. It reproduces on a Kubernetes DaemonSet under sustained HTTP-output backpressure (AWS OpenSearch Ingestion / OSIS returning408→ circuit breaker opens), withtailinputs usingthreaded on+ filesystem storage. (#11954 attributes the crash to the tail input; our stack trace is identical — though we have not ourselves tested with the tail input removed.)Version / environment
fluent/fluent-bit:5.0.8reconcile_file_state/ progress-check timer code added by in_tail: reconcile files after missed inotify events #11750 is absent inv5.0.5and present fromv5.0.6onward. in_tail: Plug timer lifecycle glitches on progress check #11956 (in 5.0.8) was the fix attempt but does not resolve it for our path.Sequence observed
httpoutput (OSIS) begins returning408and the circuit breaker opens:flb_lib_workersymbolized):s3filesystem buffer size counter underflows:18446744073549369344≈2^64 − ~160 MB— an unsigned underflow. The S3 output then refuses all buffering (Could not buffer chunk), chunks back up, and the pod goes NotReady (health check returns 500). This secondary state clears on a manual pod delete (store_dir is re-scanned on startup). It appears to be a downstream consequence of the unclean shutdown from the crash, rather than an independent issue.Configuration (sanitized)
Minimal subset exercising the crash path (filesystem storage + a
threadedtail input + a backpressuring HTTP output):Our full production config additionally has
tail(kube-audit) andsystemdinputs,kubernetes/lua/modify/rewrite_tagfilters, and a seconds3output — all available on request.Questions for maintainers
in_tailprogress-check timer lifecycle, but the SIGSEGV still occurs for us on 5.0.8. Is there a remaining path (e.g. underthreaded on+ filesystem storage + heavy HTTP-output backpressure) not covered by that fix?Inotify_Watcher Offfully avoid the crashing code path on 5.0.8, or only reduce frequency? We can apply it as a workaround but want to confirm it's a complete mitigation rather than pinning to an older release (we've rolled back to 5.0.3 for now).Happy to provide full
debug-level logs or a core dump if that helps.