-
-
Notifications
You must be signed in to change notification settings - Fork 49
Memory leak with version 1.9.2 #89
Description
Hello, first time raising a defect like this, apologies if anything is incorrect.
We have some fluentd running in k8s, that takes in a bunch of data via the http endpoint. We noticed recently that the pods were being killed due to a memory leak. The primary developer was busy, so I (not a ruby dev) investigated with the aid of AI tooling.
Here’s some of the output from the AI:
Summary
Root cause: A bug in cool.io (the async I/O library used by fluentd) where disabled write watchers can never be properly detached.
The lifecycle for every HTTP connection:
- Client sends request → fluentd reads it
- Fluentd writes response → write watcher fires → buffer drains → write watcher is disabled (enabled=0, but stays in the event loop's @watchers hash)
- Connection closes → close() tries to detach the write watcher → cool.io's C extension sees enabled==0 and silently returns without removing the watcher from @watchers
- The write watcher in @watchers keeps the entire connection object graph alive: Watcher → EventHandler::TCPServer → TCPCallbackSocket → closures → Handler → Http::Parser (with request body strings) → TCPSocket → Coolio::Buffer
At ~100KB per request body, 5,318 leaked connections = ~530MB in strings alone, matching the 571MB we saw in the sigdump.
The fix: Re-enable the write watcher before detaching it, so the C-level guard doesn't fire and the watcher is properly removed from @watchers.
The changes are:
- src/patches/coolio_write_watcher_leak_fix.rb - monkey-patch for Coolio::IO#detach_write_watcher
- Dockerfile - loads src/patches/*.rb before plugins
After deploying the provided monkey patch, my pods seem to be much better:
2 hours in, zero restarts, memory completely flat:
┌───────┬────────┬────────┬────────┬────────┐
│ Pod │ 15:13 │ 15:32 │ 15:45 │ 16:10 │
├───────┼────────┼────────┼────────┼────────┤
│ d4mlm │ 960Mi │ 964Mi │ 960Mi │ 961Mi │
├───────┼────────┼────────┼────────┼────────┤
│ tlb5c │ 1004Mi │ 1001Mi │ 1006Mi │ 1004Mi │
├───────┼────────┼────────┼────────┼────────┤
│ zgldv │ 946Mi │ 949Mi │ 952Mi │ 947Mi │
└───────┴────────┴────────┴────────┴────────┘At this point on the old image, these pods would have been ~200Mi higher and climbing. The leak is gone.
When I asked for a reference to the bug in cool.io, it gave me the following output:
It's a regression introduced just two weeks ago. The bug was added by PR #88 in cool.io, merged on 2026-02-09 and released in v1.9.2.
The backstory:
- Issue
TypeError: wrong argument type nil (expected Data)when processing a detached watcher #87 reported a TypeError crash when a watcher was detached from one thread while the event loop processed an event for it in another- PR Fix TypeError when processing a detached watcher #88 fixed that by adding an enabled == 0 guard in the Watcher_Detach macro to skip already-detached watchers
- But the guard conflates "disabled" (enabled=0, still in @watchers) with "detached" (loop=Qnil, removed from @watchers), causing our leak
The correct fix upstream would be to check watcher_data->loop == Qnil instead of watcher_data->enabled == 0. This hasn't been reported to cool.io yet - it's worth filing an issue at https://github.com/socketry/cool.io/issues referencing PR #88.