Memory leak with version 1.9.2

Hello, first time raising a defect like this, apologies if anything is incorrect.

We have some fluentd running in k8s, that takes in a bunch of data via the http endpoint.  We noticed recently that the pods were being killed due to a memory leak.  The primary developer was busy, so I (not a ruby dev) investigated with the aid of AI tooling.

Here’s some of the output from the AI:

>  Summary
> 
>   Root cause: A bug in cool.io (the async I/O library used by fluentd) where disabled write watchers can never be properly detached.
> 
>   The lifecycle for every HTTP connection:
>   1. Client sends request → fluentd reads it
>   2. Fluentd writes response → write watcher fires → buffer drains → write watcher is disabled (enabled=0, but stays in the event loop's @watchers hash)
>   3. Connection closes → close() tries to detach the write watcher → cool.io's C extension sees enabled==0 and silently returns without removing the watcher from @watchers
>   4. The write watcher in @watchers keeps the entire connection object graph alive: Watcher → EventHandler::TCPServer → TCPCallbackSocket → closures → Handler → Http::Parser (with request body strings) → TCPSocket → Coolio::Buffer
> 
>   At ~100KB per request body, 5,318 leaked connections = ~530MB in strings alone, matching the 571MB we saw in the sigdump.
> 
>   The fix: Re-enable the write watcher before detaching it, so the C-level guard doesn't fire and the watcher is properly removed from @watchers.
> 
>   The changes are:
>   - src/patches/coolio_write_watcher_leak_fix.rb - monkey-patch for Coolio::IO#detach_write_watcher
>   - Dockerfile - loads src/patches/*.rb before plugins

After deploying the provided monkey patch, my pods seem to be much better:

> 2 hours in, zero restarts, memory completely flat:
> 
>   ┌───────┬────────┬────────┬────────┬────────┐
>   │  Pod  │ 15:13  │ 15:32  │ 15:45  │ 16:10  │
>   ├───────┼────────┼────────┼────────┼────────┤
>   │ d4mlm │ 960Mi  │ 964Mi  │ 960Mi  │ 961Mi  │
>   ├───────┼────────┼────────┼────────┼────────┤
>   │ tlb5c │ 1004Mi │ 1001Mi │ 1006Mi │ 1004Mi │
>   ├───────┼────────┼────────┼────────┼────────┤
>   │ zgldv │ 946Mi  │ 949Mi  │ 952Mi  │ 947Mi  │
>   └───────┴────────┴────────┴────────┴────────┘
> 
>   At this point on the old image, these pods would have been ~200Mi higher and climbing. The leak is gone.

When I asked for a reference to the bug in cool.io, it gave me the following output:

> It's a regression introduced just two weeks ago. The bug was added by PR #88 in cool.io, merged on 2026-02-09 and released in v1.9.2.
> 
>   The backstory:
>   - Issue #87 reported a TypeError crash when a watcher was detached from one thread while the event loop processed an event for it in another
>   - PR #88 fixed that by adding an enabled == 0 guard in the Watcher_Detach macro to skip already-detached watchers
>   - But the guard conflates "disabled" (enabled=0, still in @watchers) with "detached" (loop=Qnil, removed from @watchers), causing our leak
> 
>   The correct fix upstream would be to check watcher_data->loop == Qnil instead of watcher_data->enabled == 0. This hasn't been reported to cool.io yet - it's worth filing an issue at https://github.com/socketry/cool.io/issues referencing PR #88.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory leak with version 1.9.2 #89

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Memory leak with version 1.9.2 #89

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions