Skip to content

Memory leak with version 1.9.2 #89

@leigh-capa

Description

@leigh-capa

Hello, first time raising a defect like this, apologies if anything is incorrect.

We have some fluentd running in k8s, that takes in a bunch of data via the http endpoint. We noticed recently that the pods were being killed due to a memory leak. The primary developer was busy, so I (not a ruby dev) investigated with the aid of AI tooling.

Here’s some of the output from the AI:

Summary

Root cause: A bug in cool.io (the async I/O library used by fluentd) where disabled write watchers can never be properly detached.

The lifecycle for every HTTP connection:

  1. Client sends request → fluentd reads it
  2. Fluentd writes response → write watcher fires → buffer drains → write watcher is disabled (enabled=0, but stays in the event loop's @watchers hash)
  3. Connection closes → close() tries to detach the write watcher → cool.io's C extension sees enabled==0 and silently returns without removing the watcher from @watchers
  4. The write watcher in @watchers keeps the entire connection object graph alive: Watcher → EventHandler::TCPServer → TCPCallbackSocket → closures → Handler → Http::Parser (with request body strings) → TCPSocket → Coolio::Buffer

At ~100KB per request body, 5,318 leaked connections = ~530MB in strings alone, matching the 571MB we saw in the sigdump.

The fix: Re-enable the write watcher before detaching it, so the C-level guard doesn't fire and the watcher is properly removed from @watchers.

The changes are:

  • src/patches/coolio_write_watcher_leak_fix.rb - monkey-patch for Coolio::IO#detach_write_watcher
  • Dockerfile - loads src/patches/*.rb before plugins

After deploying the provided monkey patch, my pods seem to be much better:

2 hours in, zero restarts, memory completely flat:

┌───────┬────────┬────────┬────────┬────────┐
│ Pod │ 15:13 │ 15:32 │ 15:45 │ 16:10 │
├───────┼────────┼────────┼────────┼────────┤
│ d4mlm │ 960Mi │ 964Mi │ 960Mi │ 961Mi │
├───────┼────────┼────────┼────────┼────────┤
│ tlb5c │ 1004Mi │ 1001Mi │ 1006Mi │ 1004Mi │
├───────┼────────┼────────┼────────┼────────┤
│ zgldv │ 946Mi │ 949Mi │ 952Mi │ 947Mi │
└───────┴────────┴────────┴────────┴────────┘

At this point on the old image, these pods would have been ~200Mi higher and climbing. The leak is gone.

When I asked for a reference to the bug in cool.io, it gave me the following output:

It's a regression introduced just two weeks ago. The bug was added by PR #88 in cool.io, merged on 2026-02-09 and released in v1.9.2.

The backstory:

The correct fix upstream would be to check watcher_data->loop == Qnil instead of watcher_data->enabled == 0. This hasn't been reported to cool.io yet - it's worth filing an issue at https://github.com/socketry/cool.io/issues referencing PR #88.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions