[REDIS] Don't skip XTRIM when XINFO STREAM FULL fails#26
Conversation
XINFO STREAM ... FULL can fail (timeout, etc.) on very large streams — exactly when trimming matters most. With the current code, that failure causes `continue` and the XTRIM below is never executed, so a stream that has grown big can never recover and keeps growing until it OOMs the Redis instance. Move the consumer cleanup (which actually needs the XINFO output) into its own conditional: log and continue on failure, but always attempt the trim afterwards. Also add the missing .Err() call to XTrimMinID — without it, the *IntCmd returned by the call is compared to nil instead of the actual error, and trim failures are silently swallowed.
|
Quick clarification — the diff isn't just an 1.
|
Problem
In
provider/redis/janitor.go,cleanupConsumersdoes this for everystream-*:XINFO STREAM ... FULLis expensive and starts failing (timeout, memory pressure) on streams with very many entries — exactly the streams that most need trimming. Once it starts failing, thecontinueskips the trim, the stream keeps growing, the next call is even slower, snowball.We hit this in production with a stream that grew to 2.1M entries / 769 MB and a 1 GiB Redis sidecar that got OOM-killed every hour for 47 days. Investigation showed the janitor was running but the trim path was never reached.
There's also a secondary issue:
XTrimMinID(...)returns a*IntCmd, not anerror. The current code compares the returned pointer tonilinstead of calling.Err(), so any actual trim failure is silently swallowed.Fix
XInfoStreamFullresult. If it fails, log and skip just the consumer cleanup — the trim afterwards runs unconditionally..Err()toXTrimMinIDso trim errors actually get logged.Behavior change
XINFOfailure on a stream stops trim for that stream until the next janitor cycle; ifXINFOkeeps failing, trim never runs.No API or option changes.
Tests
go test ./provider/redis/...passes locally against the existing localstack-backed test suite. Happy to add a regression test if you point me at the right place — the existing tests don't seem to exercise the janitor directly.