Skip to content

[REDIS] Don't skip XTRIM when XINFO STREAM FULL fails#26

Open
ricardozd wants to merge 1 commit into
botchris:masterfrom
ricardozd:fix/redis-janitor-xtrim-on-info-failure
Open

[REDIS] Don't skip XTRIM when XINFO STREAM FULL fails#26
ricardozd wants to merge 1 commit into
botchris:masterfrom
ricardozd:fix/redis-janitor-xtrim-on-info-failure

Conversation

@ricardozd
Copy link
Copy Markdown

Problem

In provider/redis/janitor.go, cleanupConsumers does this for every stream-*:

s, err := r.redisClient.XInfoStreamFull(r.ctx, streamName, 1).Result()
if err != nil {
    continue   // ← skips the XTRIM below as well
}
... consumer cleanup ...
r.redisClient.XTrimMinID(...)

XINFO STREAM ... FULL is expensive and starts failing (timeout, memory pressure) on streams with very many entries — exactly the streams that most need trimming. Once it starts failing, the continue skips the trim, the stream keeps growing, the next call is even slower, snowball.

We hit this in production with a stream that grew to 2.1M entries / 769 MB and a 1 GiB Redis sidecar that got OOM-killed every hour for 47 days. Investigation showed the janitor was running but the trim path was never reached.

There's also a secondary issue: XTrimMinID(...) returns a *IntCmd, not an error. The current code compares the returned pointer to nil instead of calling .Err(), so any actual trim failure is silently swallowed.

Fix

  • Move the consumer cleanup into a conditional on the XInfoStreamFull result. If it fails, log and skip just the consumer cleanup — the trim afterwards runs unconditionally.
  • Add the missing .Err() to XTrimMinID so trim errors actually get logged.

Behavior change

  • Before: one transient XINFO failure on a stream stops trim for that stream until the next janitor cycle; if XINFO keeps failing, trim never runs.
  • After: trim always runs each cycle, even if consumer cleanup fails. Logs are clearer when trim itself fails.

No API or option changes.

Tests

go test ./provider/redis/... passes locally against the existing localstack-backed test suite. Happy to add a regression test if you point me at the right place — the existing tests don't seem to exercise the janitor directly.

XINFO STREAM ... FULL can fail (timeout, etc.) on very large streams —
exactly when trimming matters most. With the current code, that
failure causes `continue` and the XTRIM below is never executed, so a
stream that has grown big can never recover and keeps growing until
it OOMs the Redis instance.

Move the consumer cleanup (which actually needs the XINFO output)
into its own conditional: log and continue on failure, but always
attempt the trim afterwards. Also add the missing .Err() call to
XTrimMinID — without it, the *IntCmd returned by the call is
compared to nil instead of the actual error, and trim failures are
silently swallowed.
@ricardozd
Copy link
Copy Markdown
Author

Quick clarification — the diff isn't just an if → else, there are two distinct behavior changes. Let me unpack:

1. XTrimMinID(...) was never reporting errors

r.redisClient.XTrimMinID(...) returns *redis.IntCmd, not error:

// go-redis/v9 commands.go:289
XTrimMinID(ctx context.Context, key string, minID string) *IntCmd

The current code is:

if err := r.redisClient.XTrimMinID(r.ctx, streamName, ...); err != nil {
    r.options.logger.Errorf("Error trimming %s", err)
}

So err is of type *IntCmd and err != nil is comparing the pointer, not the Redis error. go-redis always returns a non-nil *IntCmd, so:

  • The condition is effectively true on every iteration.
  • The "Error trimming" log fires in the success case too — and the formatted %s is the command struct, not the actual error.
  • A real Redis error (e.g. OOM command not allowed when used memory > 'maxmemory') is silently swallowed because nobody ever calls .Err() on the result.

The new code adds the missing .Err():

if err := r.redisClient.XTrimMinID(...).Err(); err != nil {
    r.options.logger.Errorf("Error trimming %s: %s", streamName, err)
}

That alone is a real bug fix — anyone who was relying on the trim error log was never seeing the real error.

2. XInfoStreamFull failure used to skip the trim for that stream

Before:

s, err := r.redisClient.XInfoStreamFull(r.ctx, streamName, 1).Result()
if err != nil {
    continue   // ← jumps to next streamName, XTRIM below is NOT executed
}
... consumer cleanup uses `s` ...
... XTrimMinID(...)  ...   // never reached when XInfoStreamFull fails

After:

if s, infoErr := r.redisClient.XInfoStreamFull(r.ctx, streamName, 1).Result(); infoErr == nil {
    ... consumer cleanup uses `s` ...
} else {
    r.options.logger.Errorf("...")
}
... XTrimMinID(...) ...   // always runs, regardless of XInfoStreamFull outcome

The control-flow difference: the trim now runs even when XInfoStreamFull returned an error. That's the failure mode that hit us in production:

  • A stream grew to 2.1M entries / 769 MB.
  • XINFO STREAM ... FULL on that stream started timing out / failing under memory pressure on a 1 GiB Redis sidecar.
  • Once it started failing, the janitor's continue skipped the trim, so the stream couldn't shrink.
  • Snowball: the bigger the stream got, the more XINFO ... FULL failed, the less trim ran. Bigger streams need trim more, but get trim less.
  • The Redis sidecar OOM-killed every ~1 hour for 47 days straight before we caught it. After a manual XTRIM ... MAXLEN 1000 it was instantly fine.

Consumer cleanup genuinely needs s (it iterates s.Groups). The trim doesn't — it only needs streamName and trimDuration. Decoupling them lets the trim survive the inspection failing.

Net effect of the PR

  • Trim error logs now actually contain the Redis error.
  • Streams that grow large enough to make XINFO ... FULL fail can still recover via the trim path. Without this, they can't.

The if → else shape comes out of doing #2; happy to restructure the diff differently (early-return, helper func, whatever fits the project style) if the visual is misleading. The two semantic changes above are what matter.

Happy to add a regression test if you'd like — it would need a seeded large stream + a way to fail XINFO STREAM FULL, which I'd want to set up against the existing localstack harness. Let me know what shape you want it in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant