Skip to content

fix(consolidation): skip eager first tick at startup to avoid FalkorDB load race#165

Merged
jack-arturo merged 1 commit intoverygoodplugins:mainfrom
flintfromthebasement:fix/skip-eager-consolidation-tick-on-startup
May 1, 2026
Merged

fix(consolidation): skip eager first tick at startup to avoid FalkorDB load race#165
jack-arturo merged 1 commit intoverygoodplugins:mainfrom
flintfromthebasement:fix/skip-eager-consolidation-tick-on-startup

Conversation

@flintfromthebasement
Copy link
Copy Markdown
Contributor

Why

When init_consolidation_scheduler() runs a tick immediately after spawning the worker thread, FalkorDB can still be loading its RDB snapshot from disk. Every Redis command during that window returns:

LOADING Redis is loading the dataset in memory

The eager tick catches the error, logs it, and bumps last_run timestamps — silently skipping the day's decay / creative / cluster work until tomorrow. The bigger the corpus, the longer the RDB load, the more reliably this fires. On any restart-on-deploy host (Railway, Docker, systemd) with a few thousand memories, it hits every deploy.

What changes

One line in automem/consolidation/runtime_scheduler.py:100 — drop the eager run_consolidation_tick_fn() call after starting the worker thread, and add a comment explaining why.

     state.consolidation_thread.start()
-    run_consolidation_tick_fn()
+    # Skip eager first tick: FalkorDB may still be loading its RDB snapshot at
+    # startup and the "Redis is loading the dataset in memory" error poisons
+    # the day's decay/creative run. The worker loop will fire its first tick
+    # after consolidation_tick_seconds, which is plenty of warm-up time.
     logger.info("Consolidation scheduler initialized")

Why this is safe

  • The worker loop still fires within CONSOLIDATION_TICK_SECONDS (default 3600s = 1h). For decay/creative/cluster intervals measured in days, a one-tick startup delay is invisible.
  • The scheduler is timestamp-driven (last_run per task), not edge-triggered. Missed intervals get picked up by the next loop iteration — nothing is "lost" by deferring.
  • Failure mode flips from "silent broken run" to "no run yet, will run shortly" — strictly better.

Out of scope

Test plan

  • Service starts cleanly with no eager tick log entry
  • Worker loop fires its first tick after CONSOLIDATION_TICK_SECONDS
  • Forcing a tick via POST /consolidate still works immediately
  • On a restart with a large RDB, no LOADING Redis is loading the dataset in memory errors appear in consolidation logs

…B load race

When init_consolidation_scheduler() ran a tick immediately after spawning
the worker thread, FalkorDB could still be loading its RDB snapshot from
disk. Every Redis command in that window returns "LOADING Redis is loading
the dataset in memory", so the eager tick fails — but the failure is
caught and last_run timestamps get bumped, silently skipping the day's
decay / creative / cluster work until tomorrow.

The bigger the corpus, the longer the RDB load, the more reliably this
fires. On any restart-on-deploy host (Railway, Docker, systemd) with a
few thousand memories, it hits every deploy.

Removing the eager tick is safe:

- The worker loop still fires within CONSOLIDATION_TICK_SECONDS (default
  1h). For decay/creative/cluster intervals measured in days, a one-tick
  delay at startup is invisible.
- The scheduler is timestamp-driven (last_run per task), not edge-triggered.
  Nothing is "lost" by deferring — the next loop iteration picks up any
  missed intervals.
- Failure mode flips from "silent broken run" to "no run yet, will run
  shortly" — strictly better.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@jack-arturo jack-arturo merged commit 1b812cf into verygoodplugins:main May 1, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants