feat(scripts): safer reclassify_with_llm.py with provider flags + tighter prompt#164
Open
flintfromthebasement wants to merge 1 commit intoverygoodplugins:mainfrom
Conversation
…hter prompt
Three improvements to scripts/reclassify_with_llm.py to make it safe to
run on a real corpus and easy to retarget at different LLM providers.
1. CLI flags for safe partial runs:
- --limit N cap the number of memories processed
- --sample N random sample N memories (instead of first N)
- --seed N reproducibility for sampled runs
- --dry-run classify but don't write back to FalkorDB
- --yes skip the interactive confirmation prompt
- --provider P openai | openrouter (default: openai)
- --model M override CLASSIFICATION_MODEL per-run
Lets you do a 100-memory sanity-check pass before committing to a
full reclassification across thousands of records. The prior version
was all-or-nothing.
2. OpenRouter / OpenAI-compatible support:
- Adds OPENROUTER_API_KEY, CLASSIFICATION_BASE_URL, CLASSIFICATION_API_KEY
env vars so the same script can target any OpenAI-compatible endpoint
(OpenRouter, LiteLLM, vLLM, Azure, etc.) without code changes.
- Adds a tolerant JSON extractor for models that don't honor
response_format=json (e.g. Gemini families on OpenRouter), which
otherwise return prose-wrapped JSON and crash the strict parser.
3. Tightened SYSTEM_PROMPT:
- Replaces the loose 7-bullet type list with strict definitions,
keyword cues, and explicit priority rules ("Fact:" / descriptive
statements go to Context, not Insight; chat/DM fragments aren't
Decisions just because they contain the word "decided").
- Empirical impact on a 100-memory sample using Gemini 3.1 Flash-Lite:
- Insight share: 56% → 8% (was being used as a catch-all)
- False Decision calls on session/DM fragments eliminated
- Pattern, Context, Habit usage closer to the intended distribution
The script remains a one-shot maintenance tool — typically run after a
model swap, prompt change, or large bulk import — not a recurring task.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
jack-arturo
pushed a commit
that referenced
this pull request
May 1, 2026
…B load race (#165) ## Why When `init_consolidation_scheduler()` runs a tick **immediately** after spawning the worker thread, FalkorDB can still be loading its RDB snapshot from disk. Every Redis command during that window returns: > `LOADING Redis is loading the dataset in memory` The eager tick catches the error, logs it, and bumps `last_run` timestamps — silently skipping the day's decay / creative / cluster work until tomorrow. The bigger the corpus, the longer the RDB load, the more reliably this fires. On any restart-on-deploy host (Railway, Docker, systemd) with a few thousand memories, it hits every deploy. ## What changes One line in `automem/consolidation/runtime_scheduler.py:100` — drop the eager `run_consolidation_tick_fn()` call after starting the worker thread, and add a comment explaining why. ```diff state.consolidation_thread.start() - run_consolidation_tick_fn() + # Skip eager first tick: FalkorDB may still be loading its RDB snapshot at + # startup and the "Redis is loading the dataset in memory" error poisons + # the day's decay/creative run. The worker loop will fire its first tick + # after consolidation_tick_seconds, which is plenty of warm-up time. logger.info("Consolidation scheduler initialized") ``` ## Why this is safe - The worker loop still fires within `CONSOLIDATION_TICK_SECONDS` (default 3600s = 1h). For decay/creative/cluster intervals measured in days, a one-tick startup delay is invisible. - The scheduler is timestamp-driven (`last_run` per task), not edge-triggered. Missed intervals get picked up by the next loop iteration — nothing is "lost" by deferring. - Failure mode flips from "silent broken run" to "no run yet, will run shortly" — strictly better. ## Out of scope - A more involved fix would actively probe FalkorDB readiness with retries before the first tick. That's a bigger change and arguably belongs at the FalkorDB-client layer, not here. This PR is the minimal, low-risk fix. - The `discover_creative_associations` / clustering improvements live in #163 and #164. ## Test plan - [ ] Service starts cleanly with no eager tick log entry - [ ] Worker loop fires its first tick after `CONSOLIDATION_TICK_SECONDS` - [ ] Forcing a tick via `POST /consolidate` still works immediately - [ ] On a restart with a large RDB, no `LOADING Redis is loading the dataset in memory` errors appear in consolidation logs Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
scripts/reclassify_with_llm.pyis a one-shot maintenance tool, but in its current form it's all-or-nothing: there's no way to dry-run it, no way to sample a subset, and no way to point it at anything except OpenAI. That makes it scary to actually run on a real corpus, and impossible to benchmark alternative classification models without forking.This PR makes it safe to use on a production corpus and provider-agnostic, plus tightens the classification prompt based on a 100-memory benchmark.
What changes
1. CLI flags for safe partial runs
--limit N--sample N--seed N--dry-run--yes--provider Popenaioropenrouter(default:openai)--model MCLASSIFICATION_MODELper-runTypical workflow now:
2. OpenRouter / OpenAI-compatible provider support
Adds three env vars:
OPENROUTER_API_KEY,CLASSIFICATION_BASE_URL,CLASSIFICATION_API_KEY. Same script can now target OpenRouter, LiteLLM, vLLM, Azure, or any OpenAI-compatible endpoint without code changes.Includes a tolerant JSON extractor for models that don't honor
response_format=json(Gemini families on OpenRouter return prose-wrapped JSON and otherwise crash the strict parser).3. Tightened SYSTEM_PROMPT
The prior prompt was a loose 7-bullet type list. New prompt has strict definitions, keyword cues, and explicit priority rules ("Fact:" and descriptive statements go to
Context, notInsight; chat/DM fragments aren'tDecisionsjust because they contain "decided").Empirical impact on a 100-memory sample (Gemini 3.1 Flash-Lite via OpenRouter):
Out of scope
automem/consolidation/runtime_scheduler.py) is not in this PR — it's a separate concern (FalkorDB RDB-loading race at init) and will land as its own PR.discover_creative_associations(the rule-based "dreaming" edge inference inconsolidation.py) is unchanged here. There's an open thought to LLM-replace that with the same Gemini 3.1 Flash-Lite + tight-prompt pattern — happy to file a separate issue if it's interesting to benchmark.Test plan
./scripts/reclassify_with_llm.py --helpshows all new flags--dry-run --sample 10runs against a dev FalkorDB without writing--provider openrouter --model google/gemini-3.1-flash-lite-preview --sample 10 --dry-runworks end-to-end