Skip to content

feat(scripts): safer reclassify_with_llm.py with provider flags + tighter prompt#164

Open
flintfromthebasement wants to merge 1 commit intoverygoodplugins:mainfrom
flintfromthebasement:feat/reclassify-script-improvements
Open

feat(scripts): safer reclassify_with_llm.py with provider flags + tighter prompt#164
flintfromthebasement wants to merge 1 commit intoverygoodplugins:mainfrom
flintfromthebasement:feat/reclassify-script-improvements

Conversation

@flintfromthebasement
Copy link
Copy Markdown
Contributor

Why

scripts/reclassify_with_llm.py is a one-shot maintenance tool, but in its current form it's all-or-nothing: there's no way to dry-run it, no way to sample a subset, and no way to point it at anything except OpenAI. That makes it scary to actually run on a real corpus, and impossible to benchmark alternative classification models without forking.

This PR makes it safe to use on a production corpus and provider-agnostic, plus tightens the classification prompt based on a 100-memory benchmark.

What changes

1. CLI flags for safe partial runs

Flag Purpose
--limit N Cap memories processed
--sample N Random sample of N memories (instead of "first N")
--seed N Reproducibility for sampled runs
--dry-run Classify but don't write back to FalkorDB
--yes Skip the interactive confirmation prompt
--provider P openai or openrouter (default: openai)
--model M Override CLASSIFICATION_MODEL per-run

Typical workflow now:

# 1. Sanity-check on 100 random memories, no writes
./scripts/reclassify_with_llm.py --sample 100 --seed 42 --dry-run

# 2. If the distribution looks right, commit to the full pass
./scripts/reclassify_with_llm.py --yes

2. OpenRouter / OpenAI-compatible provider support

Adds three env vars: OPENROUTER_API_KEY, CLASSIFICATION_BASE_URL, CLASSIFICATION_API_KEY. Same script can now target OpenRouter, LiteLLM, vLLM, Azure, or any OpenAI-compatible endpoint without code changes.

Includes a tolerant JSON extractor for models that don't honor response_format=json (Gemini families on OpenRouter return prose-wrapped JSON and otherwise crash the strict parser).

3. Tightened SYSTEM_PROMPT

The prior prompt was a loose 7-bullet type list. New prompt has strict definitions, keyword cues, and explicit priority rules ("Fact:" and descriptive statements go to Context, not Insight; chat/DM fragments aren't Decisions just because they contain "decided").

Empirical impact on a 100-memory sample (Gemini 3.1 Flash-Lite via OpenRouter):

Type Before (loose) After (strict)
Insight 56% (catch-all) 8%
False Decisions on DM/session fragments several 0
Context, Pattern, Habit underused distribution closer to intent

Out of scope

  • The startup-tick guard from the same flint-branch commit (automem/consolidation/runtime_scheduler.py) is not in this PR — it's a separate concern (FalkorDB RDB-loading race at init) and will land as its own PR.
  • discover_creative_associations (the rule-based "dreaming" edge inference in consolidation.py) is unchanged here. There's an open thought to LLM-replace that with the same Gemini 3.1 Flash-Lite + tight-prompt pattern — happy to file a separate issue if it's interesting to benchmark.

Test plan

  • ./scripts/reclassify_with_llm.py --help shows all new flags
  • --dry-run --sample 10 runs against a dev FalkorDB without writing
  • --provider openrouter --model google/gemini-3.1-flash-lite-preview --sample 10 --dry-run works end-to-end
  • Default behavior (no flags, OpenAI) is unchanged from prior script

…hter prompt

Three improvements to scripts/reclassify_with_llm.py to make it safe to
run on a real corpus and easy to retarget at different LLM providers.

1. CLI flags for safe partial runs:
   - --limit N        cap the number of memories processed
   - --sample N       random sample N memories (instead of first N)
   - --seed N         reproducibility for sampled runs
   - --dry-run        classify but don't write back to FalkorDB
   - --yes            skip the interactive confirmation prompt
   - --provider P     openai | openrouter (default: openai)
   - --model M        override CLASSIFICATION_MODEL per-run

   Lets you do a 100-memory sanity-check pass before committing to a
   full reclassification across thousands of records. The prior version
   was all-or-nothing.

2. OpenRouter / OpenAI-compatible support:
   - Adds OPENROUTER_API_KEY, CLASSIFICATION_BASE_URL, CLASSIFICATION_API_KEY
     env vars so the same script can target any OpenAI-compatible endpoint
     (OpenRouter, LiteLLM, vLLM, Azure, etc.) without code changes.
   - Adds a tolerant JSON extractor for models that don't honor
     response_format=json (e.g. Gemini families on OpenRouter), which
     otherwise return prose-wrapped JSON and crash the strict parser.

3. Tightened SYSTEM_PROMPT:
   - Replaces the loose 7-bullet type list with strict definitions,
     keyword cues, and explicit priority rules ("Fact:" / descriptive
     statements go to Context, not Insight; chat/DM fragments aren't
     Decisions just because they contain the word "decided").
   - Empirical impact on a 100-memory sample using Gemini 3.1 Flash-Lite:
     - Insight share: 56% → 8% (was being used as a catch-all)
     - False Decision calls on session/DM fragments eliminated
     - Pattern, Context, Habit usage closer to the intended distribution

The script remains a one-shot maintenance tool — typically run after a
model swap, prompt change, or large bulk import — not a recurring task.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
jack-arturo pushed a commit that referenced this pull request May 1, 2026
…B load race (#165)

## Why

When `init_consolidation_scheduler()` runs a tick **immediately** after
spawning the worker thread, FalkorDB can still be loading its RDB
snapshot from disk. Every Redis command during that window returns:

> `LOADING Redis is loading the dataset in memory`

The eager tick catches the error, logs it, and bumps `last_run`
timestamps — silently skipping the day's decay / creative / cluster work
until tomorrow. The bigger the corpus, the longer the RDB load, the more
reliably this fires. On any restart-on-deploy host (Railway, Docker,
systemd) with a few thousand memories, it hits every deploy.

## What changes

One line in `automem/consolidation/runtime_scheduler.py:100` — drop the
eager `run_consolidation_tick_fn()` call after starting the worker
thread, and add a comment explaining why.

```diff
     state.consolidation_thread.start()
-    run_consolidation_tick_fn()
+    # Skip eager first tick: FalkorDB may still be loading its RDB snapshot at
+    # startup and the "Redis is loading the dataset in memory" error poisons
+    # the day's decay/creative run. The worker loop will fire its first tick
+    # after consolidation_tick_seconds, which is plenty of warm-up time.
     logger.info("Consolidation scheduler initialized")
```

## Why this is safe

- The worker loop still fires within `CONSOLIDATION_TICK_SECONDS`
(default 3600s = 1h). For decay/creative/cluster intervals measured in
days, a one-tick startup delay is invisible.
- The scheduler is timestamp-driven (`last_run` per task), not
edge-triggered. Missed intervals get picked up by the next loop
iteration — nothing is "lost" by deferring.
- Failure mode flips from "silent broken run" to "no run yet, will run
shortly" — strictly better.

## Out of scope

- A more involved fix would actively probe FalkorDB readiness with
retries before the first tick. That's a bigger change and arguably
belongs at the FalkorDB-client layer, not here. This PR is the minimal,
low-risk fix.
- The `discover_creative_associations` / clustering improvements live in
#163 and #164.

## Test plan

- [ ] Service starts cleanly with no eager tick log entry
- [ ] Worker loop fires its first tick after
`CONSOLIDATION_TICK_SECONDS`
- [ ] Forcing a tick via `POST /consolidate` still works immediately
- [ ] On a restart with a large RDB, no `LOADING Redis is loading the
dataset in memory` errors appear in consolidation logs

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant