DEV-1416: refresh memory embeddings on slayer ingest / --ingest-on-startup#128
Conversation
…artup Per-datasource ingest pass now re-embeds every memory whose canonical entities are rooted at the datasource. A stale embeddings.db (created without an API key, or after a manual memories.yaml edit) is repaired by the next ingest without any extra step. Per-memory embed failures attribute as IngestionError(model_name="memory:<id>", ...) so a startup log inspection can distinguish them from datasource-doc / model failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…embeddings-not-refreshed-by-ingest-on-startup # Conflicts: # CLAUDE.md # pyproject.toml
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis PR makes idempotent per-datasource ingestion refresh embeddings for all memories canonically rooted at that datasource (including startup ingest), returns per-entity (model_name, error) warning tuples for precise error attribution, expands name-validation rules, adds tests, updates docs, and bumps the package version. ChangesMemory embedding refresh and startup repair
Sequence Diagram(s)sequenceDiagram
participant Ingest as ingest_datasource_idempotent
participant Refresh as _refresh_datasource_embeddings
participant Models as _refresh_models_for_datasource
participant Doc as _refresh_datasource_doc
participant MemList as storage.list_memories()
participant MemRefresh as EmbeddingService.refresh_memory()
participant Persist as storage.save_embeddings()
participant Errors as IdempotentIngestResult.errors
Ingest->>Refresh: trigger embeddings refresh for datasource
Refresh->>Models: refresh each visible model subtree
Models-->>Refresh: warnings tagged "<ds>.<model>"
Refresh->>Doc: refresh datasource doc embedding
Doc-->>Refresh: warnings tagged ""
Refresh->>MemList: list memories
MemList-->>Refresh: memory records (filtered by canonical root)
loop for each memory
Refresh->>MemRefresh: refresh_memory(memory)
MemRefresh-->>Persist: save embeddings
Persist-->>Refresh: persist success or exception (warnings include canonical_ids)
end
Refresh->>Errors: return (model_name, error) tuples for ingestion mapping
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
tests/test_idempotent_ingestion.py (2)
559-563: 💤 Low valueConsider adding SONAR comment for consistency.
This async stub function must remain
async defto match theembed_batchsignature, even though it doesn't await. Adding# NOSONAR(S7503)would suppress the warning and maintain consistency with other test stubs in this file.📝 Suggested addition
- async def should_not_be_called( + async def should_not_be_called( # NOSONAR(S7503) — must be `async def` to match embed_batch signature for monkeypatch texts: List[str], *, model: Optional[str] = None, ) -> List[Optional[List[float]]]:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_idempotent_ingestion.py` around lines 559 - 563, The async stub function should_not_be_called is intentionally declared async though it doesn't await; add the SONAR suppression comment "# NOSONAR(S7503)" on the async def line to suppress the S7503 warning and keep it consistent with other test stubs (leave the function body unchanged so it still appends to called and returns [None for _ in texts]).
393-397: 💤 Low valueConsider adding SONAR comment for consistency.
Similar to the stub in
test_startup_ingest.py, thisasync deffunction doesn't await anything but must remain async to match theembed_batchsignature for monkeypatching. Adding# NOSONAR(S7503)would suppress the static analysis warning and align with the pattern used intest_startup_ingest.py.📝 Suggested addition
- async def fake_embed_batch( + async def fake_embed_batch( # NOSONAR(S7503) — must be `async def` to match embed_batch signature for monkeypatch texts: List[str], *, model: Optional[str] = None, ) -> List[Optional[List[float]]]:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_idempotent_ingestion.py` around lines 393 - 397, The async stub fake_embed_batch currently has no await but must remain async to match the embed_batch signature for monkeypatching; add a SONAR suppression comment (e.g., # NOSONAR(S7503)) on the fake_embed_batch definition to mirror the pattern used in test_startup_ingest.py and suppress the static-analysis warning while keeping the function async for tests.tests/test_startup_ingest.py (1)
861-864: 💤 Low valueConsider adding NOSONAR comment for consistency.
The static analysis tool flags this async function for not using await, but this is a false positive—the function must be
async defto match the signature ofembed_batchfor monkeypatching. Similar stubs elsewhere in this file (e.g., lines 77, 144, 383, 488, 527, 559, 588, 658, 683, 704, 728) include# NOSONAR(S7503)comments to suppress this warning. Adding the same comment here would maintain consistency.📝 Suggested addition
- async def fake_embed_batch( + async def fake_embed_batch( # NOSONAR(S7503) — must be `async def` to match embed_batch signature for monkeypatch texts: List[str], *, model: Optional[str] = None, ) -> List[Optional[List[float]]]:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_startup_ingest.py` around lines 861 - 864, Add the same NOSONAR suppression comment to the async stub so the static analyzer stops flagging it: in the fake_embed_batch async function (definition named fake_embed_batch) append the comment "# NOSONAR(S7503)" to the function body or immediately after the def line to match the other stubs in this file and preserve the required async signature for monkeypatching.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/test_idempotent_ingestion.py`:
- Around line 559-563: The async stub function should_not_be_called is
intentionally declared async though it doesn't await; add the SONAR suppression
comment "# NOSONAR(S7503)" on the async def line to suppress the S7503 warning
and keep it consistent with other test stubs (leave the function body unchanged
so it still appends to called and returns [None for _ in texts]).
- Around line 393-397: The async stub fake_embed_batch currently has no await
but must remain async to match the embed_batch signature for monkeypatching; add
a SONAR suppression comment (e.g., # NOSONAR(S7503)) on the fake_embed_batch
definition to mirror the pattern used in test_startup_ingest.py and suppress the
static-analysis warning while keeping the function async for tests.
In `@tests/test_startup_ingest.py`:
- Around line 861-864: Add the same NOSONAR suppression comment to the async
stub so the static analyzer stops flagging it: in the fake_embed_batch async
function (definition named fake_embed_batch) append the comment "#
NOSONAR(S7503)" to the function body or immediately after the def line to match
the other stubs in this file and preserve the required async signature for
monkeypatching.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6d885198-0f10-4162-b1cd-b7eb0adf114f
📒 Files selected for processing (7)
CLAUDE.mddocs/concepts/ingestion.mddocs/concepts/search.mdpyproject.tomlslayer/engine/ingestion.pytests/test_idempotent_ingestion.pytests/test_startup_ingest.py
…s, NOSONAR) * Embedding persist-failure warning now lists the failing rows' canonical ids so per-memory persist failures attribute to memory:<id> via the ingest IngestionError shape. * Forbid ":" in all name validators (Column, ModelMeasure, SlayerModel name + data_source, SlayerQuery, DatasourceConfig). The aggregation separator is reserved across the canonical-id namespace; this enforces the invariant the memory resolver already assumes. * Factor shared name-validation rules into _SubstringRule constants (_NO_DUNDER / _NO_DOT / _NO_COLON / _NO_FWD_SLASH / _NO_BACK_SLASH / _NO_NUL) plus _require_non_empty_trimmed, so the four name validators share one source of truth per character. * Replace the regex-based memory:<id> extraction in ingest_datasource_idempotent with structured (model_name, error_text) tuples returned from three helpers: _refresh_models_for_datasource, _refresh_datasource_doc, _refresh_memories_for_datasource. Drops the fragile string sniffing and brings _refresh_datasource_embeddings' cognitive complexity back under the Sonar gate. * Add NOSONAR(S7503) on three async test stubs that must be `async def` to match the patched signatures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
slayer/core/models.py (1)
97-103: ⚡ Quick winUse keyword arguments in multi-parameter helper calls.
The new helper calls pass multiple arguments positionally. Please switch them to keyword arguments to match repository rules and keep call sites self-documenting.
Suggested patch
def _validate_model_name(name: str, context: str) -> str: @@ - _NO_DUNDER.check(name, label) - _NO_DOT.check(name, label) - _NO_COLON.check(name, label) + _NO_DUNDER.check(name=name, context=label) + _NO_DOT.check(name=name, context=label) + _NO_COLON.check(name=name, context=label) def _validate_column_name(name: str, context: str) -> str: @@ - _NO_DOT.check(name, label) - _NO_COLON.check(name, label) + _NO_DOT.check(name=name, context=label) + _NO_COLON.check(name=name, context=label) class SlayerModel(BaseModel): @@ - _NO_NUL.check(v, label) - _NO_FWD_SLASH.check(v, label) - _NO_BACK_SLASH.check(v, label) - _NO_DOT.check(v, label) - _NO_COLON.check(v, label) + _NO_NUL.check(name=v, context=label) + _NO_FWD_SLASH.check(name=v, context=label) + _NO_BACK_SLASH.check(name=v, context=label) + _NO_DOT.check(name=v, context=label) + _NO_COLON.check(name=v, context=label) class DatasourceConfig(BaseModel): @@ - _require_non_empty_trimmed(v, label) - _NO_NUL.check(v, label) - _NO_FWD_SLASH.check(v, label) - _NO_BACK_SLASH.check(v, label) - _NO_DOT.check(v, label) - _NO_COLON.check(v, label) + _require_non_empty_trimmed(v=v, context=label) + _NO_NUL.check(name=v, context=label) + _NO_FWD_SLASH.check(name=v, context=label) + _NO_BACK_SLASH.check(name=v, context=label) + _NO_DOT.check(name=v, context=label) + _NO_COLON.check(name=v, context=label)As per coding guidelines
**/*.py: Use keyword arguments for functions with more than 1 parameter.Also applies to: 112-114, 430-435, 706-711
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@slayer/core/models.py` around lines 97 - 103, Change positional multi-argument helper calls to use keyword arguments: in _validate_model_name, update the three helper invocations (_NO_DUNDER.check, _NO_DOT.check, _NO_COLON.check) to call check(name=..., label=...) instead of positional args; also apply the same change to the other _*.check call sites in this module (the other occurrences referenced in the review) so every check(...) call with more than one parameter uses explicit keyword argument names.slayer/core/query.py (1)
330-331: ⚡ Quick winUse keyword arguments when delegating name validation.
Please call
_validate_model_namewith keyword arguments for consistency with repo style rules.Suggested patch
- return _validate_model_name(v, "Query") + return _validate_model_name(name=v, context="Query")As per coding guidelines
**/*.py: Use keyword arguments for functions with more than 1 parameter.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@slayer/core/query.py` around lines 330 - 331, Replace the positional call _validate_model_name(v, "Query") with a keyword-argument call so it follows repo style; call _validate_model_name using the parameter names from its signature (for example: _validate_model_name(name=v, model_name="Query") or similar), ensuring you pass the current variable v as the name and "Query" as the model/context argument.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@slayer/engine/ingestion.py`:
- Around line 1226-1248: The helper currently builds models_in_ds by iterating
storage._list_all_model_identities and calling storage.get_model, but it still
proceeds to call _refresh_datasource_doc even when enumeration or per-model
get_model errors occurred or when hidden models were appended; this can
overwrite a correct datasource doc with a partial or unfiltered one. Fix by:
stop updating the datasource doc if storage._list_all_model_identities raised or
any get_model call failed (i.e., if the function caught exceptions and appended
to warnings), and additionally filter out hidden/ignored models from
models_in_ds before returning or refreshing (use the same visibility/filtering
logic as service.refresh_model_subtree); apply the same guard/filters in the
analogous block at the later range (lines referenced in the comment) so
_refresh_datasource_doc is only called with a complete, filtered models_in_ds.
---
Nitpick comments:
In `@slayer/core/models.py`:
- Around line 97-103: Change positional multi-argument helper calls to use
keyword arguments: in _validate_model_name, update the three helper invocations
(_NO_DUNDER.check, _NO_DOT.check, _NO_COLON.check) to call check(name=...,
label=...) instead of positional args; also apply the same change to the other
_*.check call sites in this module (the other occurrences referenced in the
review) so every check(...) call with more than one parameter uses explicit
keyword argument names.
In `@slayer/core/query.py`:
- Around line 330-331: Replace the positional call _validate_model_name(v,
"Query") with a keyword-argument call so it follows repo style; call
_validate_model_name using the parameter names from its signature (for example:
_validate_model_name(name=v, model_name="Query") or similar), ensuring you pass
the current variable v as the name and "Query" as the model/context argument.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9ceaf4b8-7e84-4dd2-a449-dbbd6860162c
📒 Files selected for processing (7)
slayer/core/models.pyslayer/core/query.pyslayer/embeddings/service.pyslayer/engine/ingestion.pytests/test_idempotent_ingestion.pytests/test_models.pytests/test_startup_ingest.py
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/test_idempotent_ingestion.py
- tests/test_startup_ingest.py
Switch the new _NO_*.check() and _require_non_empty_trimmed() call sites in slayer/core/models.py to keyword form, per the global rule about multi-parameter helper calls. Mechanical change; existing validator tests pin behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|



Summary
_refresh_datasource_embeddingsnow iterates memories whose canonical entities are rooted at the datasource (viacanonical_id_rooted_at) and re-embeds each one throughEmbeddingService.refresh_memory. Bothslayer ingest <ds>and--ingest-on-startupgo through this path, so a staleembeddings.db(created without an API key, or after a manualmemories.yamledit) is repaired by the next ingest with no extra step.IngestionError(model_name="memory:<id>", …)inIdempotentIngestResult.errors, so startup-log inspection distinguishes memory failures from datasource-doc / model failures at a glance. Regex extraction lives iningest_datasource_idempotent(_MEMORY_REF_RE).list_memoriesraising is caught and surfaced as a single warning — the datasource pass never propagates the failure.Linear: DEV-1416. Follow-up issue for the generic flat batcher across entity kinds: DEV-1417.
Test plan
tests/test_idempotent_ingestion.py::TestMemoryEmbeddingRefresh— six unit cases covering: rooted-only filter, hash-skip no-op on second pass, per-memory failure →model_name="memory:<id>", defensive raise tagged,embedding_searchextra missing → silent no-op,list_memoriesraise → warning + continue.tests/test_startup_ingest.py::TestMemoryEmbeddingsOnStartup— end-to-end throughingest_all_datasources_idempotenton a realYAMLStorage+ sqlite live DB, asserts the memory embedding row exists post-pass.poetry run pytest -m "not integration"— 2651 passed, 0 failed.poetry run ruff check slayer/ tests/— clean.Out of scope
embedding_statusonsave_memoryresponse.slayer embeddings refreshCLI (tracked in DEV-1415).🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Validation
Documentation
Tests
Chores