Migrate embedder configuration from Django settings to PipelineSettings#896
Migrate embedder configuration from Django settings to PipelineSettings#896
Conversation
…singleton Several production code paths still read the default embedder path directly from Django settings (settings.DEFAULT_EMBEDDER) instead of the PipelineSettings database singleton, creating a split-brain bug where runtime changes to the default embedder via GraphQL mutations would not take full effect. Changes: - Add get_default_embedder_path() utility in pipeline/utils.py that returns the path string from the PipelineSettings singleton - Refactor get_default_embedder() to use get_default_embedder_path() - Simplify get_default_embedder_for_filetype() to delegate to get_preferred_embedder() (was reading from removed settings.DEFAULT_EMBEDDERS_BY_FILETYPE) - Update all production consumers: - embeddings_task.py (_apply_dual_embedding_strategy) - core_vector_stores.py (global_search) - corpuses/models.py (Corpus.save embedder freeze) - corpuses/checks.py (startup consistency check) - backfill_default_embeddings.py (management command) - queries.py (GraphQL semantic search resolver) - unified_agent_conversation.py (WebSocket embedder picker) - Update tests to mock get_default_embedder_path instead of settings.DEFAULT_EMBEDDER https://claude.ai/code/session_01XdTX294CLzLHxrnaC5sodd
PR Review: Migrate embedder configuration from Django settings to PipelineSettingsThis is a clean, well-scoped migration that correctly moves embedder configuration from static Django settings to the database-backed Correctness & LogicMinor issue: redundant # corpuses/models.py line ~377
default_embedder = get_default_embedder_path() or None
Behaviour change in The old implementation looked up by mimetype in Test Coverage Gaps
This test file was not updated in this PR and uses the old mechanism: @override_settings(DEFAULT_EMBEDDER="test.embedder.path")
def test_...(self):
call_command("backfill_default_embeddings", ...)But the command now calls Similarly, Missing test: The new utility function PerformanceDB hit on every call to
from opencontractserver.pipeline.utils import (
get_default_embedder,
get_default_embedder_path,
)Minor style note: since Architecture / DesignInline imports for circular import avoidance The inline
Now that this function is a one-liner delegating to Summary
The core migration logic is correct and well-structured. The main concern is that several test files that were not updated in this PR will now silently test the wrong thing after the settings source changes. Those tests should be updated to either mock |
Replace settings.DEFAULT_EMBEDDER with get_default_embedder_path() to complete the migration to the database-backed PipelineSettings singleton.
Replace decorative @override_settings(DEFAULT_EMBEDDER=...) with @patch targeting the actual code path through get_default_embedder_path().
Replace @override_settings(DEFAULT_EMBEDDER=...) with @patch targeting the actual code path through get_default_embedder_path() in corpus_tasks.
Replace settings.DEFAULT_EMBEDDER value lookups with get_default_embedder_path() to match what the vector store search code now reads from PipelineSettings.
get_default_embedder_path() returns "" when unconfigured, and the subsequent 'if not ... and default_embedder:' guard handles falsy values correctly without the conversion to None.
Since get_default_embedder_path is imported inside function bodies (not at module level), @patch must target the source module opencontractserver.pipeline.utils, not the consuming module.
PR Review: Migrate embedder configuration to PipelineSettingsOverall this is a clean, well-scoped migration with good test coverage. The approach is sound — removing Django settings as the runtime source of truth in favour of the database-backed singleton is the right direction. A few issues need addressing before merge. Bug (Critical) — Sync ORM call inside an async methodFile: async def _pick_document_embedder(self) -> str | None:
...
paths = await database_sync_to_async(get_embedder_paths)()
if paths:
return paths[0]
else:
from opencontractserver.pipeline.utils import get_default_embedder_path
return get_default_embedder_path() # ← called raw inside async def
Fix: wrap the call: return await database_sync_to_async(get_default_embedder_path)()Bug (Minor) — Stale log messageFile: logger.error("No default embedder specified in settings")This message still says "in settings" after the config has been moved to Dead Code —
|
| File | Lines |
|---|---|
test_embedder_management.py |
39, 44, 736 |
test_dual_embeddings.py |
90, 104, 128, 169, 305, 343, 440 |
test_pydantic_ai_agents.py |
233 |
test_corpus_embeddings.py |
73 |
test_semantic_search_graphql.py |
69 |
These pass today because PipelineSettings.get_instance() bootstraps its default_embedder from settings.DEFAULT_EMBEDDER on first access — so the two values happen to be equal. The tests therefore don't actually verify that the production path reads from PipelineSettings; they verify a transitive equality. This creates a fragile coupling: if someone nulls out settings.DEFAULT_EMBEDDER in the test config (which would be consistent with full deprecation), several tests would silently start testing against an empty string.
The tests in test_embedder_management.py lines 36–44 are the sharpest example — test_preferred_embedder_auto_populated_from_default and test_created_with_embedder_set_on_creation should be updated to assert against a known value seeded into PipelineSettings directly, similar to the pattern already used in the updated test_auto_population_uses_current_default.
Documentation nits
config/graphql/queries.pylines 2197 and 2219 still referenceDEFAULT_EMBEDDERby name in inline comments. Not blocking, but worth cleaning up for consistency.opencontractserver/tests/test_backfill_default_embeddings.pydocstring at line 4 still says "Tests the command that backfills DEFAULT_EMBEDDER embeddings" — same minor inconsistency.
What's good
- The new
get_default_embedder_path()helper is a clean single point of abstraction, andget_default_embedder()correctly delegates to it (DRY improvement). - The system check in
checks.pycorrectly guards against thedocuments_pipelinesettingstable not yet existing, preventing spurious errors during initial migrations. get_default_embedder_path() or Noneincorpus_tasks.pycorrectly handles the empty-string-vs-None contract difference between the old and new APIs.- Test patches target the correct import location (where the symbol is used, not where it's defined).
- Wrap get_default_embedder_path() in database_sync_to_async inside async _pick_document_embedder to prevent SynchronousOnlyOperation - Update log message to reference PipelineSettings instead of settings - Remove dead DEFAULT_EMBEDDERS_BY_FILETYPE from settings and docs
…_embedder_path() Replace fragile transitive coupling through Django settings with direct calls to the production API across 5 test files (13 occurrences).
…of DEFAULT_EMBEDDER
PR Review: Migrate embedder configuration to PipelineSettingsThe direction of this change is sound — centralizing embedder configuration in the database-backed Bug / Correctness1. Hardcoded database table name in In if "documents_pipelinesettings" not in tables:
return errorsThis string is derived from the app label + model name and will silently break (returning no warnings at all) if either ever changes. The function already uses introspection for the column check — the table guard should too: from opencontractserver.documents.models import PipelineSettings
if PipelineSettings._meta.db_table not in tables:
return errorsThe import should be placed inside the existing try/except block where other guarded imports live. 2. Semantic drift in The old implementation read from The updated test in Clarity / Potential Confusion3.
A short inline comment would prevent this: # Bootstrap-only: used by PipelineSettings.get_instance() to seed the singleton
# on first DB creation. NOT read at runtime — configure via admin UI or
# updatePipelineSettings mutation after initial setup.
DEFAULT_EMBEDDER = "..."4. Documentation still describes
Minor / Non-blocking5. The function correctly returns 6. Performance (already handled — FYI)
SummaryThe hardcoded table name (#1) and the undocumented behavioral change in |
QuerySet.delete() bypasses model.delete() and doesn't invalidate the PipelineSettings cache. During setUp(), user creation signals trigger personal corpus creation which calls get_default_embedder_path(), populating the cache with test.py defaults. The two @override_settings tests then get stale cached values instead of creating fresh instances with the overridden settings. Adding _invalidate_cache() after the queryset delete ensures get_instance() hits the DB and picks up the overridden Django settings.
PR Review: Migrate embedder configuration to PipelineSettingsThis is a clean, well-scoped migration. The overall approach is sound and the async fix in the WebSocket consumer is correct. A few issues worth addressing before merge: Bugs / Issues1. Incorrect patch target in
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Summary
This PR migrates embedder configuration from Django settings (
DEFAULT_EMBEDDER,DEFAULT_EMBEDDERS_BY_FILETYPE) to the database-backedPipelineSettingssingleton. This enables dynamic embedder configuration without requiring application restarts.Key Changes
django.conf.settingsfor embedder configuration across the codebaseget_default_embedder_path()to centralize retrieval of the default embedder path fromPipelineSettingsget_default_embedder()now usesget_default_embedder_path()internallyget_default_embedder_for_filetype()now delegates toget_preferred_embedder()which reads fromPipelineSettingsembeddings_task.py: Usesget_default_embedder_path()instead ofsettings.DEFAULT_EMBEDDERcore_vector_stores.py: Global search now reads fromPipelineSettingscorpuses/models.py: Corpus creation freezes embedder fromPipelineSettingsbackfill_default_embeddings.py: Management command uses new utilityunified_agent_conversation.py: WebSocket consumer uses new utilitycheck_default_embedder_consistency()now reads fromPipelineSettingsand provides updated guidance for configurationDEFAULT_EMBEDDERS_BY_FILETYPEfrombase.pyandtest.py(no longer referenced by any runtime code)get_default_embedder_path()instead of Django settings; migrated all test assertions fromsettings.DEFAULT_EMBEDDERtoget_default_embedder_path()to test the actual production APIReview Feedback Addressed
get_default_embedder_path()indatabase_sync_to_async()inside_pick_document_embedderto preventSynchronousOnlyOperationpipeline/utils.pyto referencePipelineSettingsinstead of "settings"DEFAULT_EMBEDDERS_BY_FILETYPE: Deleted from settings files and documentationsettings.DEFAULT_EMBEDDERacross 5 test files withget_default_embedder_path()DEFAULT_EMBEDDERas a settings constantImplementation Details
get_default_embedder_path()function importsPipelineSettingslocally to avoid circular imports