Skip to content

Migrate embedder configuration from Django settings to PipelineSettings#896

Open
JSv4 wants to merge 15 commits intomainfrom
claude/verify-pipeline-defaults-XDUWa
Open

Migrate embedder configuration from Django settings to PipelineSettings#896
JSv4 wants to merge 15 commits intomainfrom
claude/verify-pipeline-defaults-XDUWa

Conversation

@JSv4
Copy link
Collaborator

@JSv4 JSv4 commented Feb 17, 2026

Summary

This PR migrates embedder configuration from Django settings (DEFAULT_EMBEDDER, DEFAULT_EMBEDDERS_BY_FILETYPE) to the database-backed PipelineSettings singleton. This enables dynamic embedder configuration without requiring application restarts.

Key Changes

  • Removed Django settings dependency: Eliminated imports of django.conf.settings for embedder configuration across the codebase
  • New utility function: Added get_default_embedder_path() to centralize retrieval of the default embedder path from PipelineSettings
  • Simplified embedder resolution:
    • get_default_embedder() now uses get_default_embedder_path() internally
    • get_default_embedder_for_filetype() now delegates to get_preferred_embedder() which reads from PipelineSettings
  • Updated all consumers: Modified all code paths that previously read from Django settings to use the new utility functions:
    • embeddings_task.py: Uses get_default_embedder_path() instead of settings.DEFAULT_EMBEDDER
    • core_vector_stores.py: Global search now reads from PipelineSettings
    • corpuses/models.py: Corpus creation freezes embedder from PipelineSettings
    • backfill_default_embeddings.py: Management command uses new utility
    • unified_agent_conversation.py: WebSocket consumer uses new utility
    • GraphQL queries: Semantic search uses new utility
  • Updated system checks: check_default_embedder_consistency() now reads from PipelineSettings and provides updated guidance for configuration
  • Removed dead config: Deleted DEFAULT_EMBEDDERS_BY_FILETYPE from base.py and test.py (no longer referenced by any runtime code)
  • Updated tests: Modified test mocks to patch get_default_embedder_path() instead of Django settings; migrated all test assertions from settings.DEFAULT_EMBEDDER to get_default_embedder_path() to test the actual production API

Review Feedback Addressed

  • Fixed async ORM bug: Wrapped get_default_embedder_path() in database_sync_to_async() inside _pick_document_embedder to prevent SynchronousOnlyOperation
  • Fixed stale log message: Updated error message in pipeline/utils.py to reference PipelineSettings instead of "settings"
  • Removed dead DEFAULT_EMBEDDERS_BY_FILETYPE: Deleted from settings files and documentation
  • Migrated test assertions: Replaced 13 occurrences of settings.DEFAULT_EMBEDDER across 5 test files with get_default_embedder_path()
  • Updated documentation nits: Cleaned up comments and docstrings referencing DEFAULT_EMBEDDER as a settings constant

Implementation Details

  • The get_default_embedder_path() function imports PipelineSettings locally to avoid circular imports
  • All existing functionality is preserved; this is purely a configuration source migration
  • The change maintains backward compatibility in behavior while improving configurability

…singleton

Several production code paths still read the default embedder path directly
from Django settings (settings.DEFAULT_EMBEDDER) instead of the
PipelineSettings database singleton, creating a split-brain bug where
runtime changes to the default embedder via GraphQL mutations would not
take full effect.

Changes:
- Add get_default_embedder_path() utility in pipeline/utils.py that
  returns the path string from the PipelineSettings singleton
- Refactor get_default_embedder() to use get_default_embedder_path()
- Simplify get_default_embedder_for_filetype() to delegate to
  get_preferred_embedder() (was reading from removed
  settings.DEFAULT_EMBEDDERS_BY_FILETYPE)
- Update all production consumers:
  - embeddings_task.py (_apply_dual_embedding_strategy)
  - core_vector_stores.py (global_search)
  - corpuses/models.py (Corpus.save embedder freeze)
  - corpuses/checks.py (startup consistency check)
  - backfill_default_embeddings.py (management command)
  - queries.py (GraphQL semantic search resolver)
  - unified_agent_conversation.py (WebSocket embedder picker)
- Update tests to mock get_default_embedder_path instead of
  settings.DEFAULT_EMBEDDER

https://claude.ai/code/session_01XdTX294CLzLHxrnaC5sodd
@claude
Copy link

claude bot commented Feb 17, 2026

PR Review: Migrate embedder configuration from Django settings to PipelineSettings

This is a clean, well-scoped migration that correctly moves embedder configuration from static Django settings to the database-backed PipelineSettings singleton. The overall approach is sound. Here are my findings:


Correctness & Logic

Minor issue: redundant or None in corpuses/models.py

# corpuses/models.py line ~377
default_embedder = get_default_embedder_path() or None

get_default_embedder_path() already returns "" (empty string) when unconfigured, and the subsequent if not self.preferred_embedder and default_embedder: guard already handles falsy values correctly. The or None is a no-op that adds noise without changing behavior. This was likely added to mirror the previous getattr(settings, "DEFAULT_EMBEDDER", None) pattern, but it's now unnecessary.

Behaviour change in get_default_embedder_for_filetype()

The old implementation looked up by mimetype in settings.DEFAULT_EMBEDDERS_BY_FILETYPE, which was a separate dict. The new implementation delegates to get_preferred_embedder(mimetype), which reads PipelineSettings.preferred_embedders. These should be populated equivalently by the migration/sync command, but the docstring and test now assert this delegation without testing the actual PipelineSettings DB path. Existing tests for test_backfill_default_embeddings.py still use @override_settings(DEFAULT_EMBEDDER=...), which will no longer affect behaviour through the new code path — see the test concern below.


Test Coverage Gaps

test_backfill_default_embeddings.py still uses @override_settings(DEFAULT_EMBEDDER=...)

This test file was not updated in this PR and uses the old mechanism:

@override_settings(DEFAULT_EMBEDDER="test.embedder.path")
def test_...(self):
    call_command("backfill_default_embeddings", ...)

But the command now calls get_default_embedder_path(), which reads from PipelineSettings in the database — not from Django settings. The @override_settings decorator will have no effect on the new code path. These tests will either silently pass against whatever is in the test DB's PipelineSettings, or fail unexpectedly if the DB has no embedder configured.

Similarly, test_personal_corpus.py and test_django_annotation_vector_store.py still reference settings.DEFAULT_EMBEDDER heavily and use @override_settings — these tests will be broken or misleading after this change.

Missing test: get_default_embedder_path() itself

The new utility function get_default_embedder_path() has no direct unit test. A simple test verifying it reads from PipelineSettings (and returns "" when unconfigured) would improve confidence.


Performance

DB hit on every call to get_default_embedder_path()

PipelineSettings.get_instance() uses a 5-minute cache, so this is already mitigated. However, the checks.py system check now calls get_default_embedder_path() at startup — this is fine since checks are a one-time operation, but it's worth confirming the cache is warm before Celery workers start dispatching tasks.

global_search now imports two functions instead of one

from opencontractserver.pipeline.utils import (
    get_default_embedder,
    get_default_embedder_path,
)

Minor style note: since get_default_embedder internally calls get_default_embedder_path(), there's a redundant double-lookup in global_search (one for the path, one for the class). This was the same before, so it's not a regression, but it's something to note.


Architecture / Design

Inline imports for circular import avoidance

The inline from opencontractserver.documents.models import PipelineSettings inside get_default_embedder_path() is the right call to avoid circular imports, and the comment correctly documents the reason. No concerns here.

get_default_embedder_for_filetype() is now a trivial alias

Now that this function is a one-liner delegating to get_preferred_embedder(), consider whether the alias is still needed or if call sites should just use get_preferred_embedder() directly to avoid indirection. This is a minor DRY consideration.


Summary

Category Severity Item
Bug Medium test_backfill_default_embeddings.py @override_settings no longer controls the new code path
Bug Medium test_personal_corpus.py and test_django_annotation_vector_store.py still rely on settings.DEFAULT_EMBEDDER
Style Low Redundant or None in corpuses/models.py:377
Coverage Low No direct test for get_default_embedder_path()
Design Low get_default_embedder_for_filetype() is now a trivial pass-through; worth evaluating if the alias adds value

The core migration logic is correct and well-structured. The main concern is that several test files that were not updated in this PR will now silently test the wrong thing after the settings source changes. Those tests should be updated to either mock get_default_embedder_path (like the tests that were updated) or configure PipelineSettings directly in their test setup.

JSv4 added 8 commits February 18, 2026 21:15
Replace settings.DEFAULT_EMBEDDER with get_default_embedder_path() to
complete the migration to the database-backed PipelineSettings singleton.
Replace decorative @override_settings(DEFAULT_EMBEDDER=...) with @patch
targeting the actual code path through get_default_embedder_path().
Replace @override_settings(DEFAULT_EMBEDDER=...) with @patch targeting
the actual code path through get_default_embedder_path() in corpus_tasks.
Replace settings.DEFAULT_EMBEDDER value lookups with
get_default_embedder_path() to match what the vector store search code
now reads from PipelineSettings.
get_default_embedder_path() returns "" when unconfigured, and the
subsequent 'if not ... and default_embedder:' guard handles falsy
values correctly without the conversion to None.
Since get_default_embedder_path is imported inside function bodies (not
at module level), @patch must target the source module
opencontractserver.pipeline.utils, not the consuming module.
@claude
Copy link

claude bot commented Feb 19, 2026

PR Review: Migrate embedder configuration to PipelineSettings

Overall this is a clean, well-scoped migration with good test coverage. The approach is sound — removing Django settings as the runtime source of truth in favour of the database-backed singleton is the right direction. A few issues need addressing before merge.


Bug (Critical) — Sync ORM call inside an async method

File: config/websocket/consumers/unified_agent_conversation.py, line 488

async def _pick_document_embedder(self) -> str | None:
    ...
    paths = await database_sync_to_async(get_embedder_paths)()

    if paths:
        return paths[0]
    else:
        from opencontractserver.pipeline.utils import get_default_embedder_path
        return get_default_embedder_path()   # ← called raw inside async def

get_default_embedder_path() calls PipelineSettings.get_instance(), which executes transaction.atomic() and get_or_create() — synchronous ORM operations. Before this PR, the equivalent call was settings.DEFAULT_EMBEDDER, a plain dict lookup that was safely callable from any context. The new implementation will raise SynchronousOnlyOperation at runtime whenever the else branch is taken (i.e., a document has no existing structural-annotation embeddings).

Fix: wrap the call:

return await database_sync_to_async(get_default_embedder_path)()

Bug (Minor) — Stale log message

File: opencontractserver/pipeline/utils.py, line 367

logger.error("No default embedder specified in settings")

This message still says "in settings" after the config has been moved to PipelineSettings. Should be updated to something like "No default embedder configured in PipelineSettings".


Dead Code — DEFAULT_EMBEDDERS_BY_FILETYPE setting

Files: config/settings/base.py, config/settings/test.py

get_default_embedder_for_filetype() previously read from settings.DEFAULT_EMBEDDERS_BY_FILETYPE. It now delegates entirely to get_preferred_embedder(), which reads from PipelineSettings.preferred_embedders. That field is seeded from PREFERRED_EMBEDDERS (not DEFAULT_EMBEDDERS_BY_FILETYPE), so the DEFAULT_EMBEDDERS_BY_FILETYPE entries in both settings files are now unreachable dead configuration. Per the project's no-dead-code rule, these should be removed (or, if the intent is to migrate this data to PipelineSettings.preferred_embedders, the migrate_pipeline_settings command should handle it).


Incomplete test migration — settings.DEFAULT_EMBEDDER still used as an assertion value

Several test files that were not updated in this PR compare against settings.DEFAULT_EMBEDDER directly:

File Lines
test_embedder_management.py 39, 44, 736
test_dual_embeddings.py 90, 104, 128, 169, 305, 343, 440
test_pydantic_ai_agents.py 233
test_corpus_embeddings.py 73
test_semantic_search_graphql.py 69

These pass today because PipelineSettings.get_instance() bootstraps its default_embedder from settings.DEFAULT_EMBEDDER on first access — so the two values happen to be equal. The tests therefore don't actually verify that the production path reads from PipelineSettings; they verify a transitive equality. This creates a fragile coupling: if someone nulls out settings.DEFAULT_EMBEDDER in the test config (which would be consistent with full deprecation), several tests would silently start testing against an empty string.

The tests in test_embedder_management.py lines 36–44 are the sharpest example — test_preferred_embedder_auto_populated_from_default and test_created_with_embedder_set_on_creation should be updated to assert against a known value seeded into PipelineSettings directly, similar to the pattern already used in the updated test_auto_population_uses_current_default.


Documentation nits

  • config/graphql/queries.py lines 2197 and 2219 still reference DEFAULT_EMBEDDER by name in inline comments. Not blocking, but worth cleaning up for consistency.
  • opencontractserver/tests/test_backfill_default_embeddings.py docstring at line 4 still says "Tests the command that backfills DEFAULT_EMBEDDER embeddings" — same minor inconsistency.

What's good

  • The new get_default_embedder_path() helper is a clean single point of abstraction, and get_default_embedder() correctly delegates to it (DRY improvement).
  • The system check in checks.py correctly guards against the documents_pipelinesettings table not yet existing, preventing spurious errors during initial migrations.
  • get_default_embedder_path() or None in corpus_tasks.py correctly handles the empty-string-vs-None contract difference between the old and new APIs.
  • Test patches target the correct import location (where the symbol is used, not where it's defined).

JSv4 added 4 commits February 19, 2026 05:45
- Wrap get_default_embedder_path() in database_sync_to_async inside
  async _pick_document_embedder to prevent SynchronousOnlyOperation
- Update log message to reference PipelineSettings instead of settings
- Remove dead DEFAULT_EMBEDDERS_BY_FILETYPE from settings and docs
…_embedder_path()

Replace fragile transitive coupling through Django settings with direct
calls to the production API across 5 test files (13 occurrences).
@claude
Copy link

claude bot commented Feb 19, 2026

PR Review: Migrate embedder configuration to PipelineSettings

The direction of this change is sound — centralizing embedder configuration in the database-backed PipelineSettings singleton enables runtime reconfiguration, which is a clear improvement over requiring a server restart. The breadth of changes is well-coordinated and the test updates are systematic. A few issues worth addressing before merging:


Bug / Correctness

1. Hardcoded database table name in checks.py

In opencontractserver/corpuses/checks.py:

if "documents_pipelinesettings" not in tables:
    return errors

This string is derived from the app label + model name and will silently break (returning no warnings at all) if either ever changes. The function already uses introspection for the column check — the table guard should too:

from opencontractserver.documents.models import PipelineSettings
if PipelineSettings._meta.db_table not in tables:
    return errors

The import should be placed inside the existing try/except block where other guarded imports live.


2. Semantic drift in get_default_embedder_for_filetype()

The old implementation read from settings.DEFAULT_EMBEDDERS_BY_FILETYPE; the new one delegates to get_preferred_embedder() which reads PipelineSettings.preferred_embedders (initially populated from the different Django setting PREFERRED_EMBEDDERS). If any deployment configured DEFAULT_EMBEDDERS_BY_FILETYPE without also having PREFERRED_EMBEDDERS, callers of get_default_embedder_for_filetype() will now silently get None where they previously got a value.

The updated test in test_pipeline_utils.py mocks get_preferred_embedder so it verifies the delegation but not the actual resolution equivalence. A docstring note or CHANGELOG entry documenting this behavioral change would help operators upgrading.


Clarity / Potential Confusion

3. DEFAULT_EMBEDDER in Django settings still serves a purpose — it is not dead config

config/settings/test.py (and presumably base.py) still defines DEFAULT_EMBEDDER, which is correct — PipelineSettings.get_instance() uses it to bootstrap the singleton on first creation. However, after removing DEFAULT_EMBEDDERS_BY_FILETYPE without a comment, the remaining DEFAULT_EMBEDDER looks orphaned and a future cleanup pass might remove it incorrectly.

A short inline comment would prevent this:

# Bootstrap-only: used by PipelineSettings.get_instance() to seed the singleton
# on first DB creation. NOT read at runtime — configure via admin UI or
# updatePipelineSettings mutation after initial setup.
DEFAULT_EMBEDDER = "..."

4. Documentation still describes DEFAULT_EMBEDDER as a runtime setting

docs/architecture/embeddings_creation_and_retrieval.md lists DEFAULT_EMBEDDER under "Embedder settings are configured in config/settings/base.py" without noting that its role is now bootstrap-only. Worth a one-liner clarification to avoid operator confusion during upgrades.


Minor / Non-blocking

5. get_default_embedder_path() empty-string contract

The function correctly returns "" (not None) when unconfigured. Most callers handle this fine (or None, truthy checks). The queries.py path passes the result to get_default_embedder() which already handles empty strings via its own truthy check — that path is fine. Worth a quick scan to confirm no downstream code does a strict is not None check on the raw result.


6. Performance (already handled — FYI)

PipelineSettings.get_instance() already uses Django's cache framework with a 5-minute TTL and dual invalidation on save. The DB hit concern for high-throughput scenarios is mitigated. Noted only for completeness.


Summary

The hardcoded table name (#1) and the undocumented behavioral change in get_default_embedder_for_filetype() (#2) are the two items worth fixing before merge. Items 3-5 are lower priority. The test refactoring from @override_settings to @patch is the correct approach given that the configuration source is now the DB, and the lazy-import pattern in corpus_tasks.py and models.py ensures those patches work correctly at call time.

QuerySet.delete() bypasses model.delete() and doesn't invalidate the
PipelineSettings cache. During setUp(), user creation signals trigger
personal corpus creation which calls get_default_embedder_path(),
populating the cache with test.py defaults. The two @override_settings
tests then get stale cached values instead of creating fresh instances
with the overridden settings. Adding _invalidate_cache() after the
queryset delete ensures get_instance() hits the DB and picks up the
overridden Django settings.
@claude
Copy link

claude bot commented Feb 19, 2026

PR Review: Migrate embedder configuration to PipelineSettings

This is a clean, well-scoped migration. The overall approach is sound and the async fix in the WebSocket consumer is correct. A few issues worth addressing before merge:


Bugs / Issues

1. Incorrect patch target in test_personal_corpus.py for corpus_tasks tests

Several tests patching opencontractserver.pipeline.utils.get_default_embedder_path exercise ensure_embeddings_for_corpus (in corpus_tasks.py), where the import in the diff is inside the function body (like other local imports in that task):

# corpus_tasks.py – local import inside the task function
from opencontractserver.pipeline.utils import get_default_embedder_path

Because the import is local, re-lookup at call time means patching the source module attribute works here. This is fine — but it's worth verifying this is intentional, since the more idiomatic target would be opencontractserver.tasks.corpus_tasks.get_default_embedder_path if the import were ever hoisted to module level.

2. get_default_embedder_for_filetype() may silently return None on fresh installs

Before this PR, the function read from DEFAULT_EMBEDDERS_BY_FILETYPE in Django settings — always populated from base.py. After, it delegates entirely to get_preferred_embedder(), which reads from PipelineSettings.preferred_embedders. On a freshly installed system where the migration has run but preferred_embedders is empty, callers that previously always got a class back may now get None. Whether PipelineSettings.get_instance() seeds preferred_embedders from the old settings values on creation should be verified and documented. If it does, this is fine; if not, it's a regression.

3. Stale process-level cache in long-running Celery workers

PipelineSettings.get_instance() uses a class-level cache. A change made via the admin UI will be cached out in all running Celery workers until they restart. This is an inherent limitation of the pattern, but unlike the old settings-based approach it's easy to miss — an operator changing the default embedder via the admin will see it work for new web requests but embedding tasks will keep using the old value. This trade-off should be called out in the error/warning messages or documentation so operators know they need to restart workers after changing the embedder.


Minor Issues

4. queries.py: inline import is inside the if block

if document_pk or corpus_pk:
    from opencontractserver.pipeline.utils import get_default_embedder_path
    scoped_embedder_path = get_default_embedder_path()

The import is nested inside a conditional rather than at the top of the function. If the circular import concern applies, the import should at minimum be at function scope. Even better — check whether the circular import can be resolved by adjusting module organization so the import can live at the file top like the rest.

5. get_default_embedder_path() return type vs. actual behavior

The type annotation is -> str and the docstring says "or empty string if not configured", but the implementation returns whatever PipelineSettings.get_default_embedder() returns, which may be None. The immediate callsites handle both (if not x, x or None), but a None return violates the declared return type and could surprise future callers. Either guarantee an empty string via return ... or "" or change the annotation to -> Optional[str].

6. Documentation left partially stale

In docs/architecture/embeddings_creation_and_retrieval.md, DEFAULT_EMBEDDERS_BY_FILETYPE was removed but the DEFAULT_EMBEDDER bullet still reads as if it's a Django settings key:

- **`DEFAULT_EMBEDDER`**: Fallback embedder when no preferred embedder is found

This entry should be updated to reflect that the value is now managed via the PipelineSettings singleton rather than config/settings/base.py.


What's Done Well

  • The async correctness fix in the WebSocket consumer (database_sync_to_async(get_default_embedder_path)()) is exactly right per CLAUDE.md's guidance on async tool/ORM calls.
  • The _invalidate_cache() addition in test cleanup correctly handles the edge case where queryset.delete() bypasses Model.delete() — good defensive test hygiene.
  • The checks.py guard that returns early if documents_pipelinesettings table doesn't exist is correctly defensive for migrations-in-progress states.
  • Test patches are placed at the right module locations in all cases reviewed (backfill_default_embeddings, embeddings_task, pipeline/utils for inline-import callers).
  • Removing DEFAULT_EMBEDDERS_BY_FILETYPE as dead config is clean — no lingering references found.

Overall this is in good shape. The stale-cache / worker-restart concern (item 3) is the one I'd most want addressed before merge, even if just as a note in the warning message or docs.

@codecov
Copy link

codecov bot commented Feb 19, 2026

Codecov Report

❌ Patch coverage is 98.11321% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
opencontractserver/corpuses/checks.py 75.00% 1 Missing ⚠️
opencontractserver/pipeline/utils.py 83.33% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments