Skip to content

fix: resolve review_code chroma init race#162

Closed
gdwrd wants to merge 2 commits intoarm:mainfrom
gdwrd:fix-review-code-chroma-init-race
Closed

fix: resolve review_code chroma init race#162
gdwrd wants to merge 2 commits intoarm:mainfrom
gdwrd:fix-review-code-chroma-init-race

Conversation

@gdwrd
Copy link

@gdwrd gdwrd commented Mar 5, 2026

Summary

This PR fixes a race condition during review_code query-engine initialization, especially with Chroma in non-interactive flows after indexing.
The issue could trigger initialization conflicts (including default_tenant connection errors) under parallel review workers.

Root Cause

MetisEngine._init_and_get_query_engines() and vector-store init/close paths were not serialized, so concurrent workers could race through
backend init and teardown.

Changes

  • Added engine-level synchronization:
    • MetisEngine now uses a lock (_qe_init_lock) around query-engine init and close().
    • Ensures backend init/query-engine creation happens once per lifecycle.
  • Hardened Chroma backend lifecycle:
    • Added lock-protected init()/close().
    • Added explicit state reset helper and client shutdown helper.
    • On init failure, stops partially created client, clears state, and raises VectorStoreInitError with exception chaining.
  • Hardened PGVector backend lifecycle:
    • Added lock-protected init()/close().
    • Centralized cleanup to close stores, dispose engines, and clear storage contexts.
    • On partial init failure, performs cleanup and raises VectorStoreInitError with exception chaining.
  • Expanded test coverage:
    • Thread-safe single-init behavior for engine/chroma/pgvector.
    • Retryability after init failures.
    • Correct cache clearing on close (including backend close failure path).
    • Repro/guard test for index -> non-interactive review_code race.

Validation

Executed:

UV_CACHE_DIR=$PWD/.uv-cache uv run pytest tests/test_engine_chroma.py tests/test_engine_core.py tests/test_engine_review.py tests/
test_pg_backend_mocked_unit.py

Result:

  • 23 passed, 1 skipped in 3.54s
  • Skipped test is Postgres-gated (--postgres not enabled).

with self._init_lock:
client = self._client
self._clear_state()
self._stop_client(client)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can move this outside of the lock as _clear_state() has already setself._client = None and self._initialized = False and it's now safe for another concurrent init()

raise VectorStoreInitError()
except Exception as e:
logger.error(f"Error initializing PGVectorStore: {e}")
self._cleanup_vector_components()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly if we split this function, we can clear state under lock and do the slower engine disposal outside of it

@gdwrd gdwrd force-pushed the fix-review-code-chroma-init-race branch from 4a0daeb to c76c8c5 Compare March 6, 2026 18:35
@mpekatsoula
Copy link
Contributor

Closing in favor of: #171

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants