Skip to content

fix: poll for search-index visibility in three flaky query tests#7

Closed
goharanwar wants to merge 1 commit into
mainfrom
fix/index-search-visibility-race
Closed

fix: poll for search-index visibility in three flaky query tests#7
goharanwar wants to merge 1 commit into
mainfrom
fix/index-search-visibility-race

Conversation

@goharanwar

Copy link
Copy Markdown
Contributor

Summary

Three tests have been failing intermittently on staging:

  • tests/services/corpus/test_filter_attributes_types.py::TestFilterAttributeTypes::test_text_integer_boolean_filters
  • tests/services/indexing/test_document_lifecycle.py::TestDocumentLifecycle::test_index_query_delete_query_cycle
  • tests/services/query/test_query_filters.py::TestQueryFiltersCore::test_query_with_valid_metadata_filter

All three failed the same way: AssertionError: assert 0 > 0 on a len(search_results) > 0 check immediately after indexing.

Root cause

Each test uses this pattern after indexing:

wait_for(lambda: client.get_document(corpus_key, doc_id).success, ...)
query_resp = client.post("/v2/query", ...)
assert len(query_resp.data["search_results"]) > 0  # flaky

get_document returning 200 only confirms the document is stored — not that it is searchable. There is an eventual-consistency window between document storage and search-index visibility. When that window is longer than usual on staging, the first /v2/query returns zero results and the test fails. The product is behaving correctly; the synchronization signal is wrong.

test_document_lifecycle.py already has the right pattern on the delete side (_krakatoa_gone polled via wait_for), but the index-and-query side did not.

Fix

Replace each immediate post-index query + assertion with a wait_for(...) poll that retries the query until it returns results (timeout 30s, interval 2s). All other content assertions (correct doc, correct fields, correct filtering) are preserved.

Test plan

  • Reproduced behaviour: original tests passed when run in isolation but the failure mode (get_document ≠ searchable) is well understood from the test code and direct curl reproduction.
  • Ran the three patched tests against staging (https://api.vectara.dev): 3 consecutive runs × 3 tests = 9/9 PASS.
  • Ran all 5 tests in the three modified test modules (incl. error-path tests): 5/5 PASS.
  • No production code changes — fix is contained to the api_test_suite tests.

🤖 Generated with Claude Code

…ests

The post-index queries in three tests asserted len(search_results) > 0
right after wait_for(get_document().success), but document storage and
search index visibility are eventually consistent on staging — get_document
returning 200 only proves the document is stored, not that it is searchable.
When the index lagged, the first /v2/query returned 0 results and the test
failed.

Replace the immediate query + assertion with a wait_for(...) poll that
retries until the query returns results (timeout 30s, interval 2s), mirroring
the existing _krakatoa_gone pattern already used on the delete side of the
lifecycle test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@goharanwar

Copy link
Copy Markdown
Contributor Author

Closing in favour of #8, which now contains both the search-index visibility fix and the retry-policy/observability fix. Combined branch tested end-to-end on staging: 12/12 originally-failing tests pass, plus 87/87 on the broader core profile across agents/corpus/indexing/query services.

@goharanwar goharanwar closed this May 12, 2026
goharanwar added a commit that referenced this pull request May 14, 2026
…ility (#8)

Combined fix for two intermittent staging failures.

Bug 1 — search-index visibility race (3 tests): post-index queries asserted on len(search_results) > 0 immediately after wait_for(get_document().success). get_document returning 200 confirms storage, not search visibility. Replaced each immediate query with a wait_for poll (30s/2s).

Bug 2 — non-idempotent POST retries (2 tests): urllib3 retried POST on 5xx, producing 409 'already exists' with fresh UUIDs when the first attempt had committed server-side. Restricted retries to GET/HEAD/OPTIONS; added per-request X-Request-Id and retry_history on APIResponse plus a WARNING log when retries fire, so future surprises arrive with the retry trail attached.

Codex-reviewed at high effort. End-to-end verified on staging: 12/12 originally-failing tests pass, 87/87 on the broader core profile across agents/corpus/indexing/query services.

Closes #7.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant