fix: search-index visibility race + non-idempotent retries + observability by goharanwar · Pull Request #8 · vectara/api_test_suite

goharanwar · 2026-05-05T10:35:59Z

Summary

Combined fix for two intermittent staging failures on different bug classes. Originally split across #7 and #8; consolidated here per request.

Bug 1 — search-index visibility race (3 tests)

Failing tests:

tests/services/corpus/test_filter_attributes_types.py::TestFilterAttributeTypes::test_text_integer_boolean_filters
tests/services/indexing/test_document_lifecycle.py::TestDocumentLifecycle::test_index_query_delete_query_cycle
tests/services/query/test_query_filters.py::TestQueryFiltersCore::test_query_with_valid_metadata_filter

Each indexed a doc, waited for get_document(...).success, then immediately asserted len(search_results) > 0 on /v2/query. get_document returning 200 only confirms the doc is stored, not yet searchable — there's an eventual-consistency window between document storage and search-index visibility. When that lagged on staging, the first /v2/query returned 0 results and the test failed.

Fix: wrap each post-index query in wait_for(...) with a 30s/2s poll, mirroring the existing _krakatoa_gone pattern already used on the delete side of test_document_lifecycle.py.

Bug 2 — non-idempotent POST retries (2 tests)

Failing tests:

tests/services/agents/test_compaction.py::TestCompactionConfig::test_update_agent_compaction_config
tests/services/agents/test_compaction.py::TestManualCompaction::test_manual_compaction_on_session

Both failed with 409 \"Agent with key '<fresh-uuid>' already exists\" — impossible by random collision. Root cause: utils/client.py had allowed_methods=[\"GET\", \"POST\", \"PUT\", \"DELETE\", \"PATCH\"]. When a transient 5xx (or a dropped response on the way back from the gateway) followed a successful server-side commit, urllib3 silently retried the POST. The second attempt hit the now-existing resource and got 409.

Confirmed against Agents.java:211: 409 "already exists" is only returned when a non-soft-deleted row already exists for that key. With fresh UUIDs, only duplicate submission could produce this.

Fix:

Restrict urllib3 Retry to safe methods only: allowed_methods = frozenset({\"GET\", \"HEAD\", \"OPTIONS\"}). POST/PUT/PATCH/DELETE no longer auto-retry on 5xx/429.
Generate X-Request-Id per request, surface as APIResponse.request_id (visible in failure logs and correlates with staging's x-request-id echo header).
Surface urllib3's Retry.history as APIResponse.retry_history; log a WARNING when a request actually retried, including method/URL/request_id/final_status/per-attempt history. A future surprising response will arrive with the retry trail attached instead of looking like a mysterious 409.

Codex reviewed this approach with high effort and approved Option A (retry only safe methods + observability). It explicitly rejected the workaround of treating 409 as success in create_agent, which would mask real isolation bugs.

Why combined

Both fix flakes in the same suite. Reviewing as one diff is faster than two PRs across files that don't conflict, per request.

Test plan

Local stub HTTP server (http.server + custom handler), verifying retry behaviour without staging dependency:
- GET that returns 503/503/200 → 3 server hits, retry_history length 2, final status 200, WARNING fires.
- POST that returns 503 → 1 server hit, retry_history empty, status 503 returned to caller. The actual fix.
- X-Request-Id is generated and propagated; matches APIResponse.request_id.
Originally-failing tests against staging (https://api.vectara.dev): 12/12 PASS in 2m34s.
Core profile across agents/corpus/indexing/query services: 87 passed, 1 skipped, 0 failed in 15m30s. Skip is test_custom_dimensions.py (pre-existing skip-marker, unrelated).

Notes

Closes fix: poll for search-index visibility in three flaky query tests #7 (search-index visibility race fix is included here).
Connection-error retries are not explicitly disabled. urllib3 treats those as "request not sent" and they remain bounded by total=max_retries. The bug at hand was status-code retries on POST, which allowed_methods fully addresses.
Tests that previously self-healed on transient 5xx for POST/PUT/PATCH/DELETE will now surface the underlying transient as an error. That's the correct behaviour — those silent retries were creating duplicate rows which caused these flakes. If real transient 5xxs become a problem, callers can add an explicit retry at the test level with idempotency baked in.

🤖 Generated with Claude Code

… retry observability Two agent-creation tests on staging have been failing intermittently with 409 "Agent with key '<fresh-uuid>' already exists" — impossible by random collision. Root cause: the shared HTTP client retries POST/PUT/PATCH/DELETE on 5xx/429. When a transient 5xx (or a dropped response on the way back) followed a successful server-side commit, urllib3 silently retried the POST and the second attempt hit the now-existing resource and got 409. Restrict urllib3 Retry to safe, idempotent methods only: GET / HEAD / OPTIONS. POST/PUT/PATCH/DELETE no longer auto-retry on 5xx — tests that previously self-healed by accident will now surface the real issue, which is the correct behaviour. To make future incidents diagnosable rather than mysterious 409s: - Generate an X-Request-Id (uuid hex) on every request and surface it on APIResponse.request_id so failed-test logs can be correlated with API server / load-balancer logs. - Capture urllib3's Retry.history on each response and surface it on APIResponse.retry_history. When a request actually retried, log a WARNING with method, url, request_id, final status, and the per-attempt history. A future "impossible 409" will arrive with the retry trail attached instead of looking like a UUID collision. Verified locally with a stub HTTP server: GET 503 retries to 200 (3 server hits, retry_history populated), POST 503 returns 503 immediately (1 server hit). Then ran the core profile against staging across agents/corpus/ indexing/query — 87 passed, 0 failed, 1 unrelated skip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ests The post-index queries in three tests asserted len(search_results) > 0 right after wait_for(get_document().success), but document storage and search index visibility are eventually consistent on staging — get_document returning 200 only proves the document is stored, not that it is searchable. When the index lagged, the first /v2/query returned 0 results and the test failed. Replace the immediate query + assertion with a wait_for(...) poll that retries until the query returns results (timeout 30s, interval 2s), mirroring the existing _krakatoa_gone pattern already used on the delete side of the lifecycle test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

goharanwar and others added 3 commits May 5, 2026 15:35

Apply code formatting (black + isort)

6db219c

goharanwar changed the title ~~fix(client): stop retrying non-idempotent methods + add retry observability~~ fix: search-index visibility race + non-idempotent retries + observability May 12, 2026

goharanwar mentioned this pull request May 12, 2026

fix: poll for search-index visibility in three flaky query tests #7

Closed

4 tasks

goharanwar merged commit 142ce1f into main May 14, 2026
3 checks passed

goharanwar deleted the fix/no-retry-mutating-methods branch May 14, 2026 02:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: search-index visibility race + non-idempotent retries + observability#8

fix: search-index visibility race + non-idempotent retries + observability#8
goharanwar merged 3 commits into
mainfrom
fix/no-retry-mutating-methods

goharanwar commented May 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

goharanwar commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bug 1 — search-index visibility race (3 tests)

Bug 2 — non-idempotent POST retries (2 tests)

Why combined

Test plan

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

goharanwar commented May 5, 2026 •

edited

Loading