Skip to content

fix: search-index visibility race + non-idempotent retries + observability#8

Merged
goharanwar merged 3 commits into
mainfrom
fix/no-retry-mutating-methods
May 14, 2026
Merged

fix: search-index visibility race + non-idempotent retries + observability#8
goharanwar merged 3 commits into
mainfrom
fix/no-retry-mutating-methods

Conversation

@goharanwar

@goharanwar goharanwar commented May 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Combined fix for two intermittent staging failures on different bug classes. Originally split across #7 and #8; consolidated here per request.

Bug 1 — search-index visibility race (3 tests)

Failing tests:

  • tests/services/corpus/test_filter_attributes_types.py::TestFilterAttributeTypes::test_text_integer_boolean_filters
  • tests/services/indexing/test_document_lifecycle.py::TestDocumentLifecycle::test_index_query_delete_query_cycle
  • tests/services/query/test_query_filters.py::TestQueryFiltersCore::test_query_with_valid_metadata_filter

Each indexed a doc, waited for get_document(...).success, then immediately asserted len(search_results) > 0 on /v2/query. get_document returning 200 only confirms the doc is stored, not yet searchable — there's an eventual-consistency window between document storage and search-index visibility. When that lagged on staging, the first /v2/query returned 0 results and the test failed.

Fix: wrap each post-index query in wait_for(...) with a 30s/2s poll, mirroring the existing _krakatoa_gone pattern already used on the delete side of test_document_lifecycle.py.

Bug 2 — non-idempotent POST retries (2 tests)

Failing tests:

  • tests/services/agents/test_compaction.py::TestCompactionConfig::test_update_agent_compaction_config
  • tests/services/agents/test_compaction.py::TestManualCompaction::test_manual_compaction_on_session

Both failed with 409 \"Agent with key '<fresh-uuid>' already exists\" — impossible by random collision. Root cause: utils/client.py had allowed_methods=[\"GET\", \"POST\", \"PUT\", \"DELETE\", \"PATCH\"]. When a transient 5xx (or a dropped response on the way back from the gateway) followed a successful server-side commit, urllib3 silently retried the POST. The second attempt hit the now-existing resource and got 409.

Confirmed against Agents.java:211: 409 "already exists" is only returned when a non-soft-deleted row already exists for that key. With fresh UUIDs, only duplicate submission could produce this.

Fix:

  • Restrict urllib3 Retry to safe methods only: allowed_methods = frozenset({\"GET\", \"HEAD\", \"OPTIONS\"}). POST/PUT/PATCH/DELETE no longer auto-retry on 5xx/429.
  • Generate X-Request-Id per request, surface as APIResponse.request_id (visible in failure logs and correlates with staging's x-request-id echo header).
  • Surface urllib3's Retry.history as APIResponse.retry_history; log a WARNING when a request actually retried, including method/URL/request_id/final_status/per-attempt history. A future surprising response will arrive with the retry trail attached instead of looking like a mysterious 409.

Codex reviewed this approach with high effort and approved Option A (retry only safe methods + observability). It explicitly rejected the workaround of treating 409 as success in create_agent, which would mask real isolation bugs.

Why combined

Both fix flakes in the same suite. Reviewing as one diff is faster than two PRs across files that don't conflict, per request.

Test plan

  • Local stub HTTP server (http.server + custom handler), verifying retry behaviour without staging dependency:
    • GET that returns 503/503/200 → 3 server hits, retry_history length 2, final status 200, WARNING fires.
    • POST that returns 503 → 1 server hit, retry_history empty, status 503 returned to caller. The actual fix.
    • X-Request-Id is generated and propagated; matches APIResponse.request_id.
  • Originally-failing tests against staging (https://api.vectara.dev): 12/12 PASS in 2m34s.
  • Core profile across agents/corpus/indexing/query services: 87 passed, 1 skipped, 0 failed in 15m30s. Skip is test_custom_dimensions.py (pre-existing skip-marker, unrelated).

Notes

  • Closes fix: poll for search-index visibility in three flaky query tests #7 (search-index visibility race fix is included here).
  • Connection-error retries are not explicitly disabled. urllib3 treats those as "request not sent" and they remain bounded by total=max_retries. The bug at hand was status-code retries on POST, which allowed_methods fully addresses.
  • Tests that previously self-healed on transient 5xx for POST/PUT/PATCH/DELETE will now surface the underlying transient as an error. That's the correct behaviour — those silent retries were creating duplicate rows which caused these flakes. If real transient 5xxs become a problem, callers can add an explicit retry at the test level with idempotency baked in.

🤖 Generated with Claude Code

goharanwar and others added 3 commits May 5, 2026 15:35
… retry observability

Two agent-creation tests on staging have been failing intermittently with
409 "Agent with key '<fresh-uuid>' already exists" — impossible by random
collision. Root cause: the shared HTTP client retries POST/PUT/PATCH/DELETE
on 5xx/429. When a transient 5xx (or a dropped response on the way back)
followed a successful server-side commit, urllib3 silently retried the POST
and the second attempt hit the now-existing resource and got 409.

Restrict urllib3 Retry to safe, idempotent methods only:
GET / HEAD / OPTIONS. POST/PUT/PATCH/DELETE no longer auto-retry on 5xx —
tests that previously self-healed by accident will now surface the real
issue, which is the correct behaviour.

To make future incidents diagnosable rather than mysterious 409s:
  - Generate an X-Request-Id (uuid hex) on every request and surface it on
    APIResponse.request_id so failed-test logs can be correlated with API
    server / load-balancer logs.
  - Capture urllib3's Retry.history on each response and surface it on
    APIResponse.retry_history. When a request actually retried, log a
    WARNING with method, url, request_id, final status, and the per-attempt
    history. A future "impossible 409" will arrive with the retry trail
    attached instead of looking like a UUID collision.

Verified locally with a stub HTTP server: GET 503 retries to 200 (3 server
hits, retry_history populated), POST 503 returns 503 immediately (1 server
hit). Then ran the core profile against staging across agents/corpus/
indexing/query — 87 passed, 0 failed, 1 unrelated skip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ests

The post-index queries in three tests asserted len(search_results) > 0
right after wait_for(get_document().success), but document storage and
search index visibility are eventually consistent on staging — get_document
returning 200 only proves the document is stored, not that it is searchable.
When the index lagged, the first /v2/query returned 0 results and the test
failed.

Replace the immediate query + assertion with a wait_for(...) poll that
retries until the query returns results (timeout 30s, interval 2s), mirroring
the existing _krakatoa_gone pattern already used on the delete side of the
lifecycle test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@goharanwar goharanwar changed the title fix(client): stop retrying non-idempotent methods + add retry observability fix: search-index visibility race + non-idempotent retries + observability May 12, 2026
@goharanwar goharanwar merged commit 142ce1f into main May 14, 2026
3 checks passed
@goharanwar goharanwar deleted the fix/no-retry-mutating-methods branch May 14, 2026 02:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant