fix: search-index visibility race + non-idempotent retries + observability#8
Merged
Merged
Conversation
… retry observability
Two agent-creation tests on staging have been failing intermittently with
409 "Agent with key '<fresh-uuid>' already exists" — impossible by random
collision. Root cause: the shared HTTP client retries POST/PUT/PATCH/DELETE
on 5xx/429. When a transient 5xx (or a dropped response on the way back)
followed a successful server-side commit, urllib3 silently retried the POST
and the second attempt hit the now-existing resource and got 409.
Restrict urllib3 Retry to safe, idempotent methods only:
GET / HEAD / OPTIONS. POST/PUT/PATCH/DELETE no longer auto-retry on 5xx —
tests that previously self-healed by accident will now surface the real
issue, which is the correct behaviour.
To make future incidents diagnosable rather than mysterious 409s:
- Generate an X-Request-Id (uuid hex) on every request and surface it on
APIResponse.request_id so failed-test logs can be correlated with API
server / load-balancer logs.
- Capture urllib3's Retry.history on each response and surface it on
APIResponse.retry_history. When a request actually retried, log a
WARNING with method, url, request_id, final status, and the per-attempt
history. A future "impossible 409" will arrive with the retry trail
attached instead of looking like a UUID collision.
Verified locally with a stub HTTP server: GET 503 retries to 200 (3 server
hits, retry_history populated), POST 503 returns 503 immediately (1 server
hit). Then ran the core profile against staging across agents/corpus/
indexing/query — 87 passed, 0 failed, 1 unrelated skip.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ests The post-index queries in three tests asserted len(search_results) > 0 right after wait_for(get_document().success), but document storage and search index visibility are eventually consistent on staging — get_document returning 200 only proves the document is stored, not that it is searchable. When the index lagged, the first /v2/query returned 0 results and the test failed. Replace the immediate query + assertion with a wait_for(...) poll that retries until the query returns results (timeout 30s, interval 2s), mirroring the existing _krakatoa_gone pattern already used on the delete side of the lifecycle test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Combined fix for two intermittent staging failures on different bug classes. Originally split across #7 and #8; consolidated here per request.
Bug 1 — search-index visibility race (3 tests)
Failing tests:
tests/services/corpus/test_filter_attributes_types.py::TestFilterAttributeTypes::test_text_integer_boolean_filterstests/services/indexing/test_document_lifecycle.py::TestDocumentLifecycle::test_index_query_delete_query_cycletests/services/query/test_query_filters.py::TestQueryFiltersCore::test_query_with_valid_metadata_filterEach indexed a doc, waited for
get_document(...).success, then immediately assertedlen(search_results) > 0on/v2/query.get_documentreturning 200 only confirms the doc is stored, not yet searchable — there's an eventual-consistency window between document storage and search-index visibility. When that lagged on staging, the first/v2/queryreturned 0 results and the test failed.Fix: wrap each post-index query in
wait_for(...)with a 30s/2s poll, mirroring the existing_krakatoa_gonepattern already used on the delete side oftest_document_lifecycle.py.Bug 2 — non-idempotent POST retries (2 tests)
Failing tests:
tests/services/agents/test_compaction.py::TestCompactionConfig::test_update_agent_compaction_configtests/services/agents/test_compaction.py::TestManualCompaction::test_manual_compaction_on_sessionBoth failed with
409 \"Agent with key '<fresh-uuid>' already exists\"— impossible by random collision. Root cause:utils/client.pyhadallowed_methods=[\"GET\", \"POST\", \"PUT\", \"DELETE\", \"PATCH\"]. When a transient 5xx (or a dropped response on the way back from the gateway) followed a successful server-side commit, urllib3 silently retried the POST. The second attempt hit the now-existing resource and got 409.Confirmed against
Agents.java:211: 409 "already exists" is only returned when a non-soft-deleted row already exists for that key. With fresh UUIDs, only duplicate submission could produce this.Fix:
allowed_methods = frozenset({\"GET\", \"HEAD\", \"OPTIONS\"}). POST/PUT/PATCH/DELETE no longer auto-retry on 5xx/429.X-Request-Idper request, surface asAPIResponse.request_id(visible in failure logs and correlates with staging'sx-request-idecho header).Retry.historyasAPIResponse.retry_history; log aWARNINGwhen a request actually retried, including method/URL/request_id/final_status/per-attempt history. A future surprising response will arrive with the retry trail attached instead of looking like a mysterious 409.Codex reviewed this approach with high effort and approved Option A (retry only safe methods + observability). It explicitly rejected the workaround of treating 409 as success in
create_agent, which would mask real isolation bugs.Why combined
Both fix flakes in the same suite. Reviewing as one diff is faster than two PRs across files that don't conflict, per request.
Test plan
http.server+ custom handler), verifying retry behaviour without staging dependency:retry_historylength 2, final status 200, WARNING fires.retry_historyempty, status 503 returned to caller. The actual fix.X-Request-Idis generated and propagated; matchesAPIResponse.request_id.https://api.vectara.dev): 12/12 PASS in 2m34s.agents/corpus/indexing/queryservices: 87 passed, 1 skipped, 0 failed in 15m30s. Skip istest_custom_dimensions.py(pre-existing skip-marker, unrelated).Notes
total=max_retries. The bug at hand was status-code retries on POST, whichallowed_methodsfully addresses.🤖 Generated with Claude Code