Skip to content

feat(semantic): add exponential backoff retry for LLM rate limiting#889

Closed
mvanhorn wants to merge 1 commit intovolcengine:mainfrom
mvanhorn:osc/350-feat-semantic-processor-llm-backoff
Closed

feat(semantic): add exponential backoff retry for LLM rate limiting#889
mvanhorn wants to merge 1 commit intovolcengine:mainfrom
mvanhorn:osc/350-feat-semantic-processor-llm-backoff

Conversation

@mvanhorn
Copy link
Copy Markdown
Contributor

Description

Add retry with exponential backoff for LLM calls in the semantic processor. When rate-limited (429/TooManyRequests/RequestBurstTooFast), calls now retry up to 3 times with increasing delays (0.5s, 1s, 2s + jitter) instead of failing permanently.

Related Issue

Relates to #350

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

  • Added _llm_with_retry() helper to SemanticProcessor class
  • Replaced 4 bare vlm.get_completion_async() calls with retrying wrapper
  • Added import random for jitter calculation

Why this matters

Users batch-indexing large directories hit RequestBurstTooFast 429 errors from LLM providers during summarization. @sponge225 reported in #350 that 281 Markdown files (4,435 sections) consistently trigger rate limiting with Doubao 2.0, causing partial indexing failures.

The embedding path already has exponential_backoff_retry in volcengine_embedders.py, but the LLM summarization path in semantic_processor.py had zero retry logic. A prior attempt to add this (PR #568) was closed because the contributor didn't sign the CLA, not because of technical rejection.

The retry helper detects rate limit errors by checking for "429", "TooManyRequests", "RateLimit", and "RequestBurstTooFast" in the exception string, matching the pattern used in volcengine_embedders.py:is_429_error(). On persistent failure, it returns an empty string for graceful degradation (the file gets indexed without a summary rather than crashing the pipeline).

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this on the following platforms:
    • Linux
    • macOS
    • Windows

Post-build dogfooding skipped (score 6/10). Tested via code review and ruff linting only. Full OpenViking install requires system dependencies not available in this environment.

This contribution was developed with AI assistance (Claude Code).

Wrap LLM calls in semantic_processor.py with retry logic that handles
429/TooManyRequests/RequestBurstTooFast errors with exponential backoff
and jitter. Previously, a single rate limit error during batch ingestion
caused permanent failure. Now retries up to 3 times with delays of
0.5s, 1s, 2s before giving up gracefully.

Addresses the ingestion pain reported in volcengine#350 where 4435 sections
trigger RequestBurstTooFast with Doubao 2.0.

Relates to volcengine#350

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Failed to generate code suggestions for PR

@qin-ctx
Copy link
Copy Markdown
Collaborator

qin-ctx commented Mar 24, 2026

Thanks for the PR. The problem it targets is real, and the retry behavior here is useful, especially for handling rate-limited semantic indexing workloads.

We’re going to close this PR for now, not because the approach is wrong, but because we want to address retry at a lower level instead of adding more business-layer wrappers like _llm_with_retry() in individual modules.

Our follow-up plan is to unify retry handling across both VLM and embedding paths:

  • implement retry policy in the provider/backend layer
  • allow callers to opt in with explicit retry parameters, while falling back to config defaults when not provided
  • add config-driven retry settings for things like max_retries, backoff, jitter, and retryable error classes
  • make sure all relevant call paths use the same mechanism consistently

That said, we do want to borrow from this PR’s design, especially:

  • detection of rate-limit errors such as 429, TooManyRequests, RateLimit, and RequestBurstTooFast
  • exponential backoff with jitter
  • graceful degradation on persistent failure instead of taking down the whole pipeline

Thanks again for the contribution. We’ll use this as input for the broader retry refactor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants