-
Notifications
You must be signed in to change notification settings - Fork 1.4k
[Feature]: Batch multiple file summaries per VLM call to reduce RPM pressure #907
Description
[Feature]: Batch multiple file summaries per VLM call to reduce RPM pressure
Problem
The semantic processor currently makes one VLM call per file in _generate_single_file_summary(). For a resource reindex of ~500 directories with ~7,700 nodes, this means ~7,700 separate LLM requests.
Many LLM providers are moving toward per-request quota models (e.g. Alibaba Coding Plan, OpenAdapter.dev) rather than pure token-based billing. Under these models, each API call consumes one unit of quota regardless of token count. This makes the current 1-file-per-request pattern extremely wasteful — a 100-token summary and a 10,000-token summary both cost the same quota unit.
Combined with per-minute rate limits (RPM), this creates a severe bottleneck. A reindex that could complete in minutes takes hours as requests queue behind RPM walls, generating thousands of 429 errors and cascading cooldowns.
Proposed Solution
Bundle multiple file summaries into a single VLM call using structured JSON output (response_format: { "type": "json_object" }).
Instead of:
Request 1: "Summarize file A" → "Summary A"
Request 2: "Summarize file B" → "Summary B"
Request 3: "Summarize file C" → "Summary C"
Batch them:
Request 1: "Summarize these 10 files, return JSON" → {
"summaries": [
{"file": "A", "summary": "..."},
{"file": "B", "summary": "..."},
{"file": "C", "summary": "..."}
]
}
Implementation sketch
- Add a
_generate_batch_file_summaries()method alongside the existing single-file path - Group pending file nodes into batches (configurable batch size, e.g. 5–20 files depending on average content size)
- Use
response_format: { "type": "json_object" }for broad provider compatibility (alternativelyjson_schemawhere supported for stricter enforcement) - Parse the JSON response and map summaries back to individual files
- Fall back to single-file mode if the batch call fails or if a file exceeds a size threshold
Batch size heuristics
- Modern context windows range from 200K to 1M+ tokens — even conservative batching of 10 files per request is well within limits
- Batch size could be dynamically calculated based on total input token estimate vs. model context limit (e.g. fill to 50% of context window)
- Files exceeding a configurable threshold (e.g. 50K tokens) should be processed individually
Impact
- 5–20x reduction in total VLM requests during semantic processing
- Dramatically reduces 429 errors and cooldown cycles on RPM-limited providers
- Under per-request quota models, 10 files batched = 10x quota savings
- Reindex operations that currently take hours could complete in minutes
- Overview generation (
_batched_generate_overview) already uses a batching pattern — this extends the same approach to file-level summarization
Note on structured output modes
json_object— guarantees valid JSON, no schema enforcement. Widely supported across providers (OpenAI, Anthropic, Gemini, etc. via LiteLLM).json_schema— enforces a specific JSON schema with field-level validation. More reliable but narrower provider support. Could be used where available withjson_objectas fallback.
Context
This is especially impactful for users routing through LLM proxies (e.g. LiteLLM) with multiple provider backends where RPM limits and quota models vary. The current 1-file-per-request pattern exhausts both RPM budgets and per-request quotas quickly, causing cascading cooldowns across all deployments in the model group.
Related: #840 (concurrent overview generation), #889 (exponential backoff for rate limiting)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status