Skip to content

[Feature]: Batch multiple file summaries per VLM call to reduce RPM pressure #907

@lazmo88

Description

@lazmo88

[Feature]: Batch multiple file summaries per VLM call to reduce RPM pressure

Problem

The semantic processor currently makes one VLM call per file in _generate_single_file_summary(). For a resource reindex of ~500 directories with ~7,700 nodes, this means ~7,700 separate LLM requests.

Many LLM providers are moving toward per-request quota models (e.g. Alibaba Coding Plan, OpenAdapter.dev) rather than pure token-based billing. Under these models, each API call consumes one unit of quota regardless of token count. This makes the current 1-file-per-request pattern extremely wasteful — a 100-token summary and a 10,000-token summary both cost the same quota unit.

Combined with per-minute rate limits (RPM), this creates a severe bottleneck. A reindex that could complete in minutes takes hours as requests queue behind RPM walls, generating thousands of 429 errors and cascading cooldowns.

Proposed Solution

Bundle multiple file summaries into a single VLM call using structured JSON output (response_format: { "type": "json_object" }).

Instead of:

Request 1: "Summarize file A" → "Summary A"
Request 2: "Summarize file B" → "Summary B"
Request 3: "Summarize file C" → "Summary C"

Batch them:

Request 1: "Summarize these 10 files, return JSON" → {
  "summaries": [
    {"file": "A", "summary": "..."},
    {"file": "B", "summary": "..."},
    {"file": "C", "summary": "..."}
  ]
}

Implementation sketch

  • Add a _generate_batch_file_summaries() method alongside the existing single-file path
  • Group pending file nodes into batches (configurable batch size, e.g. 5–20 files depending on average content size)
  • Use response_format: { "type": "json_object" } for broad provider compatibility (alternatively json_schema where supported for stricter enforcement)
  • Parse the JSON response and map summaries back to individual files
  • Fall back to single-file mode if the batch call fails or if a file exceeds a size threshold

Batch size heuristics

  • Modern context windows range from 200K to 1M+ tokens — even conservative batching of 10 files per request is well within limits
  • Batch size could be dynamically calculated based on total input token estimate vs. model context limit (e.g. fill to 50% of context window)
  • Files exceeding a configurable threshold (e.g. 50K tokens) should be processed individually

Impact

  • 5–20x reduction in total VLM requests during semantic processing
  • Dramatically reduces 429 errors and cooldown cycles on RPM-limited providers
  • Under per-request quota models, 10 files batched = 10x quota savings
  • Reindex operations that currently take hours could complete in minutes
  • Overview generation (_batched_generate_overview) already uses a batching pattern — this extends the same approach to file-level summarization

Note on structured output modes

  • json_object — guarantees valid JSON, no schema enforcement. Widely supported across providers (OpenAI, Anthropic, Gemini, etc. via LiteLLM).
  • json_schema — enforces a specific JSON schema with field-level validation. More reliable but narrower provider support. Could be used where available with json_object as fallback.

Context

This is especially impactful for users routing through LLM proxies (e.g. LiteLLM) with multiple provider backends where RPM limits and quota models vary. The current 1-file-per-request pattern exhausts both RPM budgets and per-request quotas quickly, causing cascading cooldowns across all deployments in the model group.

Related: #840 (concurrent overview generation), #889 (exponential backoff for rate limiting)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions