[Feature]: Batch multiple file summaries per VLM call to reduce RPM pressure

[Feature]: Batch multiple file summaries per VLM call to reduce RPM pressure

## Problem

The semantic processor currently makes one VLM call per file in `_generate_single_file_summary()`. For a resource reindex of ~500 directories with ~7,700 nodes, this means ~7,700 separate LLM requests.

Many LLM providers are moving toward **per-request quota models** (e.g. Alibaba Coding Plan, OpenAdapter.dev) rather than pure token-based billing. Under these models, each API call consumes one unit of quota regardless of token count. This makes the current 1-file-per-request pattern extremely wasteful — a 100-token summary and a 10,000-token summary both cost the same quota unit.

Combined with per-minute rate limits (RPM), this creates a severe bottleneck. A reindex that could complete in minutes takes hours as requests queue behind RPM walls, generating thousands of 429 errors and cascading cooldowns.

## Proposed Solution

Bundle multiple file summaries into a single VLM call using structured JSON output (`response_format: { "type": "json_object" }`).

Instead of:
```
Request 1: "Summarize file A" → "Summary A"
Request 2: "Summarize file B" → "Summary B"
Request 3: "Summarize file C" → "Summary C"
```

Batch them:
```
Request 1: "Summarize these 10 files, return JSON" → {
  "summaries": [
    {"file": "A", "summary": "..."},
    {"file": "B", "summary": "..."},
    {"file": "C", "summary": "..."}
  ]
}
```

## Implementation sketch

- Add a `_generate_batch_file_summaries()` method alongside the existing single-file path
- Group pending file nodes into batches (configurable batch size, e.g. 5–20 files depending on average content size)
- Use `response_format: { "type": "json_object" }` for broad provider compatibility (alternatively `json_schema` where supported for stricter enforcement)
- Parse the JSON response and map summaries back to individual files
- Fall back to single-file mode if the batch call fails or if a file exceeds a size threshold

## Batch size heuristics

- Modern context windows range from 200K to 1M+ tokens — even conservative batching of 10 files per request is well within limits
- Batch size could be dynamically calculated based on total input token estimate vs. model context limit (e.g. fill to 50% of context window)
- Files exceeding a configurable threshold (e.g. 50K tokens) should be processed individually

## Impact

- **5–20x reduction in total VLM requests** during semantic processing
- Dramatically reduces 429 errors and cooldown cycles on RPM-limited providers
- Under per-request quota models, 10 files batched = 10x quota savings
- Reindex operations that currently take hours could complete in minutes
- Overview generation (`_batched_generate_overview`) already uses a batching pattern — this extends the same approach to file-level summarization

## Note on structured output modes

- `json_object` — guarantees valid JSON, no schema enforcement. Widely supported across providers (OpenAI, Anthropic, Gemini, etc. via LiteLLM).
- `json_schema` — enforces a specific JSON schema with field-level validation. More reliable but narrower provider support. Could be used where available with `json_object` as fallback.

## Context

This is especially impactful for users routing through LLM proxies (e.g. LiteLLM) with multiple provider backends where RPM limits and quota models vary. The current 1-file-per-request pattern exhausts both RPM budgets and per-request quotas quickly, causing cascading cooldowns across all deployments in the model group.

Related: #840 (concurrent overview generation), #889 (exponential backoff for rate limiting)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Batch multiple file summaries per VLM call to reduce RPM pressure #907

Problem

Proposed Solution

Implementation sketch

Batch size heuristics

Impact

Note on structured output modes

Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Batch multiple file summaries per VLM call to reduce RPM pressure #907

Description

Problem

Proposed Solution

Implementation sketch

Batch size heuristics

Impact

Note on structured output modes

Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions