feat: batch queue for Anthropic Message Batches API (50% worker cost savings) by BYK · Pull Request #134 · BYK/loreai

BYK · 2026-05-06T19:39:43Z

Summary

Gateway-only enhancement: accumulate non-urgent worker LLM calls and submit them via Anthropic's Message Batches API for 50% cost reduction on all background work.

New BatchLLMClient wrapper (batch-queue.ts) intercepts LLMClient.prompt() calls. Non-urgent calls queue up; urgent calls bypass to the inner (synchronous) client immediately.
urgent flag on LLMClient.prompt() opts — threaded through distillation.run(), distillSegment(), metaDistill(), and all gateway pipeline callers.
Batch-safe (~80% of worker volume): incremental distillation, idle distillation, meta-distillation, curation, consolidation, worker model validation.
NOT batch-safe (urgent=true): compaction (client blocking), overflow recovery (gradient needs result), query expansion (user waiting, 3s timeout).

How it works

prompt(system, user, {urgent: false})
  → enqueue with custom_id
  → flush timer (30s) or auto-flush at capacity (50 items)
  → POST /v1/messages/batches
  → poll timer (60s) checks status
  → stream JSONL results → resolve/reject promises

prompt(system, user, {urgent: true})
  → immediate delegation to inner LLMClient

Fallback safety

Batch API error → fall back to synchronous for all queued items
No API key → fall back to synchronous
Batch exceeds max age (1h) → fall back to synchronous
Shutdown → drain queue synchronously via inner client
Disable entirely via LORE_BATCH_DISABLED=1

Cost impact

Worker call	Volume	Savings
Incremental distillation	~15-25% of turns	50% per call
Idle distillation	1x per idle	50%
Meta-distillation	1x per threshold	50%
Curation	1x per N turns	50%
Consolidation	Rare	50%
Worker model validation	Rare	50%
Estimated monthly savings		~$1,100

Files changed

packages/core/src/types.ts — add urgent?: boolean to LLMClient.prompt() opts
packages/core/src/distillation.ts — thread urgent through run(), distillSegment(), metaDistill()
packages/core/src/search.ts — mark expandQuery() as urgent: true
packages/gateway/src/batch-queue.ts — new BatchLLMClient wrapper
packages/gateway/src/pipeline.ts — wire batch queue into getLLMClient(), mark urgent callers
packages/gateway/src/index.ts — graceful shutdown via resetPipelineState()
packages/gateway/test/batch-queue.test.ts — new 10 dedicated tests
packages/gateway/test/helpers/harness.ts — async teardown for batch queue cleanup

754 tests pass, 0 failures.

When the prompt cache goes cold after idle, reduce the cold-cache write cost by distilling aggressively and using a smaller context window: - Force-distill ALL pending messages on idle (force: true), even below the normal minMessages threshold. The cache is expiring anyway, and distilling now means less raw content in the next context. - Allow meta-distillation on idle unconditionally — cache is cold so the row ID rewrites don't cause additional cache busts. - Add post-idle compact layer: when onIdleResume() fires, skip layer 0 (full-raw passthrough) and use a tighter raw budget (20% of usable instead of 40%) for layer 1. The distilled prefix covers the older history; the raw window only needs the current turn + minimal recent context. This reduces the total cold-cache write cost by up to 20% of usable (~29K tokens on a 200K context model). - Add postIdleCompact flag to SessionState (one-shot, consumed by transformInner). Exposed in inspectSessionState for test visibility.

…savings) Add BatchLLMClient wrapper in gateway that accumulates non-urgent worker LLM calls (distillation, curation, consolidation, validation) and submits them via Anthropic's Message Batches API for 50% cost reduction. Key changes: - Add urgent flag to LLMClient.prompt() opts for batch/immediate routing - Thread urgent through distillation.run(), distillSegment(), metaDistill() - Mark compaction, overflow recovery, and query expansion as urgent - Background incremental distillation, idle curation, and worker validation are batch-safe (urgent unset) - Flush timer (30s) + auto-flush at queue capacity (50 items) - Poll timer (60s) checks batch status, streams JSONL results - Fallback to synchronous on batch API errors or missing API key - Graceful shutdown drains queue synchronously - Disable via LORE_BATCH_DISABLED=1 env var - 10 dedicated tests for batch queue behavior

BYK force-pushed the batch-api-integration branch from 6e8320a to 1aa6e9c Compare May 6, 2026 19:41

BYK mentioned this pull request May 6, 2026

feat: post-idle compact layer + aggressive idle distillation #133

Closed

BYK enabled auto-merge (squash) May 6, 2026 19:41

BYK force-pushed the batch-api-integration branch from 1aa6e9c to 455a5ed Compare May 6, 2026 19:51

BYK merged commit 7e05b5d into main May 6, 2026
1 check passed

BYK deleted the batch-api-integration branch May 6, 2026 19:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: batch queue for Anthropic Message Batches API (50% worker cost savings)#134

feat: batch queue for Anthropic Message Batches API (50% worker cost savings)#134
BYK merged 2 commits intomainfrom
batch-api-integration

BYK commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BYK commented May 6, 2026

Summary

How it works

Fallback safety

Cost impact

Files changed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant