-
Notifications
You must be signed in to change notification settings - Fork 89
Description
Description
Context
I've been exploring the Prompt API for use cases that involve processing multiple independent prompts—things like classifying a batch of emails, summarizing multiple documents, or running the same analysis across a dataset. Currently, the API doesn't seem to offer a way to batch these requests efficiently.
Current Approach
Today, if I want to process multiple prompts, I have to do something like:
const items = ["Classify this email: ...", "Classify this email: ...", /* 50 more */];
const results = await Promise.all(
items.map(async (prompt) => {
const session = await LanguageModel.create({
initialPrompts: [{ role: "system", content: "You are a classifier." }]
});
const result = await session.prompt(prompt);
session.destroy();
return result;
})
);Or, using the clone pattern:
const templateSession = await LanguageModel.create({
initialPrompts: [{ role: "system", content: "You are a classifier." }]
});
const results = await Promise.all(
items.map(async (prompt) => {
const session = await templateSession.clone();
const result = await session.prompt(prompt);
session.destroy();
return result;
})
);The Problem
Both approaches have significant overhead:
- No shared computation — Each session/clone is independent; there's no opportunity for the browser to batch inference operations at the model level
- Session creation overhead — Creating or cloning sessions for each prompt adds latency
- Missed optimization opportunities — Modern inference engines (like vLLM, TensorRT-LLM, etc.) use techniques like continuous batching and KV cache sharing to dramatically improve throughput when processing multiple requests
For on-device models especially, the GPU/NPU could be significantly underutilized when processing prompts one at a time.
Proposal
I'd like to suggest considering a batch inference API that allows developers to submit multiple independent prompts for efficient parallel processing.
Option A: Static batch method
const results = await LanguageModel.batchPrompt(
[
"Classify: Is this spam? 'You won a million dollars!'",
"Classify: Is this spam? 'Meeting tomorrow at 3pm'",
"Classify: Is this spam? 'Click here for free iPhone'"
],
{
initialPrompts: [{ role: "system", content: "Respond with: spam or not_spam" }],
temperature: 0.2
}
);
// results = ["spam", "not_spam", "spam"]Option B: Session-based batch method
const session = await LanguageModel.create({
initialPrompts: [{ role: "system", content: "Respond with: spam or not_spam" }]
});
const results = await session.batchPrompt([
"Classify: 'You won a million dollars!'",
"Classify: 'Meeting tomorrow at 3pm'",
"Classify: 'Click here for free iPhone'"
]);Option C: Streaming batch with callbacks
For large batches where you want results as they complete:
const batch = session.createBatch([prompt1, prompt2, prompt3, /* ... */]);
batch.onResult((index, result) => {
console.log(`Prompt ${index} completed: ${result}`);
});
batch.onComplete((allResults) => {
console.log("All done!", allResults);
});
await batch.start();Implementation Considerations
The browser/runtime could implement this by:
- Continuous batching — Dynamically grouping prompts and interleaving their decoding steps
- KV cache optimization — Sharing prefix computation when prompts share the same system prompt
- Adaptive concurrency — Automatically tuning batch size based on device capabilities and memory constraints
- Graceful degradation — Falling back to sequential processing on devices that don't support batching
Use Cases
- Bulk classification — Spam detection, sentiment analysis, content moderation
- Data extraction — Parsing structured data from many documents
- Batch transformations — Summarizing, translating, or rewriting multiple items
- Testing/evaluation — Running a model against a test dataset
Prior Art
- vLLM — Continuous batching for high-throughput LLM serving
- TensorRT-LLM — In-flight batching for NVIDIA GPUs
- OpenAI Batch API — Async batch processing for bulk workloads (different context, but similar developer need)