Skip to content

Batch inference API for efficient parallel prompt processing #187

@Nithanaroy

Description

@Nithanaroy

Description

Context

I've been exploring the Prompt API for use cases that involve processing multiple independent prompts—things like classifying a batch of emails, summarizing multiple documents, or running the same analysis across a dataset. Currently, the API doesn't seem to offer a way to batch these requests efficiently.

Current Approach

Today, if I want to process multiple prompts, I have to do something like:

const items = ["Classify this email: ...", "Classify this email: ...", /* 50 more */];

const results = await Promise.all(
  items.map(async (prompt) => {
    const session = await LanguageModel.create({
      initialPrompts: [{ role: "system", content: "You are a classifier." }]
    });
    const result = await session.prompt(prompt);
    session.destroy();
    return result;
  })
);

Or, using the clone pattern:

const templateSession = await LanguageModel.create({
  initialPrompts: [{ role: "system", content: "You are a classifier." }]
});

const results = await Promise.all(
  items.map(async (prompt) => {
    const session = await templateSession.clone();
    const result = await session.prompt(prompt);
    session.destroy();
    return result;
  })
);

The Problem

Both approaches have significant overhead:

  1. No shared computation — Each session/clone is independent; there's no opportunity for the browser to batch inference operations at the model level
  2. Session creation overhead — Creating or cloning sessions for each prompt adds latency
  3. Missed optimization opportunities — Modern inference engines (like vLLM, TensorRT-LLM, etc.) use techniques like continuous batching and KV cache sharing to dramatically improve throughput when processing multiple requests

For on-device models especially, the GPU/NPU could be significantly underutilized when processing prompts one at a time.

Proposal

I'd like to suggest considering a batch inference API that allows developers to submit multiple independent prompts for efficient parallel processing.

Option A: Static batch method

const results = await LanguageModel.batchPrompt(
  [
    "Classify: Is this spam? 'You won a million dollars!'",
    "Classify: Is this spam? 'Meeting tomorrow at 3pm'",
    "Classify: Is this spam? 'Click here for free iPhone'"
  ],
  {
    initialPrompts: [{ role: "system", content: "Respond with: spam or not_spam" }],
    temperature: 0.2
  }
);

// results = ["spam", "not_spam", "spam"]

Option B: Session-based batch method

const session = await LanguageModel.create({
  initialPrompts: [{ role: "system", content: "Respond with: spam or not_spam" }]
});

const results = await session.batchPrompt([
  "Classify: 'You won a million dollars!'",
  "Classify: 'Meeting tomorrow at 3pm'",
  "Classify: 'Click here for free iPhone'"
]);

Option C: Streaming batch with callbacks

For large batches where you want results as they complete:

const batch = session.createBatch([prompt1, prompt2, prompt3, /* ... */]);

batch.onResult((index, result) => {
  console.log(`Prompt ${index} completed: ${result}`);
});

batch.onComplete((allResults) => {
  console.log("All done!", allResults);
});

await batch.start();

Implementation Considerations

The browser/runtime could implement this by:

  • Continuous batching — Dynamically grouping prompts and interleaving their decoding steps
  • KV cache optimization — Sharing prefix computation when prompts share the same system prompt
  • Adaptive concurrency — Automatically tuning batch size based on device capabilities and memory constraints
  • Graceful degradation — Falling back to sequential processing on devices that don't support batching

Use Cases

  • Bulk classification — Spam detection, sentiment analysis, content moderation
  • Data extraction — Parsing structured data from many documents
  • Batch transformations — Summarizing, translating, or rewriting multiple items
  • Testing/evaluation — Running a model against a test dataset

Prior Art

  • vLLM — Continuous batching for high-throughput LLM serving
  • TensorRT-LLM — In-flight batching for NVIDIA GPUs
  • OpenAI Batch API — Async batch processing for bulk workloads (different context, but similar developer need)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions