Batch inference API for efficient parallel prompt processing

## Description

### Context

I've been exploring the Prompt API for use cases that involve processing multiple independent prompts—things like classifying a batch of emails, summarizing multiple documents, or running the same analysis across a dataset. Currently, the API doesn't seem to offer a way to batch these requests efficiently.

### Current Approach

Today, if I want to process multiple prompts, I have to do something like:

```javascript
const items = ["Classify this email: ...", "Classify this email: ...", /* 50 more */];

const results = await Promise.all(
  items.map(async (prompt) => {
    const session = await LanguageModel.create({
      initialPrompts: [{ role: "system", content: "You are a classifier." }]
    });
    const result = await session.prompt(prompt);
    session.destroy();
    return result;
  })
);
```

Or, using the clone pattern:

```javascript
const templateSession = await LanguageModel.create({
  initialPrompts: [{ role: "system", content: "You are a classifier." }]
});

const results = await Promise.all(
  items.map(async (prompt) => {
    const session = await templateSession.clone();
    const result = await session.prompt(prompt);
    session.destroy();
    return result;
  })
);
```

### The Problem

Both approaches have significant overhead:

1. **No shared computation** — Each session/clone is independent; there's no opportunity for the browser to batch inference operations at the model level
2. **Session creation overhead** — Creating or cloning sessions for each prompt adds latency
3. **Missed optimization opportunities** — Modern inference engines (like vLLM, TensorRT-LLM, etc.) use techniques like continuous batching and KV cache sharing to dramatically improve throughput when processing multiple requests

For on-device models especially, the GPU/NPU could be significantly underutilized when processing prompts one at a time.

### Proposal

I'd like to suggest considering a batch inference API that allows developers to submit multiple independent prompts for efficient parallel processing.

#### Option A: Static batch method

```javascript
const results = await LanguageModel.batchPrompt(
  [
    "Classify: Is this spam? 'You won a million dollars!'",
    "Classify: Is this spam? 'Meeting tomorrow at 3pm'",
    "Classify: Is this spam? 'Click here for free iPhone'"
  ],
  {
    initialPrompts: [{ role: "system", content: "Respond with: spam or not_spam" }],
    temperature: 0.2
  }
);

// results = ["spam", "not_spam", "spam"]
```

#### Option B: Session-based batch method

```javascript
const session = await LanguageModel.create({
  initialPrompts: [{ role: "system", content: "Respond with: spam or not_spam" }]
});

const results = await session.batchPrompt([
  "Classify: 'You won a million dollars!'",
  "Classify: 'Meeting tomorrow at 3pm'",
  "Classify: 'Click here for free iPhone'"
]);
```

#### Option C: Streaming batch with callbacks

For large batches where you want results as they complete:

```javascript
const batch = session.createBatch([prompt1, prompt2, prompt3, /* ... */]);

batch.onResult((index, result) => {
  console.log(`Prompt ${index} completed: ${result}`);
});

batch.onComplete((allResults) => {
  console.log("All done!", allResults);
});

await batch.start();
```

### Implementation Considerations

The browser/runtime could implement this by:

- **Continuous batching** — Dynamically grouping prompts and interleaving their decoding steps
- **KV cache optimization** — Sharing prefix computation when prompts share the same system prompt
- **Adaptive concurrency** — Automatically tuning batch size based on device capabilities and memory constraints
- **Graceful degradation** — Falling back to sequential processing on devices that don't support batching

### Use Cases

- **Bulk classification** — Spam detection, sentiment analysis, content moderation
- **Data extraction** — Parsing structured data from many documents
- **Batch transformations** — Summarizing, translating, or rewriting multiple items
- **Testing/evaluation** — Running a model against a test dataset

### Prior Art

- **vLLM** — Continuous batching for high-throughput LLM serving
- **TensorRT-LLM** — In-flight batching for NVIDIA GPUs
- **OpenAI Batch API** — Async batch processing for bulk workloads (different context, but similar developer need)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch inference API for efficient parallel prompt processing #187

Description

Context

Current Approach

The Problem

Proposal

Option A: Static batch method

Option B: Session-based batch method

Option C: Streaming batch with callbacks

Implementation Considerations

Use Cases

Prior Art

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Batch inference API for efficient parallel prompt processing #187

Description

Description

Context

Current Approach

The Problem

Proposal

Option A: Static batch method

Option B: Session-based batch method

Option C: Streaming batch with callbacks

Implementation Considerations

Use Cases

Prior Art

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions