Accelerating LLM Inference: Exploring Predicted-Outputs in Azure OpenAI

The transformative power of Large Language Models (LLMs) is rooted in their ability to generate coherent, contextually relevant text. However, this capability comes with a significant operational challenge: inference latency. The fundamental architecture of most LLMs is autoregressive, meaning they generate text one token at a time, where each new token depends on all previously generated ones. This inherently sequential process creates a performance bottleneck, particularly for real-time, interactive applications where low latency is critical for user experience. As models grow larger and context windows expand, the time required to generate a complete response can become prohibitively long, limiting the practical deployment of these powerful tools.

To mitigate inference latency, the LLM community has developed several optimization techniques across both model architecture and serving infrastructure. Broadly speaking, these techniques fall into two categories: model-level optimizations and inference-level strategies.

Model-level optimizations include: • Model Distillation: Compressing large models into smaller, faster student models with minimal performance loss. This is widely used in production systems to serve lightweight variants of powerful base models. • Quantization: Reducing the numerical precision of model weights (e.g., from FP32 to INT8), significantly improving memory efficiency and inference throughput. • FlashAttention: An efficient attention mechanism that reduces memory overhead and speeds up computation, particularly effective on modern GPU hardware. • kTransformers / TensorRT-LLM: Frameworks like kTransformers and NVIDIA’s TensorRT-LLM offer highly optimized kernel-level acceleration for transformer architectures, often yielding 2–4× speedups with minimal quality degradation.

Inference-level strategies focus on optimizing the decoding and serving process, and are especially relevant in hosted environments like Azure OpenAI: • Prompt Caching: In Azure OpenAI, prompt caching is automatically enabled for eligible requests. When identical prompt segments are reused (e.g., system prompts or long context history), the service avoids recomputation, thereby reducing end-to-end latency and token cost. • Predicted-Outputs (Speculative Decoding): Recently introduced in Azure OpenAI, this technique enables faster responses by using a smaller model to precompute a batch of candidate tokens. The main model then validates or rejects these candidates in bulk. If the predictions match, the tokens are accepted immediately, reducing token-by-token wait times. This is especially beneficial in chat and agent workflows where early turns are often predictable.

When strategically combined, these techniques can yield 3–10× latency reduction, depending on model size, input length, and deployment stack. Among them, prompt caching and Predicted-Outputs stand out as turnkey features available directly within the Azure OpenAI API, requiring no model retraining or infrastructure changes—making them particularly appealing for enterprise adoption.

Project Structure

1_introduction_to_prompt_cache.md: Explains prompt caching technology and its benefits.
2_introduction_to_predicted_outputs.md: Explains Azure OpenAI's predicted-outputs feature.
3_prompt_cache_vs_predicted_outputs.md: Compares prompt caching and predicted-outputs.
4_combining_optimizations.md: Discusses how to combine both techniques for maximum efficiency.
5_conclusion.md: Summarizes key points and future directions.
references.md: Lists all references used.
code_examples/: Contains Python scripts demonstrating both techniques.

1. Understanding Prompt Caching

What is Prompt Caching?

Prompt caching is an optimization technique that stores frequently used prompt segments in memory to avoid recomputing them in subsequent requests. When a cached prompt is reused, the model can skip the expensive computation of processing those tokens again, leading to significant latency and cost reductions.

Prompt caching is particularly effective for:

Conversational AI applications with repeated system prompts
Code generation with boilerplate code and project context
Document processing with template structures
Multi-turn conversations maintaining context

References

2. Introduction to Predicted-Outputs in Azure OpenAI

What is Predicted-Outputs?

Azure OpenAI's predicted-outputs feature, introduced in API version 2025-01-01-preview, reduces latency in chat completions by leveraging pre-known text provided via the prediction parameter. This allows the model to focus on generating new or modified content, making it ideal for:

Regenerating or refining documents (e.g., legal contracts, technical documents).

reference extract relevant passages usecase data:
Auto-completion in IDEs for boilerplate code.
Completing templates (e.g., personalized emails, reports).
Dialog turns in chatbots (e.g., customer service).

Supported Models

Model	Version
gpt-4o-mini	2024-07-18
gpt-4o	2024-08-06
gpt-4o	2024-11-20
gpt-4.1	2025-04-14
gpt-4.1-nano	2025-04-14
gpt-4.1-mini	2025-04-14

Benefits

Reduced Latency: Focuses on new/modified sections, speeding up responses.
Context Preservation: Maintains tone, style, and content coherence.

Limitations

Cost: Rejected prediction tokens are billed at completion token rates.
Text-Only: Supports only text modalities.
Unsupported Parameters: Does not support n > 1, logprobs, presence_penalty > 0, frequency_penalty > 0, audio, max_completion_tokens, or tools/function calling.
Regional Availability: Unavailable in South East Asia.

Cost Calculation Example

Understanding the billing mechanism for predicted-outputs is crucial for cost optimization. Here's a detailed example based on Azure OpenAI documentation:

Example API Response Usage Data:

"usage": {
  "completion_tokens": 77,
  "prompt_tokens": 124,
  "total_tokens": 201,
  "completion_tokens_details": {
    "accepted_prediction_tokens": 6,
    "audio_tokens": 0,
    "reasoning_tokens": 0,
    "rejected_prediction_tokens": 4
  }
}

Billing Calculation:

Completion tokens: 77 (includes 6 accepted prediction tokens)
Rejected prediction tokens: 4 (billed separately at completion token rates)
Total billed tokens: 77 + 4 = 81 tokens

Key Points:

No cost deduction for accepted prediction tokens (they're included in completion_tokens)
Additional charges apply for rejected prediction tokens
Acceptance ratio is critical: Higher acceptance rates = better cost efficiency
Performance benefits: Despite no cost reduction for accepted tokens, significant latency improvements (up to 30% throughput improvement) result from speculative processing

Cost-Benefit Analysis:

Latency: Substantial improvements due to speculative decoding
Throughput: ~30% improvement leading to better GPU efficiency
Cost: Evaluate acceptance rate (6 accepted vs 4 rejected = 60% acceptance) to determine ROI

Recommendation: Monitor your acceptance rates closely. High acceptance rates (>70%) typically justify the additional costs through performance gains.

Advanced Configuration

Predictor Search Length Parameter

Azure OpenAI provides an additional parameter x-ms-oai-ev3-predictor_search_length to control the predictor search length for predicted-outputs optimization. This parameter is used in the underlying Speculative Decoding mechanism to define the number of tokens in the sampling space to search for reconvergence when transitioning between generative and speculation modes.

Supported Values:

1, 2, 4, 8, 16, and 32 (default)

main: {x-ms-oai-ev3-predictor_search_length:1} main: {x-ms-oai-ev3-predictor_search_length:2} main: {x-ms-oai-ev3-predictor_search_length:4} main: {x-ms-oai-ev3-predictor_search_length:8} main: {x-ms-oai-ev3-predictor_search_length:16} main: {x-ms-oai-ev3-predictor_search_length:32}

main: without x-ms-oai-ev3-predictor_search_length header parameter

Technical Details:

Default Value: 32 tokens
Purpose: Controls the search window for token reconvergence in speculative decoding
Performance Impact: Lower values (even as low as 1) can provide significant gains in decoding time per token
Trade-off: Balance between search thoroughness and decoding speed

Usage Guidelines:

Start with lower values (1-4) for maximum speed optimization
Use higher values (16-32) when prediction accuracy is more important than speed
Test different values to find the optimal balance for your specific use case

How it Works: The parameter determines how many tokens the model examines when deciding whether to reconverge from generative mode back to speculation mode during the predicted-outputs process. A smaller search length means faster decisions but potentially less optimal reconvergence points.

Important Technical Clarification: It's crucial to understand that the x-ms-oai-ev3-predictor_search_length parameter does not affect the final output accuracy or quality. Since the same underlying large language model (e.g., gpt-4.1-mini) performs the final token validation and generation regardless of the search length setting, the content accuracy remains consistent across all configurations.

What actually varies is the risk-reward trade-off:

Lower Search Length (1-4):
- Higher probability of prediction acceptance (fewer tokens to get wrong)
- Lower efficiency gains per successful prediction
- More consistent but smaller performance improvements
Higher Search Length (16-32):
- Lower probability of prediction acceptance (more tokens that could be wrong)
- Higher efficiency gains when predictions are successful
- More volatile but potentially larger performance improvements

The optimal choice depends on:

Predictability of your content: More predictable content benefits from higher search lengths
Risk tolerance: Whether you prefer consistent small gains or potentially larger but less frequent gains
Cost considerations: Rejected prediction tokens are billed, so acceptance rate impacts cost efficiency

Example Usage

Below are examples of using predicted-outputs in Azure OpenAI with different predictor search length configurations:

import openai

openai.api_key = "your_api_key"
openai.api_base = "your_api_base"

# Example 1: Maximum speed optimization (minimal search length)
response_fast = openai.ChatCompletion.create(
    engine="your_deployment_name",
    messages=[{"role": "user", "content": "Complete this code: def hello_world(): print("}],
    prediction="Hello, World!")",
    headers={
        "x-ms-oai-ev3-predictor_search_length": "1"  # Fastest decoding
    }
)

# Example 2: Balanced approach
response_balanced = openai.ChatCompletion.create(
    engine="your_deployment_name",
    messages=[{"role": "user", "content": "Complete this code: def hello_world(): print("}],
    prediction="Hello, World!")",
    headers={
        "x-ms-oai-ev3-predictor_search_length": "8"  # Balance between speed and accuracy
    }
)

# Example 3: Default behavior (can be omitted)
response_default = openai.ChatCompletion.create(
    engine="your_deployment_name",
    messages=[{"role": "user", "content": "Complete this code: def hello_world(): print("}],
    prediction="Hello, World!")",
    headers={
        "x-ms-oai-ev3-predictor_search_length": "32"  # Default value
    }
)

Performance Comparison Table:

Search Length	Speed	Prediction Acceptance Rate	Tokens per Acceptance	Use Case
1	Fastest	Higher	Fewer	Real-time applications, chat
2-4	Very Fast	Good	Moderate	Interactive applications
8-16	Fast	Moderate	More	Balanced performance
32 (default)	Standard	Lower	Most	Maximum throughput when predictions are accurate

Important Note: The relationship between search length and efficiency is nuanced:

Higher Search Length (32): Predicts more tokens at once, leading to lower acceptance rates but higher throughput when predictions are correct
Lower Search Length (1-4): Predicts fewer tokens, leading to higher acceptance rates but smaller efficiency gains per successful prediction
Final Output Quality: Remains consistent across all configurations since the same underlying model performs final validation
Trade-off: Balance between prediction accuracy (higher with smaller search lengths) and potential efficiency gains (higher with larger search lengths when successful)

Best Practices

Structure prompts to maximize cache hits by placing stable content at the beginning
Use high-confidence predictions to avoid unnecessary costs from rejected predictions
Monitor performance to track cache hit rates and prediction acceptance rates

Performance Expectations

Prompt Caching: 20-50% latency reduction for repeated prompts
Predicted-Outputs: 30-70% latency reduction for good predictions
Combined Usage: Up to 80% total latency reduction in optimal scenarios

5. Alternative Approaches: Speculative Decoding

What is Speculative Decoding?

Speculative decoding is a technique to accelerate inference in autoregressive models (e.g., transformers) without altering output quality. Introduced in the paper "Fast Inference from Transformers via Speculative Decoding" (arXiv:2211.17192), it uses a smaller, faster assistant model to generate candidate tokens, which are verified by a larger main model in a single forward pass.

Relationship to Predicted-Outputs

Speculative decoding shares conceptual similarities with predicted-outputs:

Both techniques leverage "predictions" about likely output content
Both aim to reduce the computational cost of generation
However, speculative decoding uses a separate model for predictions, while predicted-outputs uses user-provided predictions

Key Requirements

Assistant Model: Must be at least 3x faster than the main model and predict 70–80% of "easy" tokens correctly.
Tokenizer Compatibility: Both models must share the same vocabulary and tokenizer.
Performance: Best at batch size 1, with diminishing returns above batch size 4.

6. Comprehensive Performance Optimization Strategy

Multiple Optimization Techniques Comparison

Feature	Prompt Caching	Predicted-Outputs	Speculative Decoding
Platform	Cloud providers (Azure, OpenAI)	Azure OpenAI API	Open-source (e.g., Transformers)
Target	Input processing	Output generation	Output generation
Automation	Automatic	Manual prediction	Automatic
Cost	Reduced input costs	Output token billing	Free (local compute)
Latency Reduction	20-50%	30-70%	2x-3x speedup

Decision Framework

Choose Prompt Caching: For repeated prompts and cost optimization
Choose Predicted-Outputs: When you can anticipate output content
Choose Speculative Decoding: For local deployments and open-source models
Combine Techniques: When maximum performance justifies complexity

7. Conclusion

This project highlights the evolution of LLM optimization techniques, from prompt caching to predicted-outputs and speculative decoding. These complementary approaches can be combined to achieve significant performance improvements:

Prompt Caching provides transparent optimization for repeated input patterns
Predicted-Outputs offers targeted acceleration when output content is predictable
Speculative Decoding enables open-source alternatives with consistent speedups

The combination of these techniques represents a comprehensive approach to LLM optimization, enabling developers to choose the right strategy based on their specific use cases, infrastructure, and performance requirements. Future work should explore automated prediction generation, hybrid caching strategies, and intelligent technique selection based on request patterns.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
images		images
notebooks		notebooks
prompt		prompt
.env_template		.env_template
.gitignore		.gitignore
README.md		README.md
extract_relevant_passages_testing.py		extract_relevant_passages_testing.py
flash_attn_install_troubleshooting.md		flash_attn_install_troubleshooting.md
main.py		main.py
mem_snapshot.html		mem_snapshot.html
phi4_requirements.txt		phi4_requirements.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Accelerating LLM Inference: Exploring Predicted-Outputs in Azure OpenAI

Project Structure

1. Understanding Prompt Caching

What is Prompt Caching?

References

2. Introduction to Predicted-Outputs in Azure OpenAI

What is Predicted-Outputs?

Supported Models

Benefits

Limitations

Cost Calculation Example

Advanced Configuration

Predictor Search Length Parameter

Example Usage

Best Practices

Performance Expectations

5. Alternative Approaches: Speculative Decoding

What is Speculative Decoding?

Relationship to Predicted-Outputs

Key Requirements

6. Comprehensive Performance Optimization Strategy

Multiple Optimization Techniques Comparison

Decision Framework

Other Complementary Solutions

7. Conclusion

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Accelerating LLM Inference: Exploring Predicted-Outputs in Azure OpenAI

Project Structure

1. Understanding Prompt Caching

What is Prompt Caching?

References

2. Introduction to Predicted-Outputs in Azure OpenAI

What is Predicted-Outputs?

Supported Models

Benefits

Limitations

Cost Calculation Example

Advanced Configuration

Predictor Search Length Parameter

Example Usage

Best Practices

Performance Expectations

5. Alternative Approaches: Speculative Decoding

What is Speculative Decoding?

Relationship to Predicted-Outputs

Key Requirements

6. Comprehensive Performance Optimization Strategy

Multiple Optimization Techniques Comparison

Decision Framework

Other Complementary Solutions

7. Conclusion

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages