The transformative power of Large Language Models (LLMs) is rooted in their ability to generate coherent, contextually relevant text. However, this capability comes with a significant operational challenge: inference latency. The fundamental architecture of most LLMs is autoregressive, meaning they generate text one token at a time, where each new token depends on all previously generated ones. This inherently sequential process creates a performance bottleneck, particularly for real-time, interactive applications where low latency is critical for user experience. As models grow larger and context windows expand, the time required to generate a complete response can become prohibitively long, limiting the practical deployment of these powerful tools.
To mitigate inference latency, the LLM community has developed several optimization techniques across both model architecture and serving infrastructure. Broadly speaking, these techniques fall into two categories: model-level optimizations and inference-level strategies.
Model-level optimizations include: • Model Distillation: Compressing large models into smaller, faster student models with minimal performance loss. This is widely used in production systems to serve lightweight variants of powerful base models. • Quantization: Reducing the numerical precision of model weights (e.g., from FP32 to INT8), significantly improving memory efficiency and inference throughput. • FlashAttention: An efficient attention mechanism that reduces memory overhead and speeds up computation, particularly effective on modern GPU hardware. • kTransformers / TensorRT-LLM: Frameworks like kTransformers and NVIDIA’s TensorRT-LLM offer highly optimized kernel-level acceleration for transformer architectures, often yielding 2–4× speedups with minimal quality degradation.
Inference-level strategies focus on optimizing the decoding and serving process, and are especially relevant in hosted environments like Azure OpenAI: • Prompt Caching: In Azure OpenAI, prompt caching is automatically enabled for eligible requests. When identical prompt segments are reused (e.g., system prompts or long context history), the service avoids recomputation, thereby reducing end-to-end latency and token cost. • Predicted-Outputs (Speculative Decoding): Recently introduced in Azure OpenAI, this technique enables faster responses by using a smaller model to precompute a batch of candidate tokens. The main model then validates or rejects these candidates in bulk. If the predictions match, the tokens are accepted immediately, reducing token-by-token wait times. This is especially beneficial in chat and agent workflows where early turns are often predictable.
When strategically combined, these techniques can yield 3–10× latency reduction, depending on model size, input length, and deployment stack. Among them, prompt caching and Predicted-Outputs stand out as turnkey features available directly within the Azure OpenAI API, requiring no model retraining or infrastructure changes—making them particularly appealing for enterprise adoption.
1_introduction_to_prompt_cache.md: Explains prompt caching technology and its benefits.2_introduction_to_predicted_outputs.md: Explains Azure OpenAI's predicted-outputs feature.3_prompt_cache_vs_predicted_outputs.md: Compares prompt caching and predicted-outputs.4_combining_optimizations.md: Discusses how to combine both techniques for maximum efficiency.5_conclusion.md: Summarizes key points and future directions.references.md: Lists all references used.code_examples/: Contains Python scripts demonstrating both techniques.
Prompt caching is an optimization technique that stores frequently used prompt segments in memory to avoid recomputing them in subsequent requests. When a cached prompt is reused, the model can skip the expensive computation of processing those tokens again, leading to significant latency and cost reductions.
Prompt caching is particularly effective for:
- Conversational AI applications with repeated system prompts
- Code generation with boilerplate code and project context
- Document processing with template structures
- Multi-turn conversations maintaining context
Azure OpenAI's predicted-outputs feature, introduced in API version 2025-01-01-preview, reduces latency in chat completions by leveraging pre-known text provided via the prediction parameter. This allows the model to focus on generating new or modified content, making it ideal for:
-
Regenerating or refining documents (e.g., legal contracts, technical documents).
reference extract relevant passages usecase data:
-
Auto-completion in IDEs for boilerplate code.
-
Completing templates (e.g., personalized emails, reports).
-
Dialog turns in chatbots (e.g., customer service).
| Model | Version |
|---|---|
| gpt-4o-mini | 2024-07-18 |
| gpt-4o | 2024-08-06 |
| gpt-4o | 2024-11-20 |
| gpt-4.1 | 2025-04-14 |
| gpt-4.1-nano | 2025-04-14 |
| gpt-4.1-mini | 2025-04-14 |
- Reduced Latency: Focuses on new/modified sections, speeding up responses.
- Context Preservation: Maintains tone, style, and content coherence.
- Cost: Rejected prediction tokens are billed at completion token rates.
- Text-Only: Supports only text modalities.
- Unsupported Parameters: Does not support
n> 1,logprobs,presence_penalty> 0,frequency_penalty> 0,audio,max_completion_tokens, or tools/function calling. - Regional Availability: Unavailable in South East Asia.
Understanding the billing mechanism for predicted-outputs is crucial for cost optimization. Here's a detailed example based on Azure OpenAI documentation:
Example API Response Usage Data:
"usage": {
"completion_tokens": 77,
"prompt_tokens": 124,
"total_tokens": 201,
"completion_tokens_details": {
"accepted_prediction_tokens": 6,
"audio_tokens": 0,
"reasoning_tokens": 0,
"rejected_prediction_tokens": 4
}
}Billing Calculation:
- Completion tokens: 77 (includes 6 accepted prediction tokens)
- Rejected prediction tokens: 4 (billed separately at completion token rates)
- Total billed tokens: 77 + 4 = 81 tokens
Key Points:
- No cost deduction for accepted prediction tokens (they're included in completion_tokens)
- Additional charges apply for rejected prediction tokens
- Acceptance ratio is critical: Higher acceptance rates = better cost efficiency
- Performance benefits: Despite no cost reduction for accepted tokens, significant latency improvements (up to 30% throughput improvement) result from speculative processing
Cost-Benefit Analysis:
- Latency: Substantial improvements due to speculative decoding
- Throughput: ~30% improvement leading to better GPU efficiency
- Cost: Evaluate acceptance rate (6 accepted vs 4 rejected = 60% acceptance) to determine ROI
Recommendation: Monitor your acceptance rates closely. High acceptance rates (>70%) typically justify the additional costs through performance gains.
Azure OpenAI provides an additional parameter x-ms-oai-ev3-predictor_search_length to control the predictor search length for predicted-outputs optimization. This parameter is used in the underlying Speculative Decoding mechanism to define the number of tokens in the sampling space to search for reconvergence when transitioning between generative and speculation modes.
Supported Values:
1,2,4,8,16, and32(default)
main: {x-ms-oai-ev3-predictor_search_length:1}
main: {x-ms-oai-ev3-predictor_search_length:2}
main: {x-ms-oai-ev3-predictor_search_length:4}
main: {x-ms-oai-ev3-predictor_search_length:8}
main: {x-ms-oai-ev3-predictor_search_length:16}
main: {x-ms-oai-ev3-predictor_search_length:32}

main: without x-ms-oai-ev3-predictor_search_length header parameter

Technical Details:
- Default Value: 32 tokens
- Purpose: Controls the search window for token reconvergence in speculative decoding
- Performance Impact: Lower values (even as low as 1) can provide significant gains in decoding time per token
- Trade-off: Balance between search thoroughness and decoding speed
Usage Guidelines:
- Start with lower values (1-4) for maximum speed optimization
- Use higher values (16-32) when prediction accuracy is more important than speed
- Test different values to find the optimal balance for your specific use case
How it Works: The parameter determines how many tokens the model examines when deciding whether to reconverge from generative mode back to speculation mode during the predicted-outputs process. A smaller search length means faster decisions but potentially less optimal reconvergence points.
Important Technical Clarification:
It's crucial to understand that the x-ms-oai-ev3-predictor_search_length parameter does not affect the final output accuracy or quality. Since the same underlying large language model (e.g., gpt-4.1-mini) performs the final token validation and generation regardless of the search length setting, the content accuracy remains consistent across all configurations.
What actually varies is the risk-reward trade-off:
- Lower Search Length (1-4):
- Higher probability of prediction acceptance (fewer tokens to get wrong)
- Lower efficiency gains per successful prediction
- More consistent but smaller performance improvements
- Higher Search Length (16-32):
- Lower probability of prediction acceptance (more tokens that could be wrong)
- Higher efficiency gains when predictions are successful
- More volatile but potentially larger performance improvements
The optimal choice depends on:
- Predictability of your content: More predictable content benefits from higher search lengths
- Risk tolerance: Whether you prefer consistent small gains or potentially larger but less frequent gains
- Cost considerations: Rejected prediction tokens are billed, so acceptance rate impacts cost efficiency
Below are examples of using predicted-outputs in Azure OpenAI with different predictor search length configurations:
import openai
openai.api_key = "your_api_key"
openai.api_base = "your_api_base"
# Example 1: Maximum speed optimization (minimal search length)
response_fast = openai.ChatCompletion.create(
engine="your_deployment_name",
messages=[{"role": "user", "content": "Complete this code: def hello_world(): print("}],
prediction="Hello, World!")",
headers={
"x-ms-oai-ev3-predictor_search_length": "1" # Fastest decoding
}
)
# Example 2: Balanced approach
response_balanced = openai.ChatCompletion.create(
engine="your_deployment_name",
messages=[{"role": "user", "content": "Complete this code: def hello_world(): print("}],
prediction="Hello, World!")",
headers={
"x-ms-oai-ev3-predictor_search_length": "8" # Balance between speed and accuracy
}
)
# Example 3: Default behavior (can be omitted)
response_default = openai.ChatCompletion.create(
engine="your_deployment_name",
messages=[{"role": "user", "content": "Complete this code: def hello_world(): print("}],
prediction="Hello, World!")",
headers={
"x-ms-oai-ev3-predictor_search_length": "32" # Default value
}
)Performance Comparison Table:
| Search Length | Speed | Prediction Acceptance Rate | Tokens per Acceptance | Use Case |
|---|---|---|---|---|
| 1 | Fastest | Higher | Fewer | Real-time applications, chat |
| 2-4 | Very Fast | Good | Moderate | Interactive applications |
| 8-16 | Fast | Moderate | More | Balanced performance |
| 32 (default) | Standard | Lower | Most | Maximum throughput when predictions are accurate |
Important Note: The relationship between search length and efficiency is nuanced:
- Higher Search Length (32): Predicts more tokens at once, leading to lower acceptance rates but higher throughput when predictions are correct
- Lower Search Length (1-4): Predicts fewer tokens, leading to higher acceptance rates but smaller efficiency gains per successful prediction
- Final Output Quality: Remains consistent across all configurations since the same underlying model performs final validation
- Trade-off: Balance between prediction accuracy (higher with smaller search lengths) and potential efficiency gains (higher with larger search lengths when successful)
- Structure prompts to maximize cache hits by placing stable content at the beginning
- Use high-confidence predictions to avoid unnecessary costs from rejected predictions
- Monitor performance to track cache hit rates and prediction acceptance rates
- Prompt Caching: 20-50% latency reduction for repeated prompts
- Predicted-Outputs: 30-70% latency reduction for good predictions
- Combined Usage: Up to 80% total latency reduction in optimal scenarios
Speculative decoding is a technique to accelerate inference in autoregressive models (e.g., transformers) without altering output quality. Introduced in the paper "Fast Inference from Transformers via Speculative Decoding" (arXiv:2211.17192), it uses a smaller, faster assistant model to generate candidate tokens, which are verified by a larger main model in a single forward pass.
Speculative decoding shares conceptual similarities with predicted-outputs:
- Both techniques leverage "predictions" about likely output content
- Both aim to reduce the computational cost of generation
- However, speculative decoding uses a separate model for predictions, while predicted-outputs uses user-provided predictions
- Assistant Model: Must be at least 3x faster than the main model and predict 70–80% of "easy" tokens correctly.
- Tokenizer Compatibility: Both models must share the same vocabulary and tokenizer.
- Performance: Best at batch size 1, with diminishing returns above batch size 4.
| Feature | Prompt Caching | Predicted-Outputs | Speculative Decoding |
|---|---|---|---|
| Platform | Cloud providers (Azure, OpenAI) | Azure OpenAI API | Open-source (e.g., Transformers) |
| Target | Input processing | Output generation | Output generation |
| Automation | Automatic | Manual prediction | Automatic |
| Cost | Reduced input costs | Output token billing | Free (local compute) |
| Latency Reduction | 20-50% | 30-70% | 2x-3x speedup |
- Choose Prompt Caching: For repeated prompts and cost optimization
- Choose Predicted-Outputs: When you can anticipate output content
- Choose Speculative Decoding: For local deployments and open-source models
- Combine Techniques: When maximum performance justifies complexity
- Knowledge Distillation: Training smaller models to mimic larger ones for faster inference.
- Model Pruning: Reducing model size by removing less important weights.
- Quantization: Using lower precision (e.g., INT8) for faster computation.
This project highlights the evolution of LLM optimization techniques, from prompt caching to predicted-outputs and speculative decoding. These complementary approaches can be combined to achieve significant performance improvements:
- Prompt Caching provides transparent optimization for repeated input patterns
- Predicted-Outputs offers targeted acceleration when output content is predictable
- Speculative Decoding enables open-source alternatives with consistent speedups
The combination of these techniques represents a comprehensive approach to LLM optimization, enabling developers to choose the right strategy based on their specific use cases, infrastructure, and performance requirements. Future work should explore automated prediction generation, hybrid caching strategies, and intelligent technique selection based on request patterns.
- Azure OpenAI Predicted-Outputs Documentation
- Azure OpenAI Prompt Caching Documentation
- Medium: Speed Up OpenAI API Responses
- Hugging Face: Whisper Speculative Decoding
- Hugging Face: Assisted Generation
- ArXiv: Fast Inference from Transformers via Speculative Decoding
- Google Research: Speculative Decoding





