-
Notifications
You must be signed in to change notification settings - Fork 24
Blog post on KServe + llm-d + vLLM from Red Hat and Tesla #192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
terrytangyuan
wants to merge
8
commits into
llm-d:main
Choose a base branch
from
terrytangyuan:blog-kserve-llm-d
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+137
−21
Open
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
cf00c3d
Blog post on KServe + llm-d + vLLM from Red Hat and Tesla
terrytangyuan 06ad929
Update blog/2026-03-06_production-grade-ai-inference-kserve-red-hat-a…
terrytangyuan ed0d1f0
Update blog/2026-03-06_production-grade-ai-inference-kserve-red-hat-a…
terrytangyuan 0c799ab
Update blog/2026-03-06_production-grade-ai-inference-kserve-red-hat-a…
terrytangyuan c0d801a
Convert author images to WebP, download architecture diagram, and add…
petecheslock d0b4d32
Add LinkedIn and GitHub socials to all authors in authors.yml
petecheslock 30c87b3
Clean up authors.yml: remove redundant url fields and add missing Lin…
petecheslock 4c2fed0
Update sai github link
terrytangyuan File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
70 changes: 70 additions & 0 deletions
70
...6-03-06_production-grade-ai-inference-kserve-red-hat-and-tesla-success-story.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,70 @@ | ||
| --- | ||
| title: "Production-Grade AI Inference with KServe, llm-d, and vLLM: A Red Hat and Tesla Success Story" | ||
| description: "The collaboration story between Red Hat and Tesla to overcome significant scaling and operational challenges in LLM deployment. It explains how migrating from a simple vLLM deployment to a robust MLOps platform utilizing KServe, llm-d's intelligent routing, and vLLM provides deep customization and improved efficiency through prefix-cache aware routing to maximize GPU utilization." | ||
| slug: production-grade-ai-inference-kserve-red-hat-and-tesla-success-story | ||
| date: 2026-03-06T09:00 | ||
|
|
||
| authors: | ||
| - terrytangyuan | ||
| - cabrinha | ||
| - robshaw | ||
| - saikrishna | ||
|
|
||
| tags: [blog] | ||
| --- | ||
|
|
||
| # Production-Grade AI Inference with KServe, llm-d, and vLLM: A Red Hat and Tesla Success Story | ||
|
|
||
| ## The Problem with "Simple" LLM Deployments | ||
|
|
||
| Everyone is racing to run Large Language Models (LLMs), in the cloud, on-prem, and even on edge devices. The real challenge, however, isn't the first deployment; it's scaling, managing, and maintaining hundreds of LLMs efficiently. We initially approached this challenge with a straightforward vLLM deployment wrapped in a Kubernetes StatefulSet. | ||
|
|
||
| <!-- truncate --> | ||
| The approach quickly introduced severe operational bottlenecks: | ||
|
|
||
| * **Storage Drag:** Models like Llama 3 can easily reach hundreds of gigabytes in size. Relying on sluggish network storage (NFS) for these massive safetensors was a non-starter. | ||
| * **Infrastructure Lock-in:** Switching to local LVM persistent volumes solved the speed problem but created a rigid node-to-pod affinity. A single hardware failure meant a manual intervention to delete the Persistent Volume Claim (PVC) and reschedule the pod, which is an unacceptable burden for day-2 operations. | ||
| * **Naive Load Balancing:** Beyond the looming retirement of NGINX Ingress Controller, a simple round-robin load-balancing strategy is fundamentally inefficient for LLMs. It fails to utilize the critical **KV-cache** on the GPU, a core feature of vLLM that significantly boosts throughput. In a world where GPU costs are paramount, squeezing efficiency out of every core is non-negotiable. | ||
|
|
||
| ## The Search for a Superior Operator | ||
|
|
||
| We recognized that running LLMs at scale demanded a purpose-built solution, a Kubernetes Operator designed for the intricacies of AI/ML. While some existing projects are clean and functional as a Proof-of-Concept, they lacked the necessary extensibility. Customizing the runtime specification beyond the exposed Custom Resources was a requirement we couldn't compromise on. There are also other tools that offered complexity and robustness but were overly opinionated, catering heavily toward a specific prefill/decode setup. In addition, their strict API contracts didn't align with our need for flexible, customized deployment patterns. | ||
|
|
||
| ## The Winning Combination: KServe \+ llm-d \+ vLLM | ||
|
|
||
|  | ||
|
|
||
| Our journey led us back to the most flexible and powerful solution: [**llm-d**](https://github.com/llm-d/llm-d), powered by [**KServe**](https://github.com/kserve/kserve) and its cutting-edge **Inference Gateway Extension**. | ||
|
|
||
| This combination solved every scaling and operational challenge we faced by delivering: | ||
|
|
||
| 1. **Deep Customization:** The **LLMInferenceService** and **LLMInferenceConfig** objects expose the standard Kubernetes API, allowing us to override the spec precisely where needed. This level of granular control is crucial for tailoring vLLM to specialized hardware or quickly implementing flag changes. | ||
| 2. **Intelligent Routing and Efficiency:** By leveraging [**Envoy**](https://www.envoyproxy.io/), [**Envoy AI Gateway**](https://aigateway.envoyproxy.io/), and [**Gateway API Inference Extension**](https://github.com/kubernetes-sigs/gateway-api-inference-extension), we moved far beyond round-robin. This technology enables **prefix-cache aware routing**, ensuring requests are intelligently routed to the correct vLLM instance to maximize KV-cache utilization and drive up GPU efficiency. | ||
|
|
||
| TODO(saikrishna): charts on the before --> after with prefix-awareness (pending approval) along with some text/descriptions | ||
|
|
||
| ## Collaboration for Successful Adoption | ||
|
|
||
| This migration from a fragile StatefulSet to a robust, scalable MLOps platform was not a solitary effort. It was a direct result of the powerful collaboration between **Red Hat** and **Tesla**. By combining Red Hat’s deep expertise in enterprise-grade Kubernetes and open-source infrastructure with Tesla’s demanding requirements for high-performance, large-scale AI serving, we successfully integrated and validated the KServe and llm-d solution. This partnership demonstrates how open standards and purpose-built operators are the key to unlocking the true potential of LLMs in production environments. | ||
|
|
||
| This collaboration helps identify issues and sparks ideas for new features in KServe ([\#4901](https://github.com/kserve/kserve/issues/4901), [\#4900](https://github.com/kserve/kserve/issues/4900), [\#4898](https://github.com/kserve/kserve/issues/4898), [\#4899](https://github.com/kserve/kserve/issues/4899)). In addition, LLMInferenceService’s storageInitializer field has been [changed to optional](https://github.com/kserve/kserve/pull/4970) to enable the use of RunAI Model Streamer. | ||
|
|
||
| The combination of **KServe's** industry-leading standard for model serving, **llm-d's** intelligent routing capabilities, and **vLLM's** high-throughput inference engine provides the best foundation for managing the next generation of AI workloads at enterprise scale. | ||
|
|
||
| ## Get Involved with llm-d | ||
|
|
||
| The work described here is just one example of what becomes possible when a community of engineers tackles hard problems together in the open. If you're running LLMs at scale and wrestling with the same challenges — storage, routing, efficiency, day-2 operations — we'd love to have you involved. | ||
|
|
||
| * **Explore the code** → Browse our [GitHub organization](https://github.com/llm-d) and dig into the projects powering this stack | ||
| * **Join our Slack** → [Get your invite](/slack) and connect directly with maintainers and contributors from Red Hat, Tesla, and beyond | ||
| * **Attend community calls** → All meetings are open! Add our [public calendar](https://red.ht/llm-d-public-calendar) (Wednesdays 12:30pm ET) and join the conversation | ||
| * **Follow project updates** → Stay current on [Twitter/X](https://twitter.com/_llm_d_), [Bluesky](https://bsky.app/profile/llm-d.ai), and [LinkedIn](https://www.linkedin.com/company/llm-d) | ||
| * **Watch demos and recordings** → Subscribe to the [llm-d YouTube channel](https://www.youtube.com/@llm-d-project) for community call recordings and feature walkthroughs | ||
| * **Read the docs** → Visit our [community page](/docs/community) to find SIGs, contribution guides, and upcoming events | ||
|
|
||
| ## Acknowledgement | ||
|
|
||
| We’d like to thank everyone from the community who has contributed to the successful adoption of KServe, llm-d, and vLLM in Tesla's production environment. In particular, below is the list of people from Red Hat and Tesla teams who have helped through the process (in alphabetical order). | ||
|
|
||
| * **Red Hat team**: Andres Llausas, Bartosz Majsak, Greg Pereira, Pierangelo Di Pilato, Vivek Karunai Kiri Ragavan, Robert Shaw, and Yuan Tang | ||
| * **Tesla team**: Scott Cabrinha and Sai Krishna | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file added
BIN
+26.4 KB
...de-ai-inference-kserve-red-hat-and-tesla-success-story/kserve-architecture.webp
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.