llm-d · terrytangyuan · Mar 4, 2026 · Mar 4, 2026 · Mar 4, 2026 · Mar 4, 2026
@@ -0,0 +1,70 @@
+---
+title: "Production-Grade AI Inference with KServe, llm-d, and vLLM: A Red Hat and Tesla Success Story"
+description: "The collaboration story between Red Hat and Tesla to overcome significant scaling and operational challenges in LLM deployment. It explains how migrating from a simple vLLM deployment to a robust MLOps platform utilizing KServe, llm-d's intelligent routing, and vLLM provides deep customization and improved efficiency through prefix-cache aware routing to maximize GPU utilization."
+slug: production-grade-ai-inference-kserve-red-hat-and-tesla-success-story
+date: 2026-03-06T09:00
+
+authors:
+  - terrytangyuan
+  - cabrinha
+  - robshaw
+  - saikrishna
+
+tags: [blog]
+---
+
+# Production-Grade AI Inference with KServe, llm-d, and vLLM: A Red Hat and Tesla Success Story
+
+## The Problem with "Simple" LLM Deployments
+
+Everyone is racing to run Large Language Models (LLMs), in the cloud, on-prem, and even on edge devices. The real challenge, however, isn't the first deployment; it's scaling, managing, and maintaining hundreds of LLMs efficiently. We initially approached this challenge with a straightforward vLLM deployment wrapped in a Kubernetes StatefulSet.
+
+<!-- truncate -->
+The approach quickly introduced severe operational bottlenecks:
+
+* **Storage Drag:** Models like Llama 3 can easily reach hundreds of gigabytes in size. Relying on sluggish network storage (NFS) for these massive safetensors was a non-starter.  
+* **Infrastructure Lock-in:** Switching to local LVM persistent volumes solved the speed problem but created a rigid node-to-pod affinity. A single hardware failure meant a manual intervention to delete the Persistent Volume Claim (PVC) and reschedule the pod, which is an unacceptable burden for day-2 operations.  
+* **Naive Load Balancing:** Beyond the looming retirement of NGINX Ingress Controller, a simple round-robin load-balancing strategy is fundamentally inefficient for LLMs. It fails to utilize the critical **KV-cache** on the GPU, a core feature of vLLM that significantly boosts throughput. In a world where GPU costs are paramount, squeezing efficiency out of every core is non-negotiable.
+
+## The Search for a Superior Operator
+
+We recognized that running LLMs at scale demanded a purpose-built solution, a Kubernetes Operator designed for the intricacies of AI/ML. While some existing projects are clean and functional as a Proof-of-Concept, they lacked the necessary extensibility. Customizing the runtime specification beyond the exposed Custom Resources was a requirement we couldn't compromise on. There are also other tools that offered complexity and robustness but were overly opinionated, catering heavily toward a specific prefill/decode setup. In addition, their strict API contracts didn't align with our need for flexible, customized deployment patterns.
+
+## The Winning Combination: KServe \+ llm-d \+ vLLM
+
+![kserve-architecture](/img/blogs/production-grade-ai-inference-kserve-red-hat-and-tesla-success-story/kserve-architecture.webp)
+
+Our journey led us back to the most flexible and powerful solution: [**llm-d**](https://github.com/llm-d/llm-d), powered by [**KServe**](https://github.com/kserve/kserve) and its cutting-edge **Inference Gateway Extension**.
+
+This combination solved every scaling and operational challenge we faced by delivering:
+
+1. **Deep Customization:** The **LLMInferenceService** and **LLMInferenceConfig** objects expose the standard Kubernetes API, allowing us to override the spec precisely where needed. This level of granular control is crucial for tailoring vLLM to specialized hardware or quickly implementing flag changes.  
+2. **Intelligent Routing and Efficiency:** By leveraging [**Envoy**](https://www.envoyproxy.io/), [**Envoy AI Gateway**](https://aigateway.envoyproxy.io/), and [**Gateway API Inference Extension**](https://github.com/kubernetes-sigs/gateway-api-inference-extension), we moved far beyond round-robin. This technology enables **prefix-cache aware routing**, ensuring requests are intelligently routed to the correct vLLM instance to maximize KV-cache utilization and drive up GPU efficiency.
+
+TODO(saikrishna): charts on the before --> after with prefix-awareness (pending approval) along with some text/descriptions
+
+## Collaboration for Successful Adoption
+
+This migration from a fragile StatefulSet to a robust, scalable MLOps platform was not a solitary effort. It was a direct result of the powerful collaboration between **Red Hat** and **Tesla**. By combining Red Hat’s deep expertise in enterprise-grade Kubernetes and open-source infrastructure with Tesla’s demanding requirements for high-performance, large-scale AI serving, we successfully integrated and validated the KServe and llm-d solution. This partnership demonstrates how open standards and purpose-built operators are the key to unlocking the true potential of LLMs in production environments.
+
+This collaboration helps identify issues and sparks ideas for new features in KServe ([\#4901](https://github.com/kserve/kserve/issues/4901), [\#4900](https://github.com/kserve/kserve/issues/4900), [\#4898](https://github.com/kserve/kserve/issues/4898), [\#4899](https://github.com/kserve/kserve/issues/4899)). In addition, LLMInferenceService’s storageInitializer field has been [changed to optional](https://github.com/kserve/kserve/pull/4970) to enable the use of RunAI Model Streamer.
+
+The combination of **KServe's** industry-leading standard for model serving, **llm-d's** intelligent routing capabilities, and **vLLM's** high-throughput inference engine provides the best foundation for managing the next generation of AI workloads at enterprise scale.
+
+## Get Involved with llm-d
+
+The work described here is just one example of what becomes possible when a community of engineers tackles hard problems together in the open. If you're running LLMs at scale and wrestling with the same challenges — storage, routing, efficiency, day-2 operations — we'd love to have you involved.
+
+* **Explore the code** → Browse our [GitHub organization](https://github.com/llm-d) and dig into the projects powering this stack
+* **Join our Slack** → [Get your invite](/slack) and connect directly with maintainers and contributors from Red Hat, Tesla, and beyond
+* **Attend community calls** → All meetings are open! Add our [public calendar](https://red.ht/llm-d-public-calendar) (Wednesdays 12:30pm ET) and join the conversation
+* **Follow project updates** → Stay current on [Twitter/X](https://twitter.com/_llm_d_), [Bluesky](https://bsky.app/profile/llm-d.ai), and [LinkedIn](https://www.linkedin.com/company/llm-d)
+* **Watch demos and recordings** → Subscribe to the [llm-d YouTube channel](https://www.youtube.com/@llm-d-project) for community call recordings and feature walkthroughs
+* **Read the docs** → Visit our [community page](/docs/community) to find SIGs, contribution guides, and upcoming events
+
+## Acknowledgement
+
+We’d like to thank everyone from the community who has contributed to the successful adoption of KServe, llm-d, and vLLM in Tesla's production environment. In particular, below is the list of people from Red Hat and Tesla teams who have helped through the process (in alphabetical order).
+
+* **Red Hat team**: Andres Llausas, Bartosz Majsak, Greg Pereira, Pierangelo Di Pilato, Vivek Karunai Kiri Ragavan, Robert Shaw, and Yuan Tang  
+* **Tesla team**: Scott Cabrinha and Sai Krishna
@@ -7,96 +7,114 @@ redhat:
 robshaw:
   name: Robert Shaw
   title: Director of Engineering, Red Hat
-  url: https://github.com/robertgshaw2-redhat
   image_url: https://avatars.githubusercontent.com/u/114415538?v=4
   email: robshaw@redhat.com
+  socials:
+    linkedin: robert-shaw-1a01399a
+    github: https://github.com/robertgshaw2-redhat
 
 smarterclayton:
   name: Clayton Coleman
   title: Distinguished Engineer, Google
-  url: https://github.com/smarterclayton
   image_url: https://avatars.githubusercontent.com/u/1163175?v=4
   email: claytoncoleman@google.com
+  socials:
+    github: https://github.com/smarterclayton
 
 chcost:
   name: Carlos Costa
   title: Distinguished Engineer, IBM
-  url: https://github.com/chcost
   image_url: https://avatars.githubusercontent.com/u/26551701?v=4
   email: chcost@us.ibm.com
+  socials:
+    github: https://github.com/chcost
 
 petecheslock:
   name: Pete Cheslock
   title: AI Community Architect, Red Hat
-  url: https://github.com/petecheslock
   image_url: https://avatars.githubusercontent.com/u/511733?v=4
   email: pete.cheslock@redhat.com
+  socials:
+    linkedin: petecheslock
+    github: https://github.com/petecheslock
 
 cnuland:
   name: Christopher Nuland
   title: Principal Technical Marketing Manager for AI, Red Hat
-  url: https://github.com/cnuland
   image_url: /img/blogs/cnuland.webp
+  socials:
+    linkedin: cjnuland
+    github: https://github.com/cnuland
 
 niliguy:
   name: Nili Guy
   title: R&D Manager, AI Infrastructure, IBM
-  url: https://www.linkedin.com/in/nilig/
   image_url: /img/blogs/niliguy.webp
-
+  socials:
+    linkedin: nilig
+
 etailevran:
   name: Etai Lev Ran
   title: Cloud Architect, IBM
-  url: https://www.linkedin.com/in/elevran/
   image_url: /img/blogs/etailevran.webp
+  socials:
+    linkedin: elevran
 
 vitabortnikov:
   name: Vita Bortnikov
   title: IBM Fellow, IBM
-  url: https://www.linkedin.com/in/vita-bortnikov/
   image_url: /img/blogs/vitabortnikov.webp
+  socials:
+    linkedin: vita-bortnikov
 
 maroonayoub:
   name: Maroon Ayoub
   title: Research Scientist & Architect, IBM
-  url: https://www.linkedin.com/in/v-maroon/
   image_url: /img/blogs/maroonayoub.webp
+  socials:
+    linkedin: v-maroon
 
 dannyharnik:
   name: Danny Harnik
   title: Senior Technical Staff Member, IBM
-  url: https://www.linkedin.com/in/danny-harnik-19a95436/
   image_url: /img/blogs/dannyharnik.webp
+  socials:
+    linkedin: danny-harnik-19a95436
 
 kfirtoledo:
   name: Kfir Toledo
   title: Research Staff Member, IBM
-  url: https://www.linkedin.com/in/kfir-toledo-394a8811a/
   image_url: /img/blogs/kfirtoledo.webp
+  socials:
+    linkedin: kfir-toledo-394a8811a
 
 effiofer:
   name: Effi Ofer
   title: Research Staff Member, IBM
-  url: https://www.linkedin.com/in/effi-ofer-91a261b0/
   image_url: /img/blogs/effiofer.webp
+  socials:
+    linkedin: effi-ofer-91a261b0
 
 orozeri:
   name: Or Ozeri
   title: Research Staff Member, IBM
-  url: https://www.linkedin.com/in/or-ozeri-a942859a/
   image_url: /img/blogs/orozeri.webp
+  socials:
+    linkedin: or-ozeri-a942859a
 
 tylersmith:
   name: Tyler Smith
   title: Member of Technical Staff, Red Hat
-  url: https://www.linkedin.com/in/tyler-michael-smith-017b28102/
   image_url: /img/blogs/tylersmith.webp
+  socials:
+    linkedin: tyler-michael-smith-017b28102
 
 kellenswain:
   name: Kellen Swain
   title: Software Engineer, Google
-  url: https://www.linkedin.com/in/kellen-swain/
   image_url: /img/blogs/kellenswain.webp
+  socials:
+    linkedin: kellen-swain
 
 xiningwang:
   name: Xining Wang
@@ -111,23 +129,51 @@ hangyin:
 kayyan:
   name: Kay Yan
   title: Principal Software Engineer, DaoCloud
-  url: https://www.linkedin.com/in/yankay/
   image_url: /img/blogs/kayyan.webp
+  socials:
+    linkedin: yankay
 
 kylebader:
   name: Kyle Bader
   title: Chief Architect, Data and AI, Ceph at IBM
-  url: https://www.linkedin.com/in/kyle-bader-5267a030/
   image_url: /img/blogs/kyle-bader.webp
+  socials:
+    linkedin: kyle-bader-5267a030
 
 tushargohad:
   name: Tushar Gohad
   title: Distinguished Engineer, Intel
-  url: https://www.linkedin.com/in/tushargohad/
   image_url: /img/blogs/tushar-gohad.webp
+  socials:
+    linkedin: tushargohad
 
 guymargalit:
   name: Guy Margalit
-  title: Senior Technical Staff Member, IBM Storage CTO Ofiice 
-  url: https://www.linkedin.com/in/guymargalit/
+  title: Senior Technical Staff Member, IBM Storage CTO Office
   image_url: /img/blogs/guymargalit.webp
+  socials:
+    linkedin: guymargalit
+
+terrytangyuan:
+  name: Yuan Tang
+  title: Senior Principal Software Engineer, Red Hat
+  image_url: https://github.com/terrytangyuan.png
+  socials:
+    linkedin: terrytangyuan
+    github: https://github.com/terrytangyuan
+
+cabrinha:
+  name: Scott Cabrinha
+  title: Staff Site Reliability Engineer, Tesla
+  image_url: /img/blogs/scottcabrinha.webp
+  socials:
+    linkedin: scott-cabrinha
+    github: https://github.com/cabrinha
+
+saikrishna:
+  name: Sai Krishna
+  title: Staff Software Engineer, Tesla
+  image_url: /img/blogs/saikrishna.webp
+  socials:
+    linkedin: sai-krishna-45372444
+    github: https://github.com/skpulipaka26