diff --git a/blog/2026-03-06_production-grade-ai-inference-kserve-red-hat-and-tesla-success-story.md b/blog/2026-03-06_production-grade-ai-inference-kserve-red-hat-and-tesla-success-story.md new file mode 100644 index 0000000..14b5d9c --- /dev/null +++ b/blog/2026-03-06_production-grade-ai-inference-kserve-red-hat-and-tesla-success-story.md @@ -0,0 +1,70 @@ +--- +title: "Production-Grade AI Inference with KServe, llm-d, and vLLM: A Red Hat and Tesla Success Story" +description: "The collaboration story between Red Hat and Tesla to overcome significant scaling and operational challenges in LLM deployment. It explains how migrating from a simple vLLM deployment to a robust MLOps platform utilizing KServe, llm-d's intelligent routing, and vLLM provides deep customization and improved efficiency through prefix-cache aware routing to maximize GPU utilization." +slug: production-grade-ai-inference-kserve-red-hat-and-tesla-success-story +date: 2026-03-06T09:00 + +authors: + - terrytangyuan + - cabrinha + - robshaw + - saikrishna + +tags: [blog] +--- + +# Production-Grade AI Inference with KServe, llm-d, and vLLM: A Red Hat and Tesla Success Story + +## The Problem with "Simple" LLM Deployments + +Everyone is racing to run Large Language Models (LLMs), in the cloud, on-prem, and even on edge devices. The real challenge, however, isn't the first deployment; it's scaling, managing, and maintaining hundreds of LLMs efficiently. We initially approached this challenge with a straightforward vLLM deployment wrapped in a Kubernetes StatefulSet. + + +The approach quickly introduced severe operational bottlenecks: + +* **Storage Drag:** Models like Llama 3 can easily reach hundreds of gigabytes in size. Relying on sluggish network storage (NFS) for these massive safetensors was a non-starter. +* **Infrastructure Lock-in:** Switching to local LVM persistent volumes solved the speed problem but created a rigid node-to-pod affinity. A single hardware failure meant a manual intervention to delete the Persistent Volume Claim (PVC) and reschedule the pod, which is an unacceptable burden for day-2 operations. +* **Naive Load Balancing:** Beyond the looming retirement of NGINX Ingress Controller, a simple round-robin load-balancing strategy is fundamentally inefficient for LLMs. It fails to utilize the critical **KV-cache** on the GPU, a core feature of vLLM that significantly boosts throughput. In a world where GPU costs are paramount, squeezing efficiency out of every core is non-negotiable. + +## The Search for a Superior Operator + +We recognized that running LLMs at scale demanded a purpose-built solution, a Kubernetes Operator designed for the intricacies of AI/ML. While some existing projects are clean and functional as a Proof-of-Concept, they lacked the necessary extensibility. Customizing the runtime specification beyond the exposed Custom Resources was a requirement we couldn't compromise on. There are also other tools that offered complexity and robustness but were overly opinionated, catering heavily toward a specific prefill/decode setup. In addition, their strict API contracts didn't align with our need for flexible, customized deployment patterns. + +## The Winning Combination: KServe \+ llm-d \+ vLLM + +![kserve-architecture](/img/blogs/production-grade-ai-inference-kserve-red-hat-and-tesla-success-story/kserve-architecture.webp) + +Our journey led us back to the most flexible and powerful solution: [**llm-d**](https://github.com/llm-d/llm-d), powered by [**KServe**](https://github.com/kserve/kserve) and its cutting-edge **Inference Gateway Extension**. + +This combination solved every scaling and operational challenge we faced by delivering: + +1. **Deep Customization:** The **LLMInferenceService** and **LLMInferenceConfig** objects expose the standard Kubernetes API, allowing us to override the spec precisely where needed. This level of granular control is crucial for tailoring vLLM to specialized hardware or quickly implementing flag changes. +2. **Intelligent Routing and Efficiency:** By leveraging [**Envoy**](https://www.envoyproxy.io/), [**Envoy AI Gateway**](https://aigateway.envoyproxy.io/), and [**Gateway API Inference Extension**](https://github.com/kubernetes-sigs/gateway-api-inference-extension), we moved far beyond round-robin. This technology enables **prefix-cache aware routing**, ensuring requests are intelligently routed to the correct vLLM instance to maximize KV-cache utilization and drive up GPU efficiency. + +TODO(saikrishna): charts on the before --> after with prefix-awareness (pending approval) along with some text/descriptions + +## Collaboration for Successful Adoption + +This migration from a fragile StatefulSet to a robust, scalable MLOps platform was not a solitary effort. It was a direct result of the powerful collaboration between **Red Hat** and **Tesla**. By combining Red Hat’s deep expertise in enterprise-grade Kubernetes and open-source infrastructure with Tesla’s demanding requirements for high-performance, large-scale AI serving, we successfully integrated and validated the KServe and llm-d solution. This partnership demonstrates how open standards and purpose-built operators are the key to unlocking the true potential of LLMs in production environments. + +This collaboration helps identify issues and sparks ideas for new features in KServe ([\#4901](https://github.com/kserve/kserve/issues/4901), [\#4900](https://github.com/kserve/kserve/issues/4900), [\#4898](https://github.com/kserve/kserve/issues/4898), [\#4899](https://github.com/kserve/kserve/issues/4899)). In addition, LLMInferenceService’s storageInitializer field has been [changed to optional](https://github.com/kserve/kserve/pull/4970) to enable the use of RunAI Model Streamer and we [added support for latest version of GIE](https://github.com/kserve/kserve/pull/4886). + +The combination of **KServe's** industry-leading standard for model serving, **llm-d's** intelligent routing capabilities, and **vLLM's** high-throughput inference engine provides the best foundation for managing the next generation of AI workloads at enterprise scale. + +## Get Involved with llm-d + +The work described here is just one example of what becomes possible when a community of engineers tackles hard problems together in the open. If you're running LLMs at scale and wrestling with the same challenges — storage, routing, efficiency, day-2 operations — we'd love to have you involved. + +* **Explore the code** → Browse our [GitHub organization](https://github.com/llm-d) and dig into the projects powering this stack +* **Join our Slack** → [Get your invite](/slack) and connect directly with maintainers and contributors from Red Hat, Tesla, and beyond +* **Attend community calls** → All meetings are open! Add our [public calendar](https://red.ht/llm-d-public-calendar) (Wednesdays 12:30pm ET) and join the conversation +* **Follow project updates** → Stay current on [Twitter/X](https://twitter.com/_llm_d_), [Bluesky](https://bsky.app/profile/llm-d.ai), and [LinkedIn](https://www.linkedin.com/company/llm-d) +* **Watch demos and recordings** → Subscribe to the [llm-d YouTube channel](https://www.youtube.com/@llm-d-project) for community call recordings and feature walkthroughs +* **Read the docs** → Visit our [community page](/docs/community) to find SIGs, contribution guides, and upcoming events + +## Acknowledgement + +We’d like to thank everyone from the community who has contributed to the successful adoption of KServe, llm-d, and vLLM in Tesla's production environment. In particular, below is the list of people from Red Hat and Tesla teams who have helped through the process (in alphabetical order). + +* **Red Hat team**: Sergey Bekkerman, Nati Fridman, Killian Golds,Andres Llausas, Bartosz Majsak, Greg Pereira, Pierangelo Di Pilato, Ran Pollak, Vivek Karunai Kiri Ragavan, Robert Shaw, and Yuan Tang +* **Tesla team**: Scott Cabrinha and Sai Krishna diff --git a/blog/authors.yml b/blog/authors.yml index 21cb1f3..e9ad569 100644 --- a/blog/authors.yml +++ b/blog/authors.yml @@ -7,96 +7,114 @@ redhat: robshaw: name: Robert Shaw title: Director of Engineering, Red Hat - url: https://github.com/robertgshaw2-redhat image_url: https://avatars.githubusercontent.com/u/114415538?v=4 email: robshaw@redhat.com + socials: + linkedin: robert-shaw-1a01399a + github: https://github.com/robertgshaw2-redhat smarterclayton: name: Clayton Coleman title: Distinguished Engineer, Google - url: https://github.com/smarterclayton image_url: https://avatars.githubusercontent.com/u/1163175?v=4 email: claytoncoleman@google.com + socials: + github: https://github.com/smarterclayton chcost: name: Carlos Costa title: Distinguished Engineer, IBM - url: https://github.com/chcost image_url: https://avatars.githubusercontent.com/u/26551701?v=4 email: chcost@us.ibm.com + socials: + github: https://github.com/chcost petecheslock: name: Pete Cheslock title: AI Community Architect, Red Hat - url: https://github.com/petecheslock image_url: https://avatars.githubusercontent.com/u/511733?v=4 email: pete.cheslock@redhat.com + socials: + linkedin: petecheslock + github: https://github.com/petecheslock cnuland: name: Christopher Nuland title: Principal Technical Marketing Manager for AI, Red Hat - url: https://github.com/cnuland image_url: /img/blogs/cnuland.webp + socials: + linkedin: cjnuland + github: https://github.com/cnuland niliguy: name: Nili Guy title: R&D Manager, AI Infrastructure, IBM - url: https://www.linkedin.com/in/nilig/ image_url: /img/blogs/niliguy.webp - + socials: + linkedin: nilig + etailevran: name: Etai Lev Ran title: Cloud Architect, IBM - url: https://www.linkedin.com/in/elevran/ image_url: /img/blogs/etailevran.webp + socials: + linkedin: elevran vitabortnikov: name: Vita Bortnikov title: IBM Fellow, IBM - url: https://www.linkedin.com/in/vita-bortnikov/ image_url: /img/blogs/vitabortnikov.webp + socials: + linkedin: vita-bortnikov maroonayoub: name: Maroon Ayoub title: Research Scientist & Architect, IBM - url: https://www.linkedin.com/in/v-maroon/ image_url: /img/blogs/maroonayoub.webp + socials: + linkedin: v-maroon dannyharnik: name: Danny Harnik title: Senior Technical Staff Member, IBM - url: https://www.linkedin.com/in/danny-harnik-19a95436/ image_url: /img/blogs/dannyharnik.webp + socials: + linkedin: danny-harnik-19a95436 kfirtoledo: name: Kfir Toledo title: Research Staff Member, IBM - url: https://www.linkedin.com/in/kfir-toledo-394a8811a/ image_url: /img/blogs/kfirtoledo.webp + socials: + linkedin: kfir-toledo-394a8811a effiofer: name: Effi Ofer title: Research Staff Member, IBM - url: https://www.linkedin.com/in/effi-ofer-91a261b0/ image_url: /img/blogs/effiofer.webp + socials: + linkedin: effi-ofer-91a261b0 orozeri: name: Or Ozeri title: Research Staff Member, IBM - url: https://www.linkedin.com/in/or-ozeri-a942859a/ image_url: /img/blogs/orozeri.webp + socials: + linkedin: or-ozeri-a942859a tylersmith: name: Tyler Smith title: Member of Technical Staff, Red Hat - url: https://www.linkedin.com/in/tyler-michael-smith-017b28102/ image_url: /img/blogs/tylersmith.webp + socials: + linkedin: tyler-michael-smith-017b28102 kellenswain: name: Kellen Swain title: Software Engineer, Google - url: https://www.linkedin.com/in/kellen-swain/ image_url: /img/blogs/kellenswain.webp + socials: + linkedin: kellen-swain xiningwang: name: Xining Wang @@ -111,23 +129,51 @@ hangyin: kayyan: name: Kay Yan title: Principal Software Engineer, DaoCloud - url: https://www.linkedin.com/in/yankay/ image_url: /img/blogs/kayyan.webp + socials: + linkedin: yankay kylebader: name: Kyle Bader title: Chief Architect, Data and AI, Ceph at IBM - url: https://www.linkedin.com/in/kyle-bader-5267a030/ image_url: /img/blogs/kyle-bader.webp + socials: + linkedin: kyle-bader-5267a030 tushargohad: name: Tushar Gohad title: Distinguished Engineer, Intel - url: https://www.linkedin.com/in/tushargohad/ image_url: /img/blogs/tushar-gohad.webp + socials: + linkedin: tushargohad guymargalit: name: Guy Margalit - title: Senior Technical Staff Member, IBM Storage CTO Ofiice - url: https://www.linkedin.com/in/guymargalit/ + title: Senior Technical Staff Member, IBM Storage CTO Office image_url: /img/blogs/guymargalit.webp + socials: + linkedin: guymargalit + +terrytangyuan: + name: Yuan Tang + title: Senior Principal Software Engineer, Red Hat + image_url: https://github.com/terrytangyuan.png + socials: + linkedin: terrytangyuan + github: https://github.com/terrytangyuan + +cabrinha: + name: Scott Cabrinha + title: Staff Site Reliability Engineer, Tesla + image_url: /img/blogs/scottcabrinha.webp + socials: + linkedin: scott-cabrinha + github: https://github.com/cabrinha + +saikrishna: + name: Sai Krishna + title: Staff Software Engineer, Tesla + image_url: /img/blogs/saikrishna.webp + socials: + linkedin: sai-krishna-45372444 + github: https://github.com/skpulipaka26 diff --git a/static/img/blogs/production-grade-ai-inference-kserve-red-hat-and-tesla-success-story/kserve-architecture.webp b/static/img/blogs/production-grade-ai-inference-kserve-red-hat-and-tesla-success-story/kserve-architecture.webp new file mode 100644 index 0000000..0dd8ef5 Binary files /dev/null and b/static/img/blogs/production-grade-ai-inference-kserve-red-hat-and-tesla-success-story/kserve-architecture.webp differ diff --git a/static/img/blogs/saikrishna.webp b/static/img/blogs/saikrishna.webp new file mode 100644 index 0000000..934b08b Binary files /dev/null and b/static/img/blogs/saikrishna.webp differ diff --git a/static/img/blogs/scottcabrinha.webp b/static/img/blogs/scottcabrinha.webp new file mode 100644 index 0000000..cee1ffe Binary files /dev/null and b/static/img/blogs/scottcabrinha.webp differ