Skip to content
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
---
title: "Production-Grade AI Inference with KServe, llm-d, and vLLM: A Red Hat and Tesla Success Story"
description: "The collaboration story between Red Hat and Tesla to overcome significant scaling and operational challenges in LLM deployment. It explains how migrating from a simple vLLM deployment to a robust MLOps platform utilizing KServe, llm-d's intelligent routing, and vLLM provides deep customization and improved efficiency through prefix-cache aware routing to maximize GPU utilization."
slug: production-grade-ai-inference-kserve-red-hat-and-tesla-success-story
date: 2026-03-06T09:00

authors:
- terrytangyuan
- cabrinha
- robshaw
- saikrishna

tags: [blog]
---

# Production-Grade AI Inference with KServe, llm-d, and vLLM: A Red Hat and Tesla Success Story

## The Problem with "Simple" LLM Deployments

Everyone is racing to run Large Language Models (LLMs), in the cloud, on-prem, and even on edge devices. The real challenge, however, isn't the first deployment; it's scaling, managing, and maintaining hundreds of LLMs efficiently. We initially approached this challenge with a straightforward vLLM deployment wrapped in a Kubernetes StatefulSet.

<!-- truncate -->
The approach quickly introduced severe operational bottlenecks:

* **Storage Drag:** Models like Llama 3 can easily reach hundreds of gigabytes in size. Relying on sluggish network storage (NFS) for these massive safetensors was a non-starter.
* **Infrastructure Lock-in:** Switching to local LVM persistent volumes solved the speed problem but created a rigid node-to-pod affinity. A single hardware failure meant a manual intervention to delete the Persistent Volume Claim (PVC) and reschedule the pod, which is an unacceptable burden for day-2 operations.
* **Naive Load Balancing:** Beyond the looming retirement of NGINX Ingress Controller, a simple round-robin load-balancing strategy is fundamentally inefficient for LLMs. It fails to utilize the critical **KV-cache** on the GPU, a core feature of vLLM that significantly boosts throughput. In a world where GPU costs are paramount, squeezing efficiency out of every core is non-negotiable.

## The Search for a Superior Operator

We recognized that running LLMs at scale demanded a purpose-built solution, a Kubernetes Operator designed for the intricacies of AI/ML. While some existing projects are clean and functional as a Proof-of-Concept, they lacked the necessary extensibility. Customizing the runtime specification beyond the exposed Custom Resources was a requirement we couldn't compromise on. There are also other tools that offered complexity and robustness but were overly opinionated, catering heavily toward a specific prefill/decode setup. In addition, their strict API contracts didn't align with our need for flexible, customized deployment patterns.

## The Winning Combination: KServe \+ llm-d \+ vLLM

![kserve-architecture](/img/blogs/production-grade-ai-inference-kserve-red-hat-and-tesla-success-story/kserve-architecture.webp)

Our journey led us back to the most flexible and powerful solution: [**llm-d**](https://github.com/llm-d/llm-d), powered by [**KServe**](https://github.com/kserve/kserve) and its cutting-edge **Inference Gateway Extension**.

This combination solved every scaling and operational challenge we faced by delivering:

1. **Deep Customization:** The **LLMInferenceService** and **LLMInferenceConfig** objects expose the standard Kubernetes API, allowing us to override the spec precisely where needed. This level of granular control is crucial for tailoring vLLM to specialized hardware or quickly implementing flag changes.
2. **Intelligent Routing and Efficiency:** By leveraging [**Envoy**](https://www.envoyproxy.io/), [**Envoy AI Gateway**](https://aigateway.envoyproxy.io/), and [**Gateway API Inference Extension**](https://github.com/kubernetes-sigs/gateway-api-inference-extension), we moved far beyond round-robin. This technology enables **prefix-cache aware routing**, ensuring requests are intelligently routed to the correct vLLM instance to maximize KV-cache utilization and drive up GPU efficiency.

TODO(saikrishna): charts on the before --> after with prefix-awareness (pending approval) along with some text/descriptions

## Collaboration for Successful Adoption

This migration from a fragile StatefulSet to a robust, scalable MLOps platform was not a solitary effort. It was a direct result of the powerful collaboration between **Red Hat** and **Tesla**. By combining Red Hat’s deep expertise in enterprise-grade Kubernetes and open-source infrastructure with Tesla’s demanding requirements for high-performance, large-scale AI serving, we successfully integrated and validated the KServe and llm-d solution. This partnership demonstrates how open standards and purpose-built operators are the key to unlocking the true potential of LLMs in production environments.

This collaboration helps identify issues and sparks ideas for new features in KServe ([\#4901](https://github.com/kserve/kserve/issues/4901), [\#4900](https://github.com/kserve/kserve/issues/4900), [\#4898](https://github.com/kserve/kserve/issues/4898), [\#4899](https://github.com/kserve/kserve/issues/4899)). In addition, LLMInferenceService’s storageInitializer field has been [changed to optional](https://github.com/kserve/kserve/pull/4970) to enable the use of RunAI Model Streamer.

The combination of **KServe's** industry-leading standard for model serving, **llm-d's** intelligent routing capabilities, and **vLLM's** high-throughput inference engine provides the best foundation for managing the next generation of AI workloads at enterprise scale.

## Get Involved with llm-d

The work described here is just one example of what becomes possible when a community of engineers tackles hard problems together in the open. If you're running LLMs at scale and wrestling with the same challenges — storage, routing, efficiency, day-2 operations — we'd love to have you involved.

* **Explore the code** → Browse our [GitHub organization](https://github.com/llm-d) and dig into the projects powering this stack
* **Join our Slack** → [Get your invite](/slack) and connect directly with maintainers and contributors from Red Hat, Tesla, and beyond
* **Attend community calls** → All meetings are open! Add our [public calendar](https://red.ht/llm-d-public-calendar) (Wednesdays 12:30pm ET) and join the conversation
* **Follow project updates** → Stay current on [Twitter/X](https://twitter.com/_llm_d_), [Bluesky](https://bsky.app/profile/llm-d.ai), and [LinkedIn](https://www.linkedin.com/company/llm-d)
* **Watch demos and recordings** → Subscribe to the [llm-d YouTube channel](https://www.youtube.com/@llm-d-project) for community call recordings and feature walkthroughs
* **Read the docs** → Visit our [community page](/docs/community) to find SIGs, contribution guides, and upcoming events

## Acknowledgement

We’d like to thank everyone from the community who has contributed to the successful adoption of KServe, llm-d, and vLLM in Tesla's production environment. In particular, below is the list of people from Red Hat and Tesla teams who have helped through the process (in alphabetical order).

* **Red Hat team**: Andres Llausas, Bartosz Majsak, Greg Pereira, Pierangelo Di Pilato, Vivek Karunai Kiri Ragavan, Robert Shaw, and Yuan Tang
* **Tesla team**: Scott Cabrinha and Sai Krishna
88 changes: 67 additions & 21 deletions blog/authors.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,96 +7,114 @@ redhat:
robshaw:
name: Robert Shaw
title: Director of Engineering, Red Hat
url: https://github.com/robertgshaw2-redhat
image_url: https://avatars.githubusercontent.com/u/114415538?v=4
email: robshaw@redhat.com
socials:
linkedin: robert-shaw-1a01399a
github: https://github.com/robertgshaw2-redhat

smarterclayton:
name: Clayton Coleman
title: Distinguished Engineer, Google
url: https://github.com/smarterclayton
image_url: https://avatars.githubusercontent.com/u/1163175?v=4
email: claytoncoleman@google.com
socials:
github: https://github.com/smarterclayton

chcost:
name: Carlos Costa
title: Distinguished Engineer, IBM
url: https://github.com/chcost
image_url: https://avatars.githubusercontent.com/u/26551701?v=4
email: chcost@us.ibm.com
socials:
github: https://github.com/chcost

petecheslock:
name: Pete Cheslock
title: AI Community Architect, Red Hat
url: https://github.com/petecheslock
image_url: https://avatars.githubusercontent.com/u/511733?v=4
email: pete.cheslock@redhat.com
socials:
linkedin: petecheslock
github: https://github.com/petecheslock

cnuland:
name: Christopher Nuland
title: Principal Technical Marketing Manager for AI, Red Hat
url: https://github.com/cnuland
image_url: /img/blogs/cnuland.webp
socials:
linkedin: cjnuland
github: https://github.com/cnuland

niliguy:
name: Nili Guy
title: R&D Manager, AI Infrastructure, IBM
url: https://www.linkedin.com/in/nilig/
image_url: /img/blogs/niliguy.webp

socials:
linkedin: nilig

etailevran:
name: Etai Lev Ran
title: Cloud Architect, IBM
url: https://www.linkedin.com/in/elevran/
image_url: /img/blogs/etailevran.webp
socials:
linkedin: elevran

vitabortnikov:
name: Vita Bortnikov
title: IBM Fellow, IBM
url: https://www.linkedin.com/in/vita-bortnikov/
image_url: /img/blogs/vitabortnikov.webp
socials:
linkedin: vita-bortnikov

maroonayoub:
name: Maroon Ayoub
title: Research Scientist & Architect, IBM
url: https://www.linkedin.com/in/v-maroon/
image_url: /img/blogs/maroonayoub.webp
socials:
linkedin: v-maroon

dannyharnik:
name: Danny Harnik
title: Senior Technical Staff Member, IBM
url: https://www.linkedin.com/in/danny-harnik-19a95436/
image_url: /img/blogs/dannyharnik.webp
socials:
linkedin: danny-harnik-19a95436

kfirtoledo:
name: Kfir Toledo
title: Research Staff Member, IBM
url: https://www.linkedin.com/in/kfir-toledo-394a8811a/
image_url: /img/blogs/kfirtoledo.webp
socials:
linkedin: kfir-toledo-394a8811a

effiofer:
name: Effi Ofer
title: Research Staff Member, IBM
url: https://www.linkedin.com/in/effi-ofer-91a261b0/
image_url: /img/blogs/effiofer.webp
socials:
linkedin: effi-ofer-91a261b0

orozeri:
name: Or Ozeri
title: Research Staff Member, IBM
url: https://www.linkedin.com/in/or-ozeri-a942859a/
image_url: /img/blogs/orozeri.webp
socials:
linkedin: or-ozeri-a942859a

tylersmith:
name: Tyler Smith
title: Member of Technical Staff, Red Hat
url: https://www.linkedin.com/in/tyler-michael-smith-017b28102/
image_url: /img/blogs/tylersmith.webp
socials:
linkedin: tyler-michael-smith-017b28102

kellenswain:
name: Kellen Swain
title: Software Engineer, Google
url: https://www.linkedin.com/in/kellen-swain/
image_url: /img/blogs/kellenswain.webp
socials:
linkedin: kellen-swain

xiningwang:
name: Xining Wang
Expand All @@ -111,23 +129,51 @@ hangyin:
kayyan:
name: Kay Yan
title: Principal Software Engineer, DaoCloud
url: https://www.linkedin.com/in/yankay/
image_url: /img/blogs/kayyan.webp
socials:
linkedin: yankay

kylebader:
name: Kyle Bader
title: Chief Architect, Data and AI, Ceph at IBM
url: https://www.linkedin.com/in/kyle-bader-5267a030/
image_url: /img/blogs/kyle-bader.webp
socials:
linkedin: kyle-bader-5267a030

tushargohad:
name: Tushar Gohad
title: Distinguished Engineer, Intel
url: https://www.linkedin.com/in/tushargohad/
image_url: /img/blogs/tushar-gohad.webp
socials:
linkedin: tushargohad

guymargalit:
name: Guy Margalit
title: Senior Technical Staff Member, IBM Storage CTO Ofiice
url: https://www.linkedin.com/in/guymargalit/
title: Senior Technical Staff Member, IBM Storage CTO Office
image_url: /img/blogs/guymargalit.webp
socials:
linkedin: guymargalit

terrytangyuan:
name: Yuan Tang
title: Senior Principal Software Engineer, Red Hat
image_url: https://github.com/terrytangyuan.png
socials:
linkedin: terrytangyuan
github: https://github.com/terrytangyuan

cabrinha:
name: Scott Cabrinha
title: Staff Site Reliability Engineer, Tesla
image_url: /img/blogs/scottcabrinha.webp
socials:
linkedin: scott-cabrinha
github: https://github.com/cabrinha

saikrishna:
name: Sai Krishna
title: Staff Software Engineer, Tesla
image_url: /img/blogs/saikrishna.webp
socials:
linkedin: sai-krishna-45372444
github: https://github.com/skpulipaka26
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added static/img/blogs/saikrishna.webp
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added static/img/blogs/scottcabrinha.webp
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.