Blog post on KServe + llm-d + vLLM from Red Hat and Tesla by terrytangyuan · Pull Request #192 · llm-d/llm-d.github.io

terrytangyuan · 2026-03-04T16:44:18Z

This blog highlights the collaboration story between Red Hat and Tesla to overcome significant scaling and operational challenges in LLM deployment. It explains how migrating from a simple vLLM deployment to a robust MLOps platform utilizing KServe, llm-d's intelligent routing, and vLLM provides deep customization and improved efficiency through prefix-cache aware routing to maximize GPU utilization.

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

netlify · 2026-03-04T16:44:25Z

✅ Deploy Preview for elaborate-kangaroo-25e1ee ready!

Name	Link
🔨 Latest commit	`4c2fed0`
🔍 Latest deploy log	https://app.netlify.com/projects/elaborate-kangaroo-25e1ee/deploys/69a9946bccfb3c000863cbd0
😎 Deploy Preview	https://deploy-preview-192--elaborate-kangaroo-25e1ee.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copilot

Pull request overview

Adds a new blog post describing a Red Hat + Tesla collaboration story for production-grade LLM inference using KServe, llm-d routing, and vLLM, and registers new authors for attribution.

Changes:

Add three author entries to blog/authors.yml for the new post.
Add a new blog post markdown file with frontmatter, content, and an architecture diagram reference.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
blog/authors.yml	Adds new author profiles referenced by the new blog post.
blog/2026-03-06_production-grade-ai-inference-kserve-red-hat-and-tesla-success-story.md	Introduces the new Red Hat + Tesla success story blog post (frontmatter + content).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

blog/2026-03-06_production-grade-ai-inference-kserve-red-hat-and-tesla-success-story.md

Copilot · 2026-03-04T16:48:21Z

blog/2026-03-06_production-grade-ai-inference-kserve-red-hat-and-tesla-success-story.md

+1. **Deep Customization:** The **LLMInferenceService** and **LLMInferenceConfig** objects expose the standard Kubernetes API, allowing us to override the spec precisely where needed. This level of granular control is crucial for tailoring vLLM to specialized hardware or quickly implementing flag changes.  
+2. **Intelligent Routing and Efficiency:** By leveraging [**Envoy**](https://www.envoyproxy.io/)**, [Envoy AI Gateway](https://aigateway.envoyproxy.io/), and [Gateway API Inference Extension](https://github.com/kubernetes-sigs/gateway-api-inference-extension)**, we moved far beyond round-robin. This technology enables **prefix-cache aware routing**, ensuring requests are intelligently routed to the correct vLLM instance to maximize KV-cache utilization and drive up GPU efficiency.
+
+TODO(saikrishna): charts on the before --> after with prefix-awarness (pending approval) along with some text/descriptions


There's an inline TODO left in the published post content. Please remove it before merging, or replace it with the actual charts/text (or a non-TODO placeholder that won't ship to readers).

Suggested change

TODO(saikrishna): charts on the before --> after with prefix-awarness (pending approval) along with some text/descriptions

In practice, introducing prefix-cache aware routing significantly reduced tail latency and improved effective GPU utilization compared to our initial, naïve round‑robin setup. Instead of repeatedly rebuilding KV-cache entries for similar prompts across many replicas, hot prefixes now stay “sticky” to the right vLLM instances, which translates directly into higher throughput, more predictable response times, and better cost efficiency at scale.

blog/2026-03-06_production-grade-ai-inference-kserve-red-hat-and-tesla-success-story.md

…nd-tesla-success-story.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

… community section - Convert saikrishna.jpg and scottcabrinha.jpg to WebP and update authors.yml image_url to local paths - Download KServe architecture diagram, convert to WebP, store under static/img/blogs/<slug>/ - Update blog post image reference from remote GitHub blob URL to local WebP path - Add "Get Involved with llm-d" community section with links to Slack, GitHub, community calls, social media, and YouTube Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Pete Cheslock <pete.cheslock@redhat.com>

blog/authors.yml

- Move LinkedIn URLs from url field to socials.linkedin for all LinkedIn-based authors - Add socials.github for authors with known GitHub profiles - Add socials.linkedin for terrytangyuan, cabrinha, robshaw, and saikrishna Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Pete Cheslock <pete.cheslock@redhat.com>

…kedIn socials - Remove url field from all authors who have socials defined - Add linkedin socials for petecheslock, cnuland, terrytangyuan, cabrinha, robshaw - Only redhat org entry retains url field (no socials) Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Pete Cheslock <pete.cheslock@redhat.com>

petecheslock · 2026-03-04T21:02:21Z

Preview URL: https://deploy-preview-192--elaborate-kangaroo-25e1ee.netlify.app/blog/production-grade-ai-inference-kserve-red-hat-and-tesla-success-story

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

Blog post on KServe + llm-d + vLLM from Red Hat and Tesla

cf00c3d

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

Copilot AI review requested due to automatic review settings March 4, 2026 16:44

Copilot started reviewing on behalf of terrytangyuan March 4, 2026 16:44 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

terrytangyuan and others added 4 commits March 4, 2026 11:50

Update blog/2026-03-06_production-grade-ai-inference-kserve-red-hat-a…

06ad929

…nd-tesla-success-story.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

Update blog/2026-03-06_production-grade-ai-inference-kserve-red-hat-a…

ed0d1f0

…nd-tesla-success-story.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

Update blog/2026-03-06_production-grade-ai-inference-kserve-red-hat-a…

0c799ab

…nd-tesla-success-story.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

petecheslock reviewed Mar 4, 2026

View reviewed changes

blog/authors.yml Show resolved Hide resolved

petecheslock and others added 2 commits March 4, 2026 15:57

Update sai github link

4c2fed0

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blog post on KServe + llm-d + vLLM from Red Hat and Tesla#192

Blog post on KServe + llm-d + vLLM from Red Hat and Tesla#192
terrytangyuan wants to merge 8 commits intollm-d:mainfrom
terrytangyuan:blog-kserve-llm-d

terrytangyuan commented Mar 4, 2026

Uh oh!

netlify bot commented Mar 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Uh oh!

Uh oh!

petecheslock commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

terrytangyuan commented Mar 4, 2026

Uh oh!

netlify bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for elaborate-kangaroo-25e1ee ready!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

petecheslock commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

netlify bot commented Mar 4, 2026 •

edited

Loading