Skip to content

Blog post on KServe + llm-d + vLLM from Red Hat and Tesla#192

Open
terrytangyuan wants to merge 8 commits intollm-d:mainfrom
terrytangyuan:blog-kserve-llm-d
Open

Blog post on KServe + llm-d + vLLM from Red Hat and Tesla#192
terrytangyuan wants to merge 8 commits intollm-d:mainfrom
terrytangyuan:blog-kserve-llm-d

Conversation

@terrytangyuan
Copy link
Member

This blog highlights the collaboration story between Red Hat and Tesla to overcome significant scaling and operational challenges in LLM deployment. It explains how migrating from a simple vLLM deployment to a robust MLOps platform utilizing KServe, llm-d's intelligent routing, and vLLM provides deep customization and improved efficiency through prefix-cache aware routing to maximize GPU utilization.

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Copilot AI review requested due to automatic review settings March 4, 2026 16:44
@netlify
Copy link

netlify bot commented Mar 4, 2026

Deploy Preview for elaborate-kangaroo-25e1ee ready!

Name Link
🔨 Latest commit 4c2fed0
🔍 Latest deploy log https://app.netlify.com/projects/elaborate-kangaroo-25e1ee/deploys/69a9946bccfb3c000863cbd0
😎 Deploy Preview https://deploy-preview-192--elaborate-kangaroo-25e1ee.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new blog post describing a Red Hat + Tesla collaboration story for production-grade LLM inference using KServe, llm-d routing, and vLLM, and registers new authors for attribution.

Changes:

  • Add three author entries to blog/authors.yml for the new post.
  • Add a new blog post markdown file with frontmatter, content, and an architecture diagram reference.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
blog/authors.yml Adds new author profiles referenced by the new blog post.
blog/2026-03-06_production-grade-ai-inference-kserve-red-hat-and-tesla-success-story.md Introduces the new Red Hat + Tesla success story blog post (frontmatter + content).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

1. **Deep Customization:** The **LLMInferenceService** and **LLMInferenceConfig** objects expose the standard Kubernetes API, allowing us to override the spec precisely where needed. This level of granular control is crucial for tailoring vLLM to specialized hardware or quickly implementing flag changes.
2. **Intelligent Routing and Efficiency:** By leveraging [**Envoy**](https://www.envoyproxy.io/)**, [Envoy AI Gateway](https://aigateway.envoyproxy.io/), and [Gateway API Inference Extension](https://github.com/kubernetes-sigs/gateway-api-inference-extension)**, we moved far beyond round-robin. This technology enables **prefix-cache aware routing**, ensuring requests are intelligently routed to the correct vLLM instance to maximize KV-cache utilization and drive up GPU efficiency.

TODO(saikrishna): charts on the before --> after with prefix-awarness (pending approval) along with some text/descriptions
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an inline TODO left in the published post content. Please remove it before merging, or replace it with the actual charts/text (or a non-TODO placeholder that won't ship to readers).

Suggested change
TODO(saikrishna): charts on the before --> after with prefix-awarness (pending approval) along with some text/descriptions
In practice, introducing prefix-cache aware routing significantly reduced tail latency and improved effective GPU utilization compared to our initial, naïve round‑robin setup. Instead of repeatedly rebuilding KV-cache entries for similar prompts across many replicas, hot prefixes now stay “sticky” to the right vLLM instances, which translates directly into higher throughput, more predictable response times, and better cost efficiency at scale.

Copilot uses AI. Check for mistakes.
terrytangyuan and others added 4 commits March 4, 2026 11:50
…nd-tesla-success-story.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
…nd-tesla-success-story.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
…nd-tesla-success-story.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
… community section

- Convert saikrishna.jpg and scottcabrinha.jpg to WebP and update authors.yml image_url to local paths
- Download KServe architecture diagram, convert to WebP, store under static/img/blogs/<slug>/
- Update blog post image reference from remote GitHub blob URL to local WebP path
- Add "Get Involved with llm-d" community section with links to Slack, GitHub, community calls, social media, and YouTube

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Pete Cheslock <pete.cheslock@redhat.com>
petecheslock and others added 2 commits March 4, 2026 15:57
- Move LinkedIn URLs from url field to socials.linkedin for all LinkedIn-based authors
- Add socials.github for authors with known GitHub profiles
- Add socials.linkedin for terrytangyuan, cabrinha, robshaw, and saikrishna

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Pete Cheslock <pete.cheslock@redhat.com>
…kedIn socials

- Remove url field from all authors who have socials defined
- Add linkedin socials for petecheslock, cnuland, terrytangyuan, cabrinha, robshaw
- Only redhat org entry retains url field (no socials)

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Pete Cheslock <pete.cheslock@redhat.com>
@petecheslock
Copy link
Member

Preview URL: https://deploy-preview-192--elaborate-kangaroo-25e1ee.netlify.app/blog/production-grade-ai-inference-kserve-red-hat-and-tesla-success-story

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants