Blog post on KServe + llm-d + vLLM from Red Hat and Tesla#192
Open
terrytangyuan wants to merge 8 commits intollm-d:mainfrom
Open
Blog post on KServe + llm-d + vLLM from Red Hat and Tesla#192terrytangyuan wants to merge 8 commits intollm-d:mainfrom
terrytangyuan wants to merge 8 commits intollm-d:mainfrom
Conversation
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
✅ Deploy Preview for elaborate-kangaroo-25e1ee ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new blog post describing a Red Hat + Tesla collaboration story for production-grade LLM inference using KServe, llm-d routing, and vLLM, and registers new authors for attribution.
Changes:
- Add three author entries to
blog/authors.ymlfor the new post. - Add a new blog post markdown file with frontmatter, content, and an architecture diagram reference.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| blog/authors.yml | Adds new author profiles referenced by the new blog post. |
| blog/2026-03-06_production-grade-ai-inference-kserve-red-hat-and-tesla-success-story.md | Introduces the new Red Hat + Tesla success story blog post (frontmatter + content). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
blog/2026-03-06_production-grade-ai-inference-kserve-red-hat-and-tesla-success-story.md
Show resolved
Hide resolved
blog/2026-03-06_production-grade-ai-inference-kserve-red-hat-and-tesla-success-story.md
Outdated
Show resolved
Hide resolved
blog/2026-03-06_production-grade-ai-inference-kserve-red-hat-and-tesla-success-story.md
Outdated
Show resolved
Hide resolved
| 1. **Deep Customization:** The **LLMInferenceService** and **LLMInferenceConfig** objects expose the standard Kubernetes API, allowing us to override the spec precisely where needed. This level of granular control is crucial for tailoring vLLM to specialized hardware or quickly implementing flag changes. | ||
| 2. **Intelligent Routing and Efficiency:** By leveraging [**Envoy**](https://www.envoyproxy.io/)**, [Envoy AI Gateway](https://aigateway.envoyproxy.io/), and [Gateway API Inference Extension](https://github.com/kubernetes-sigs/gateway-api-inference-extension)**, we moved far beyond round-robin. This technology enables **prefix-cache aware routing**, ensuring requests are intelligently routed to the correct vLLM instance to maximize KV-cache utilization and drive up GPU efficiency. | ||
|
|
||
| TODO(saikrishna): charts on the before --> after with prefix-awarness (pending approval) along with some text/descriptions |
There was a problem hiding this comment.
There's an inline TODO left in the published post content. Please remove it before merging, or replace it with the actual charts/text (or a non-TODO placeholder that won't ship to readers).
Suggested change
| TODO(saikrishna): charts on the before --> after with prefix-awarness (pending approval) along with some text/descriptions | |
| In practice, introducing prefix-cache aware routing significantly reduced tail latency and improved effective GPU utilization compared to our initial, naïve round‑robin setup. Instead of repeatedly rebuilding KV-cache entries for similar prompts across many replicas, hot prefixes now stay “sticky” to the right vLLM instances, which translates directly into higher throughput, more predictable response times, and better cost efficiency at scale. |
blog/2026-03-06_production-grade-ai-inference-kserve-red-hat-and-tesla-success-story.md
Outdated
Show resolved
Hide resolved
…nd-tesla-success-story.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
…nd-tesla-success-story.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
…nd-tesla-success-story.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
… community section - Convert saikrishna.jpg and scottcabrinha.jpg to WebP and update authors.yml image_url to local paths - Download KServe architecture diagram, convert to WebP, store under static/img/blogs/<slug>/ - Update blog post image reference from remote GitHub blob URL to local WebP path - Add "Get Involved with llm-d" community section with links to Slack, GitHub, community calls, social media, and YouTube Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Pete Cheslock <pete.cheslock@redhat.com>
petecheslock
reviewed
Mar 4, 2026
- Move LinkedIn URLs from url field to socials.linkedin for all LinkedIn-based authors - Add socials.github for authors with known GitHub profiles - Add socials.linkedin for terrytangyuan, cabrinha, robshaw, and saikrishna Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Pete Cheslock <pete.cheslock@redhat.com>
…kedIn socials - Remove url field from all authors who have socials defined - Add linkedin socials for petecheslock, cnuland, terrytangyuan, cabrinha, robshaw - Only redhat org entry retains url field (no socials) Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Pete Cheslock <pete.cheslock@redhat.com>
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This blog highlights the collaboration story between Red Hat and Tesla to overcome significant scaling and operational challenges in LLM deployment. It explains how migrating from a simple vLLM deployment to a robust MLOps platform utilizing KServe, llm-d's intelligent routing, and vLLM provides deep customization and improved efficiency through prefix-cache aware routing to maximize GPU utilization.