docs: add KEP-2839 Dynamic LLM Trainer Framework proposal by NarayanaSabari · Pull Request #3263 · kubeflow/trainer

NarayanaSabari · 2026-02-27T10:01:50Z

Summary

Add KEP-2839: Dynamic LLM Trainer Framework proposal.

This KEP proposes decoupling the BuiltinTrainer from TorchTune by introducing a pluggable
LLMBackend interface in the Python SDK and a corresponding LLMBackendStrategy in the Go
control plane. TorchTune becomes the first backend implementation (preserving backward
compatibility), and TRL is added as the first new backend with SFT/DPO support.

Builds on KEP-2401
and the community consensus on "Plan 3" from #2752.

Tracking issue: #2839

What This KEP Covers

SDK (Python)

LLMBackend abstract base class with to_command(), to_args(), framework(), validate()
Backend registry with @register_backend decorator supporting external/out-of-tree backends
BuiltinTrainer.config type widened from TorchTuneConfig to LLMBackend
TorchTuneConfig refactored to implement LLMBackend (zero breaking changes)
TRLConfig backend with SFT and DPO trainer types

Go Control Plane

LLMBackendStrategy interface replacing hardcoded TorchTune command-sniffing in the Torch plugin
TorchTuneStrategy (wraps existing torchtune.go logic unchanged)
TRLStrategy (minimal -- TRL config is fully constructed by the SDK)
Dispatch via trainer.kubeflow.org/framework label on ClusterTrainingRuntime

Infrastructure

TRL container image (cmd/trainers/trl/)
TRL ClusterTrainingRuntime manifests (manifests/base/runtimes/trl/)
Helm chart additions for TRL runtimes

Non-Goals

Unsloth or LlamaFactory backends (future work)
CRD schema changes
New Kubernetes resource topologies

Test Plan

Unit tests for backend registry, TorchTuneConfig backward compat, TRLConfig, Go strategy dispatch
Integration tests for TRL TrainJob reconciliation and runtime listing
E2E tests for TRL SFT/DPO on GPU and TorchTune regression

/cc @Electronic-Waste @andreyvelich @tariq-hasan

google-oss-prow · 2026-02-27T10:01:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-02-27T10:02:00Z

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Slack: Join our #kubeflow-trainer Slack channel.
Meetings: Attend the Kubeflow AutoML and Training Working Group bi-weekly meetings.

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

- Strip all Low-Level Design content (code interfaces, strategies, Dockerfile, runtime YAML, Helm chart details) - Fix 10 technical inaccuracies found during audit: - TRL CLI entry point (trl sft, not python -m trl) - Multi-node env vars (standard + PET variants) - Correct enforceTorchTunePolicy inline location - dependsOn YAML format, volume handling pattern - TRLTrainerType enum values (SFT/DPO/KTO/GRPO) - Container name 'node' not 'trainer' - PET env var naming conventions - KEP now covers: Summary, Goals, Non-Goals, Current State Analysis, High-Level Design, Test Plan, Risks, Phases

Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>

docs: add KEP-2839 Dynamic LLM Trainer Framework proposal

611adef

google-oss-prow bot added the do-not-merge/work-in-progress label Feb 27, 2026

google-oss-prow bot requested review from jinchihe and kuizhiqing February 27, 2026 10:01

google-oss-prow bot added the size/XL label Feb 27, 2026

NarayanaSabari added 3 commits March 2, 2026 13:08

updated KEP for TRL

37f64fe

Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>

updated kep for dpo example

f755136

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add KEP-2839 Dynamic LLM Trainer Framework proposal#3263

docs: add KEP-2839 Dynamic LLM Trainer Framework proposal#3263
NarayanaSabari wants to merge 4 commits intokubeflow:masterfrom
NarayanaSabari:kep-2839-dynamic-llm-trainer

NarayanaSabari commented Feb 27, 2026 •

edited

Loading

Uh oh!

google-oss-prow bot commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NarayanaSabari commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What This KEP Covers

Non-Goals

Test Plan

Uh oh!

google-oss-prow bot commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

NarayanaSabari commented Feb 27, 2026 •

edited

Loading