Skip to content

docs: add KEP-2839 Dynamic LLM Trainer Framework proposal#3263

Draft
NarayanaSabari wants to merge 4 commits intokubeflow:masterfrom
NarayanaSabari:kep-2839-dynamic-llm-trainer
Draft

docs: add KEP-2839 Dynamic LLM Trainer Framework proposal#3263
NarayanaSabari wants to merge 4 commits intokubeflow:masterfrom
NarayanaSabari:kep-2839-dynamic-llm-trainer

Conversation

@NarayanaSabari
Copy link

@NarayanaSabari NarayanaSabari commented Feb 27, 2026

Summary

Add KEP-2839: Dynamic LLM Trainer Framework proposal.

This KEP proposes decoupling the BuiltinTrainer from TorchTune by introducing a pluggable
LLMBackend interface in the Python SDK and a corresponding LLMBackendStrategy in the Go
control plane. TorchTune becomes the first backend implementation (preserving backward
compatibility), and TRL is added as the first new backend with SFT/DPO support.

Builds on KEP-2401
and the community consensus on "Plan 3" from #2752.

Tracking issue: #2839

What This KEP Covers

SDK (Python)

  • LLMBackend abstract base class with to_command(), to_args(), framework(), validate()
  • Backend registry with @register_backend decorator supporting external/out-of-tree backends
  • BuiltinTrainer.config type widened from TorchTuneConfig to LLMBackend
  • TorchTuneConfig refactored to implement LLMBackend (zero breaking changes)
  • TRLConfig backend with SFT and DPO trainer types

Go Control Plane

  • LLMBackendStrategy interface replacing hardcoded TorchTune command-sniffing in the Torch plugin
  • TorchTuneStrategy (wraps existing torchtune.go logic unchanged)
  • TRLStrategy (minimal -- TRL config is fully constructed by the SDK)
  • Dispatch via trainer.kubeflow.org/framework label on ClusterTrainingRuntime

Infrastructure

  • TRL container image (cmd/trainers/trl/)
  • TRL ClusterTrainingRuntime manifests (manifests/base/runtimes/trl/)
  • Helm chart additions for TRL runtimes

Non-Goals

  • Unsloth or LlamaFactory backends (future work)
  • CRD schema changes
  • New Kubernetes resource topologies

Test Plan

  • Unit tests for backend registry, TorchTuneConfig backward compat, TRLConfig, Go strategy dispatch
  • Integration tests for TRL TrainJob reconciliation and runtime listing
  • E2E tests for TRL SFT/DPO on GPU and TorchTune regression

/cc @Electronic-Waste @andreyvelich @tariq-hasan

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

- Strip all Low-Level Design content (code interfaces, strategies,
  Dockerfile, runtime YAML, Helm chart details)
- Fix 10 technical inaccuracies found during audit:
  - TRL CLI entry point (trl sft, not python -m trl)
  - Multi-node env vars (standard + PET variants)
  - Correct enforceTorchTunePolicy inline location
  - dependsOn YAML format, volume handling pattern
  - TRLTrainerType enum values (SFT/DPO/KTO/GRPO)
  - Container name 'node' not 'trainer'
  - PET env var naming conventions
- KEP now covers: Summary, Goals, Non-Goals, Current State
  Analysis, High-Level Design, Test Plan, Risks, Phases
Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant