Skip to content

chore(deps-dev): update trl requirement from >=0.12 to >=1.6.0#41

Open
dependabot[bot] wants to merge 1 commit into
mainfrom
dependabot/pip/trl-gte-1.5.1
Open

chore(deps-dev): update trl requirement from >=0.12 to >=1.6.0#41
dependabot[bot] wants to merge 1 commit into
mainfrom
dependabot/pip/trl-gte-1.5.1

Conversation

@dependabot

@dependabot dependabot Bot commented on behalf of github Jun 8, 2026

Copy link
Copy Markdown
Contributor

Updates the requirements on trl to permit the latest version.

Release notes

Sourced from trl's releases.

v1.6.0

Features

AsyncGRPO rollout worker now runs in a separate process

AsyncRolloutWorker is no longer a thread — it's a spawned child process with its own GIL. The trainer's autograd engine no longer competes with recursive_parse / accuracy_reward for the GIL, which was causing 1-5s stalls in real Qwen3-30B-A3B @ 16k runs and ultimately NCCL watchdog timeouts on other ranks.

Architectural changes:

  • AsyncRolloutWorker (parent) owns the child process + shared mp.Queue / mp.Value / mp.Event.
  • _AsyncRolloutLoop (child-only) handles tokenization, dataset iteration, reward funcs, and asyncio loops.
  • A new WeightTransferClient owns the NCCL group with vLLM (/pause, /resume, /init_weight_transfer_engine, /update_weights); the rollout child only talks to /v1/completions.

Two correctness fixes shipped alongside (they would have conflicted otherwise): broader aiohttp retry (now catches ClientPayloadError) with bounded exponential backoff, and all-NaN reward columns are now preserved — np.nansum was silently returning 0, giving unscorable completions a real advantage signal and pushing the policy away from correct answers (~30% of DeepMath / OpenR1-Math rows).

[!NOTE] reward_funcs / tools / environment_factory must now be picklable, and the child runs CPU-only (CUDA_VISIBLE_DEVICES="").

by @​AmineDiro in huggingface/trl#5749

New experimental A2PO trainer (Optimal Advantage Regression)

A new A2POTrainer implements A*-PO from "Accelerating RL for LLM Reasoning with Optimal Advantage Regression". Two stages: an offline V* estimation pass from reference policy samples (with optional filter_all_incorrect to drop prompts where every reference completion fails), then on-policy training with one generation per prompt and a plain least-squares loss on β₂·log(π/π_ref) vs r − V*. No group, no critic, no clipping, no reward normalization.

from trl.experimental.a2po import A2POConfig, A2POTrainer
trainer = A2POTrainer(
model="Qwen/Qwen3-4B",
args=A2POConfig(num_value_samples=8, filter_all_incorrect=True),
train_dataset=dataset,
reward_funcs=accuracy_reward,
)
trainer.train()

Designed for binary verifiable rewards (math/code), not open-ended problems.

by @​raghulchandramouli in huggingface/trl#5940

KTO now supports VLMs + big alignment push

The biggest KTO ↔ DPO alignment cycle yet — KTOTrainer now supports vision-language models, plus a deep restructuring of compute_loss, KL dataset generation, ref-logp precomputation, activation offloading, sampler strategy, metrics, and more. KTO graduation is very close.

from trl.experimental.kto import KTOConfig, KTOTrainer
trainer = KTOTrainer(
model="Qwen/Qwen2.5-VL-3B-Instruct",
args=KTOConfig(...),
train_dataset=vision_kto_dataset,
</tr></table>

... (truncated)

Commits
  • 0dac440 Release: v1.6 (#6009)
  • 6842058 docs: clarify PPO entropy metrics in PPO trainer docs (#5289)
  • cb5ca23 fix(cli): drop duplicate "to" in trl skills install description (#6008)
  • 8226159 Hide DeepSpeed/FSDP distributed backend boilerplate (#6000)
  • fa286a8 Padding-free invariance test (#5842)
  • eab8bc8 Announce upcoming SFT loss_type default change from 'nll' to `'chunked_nl...
  • 4520e4b [CI] Check that training chat templates keep the stop token in the loss mask ...
  • e28c6d9 Document bnb_4bit_quant_storage and normalize docstring param headers (#5993)
  • 3f6f7d2 Align KTO with DPO: Inline kto_loss in _compute_loss (#5999)
  • b84b487 Align KTO with DPO: Rename kto_loss_fn to liger_loss_fn (#5998)
  • Additional commits viewable in compare view

@dependabot @github

dependabot Bot commented on behalf of github Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Labels

The following labels could not be found: dependencies. Please create it before Dependabot can add it to a pull request.

Please fix the above issues or remove invalid values from dependabot.yml.

Updates the requirements on [trl](https://github.com/huggingface/trl) to permit the latest version.
- [Release notes](https://github.com/huggingface/trl/releases)
- [Changelog](https://github.com/huggingface/trl/blob/main/RELEASE.md)
- [Commits](huggingface/trl@v0.12.0...v1.6.0)

---
updated-dependencies:
- dependency-name: trl
  dependency-version: 1.5.1
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot Bot changed the title build(deps-dev): update trl requirement from >=0.12 to >=1.5.1 chore(deps-dev): update trl requirement from >=0.12 to >=1.6.0 Jun 15, 2026
@dependabot dependabot Bot force-pushed the dependabot/pip/trl-gte-1.5.1 branch from 2190662 to eec18c0 Compare June 15, 2026 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants