chore(deps-dev): update trl requirement from >=0.12 to >=1.6.0 by dependabot[bot] · Pull Request #41 · mara-werils/llmstack

dependabot · 2026-06-08T15:25:58Z

Updates the requirements on trl to permit the latest version.

Release notes

v1.6.0

Features

AsyncGRPO rollout worker now runs in a separate process

AsyncRolloutWorker is no longer a thread — it's a spawned child process with its own GIL. The trainer's autograd engine no longer competes with recursive_parse / accuracy_reward for the GIL, which was causing 1-5s stalls in real Qwen3-30B-A3B @ 16k runs and ultimately NCCL watchdog timeouts on other ranks.

Architectural changes:

AsyncRolloutWorker (parent) owns the child process + shared mp.Queue / mp.Value / mp.Event.

_AsyncRolloutLoop (child-only) handles tokenization, dataset iteration, reward funcs, and asyncio loops.

A new WeightTransferClient owns the NCCL group with vLLM (/pause, /resume, /init_weight_transfer_engine, /update_weights); the rollout child only talks to /v1/completions.

Two correctness fixes shipped alongside (they would have conflicted otherwise): broader aiohttp retry (now catches ClientPayloadError) with bounded exponential backoff, and all-NaN reward columns are now preserved — np.nansum was silently returning 0, giving unscorable completions a real advantage signal and pushing the policy away from correct answers (~30% of DeepMath / OpenR1-Math rows).

[!NOTE] reward_funcs / tools / environment_factory must now be picklable, and the child runs CPU-only (CUDA_VISIBLE_DEVICES="").

by @AmineDiro in huggingface/trl#5749

New experimental A2PO trainer (Optimal Advantage Regression)

A new A2POTrainer implements A*-PO from "Accelerating RL for LLM Reasoning with Optimal Advantage Regression". Two stages: an offline V* estimation pass from reference policy samples (with optional filter_all_incorrect to drop prompts where every reference completion fails), then on-policy training with one generation per prompt and a plain least-squares loss on β₂·log(π/π_ref) vs r − V*. No group, no critic, no clipping, no reward normalization.
from trl.experimental.a2po import A2POConfig, A2POTrainer
trainer = A2POTrainer(
model="Qwen/Qwen3-4B",
args=A2POConfig(num_value_samples=8, filter_all_incorrect=True),
train_dataset=dataset,
reward_funcs=accuracy_reward,
)
trainer.train()
Designed for binary verifiable rewards (math/code), not open-ended problems.

by @raghulchandramouli in huggingface/trl#5940

KTO now supports VLMs + big alignment push

The biggest KTO ↔ DPO alignment cycle yet — KTOTrainer now supports vision-language models, plus a deep restructuring of compute_loss, KL dataset generation, ref-logp precomputation, activation offloading, sampler strategy, metrics, and more. KTO graduation is very close.
from trl.experimental.kto import KTOConfig, KTOTrainer
trainer = KTOTrainer(
model="Qwen/Qwen2.5-VL-3B-Instruct",
args=KTOConfig(...),
train_dataset=vision_kto_dataset,
</tr></table>

... (truncated)

Commits

0dac440 Release: v1.6 (#6009)
6842058 docs: clarify PPO entropy metrics in PPO trainer docs (#5289)
cb5ca23 fix(cli): drop duplicate "to" in trl skills install description (#6008)
8226159 Hide DeepSpeed/FSDP distributed backend boilerplate (#6000)
fa286a8 Padding-free invariance test (#5842)
eab8bc8 Announce upcoming SFT loss_type default change from 'nll' to `'chunked_nl...
4520e4b [CI] Check that training chat templates keep the stop token in the loss mask ...
e28c6d9 Document bnb_4bit_quant_storage and normalize docstring param headers (#5993)
3f6f7d2 Align KTO with DPO: Inline kto_loss in _compute_loss (#5999)
b84b487 Align KTO with DPO: Rename kto_loss_fn to liger_loss_fn (#5998)
Additional commits viewable in compare view

dependabot · 2026-06-08T15:25:59Z

Labels

The following labels could not be found: dependencies. Please create it before Dependabot can add it to a pull request.

Please fix the above issues or remove invalid values from dependabot.yml.

Updates the requirements on [trl](https://github.com/huggingface/trl) to permit the latest version. - [Release notes](https://github.com/huggingface/trl/releases) - [Changelog](https://github.com/huggingface/trl/blob/main/RELEASE.md) - [Commits](huggingface/trl@v0.12.0...v1.6.0) --- updated-dependencies: - dependency-name: trl dependency-version: 1.5.1 dependency-type: direct:development ... Signed-off-by: dependabot[bot] <support@github.com>

dependabot Bot changed the title ~~build(deps-dev): update trl requirement from >=0.12 to >=1.5.1~~ chore(deps-dev): update trl requirement from >=0.12 to >=1.6.0 Jun 15, 2026

dependabot Bot force-pushed the dependabot/pip/trl-gte-1.5.1 branch from 2190662 to eec18c0 Compare June 15, 2026 14:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(deps-dev): update trl requirement from >=0.12 to >=1.6.0#41

chore(deps-dev): update trl requirement from >=0.12 to >=1.6.0#41
dependabot[bot] wants to merge 1 commit into
mainfrom
dependabot/pip/trl-gte-1.5.1

dependabot Bot commented on behalf of github Jun 8, 2026 •

edited

Loading

Uh oh!

dependabot Bot commented on behalf of github Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

dependabot Bot commented on behalf of github Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

v1.6.0

Features

AsyncGRPO rollout worker now runs in a separate process

New experimental A2PO trainer (Optimal Advantage Regression)

KTO now supports VLMs + big alignment push

Uh oh!

dependabot Bot commented on behalf of github Jun 8, 2026

Labels

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

dependabot Bot commented on behalf of github Jun 8, 2026 •

edited

Loading