[WIP]Feature/dpo trainer by HT-Yuan · Pull Request #1190 · inclusionAI/AReaL

HT-Yuan · 2026-04-16T06:54:51Z

Description

Related Issue

Fixes #1137

Type of Change

Checklist

I have read the Contributing Guide
[x ] Pre-commit hooks pass (pre-commit run --all-files)
Relevant tests pass; new tests added for new functionality
Documentation updated (if applicable; built with ./docs/build_all.sh)
Branch is up to date with main
Self-reviewed via /review-pr command
This PR was created by a coding agent via /create-pr
This PR is a breaking change

Breaking Change Details (if applicable):

Additional Context

Need help? Check the Contributing Guide or ask in
GitHub Discussions!

gemini-code-assist

Code Review

This pull request introduces Direct Preference Optimization (DPO) support to the framework, including the DPOTrainer, DPO-specific engine implementations for FSDP, Megatron, and Archon backends, and dataset processing for the HH-RLHF dataset. The review feedback identifies critical issues: potential crashes in DPOEngine due to incorrect batch type handling in training and evaluation paths, an incorrect dataset size calculation in DPOTrainer that would lead to improper learning rate scheduling, and an inefficient Python loop in the DPO loss computation that should be vectorized to improve performance and numerical stability.

gemini-code-assist · 2026-04-16T06:58:00Z

+
+    def _train_dpo(self, data: dict[str, Any]) -> None:
+        """Train on a batch (DPO)."""
+        if _dpo_loss_weight(data) == 0:


The _dpo_loss_weight function expects a dictionary containing cu_seqlens, but data here is a list[dict[str, Any]] (the raw batch from the dataloader). This will cause a TypeError when trying to access data["cu_seqlens"]. Since the goal is to skip empty batches and log placeholder stats, you should check the list length instead.

Suggested change

if _dpo_loss_weight(data) == 0:

if not data:

_log_empty_dpo_stats(current_platform.current_device())

return

gemini-code-assist · 2026-04-16T06:58:00Z

+        batched_call(self._evaluate_dpo, data, unpack=False)
+
+    def _evaluate_dpo(self, data: dict[str, Any]) -> None:
+        if _dpo_loss_weight(data) == 0:


Similar to the training path, _dpo_loss_weight will crash here because data is a list of dictionaries, not a packed dictionary with cu_seqlens.

Suggested change

if _dpo_loss_weight(data) == 0:

if not data:

_log_empty_dpo_stats(current_platform.current_device())

return

gemini-code-assist · 2026-04-16T06:58:00Z

+
+        ft_spec = FinetuneSpec(
+            total_train_epochs=config.total_train_epochs,
+            dataset_size=len(self.train_dataloader) * config.train_dataset.batch_size,


The dataset_size is calculated using the length of the sharded dataloader, which represents the number of samples per rank. This will cause FinetuneSpec to compute an incorrect total_train_steps (underestimated by a factor of world_size), leading to incorrect learning rate scheduling and premature training termination. You should use the total dataset size instead.

Suggested change

dataset_size=len(self.train_dataloader) * config.train_dataset.batch_size,

dataset_size=len(train_dataset),

gemini-code-assist · 2026-04-16T06:58:00Z

+    seqlens = (cu_seqlens[1:] - cu_seqlens[:-1]).cpu()
+    n_seqs = seqlens.shape[0]
+
+    policy_logps = torch.zeros(n_seqs, dtype=torch.float64, device=device)
+    ref_logps = torch.zeros(n_seqs, dtype=torch.float64, device=device)
+
+    for i in range(n_seqs):
+        start = cu_seqlens[i]
+        end = cu_seqlens[i + 1]
+        m = loss_mask[start:end]
+        policy_logps[i] = torch.where(m, logprobs[start:end], 0.0).sum()
+        ref_logps[i] = torch.where(m, ref_logprobs[start:end], 0.0).sum()


This Python loop over sequences in the packed batch is inefficient and causes multiple GPU-CPU synchronizations. Furthermore, summing log-probabilities in low precision (e.g., bfloat16) can lead to numerical instability for long sequences. It is highly recommended to vectorize this operation and perform the summation in float64 to maintain precision.

n_seqs = cu_seqlens.numel() - 1 seq_ids = torch.zeros(logprobs.shape[0], dtype=torch.long, device=device) seq_ids.scatter_(0, cu_seqlens[1:-1].long(), 1) seq_ids = seq_ids.cumsum(dim=0) policy_logps = torch.zeros(n_seqs, dtype=torch.float64, device=device) policy_logps.index_add_(0, seq_ids, torch.where(loss_mask, logprobs, 0.0).to(torch.float64)) ref_logps = torch.zeros(n_seqs, dtype=torch.float64, device=device) ref_logps.index_add_(0, seq_ids, torch.where(loss_mask, ref_logprobs, 0.0).to(torch.float64))

HT-Yuan requested review from garrett4wade, nuzant and rchardx as code owners April 16, 2026 06:54

feat: add dpo

783eb4e

gemini-code-assist bot reviewed Apr 16, 2026

View reviewed changes

style: fix formatting

7c8f3af

HT-Yuan force-pushed the feature/dpo-trainer branch from 1c85889 to 7c8f3af Compare April 16, 2026 06:58

HT-Yuan marked this pull request as draft April 16, 2026 07:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]Feature/dpo trainer#1190

[WIP]Feature/dpo trainer#1190
HT-Yuan wants to merge 2 commits intoinclusionAI:mainfrom
HT-Yuan:feature/dpo-trainer

HT-Yuan commented Apr 16, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-        if _dpo_loss_weight(data) == 0:
+        if not data:
+            _log_empty_dpo_stats(current_platform.current_device())
+            return

	dataset_size=len(self.train_dataloader) * config.train_dataset.batch_size,
	dataset_size=len(train_dataset),

Conversation

HT-Yuan commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HT-Yuan commented Apr 16, 2026 •

edited

Loading