Skip to content

feat: add DP memory controls and physical batch sizing#518

Draft
binaryaaron wants to merge 5 commits into
mainfrom
binaryaaron/transformers-v5-oom
Draft

feat: add DP memory controls and physical batch sizing#518
binaryaaron wants to merge 5 commits into
mainfrom
binaryaaron/transformers-v5-oom

Conversation

@binaryaaron
Copy link
Copy Markdown
Collaborator

  • feat: add DP memory training controls
  • fix: estimate preflight VRAM from quantization state
  • feat: add physical batch size constraints
  • fix: cover DP batching and progress edge cases

Add ghost clipping support and diagnostic loss controls so DP training can reduce and inspect memory pressure on larger models.

Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Use the model quantization flag rather than PEFT mode to estimate base-weight memory so preflight warnings match actual loading behavior.

Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Resolve runtime Trainer batch arguments from a physical microbatch cap while preserving the configured effective batch for training and DP accounting.

Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Preserve validated privacy and batching config paths while fixing ghost clipping adapter saves and generation progress rate rendering.

Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 687df0f1-c758-47e5-94a5-1e34671d76eb

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch binaryaaron/transformers-v5-oom

Comment @coderabbitai help to get the list of available commands and usage tips.

for loss_name, loss_fn in loss_utils.LOSS_MAPPING.items():
if loss_fn is original:
loss_utils.LOSS_MAPPING[loss_name] = probed_for_causal_lm_loss
_CAUSAL_LM_LOSS_MEMORY_PROBE_INSTALLED = True
@codecov
Copy link
Copy Markdown

codecov Bot commented May 26, 2026

Tighten the updated tests around resolver contracts, DP adapter saves, and SDK validation so they assert behavior rather than implementation details.

Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
@binaryaaron binaryaaron changed the title Add DP memory controls and physical batch sizing feat: add DP memory controls and physical batch sizing May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant