Skip to content

Device_map multi-GPU inference can silently corrupt tensors when local CUDA peer copies are unhealthy #4036

@Aitejiu

Description

@Aitejiu

System Info

- `Accelerate` version: 1.6.0                                                                                                                       
- Platform: Linux-5.4.0-216-generic-x86_64-with-glibc2.31                                                                                           
- `accelerate` bash location: /home/zhmao/anaconda3/envs/minimind-v/bin/accelerate                                                                  
- Python version: 3.10.16
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.2.2+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch SDAA available: False
- PyTorch MUSA available: False
- System RAM: 503.79 GB
- GPU type: NVIDIA RTX A6000
- `Accelerate` default config:
      - compute_environment: LOCAL_MACHINE
      - distributed_type: MULTI_GPU
      - mixed_precision: fp16
      - use_cpu: False
      - debug: False
      - num_processes: 2
      - machine_rank: 0
      - num_machines: 1
      - gpu_ids: all
      - rdzv_backend: static
      - same_network: True
      - main_training_function: main
      - enable_cpu_affinity: False
      - downcast_bf16: no
      - tpu_use_cluster: False
      - tpu_use_sudo: False
      - tpu_env: []

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

device_map multi-GPU inference can silently corrupt tensors when local CUDA peer copies are unhealthy

Summary

When using Transformers + accelerate multi-GPU inference through device_map, intermediate activations moved from one GPU to another via accelerate.utils.send_to_device() can be silently corrupted on systems where local CUDA peer copies are unhealthy.

On the affected host:

  • tensor.to("cuda:N") across GPUs can return all-zero tensors
  • tensor.copy_(...) across GPUs can return all-zero tensors
  • cudaMemcpyPeer can return all-zero tensors
  • torch.distributed / NCCL collectives and point-to-point send/recv are still correct

Because dispatch_model() uses send_to_device(...)->tensor.to(target_device) for intermediate activations, the corruption propagates directly into model outputs during device_map inference.

I checked the latest main branch of huggingface/accelerate (commit 29e03d185d6c4608472b3b866964c8942c2fa4a3, version 1.14.0.dev0). The relevant send_to_device / AlignDevicesHook / dispatch_model logic is still unchanged there.

I saw a similar issue in the closed issues.3995

Environment

  • OS: Ubuntu 20.04.6 LTS
  • Python: 3.10.16
  • PyTorch: 2.2.2+cu121
  • CUDA runtime: 12.1
  • GPU: 8 x NVIDIA RTX A6000
  • Driver: 570.172.08
  • Accelerate: local repro confirmed against latest main logic

Accelerate config

~/.cache/huggingface/accelerate/default_config.yaml

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Minimal reproduction (<30s)

import torch
from accelerate.utils.operations import send_to_device

x = torch.arange(2048 * 2048, device="cuda:0", dtype=torch.float32).reshape(2048, 2048)
y = send_to_device(x, torch.device("cuda:1"))

print("src_absmax", float(x.abs().max().item()))
print("dst_absmax", float(y.abs().max().item()))
print("max_abs_diff", float((x.cpu() - y.cpu()).abs().max().item()))
print("dst_zero_count", int((y.cpu() == 0).sum().item()))

Observed output on the affected machine:

src_absmax 4194303.0
dst_absmax 0.0
max_abs_diff 4194303.0
dst_zero_count 4194304

Model-level manifestation

Using a local Qwen2 model split across GPU 0 and GPU 1:

  • direct dispatch_model(..., device_map=...) produces incorrect logits
  • the last-token top-k differs from the single-GPU baseline
  • the activation captured at the 0->1 boundary becomes all zero after transfer

At the same time:

  • torch.distributed all_reduce, all_gather, and send/recv are correct
  • NCCL_P2P_LEVEL=NVL or LOC makes distributed training stable on the same host

So the failure appears isolated to the local CUDA peer-copy path used by tensor.to(...), not to NCCL itself.

Why this seems to be happening

The current device_map inference path is:

  1. Transformers calls dispatch_model() when device_map is provided
  2. dispatch_model() installs AlignDevicesHook
  3. AlignDevicesHook.pre_forward() moves inputs to the execution device
  4. accelerate.utils.send_to_device() eventually calls tensor.to(device)

On the affected machine, tensor.to(other_gpu) is already broken, so the model receives corrupted activations.

System-level diagnostics are consistent with that interpretation:

  • cudaMemcpyPeer is also broken
  • p2pBandwidthLatencyTest reports healthy non-P2P bandwidth but severely degraded P2P write bandwidth/latency
  • NCCL defaults to P2P/CUMEM unless restricted, but can be forced to fallback to SHM

Suggested direction

An explicit opt-in workaround would be useful:

  • add a dispatch_model(..., force_cuda_to_cuda_via_cpu=True)-style switch
  • route CUDA-to-CUDA activation moves through tensor.cpu().to(target_gpu) instead of direct peer copies
  • keep it opt-in because it has a performance cost, but it gives users a safe escape hatch on systems where peer copies are unreliable

I have a local patch that implements exactly this shape of workaround and validates that:

  • direct split-model inference is wrong
  • force_cuda_to_cuda_via_cpu=True restores exact agreement with the single-GPU baseline
  • max_abs_diff_fixed_vs_single == 0.0

If this direction sounds acceptable, I can open a PR with the implementation and focused tests.

Expected behavior

dispatch_model(..., device_map=...) should preserve tensor values when moving intermediate activations across GPUs.
On systems where direct local CUDA peer copies are unhealthy, I would expect Accelerate to at least provide a safe opt-in workaround so that multi-GPU device_map inference remains numerically correct instead of silently producing corrupted activations and wrong outputs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions