System Info
- `Accelerate` version: 1.6.0
- Platform: Linux-5.4.0-216-generic-x86_64-with-glibc2.31
- `accelerate` bash location: /home/zhmao/anaconda3/envs/minimind-v/bin/accelerate
- Python version: 3.10.16
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.2.2+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch SDAA available: False
- PyTorch MUSA available: False
- System RAM: 503.79 GB
- GPU type: NVIDIA RTX A6000
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Information
Tasks
Reproduction
device_map multi-GPU inference can silently corrupt tensors when local CUDA peer copies are unhealthy
Summary
When using Transformers + accelerate multi-GPU inference through device_map, intermediate activations moved from one GPU to another via accelerate.utils.send_to_device() can be silently corrupted on systems where local CUDA peer copies are unhealthy.
On the affected host:
tensor.to("cuda:N") across GPUs can return all-zero tensors
tensor.copy_(...) across GPUs can return all-zero tensors
cudaMemcpyPeer can return all-zero tensors
torch.distributed / NCCL collectives and point-to-point send/recv are still correct
Because dispatch_model() uses send_to_device(...)->tensor.to(target_device) for intermediate activations, the corruption propagates directly into model outputs during device_map inference.
I checked the latest main branch of huggingface/accelerate (commit 29e03d185d6c4608472b3b866964c8942c2fa4a3, version 1.14.0.dev0). The relevant send_to_device / AlignDevicesHook / dispatch_model logic is still unchanged there.
I saw a similar issue in the closed issues.3995
Environment
- OS: Ubuntu 20.04.6 LTS
- Python: 3.10.16
- PyTorch: 2.2.2+cu121
- CUDA runtime: 12.1
- GPU: 8 x NVIDIA RTX A6000
- Driver: 570.172.08
- Accelerate: local repro confirmed against latest
main logic
Accelerate config
~/.cache/huggingface/accelerate/default_config.yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Minimal reproduction (<30s)
import torch
from accelerate.utils.operations import send_to_device
x = torch.arange(2048 * 2048, device="cuda:0", dtype=torch.float32).reshape(2048, 2048)
y = send_to_device(x, torch.device("cuda:1"))
print("src_absmax", float(x.abs().max().item()))
print("dst_absmax", float(y.abs().max().item()))
print("max_abs_diff", float((x.cpu() - y.cpu()).abs().max().item()))
print("dst_zero_count", int((y.cpu() == 0).sum().item()))
Observed output on the affected machine:
src_absmax 4194303.0
dst_absmax 0.0
max_abs_diff 4194303.0
dst_zero_count 4194304
Model-level manifestation
Using a local Qwen2 model split across GPU 0 and GPU 1:
- direct
dispatch_model(..., device_map=...) produces incorrect logits
- the last-token top-k differs from the single-GPU baseline
- the activation captured at the 0->1 boundary becomes all zero after transfer
At the same time:
torch.distributed all_reduce, all_gather, and send/recv are correct
NCCL_P2P_LEVEL=NVL or LOC makes distributed training stable on the same host
So the failure appears isolated to the local CUDA peer-copy path used by tensor.to(...), not to NCCL itself.
Why this seems to be happening
The current device_map inference path is:
Transformers calls dispatch_model() when device_map is provided
dispatch_model() installs AlignDevicesHook
AlignDevicesHook.pre_forward() moves inputs to the execution device
accelerate.utils.send_to_device() eventually calls tensor.to(device)
On the affected machine, tensor.to(other_gpu) is already broken, so the model receives corrupted activations.
System-level diagnostics are consistent with that interpretation:
cudaMemcpyPeer is also broken
p2pBandwidthLatencyTest reports healthy non-P2P bandwidth but severely degraded P2P write bandwidth/latency
- NCCL defaults to
P2P/CUMEM unless restricted, but can be forced to fallback to SHM
Suggested direction
An explicit opt-in workaround would be useful:
- add a
dispatch_model(..., force_cuda_to_cuda_via_cpu=True)-style switch
- route CUDA-to-CUDA activation moves through
tensor.cpu().to(target_gpu) instead of direct peer copies
- keep it opt-in because it has a performance cost, but it gives users a safe escape hatch on systems where peer copies are unreliable
I have a local patch that implements exactly this shape of workaround and validates that:
- direct split-model inference is wrong
force_cuda_to_cuda_via_cpu=True restores exact agreement with the single-GPU baseline
max_abs_diff_fixed_vs_single == 0.0
If this direction sounds acceptable, I can open a PR with the implementation and focused tests.
Expected behavior
dispatch_model(..., device_map=...) should preserve tensor values when moving intermediate activations across GPUs.
On systems where direct local CUDA peer copies are unhealthy, I would expect Accelerate to at least provide a safe opt-in workaround so that multi-GPU device_map inference remains numerically correct instead of silently producing corrupted activations and wrong outputs.
System Info
Information
Tasks
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py)Reproduction
device_mapmulti-GPU inference can silently corrupt tensors when local CUDA peer copies are unhealthySummary
When using
Transformers + acceleratemulti-GPU inference throughdevice_map, intermediate activations moved from one GPU to another viaaccelerate.utils.send_to_device()can be silently corrupted on systems where local CUDA peer copies are unhealthy.On the affected host:
tensor.to("cuda:N")across GPUs can return all-zero tensorstensor.copy_(...)across GPUs can return all-zero tensorscudaMemcpyPeercan return all-zero tensorstorch.distributed/ NCCL collectives and point-to-pointsend/recvare still correctBecause
dispatch_model()usessend_to_device(...)->tensor.to(target_device)for intermediate activations, the corruption propagates directly into model outputs duringdevice_mapinference.I checked the latest
mainbranch ofhuggingface/accelerate(commit29e03d185d6c4608472b3b866964c8942c2fa4a3, version1.14.0.dev0). The relevantsend_to_device/AlignDevicesHook/dispatch_modellogic is still unchanged there.I saw a similar issue in the closed issues.3995
Environment
mainlogicAccelerate config
~/.cache/huggingface/accelerate/default_config.yamlMinimal reproduction (<30s)
Observed output on the affected machine:
Model-level manifestation
Using a local Qwen2 model split across GPU 0 and GPU 1:
dispatch_model(..., device_map=...)produces incorrect logitsAt the same time:
torch.distributedall_reduce,all_gather, andsend/recvare correctNCCL_P2P_LEVEL=NVLorLOCmakes distributed training stable on the same hostSo the failure appears isolated to the local CUDA peer-copy path used by
tensor.to(...), not to NCCL itself.Why this seems to be happening
The current
device_mapinference path is:Transformerscallsdispatch_model()whendevice_mapis provideddispatch_model()installsAlignDevicesHookAlignDevicesHook.pre_forward()moves inputs to the execution deviceaccelerate.utils.send_to_device()eventually callstensor.to(device)On the affected machine,
tensor.to(other_gpu)is already broken, so the model receives corrupted activations.System-level diagnostics are consistent with that interpretation:
cudaMemcpyPeeris also brokenp2pBandwidthLatencyTestreports healthy non-P2P bandwidth but severely degraded P2P write bandwidth/latencyP2P/CUMEMunless restricted, but can be forced to fallback toSHMSuggested direction
An explicit opt-in workaround would be useful:
dispatch_model(..., force_cuda_to_cuda_via_cpu=True)-style switchtensor.cpu().to(target_gpu)instead of direct peer copiesI have a local patch that implements exactly this shape of workaround and validates that:
force_cuda_to_cuda_via_cpu=Truerestores exact agreement with the single-GPU baselinemax_abs_diff_fixed_vs_single == 0.0If this direction sounds acceptable, I can open a PR with the implementation and focused tests.
Expected behavior
dispatch_model(..., device_map=...)should preserve tensor values when moving intermediate activations across GPUs.On systems where direct local CUDA peer copies are unhealthy, I would expect Accelerate to at least provide a safe opt-in workaround so that multi-GPU
device_mapinference remains numerically correct instead of silently producing corrupted activations and wrong outputs.