Device_map multi-GPU inference can silently corrupt tensors when local CUDA peer copies are unhealthy

### System Info

```Shell
- `Accelerate` version: 1.6.0                                                                                                                       
- Platform: Linux-5.4.0-216-generic-x86_64-with-glibc2.31                                                                                           
- `accelerate` bash location: /home/zhmao/anaconda3/envs/minimind-v/bin/accelerate                                                                  
- Python version: 3.10.16
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.2.2+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch SDAA available: False
- PyTorch MUSA available: False
- System RAM: 503.79 GB
- GPU type: NVIDIA RTX A6000
- `Accelerate` default config:
      - compute_environment: LOCAL_MACHINE
      - distributed_type: MULTI_GPU
      - mixed_precision: fp16
      - use_cpu: False
      - debug: False
      - num_processes: 2
      - machine_rank: 0
      - num_machines: 1
      - gpu_ids: all
      - rdzv_backend: static
      - same_network: True
      - main_training_function: main
      - enable_cpu_affinity: False
      - downcast_bf16: no
      - tpu_use_cluster: False
      - tpu_use_sudo: False
      - tpu_env: []
```

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported `no_trainer` script in the `examples` folder of the `transformers` repo (such as `run_no_trainer_glue.py`)
- [x] My own task or dataset (give details below)

### Reproduction

# `device_map` multi-GPU inference can silently corrupt tensors when local CUDA peer copies are unhealthy

## Summary

When using `Transformers + accelerate` multi-GPU inference through `device_map`, intermediate activations moved from one GPU to another via `accelerate.utils.send_to_device()` can be silently corrupted on systems where local CUDA peer copies are unhealthy.

On the affected host:

- `tensor.to("cuda:N")` across GPUs can return all-zero tensors
- `tensor.copy_(...)` across GPUs can return all-zero tensors
- `cudaMemcpyPeer` can return all-zero tensors
- `torch.distributed` / NCCL collectives and point-to-point `send/recv` are still correct

Because `dispatch_model()` uses `send_to_device(...)->tensor.to(target_device)` for intermediate activations, the corruption propagates directly into model outputs during `device_map` inference.

I checked the latest `main` branch of `huggingface/accelerate` (commit `29e03d185d6c4608472b3b866964c8942c2fa4a3`, version `1.14.0.dev0`). The relevant `send_to_device` / `AlignDevicesHook` / `dispatch_model` logic is still unchanged there.

I saw a similar issue in the closed [issues.3995](https://github.com/huggingface/accelerate/issues/3995)

## Environment

- OS: Ubuntu 20.04.6 LTS
- Python: 3.10.16
- PyTorch: 2.2.2+cu121
- CUDA runtime: 12.1
- GPU: 8 x NVIDIA RTX A6000
- Driver: 570.172.08
- Accelerate: local repro confirmed against latest `main` logic

## Accelerate config

`~/.cache/huggingface/accelerate/default_config.yaml`

```yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```

## Minimal reproduction (<30s)

```python
import torch
from accelerate.utils.operations import send_to_device

x = torch.arange(2048 * 2048, device="cuda:0", dtype=torch.float32).reshape(2048, 2048)
y = send_to_device(x, torch.device("cuda:1"))

print("src_absmax", float(x.abs().max().item()))
print("dst_absmax", float(y.abs().max().item()))
print("max_abs_diff", float((x.cpu() - y.cpu()).abs().max().item()))
print("dst_zero_count", int((y.cpu() == 0).sum().item()))
```

Observed output on the affected machine:

```text
src_absmax 4194303.0
dst_absmax 0.0
max_abs_diff 4194303.0
dst_zero_count 4194304
```

## Model-level manifestation

Using a local Qwen2 model split across GPU 0 and GPU 1:

- direct `dispatch_model(..., device_map=...)` produces incorrect logits
- the last-token top-k differs from the single-GPU baseline
- the activation captured at the 0->1 boundary becomes all zero after transfer

At the same time:

- `torch.distributed` `all_reduce`, `all_gather`, and `send/recv` are correct
- `NCCL_P2P_LEVEL=NVL` or `LOC` makes distributed training stable on the same host

So the failure appears isolated to the local CUDA peer-copy path used by `tensor.to(...)`, not to NCCL itself.

## Why this seems to be happening

The current `device_map` inference path is:

1. `Transformers` calls `dispatch_model()` when `device_map` is provided
2. `dispatch_model()` installs `AlignDevicesHook`
3. `AlignDevicesHook.pre_forward()` moves inputs to the execution device
4. `accelerate.utils.send_to_device()` eventually calls `tensor.to(device)`

On the affected machine, `tensor.to(other_gpu)` is already broken, so the model receives corrupted activations.

System-level diagnostics are consistent with that interpretation:

- `cudaMemcpyPeer` is also broken
- `p2pBandwidthLatencyTest` reports healthy non-P2P bandwidth but severely degraded P2P write bandwidth/latency
- NCCL defaults to `P2P/CUMEM` unless restricted, but can be forced to fallback to `SHM`

## Suggested direction

An explicit opt-in workaround would be useful:
- add a `dispatch_model(..., force_cuda_to_cuda_via_cpu=True)`-style switch
- route CUDA-to-CUDA activation moves through `tensor.cpu().to(target_gpu)` instead of direct peer copies
- keep it opt-in because it has a performance cost, but it gives users a safe escape hatch on systems where peer copies are unreliable

I have a local patch that implements exactly this shape of workaround and validates that:

- direct split-model inference is wrong
- `force_cuda_to_cuda_via_cpu=True` restores exact agreement with the single-GPU baseline
- `max_abs_diff_fixed_vs_single == 0.0`

If this direction sounds acceptable, I can open a PR with the implementation and focused tests.


### Expected behavior

`dispatch_model(..., device_map=...)` should preserve tensor values when moving intermediate activations across GPUs.
On systems where direct local CUDA peer copies are unhealthy, I would expect Accelerate to at least provide a safe opt-in workaround so that multi-GPU `device_map` inference remains numerically correct instead of silently producing corrupted activations and wrong outputs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Device_map multi-GPU inference can silently corrupt tensors when local CUDA peer copies are unhealthy #4036

System Info

Information

Tasks

Reproduction

`device_map` multi-GPU inference can silently corrupt tensors when local CUDA peer copies are unhealthy

Summary

Environment

Accelerate config

Minimal reproduction (<30s)

Model-level manifestation

Why this seems to be happening

Suggested direction

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Device_map multi-GPU inference can silently corrupt tensors when local CUDA peer copies are unhealthy #4036

Description

System Info

Information

Tasks

Reproduction

device_map multi-GPU inference can silently corrupt tensors when local CUDA peer copies are unhealthy

Summary

Environment

Accelerate config

Minimal reproduction (<30s)

Model-level manifestation

Why this seems to be happening

Suggested direction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`device_map` multi-GPU inference can silently corrupt tensors when local CUDA peer copies are unhealthy