Skip to content

fix(vllm): drop flash-attn wheel to avoid torch 2.10 ABI mismatch#9557

Merged
mudler merged 1 commit intomudler:masterfrom
richiejp:fix/vllm-flash-attn
Apr 25, 2026
Merged

fix(vllm): drop flash-attn wheel to avoid torch 2.10 ABI mismatch#9557
mudler merged 1 commit intomudler:masterfrom
richiejp:fix/vllm-flash-attn

Conversation

@richiejp
Copy link
Copy Markdown
Collaborator

The pinned flash-attn 2.8.3+cu12torch2.7 wheel breaks at import time
once vllm 0.19.1 upgrades torch to its hard-pinned 2.10.0:

ImportError: .../flash_attn_2_cuda...so: undefined symbol:
_ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib

That C10 CUDA symbol is libtorch-version-specific. Dao-AILab has not yet
published flash-attn wheels for torch 2.10 -- the latest release (2.8.3)
tops out at torch 2.8 -- so any wheel pinned here is silently ABI-broken
the moment vllm completes its install.

vllm 0.19.1 lists flashinfer-python==0.6.6 as a hard dep, which already
covers the attention path. The only other use of flash-attn in vllm is
the rotary apply_rotary import in
vllm/model_executor/layers/rotary_embedding/common.py, which is guarded
by find_spec("flash_attn") and falls back cleanly when absent.

Also unpin torch in requirements-cublas12.txt: the 2.7.0 pin only
existed to give the flash-attn wheel a matching torch to link against.
With flash-attn gone, vllm's own torch==2.10.0 dep is the binding
constraint regardless of what we put here.

Notes for Reviewers

I don't know what are the performance implications, but we need them to update flash-attn practically speaking.

Signed commits

  • Yes, I signed my commits.

@richiejp richiejp force-pushed the fix/vllm-flash-attn branch from bb9f22b to e85629f Compare April 25, 2026 09:22
@mudler
Copy link
Copy Markdown
Owner

mudler commented Apr 25, 2026

mh interesting, we should check closer as not having flash attn is a huge slowdown

@richiejp
Copy link
Copy Markdown
Collaborator Author

mh interesting, we should check closer as not having flash attn is a huge slowdown

oh, it's still there, but implemented in flashinfer which seems to be the default in vLLM.

The pinned flash-attn 2.8.3+cu12torch2.7 wheel breaks at import time
once vllm 0.19.1 upgrades torch to its hard-pinned 2.10.0:

  ImportError: .../flash_attn_2_cuda...so: undefined symbol:
  _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib

That C10 CUDA symbol is libtorch-version-specific. Dao-AILab has not yet
published flash-attn wheels for torch 2.10 -- the latest release (2.8.3)
tops out at torch 2.8 -- so any wheel pinned here is silently ABI-broken
the moment vllm completes its install.

vllm 0.19.1 lists flashinfer-python==0.6.6 as a hard dep, which already
covers the attention path. The only other use of flash-attn in vllm is
the rotary apply_rotary import in
vllm/model_executor/layers/rotary_embedding/common.py, which is guarded
by find_spec("flash_attn") and falls back cleanly when absent.

Also unpin torch in requirements-cublas12.txt: the 2.7.0 pin only
existed to give the flash-attn wheel a matching torch to link against.
With flash-attn gone, vllm's own torch==2.10.0 dep is the binding
constraint regardless of what we put here.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
@richiejp richiejp force-pushed the fix/vllm-flash-attn branch from e85629f to e011550 Compare April 25, 2026 12:39
@mudler mudler enabled auto-merge (squash) April 25, 2026 13:11
@mudler mudler merged commit 73aacad into mudler:master Apr 25, 2026
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants