Fix Eagle3 FA cached backward semantics and keep FA2 fallback backend by uygnef · Pull Request #79 · lightseekorg/TorchSpec

uygnef · 2026-04-16T02:49:32Z

Summary

This PR fixes the Eagle3 fa cached-attention backward path.

The same gradient issue can also be avoided by
LlamaFlashAttentionMasked, but that path depends on the masked / cute FA
stack.

This PR fixes the standard fa path instead, so we have a correct backend
that only depends on standard FA2 and can serve as a more compatible
backup backend, especially on older devices or environments where the
masked FA path is unavailable.

Background

For Eagle3 cached attention, the final output is a merge of:

block0 causal attention
suffix singleton cached blocks

The old fa path handled block0 and the external merge separately, which
made block0 backward inconsistent with the final merged normalization.

As a result, cached-path q/k gradients could diverge from
flex_attention / masked-attention behavior and contribute to abnormal
gradient spikes.

LlamaFlashAttentionMasked does not have this issue because it expresses
the attention pattern inside a masked FA formulation, but it has stricter
backend/runtime requirements.

What this PR changes

In torchspec/models/draft/llama3_eagle.py:

fix the cached-merge backward semantics for fa
keep block0 on standard flash-attention
reuse standard flash-attn backward for block0 with merged
combined_out / combined_lse
keep suffix block gradients analytic and explicit
support padded batches via varlen flash-attn for block0
simplify the main fa path around the corrected implementation

Why this is useful

This gives us a corrected fa backend that:

matches the intended Eagle3 cached-merge backward semantics
aligns with flex_attention
does not require the masked / cute FA path
only depends on standard FA2
can be used as a more compatible backup backend on older devices or less
specialized environments

Validation

Historical dump replay

On a historical spike dump:

dump:
- grad_norm = 1422.276855
- total_loss = 14.712839
fixed fa replay:
- grad_norm = 22.8646
- weighted_loss = 14.633358
flex_attention replay:
- grad_norm = 22.7233
- weighted_loss = 14.633938

So after the fix:

fa and flex_attention are essentially aligned
the previous large-gradient behavior is no longer reproduced

Padded batch benchmark

Right-padded batch, batch=4, mixed valid lengths.

max seq	valid lengths	fa time	flex time	fa peak	flex peak
4096	[4096, 3584, 2560, 1536]	0.155s	0.122s	6.33 GiB	5.22 GiB
8192	[8192, 7168, 5120, 3072]	0.248s	0.317s	14.02 GiB	13.35 GiB

Numerical alignment stayed good:

losses matched
outputs stayed close
parameter gradient relative L2 error stayed small

Tests

Updated / added checks in tests/test_flex_attention.py:

cached-path gradients match flex_attention
forward behavior matches expected outputs
padded batch cases are numerically aligned with flex_attention

Signed-off-by: Yu Feng <admin@fengyu.org>

uygnef changed the title ~~Fix/eagle3 fa~~ Fix Eagle3 FA cached backward semantics and keep FA2 fallback backend Apr 16, 2026

torchspec-bot requested review from cicirori and yubofredwang April 16, 2026 03:37

torchspec-bot assigned yubofredwang and cicirori Apr 16, 2026

torchspec-bot added the high priority label Apr 16, 2026

torchspec-bot approved these changes Apr 16, 2026

View reviewed changes

uygnef force-pushed the fix/eagle3_fa branch from 8b7c376 to f0a1052 Compare April 16, 2026 04:02

uygnef added 5 commits April 16, 2026 16:31

Support padded FA and add TensorBoard logging

a9929fb

Signed-off-by: Yu Feng <admin@fengyu.org>

Remove legacy flash-attn compatibility branches

4efdda2

Signed-off-by: Yu Feng <admin@fengyu.org>

xxx

9753b9d

Signed-off-by: Yu Feng <admin@fengyu.org>

Refactor standard flash-attn varlen import

941bc15

Signed-off-by: Yu Feng <admin@fengyu.org>

modify format

b70fd17

Signed-off-by: Yu Feng <admin@fengyu.org>

uygnef force-pushed the fix/eagle3_fa branch from f0a1052 to b70fd17 Compare April 16, 2026 08:32

zhyncs merged commit 6dda2bf into lightseekorg:main Apr 16, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Eagle3 FA cached backward semantics and keep FA2 fallback backend#79

Fix Eagle3 FA cached backward semantics and keep FA2 fallback backend#79
zhyncs merged 5 commits intolightseekorg:mainfrom
uygnef:fix/eagle3_fa

uygnef commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

uygnef commented Apr 16, 2026

Summary

Background

What this PR changes

Why this is useful

Validation

Historical dump replay

Padded batch benchmark

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants