Fix causality violation: use per-token weights instead of full-sequence mean pooling by sippycoder · Pull Request #3 · WithNucleusAI/mHC-triton

sippycoder · 2026-03-10T09:12:21Z

Dynamic weights (H_pre, H_post, H_res) were computed from H.mean(dim=1), which
averages over all sequence positions. For autoregressive LLMs this leaks future
token information into the mixing weights applied at position t, breaking causality.

Fix: replace H.mean(dim=1).reshape(batch, ndim) with H.reshape(batchseq, n*dim)
so each token's weights are derived solely from its own hidden state. Weight shapes
change from (batch, n) to (batch, seq, n) throughout, matching the paper's intent
and reference implementations (tokenbender, VatsaDev).

Changes:

module.py / _torch_baseline.py: all three _compute_weights paths (static, fused,
separate-projections) now produce per-position weights (batch, seq, n)
_kernels.py: stream_mix and add_residual forward kernels index weights by pid_bs
(b*seq+s) instead of b
_backward.py: same index fix in backward kernels; gradient reductions now sum
only over d_blocks (dim=2) to preserve the per-position (batch, seq, n) shape
ops.py / _torch_baseline.py: updated einsum signatures and docstrings

https://claude.ai/code/session_016YVdHfTQm3GA8aqcj8ws25

…ce mean pooling Dynamic weights (H_pre, H_post, H_res) were computed from H.mean(dim=1), which averages over all sequence positions. For autoregressive LLMs this leaks future token information into the mixing weights applied at position t, breaking causality. Fix: replace H.mean(dim=1).reshape(batch, n*dim) with H.reshape(batch*seq, n*dim) so each token's weights are derived solely from its own hidden state. Weight shapes change from (batch, n) to (batch, seq, n) throughout, matching the paper's intent and reference implementations (tokenbender, VatsaDev). Changes: - module.py / _torch_baseline.py: all three _compute_weights paths (static, fused, separate-projections) now produce per-position weights (batch, seq, n) - _kernels.py: stream_mix and add_residual forward kernels index weights by pid_bs (b*seq+s) instead of b - _backward.py: same index fix in backward kernels; gradient reductions now sum only over d_blocks (dim=2) to preserve the per-position (batch, seq, n) shape - ops.py / _torch_baseline.py: updated einsum signatures and docstrings https://claude.ai/code/session_016YVdHfTQm3GA8aqcj8ws25

sippycoder · 2026-03-10T09:14:55Z

Verify tests and numerical stability
Train sample runs and track layer norms for training stability

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix causality violation: use per-token weights instead of full-sequence mean pooling#3

Fix causality violation: use per-token weights instead of full-sequence mean pooling#3
sippycoder wants to merge 1 commit into
mainfrom
claude/fix-causality-tensor-reshape-itfva

sippycoder commented Mar 10, 2026

Uh oh!

sippycoder commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

sippycoder commented Mar 10, 2026

Uh oh!

sippycoder commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants