Skip to content

Add Hybrid Depth-Recurrent Transformer submission#341

Open
tobiascanavesi wants to merge 1 commit intoopenai:mainfrom
tobiascanavesi:hybrid-depth-recurrent
Open

Add Hybrid Depth-Recurrent Transformer submission#341
tobiascanavesi wants to merge 1 commit intoopenai:mainfrom
tobiascanavesi:hybrid-depth-recurrent

Conversation

@tobiascanavesi
Copy link

Hybrid Depth-Recurrent Transformer

Testing this new architecture that solves the int8 quantization compounding problem in depth-recurrent transformers.

Key Insight

Standard depth-recurrence shares all weights across loop iterationsm int8 rounding errors compound on every loop (0.40 BPB gap). The hybrid keeps precision-sensitive layers near input/output as unique weights, while only the bulk middle layers are shared and looped.
Result: quantization gap reduced from 0.40 to near-zero (-0.004 BPB).

Architecture

  • 1 unique entry layer + 4 shared blocks × 5 loops + 1 unique exit layer = 22 effective layers from 6 weight blocks
  • U-Net skip connections across full effective depth
  • Per-virtual-layer scalars (attn_scale, mlp_scale, resid_mix, q_gain)

Techniques

  • FP16 tied embedding passthrough during int8 quantization
  • Sliding window evaluation (stride=64, seq_len=1024)
  • Decoupled Muon weight decay (0.02)
  • Overtone spectral embedding init (SVD power-law shaping)
  • Phase-transition residual mixing initialization

Preliminary Results (2×H100)

Seed val_bpb Steps Artifact
1337 1.3323 954 14.2 MB

8×H100 run pending, expecting significant improvement with full compute.

Reproduce

WARMDOWN_ITERS=2500 MATRIX_LR=0.03 SCALAR_LR=0.03 TIED_EMBED_LR=0.04 torchrun --nproc_per_node=8 train_gpt.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant