Reparameterized LLM Training via Orthogonal Equivalence Transformation
POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation
This repository contains the official implementation of POET and POET-X — a family of reparameterized LLM training algorithms that optimize weight matrices through Orthogonal Equivalence Transformation (OET), achieving superior generalization with provably bounded weight spectra.
POET's three learning phases: conical shell searching → stable learning → final adjusting.
git clone https://github.com/Sphere-AI-Lab/poet.git
cd poet
pip install -e .Requirements:
- Python ≥ 3.10
- PyTorch ≥ 2.7
- CUDA ≥ 12.6
- Triton ≥ 3.4.0
The training scripts expect the C4 dataset at ./c4/en/ relative to the repo root. Run the following commands from the root of this repository:
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4
cd c4
git lfs pull --include "en/*"
cd ..This will create a c4/en/ folder directly inside the repo root, resulting in the following structure:
poet/
├── c4/
│ └── en/
│ ├── c4-train.00000-of-01024.json.gz
│ └── ...
├── torchrun_main.py
└── ...
Note: If no local data is found, the training script will automatically fall back to streaming the dataset directly from HuggingFace (
allenai/c4), which requires an internet connection but no local storage.
# Pretrain LLaMA-3B with POET-X (block_size=512) on C4
bash scripts/benchmark_c4_poet/pretrain_poet_3b.sh
# Pretrain LLaMA-3B with POET-XQ (block_size=512) on C4
bash scripts/benchmark_c4_qpoet/pretrain_qpoet_3b.shPOET reparameterizes each weight matrix as:
where
Why orthogonal transformations? They preserve singular values exactly — giving POET direct, provable control over the weight spectrum throughout training.
Dynamics of singular values: POET (right) avoids the large singular value growth seen in standard AdamW training (left).
POET maintains consistently higher SVD entropy (singular value diversity) throughout training compared to AdamW and Muon.
Large orthogonal matrices
-
POET-FS (Fully Stochastic SPO): Randomly samples a small
$b \times b$ submatrix at each step. Highly parameter-efficient; decouples parameter count from matrix size. - POET-BS (Block-Stochastic SPO): Block-diagonal structure with random permutations; transforms all dimensions simultaneously. More expressive per parameter.
Weight update coverage: POET-BS achieves more even updates across all weight elements compared to POET-FS.
Orthogonal matrices are parameterized via Cayley-Neumann Parameterization (CNP), which approximates the matrix inverse using a truncated Neumann series for numerical stability:
A merge-then-reinitialize trick periodically absorbs
POET outperforms AdamW with significantly fewer trainable parameters across all LLaMA model sizes on C4.
| Method | Params | 60M PPL | 130M PPL | 350M PPL | 1.3B PPL |
|---|---|---|---|---|---|
| AdamW | Full | 26.68 | 20.82 | 16.78 | 14.73 |
| GaLore | Full | 29.81 | 22.35 | 17.99 | 18.33 |
| LoRA (r=64) | ~5% | 39.70 | 32.07 | 25.19 | 20.55 |
| POET-BS (b=128) | ~13% | 26.90 | 21.86 | 18.05 | 16.24 |
| POET-BS (b=256) | ~26% | 25.29 | 19.88 | 16.27 | 14.56 |
Quantitative comparison of validation perplexity
POET-FS (b=1/2) still outperforms AdamW even when AdamW is trained with ~3× more tokens.
POET-X is a scalable, memory-efficient variant of POET that makes orthogonal equivalence training practical at the billion-parameter scale.
The original POET must store the full transformed weight
$RW_0P$ for backpropagation, making it more memory-intensive than AdamW. POET-X resolves this through a suite of engineering innovations.
Latency breakdown: POET-X reduces forward+backward latency from 10.59ms (POET) to 1.38ms (POET-Xfast), approaching standard linear layers.
Memory breakdown for Llama-8B training on a single GPU. POET-X_mem achieves PEFT-level memory; POET runs OOM.
Llama-3B pretraining on 60B C4 tokens: POET-X achieves better PPL than AdamW and all memory-efficient baselines.
POET-XQ (quantized): Best PPL of 14.78 with minimal memory footprint, outperforming GaLore and APOLLO.
Training dynamics with different block sizes:
Validation PPL curves at block size b=256 (left) and b=1024 (right).
Peak GPU memory across model sizes (3B–13B) and sequence lengths: POET-X_mem outperforms all baselines including LoRA.
POET-X closely follows ideal linear scaling on 64× H100s, while AdamW (FSDP) plateaus due to communication overhead.
The core insight is an input-centric formulation that avoids materializing the full
This reduces complexity from
Four engineering innovations:
- Permutation Acceleration — Custom CUDA kernels for index-mapped permutations (up to 20× speedup).
- Permutation Reduction — Pre-computes permuted weights once per inner loop, eliminating redundant ops.
-
Batch-Parallel Strategy — Treats each block of block-diagonal
$G_P$ ,$G_R$ as an independent batch element; avoids large sparse matrix construction. -
Fused Cayley-Neumann Kernels — Triton kernel loads
$Q$ and$Q^2$ into shared memory once for all terms; backward pass also fused.
Fused Cayley-Neumann parameterization: batch-wise implementation via Triton kernel fusion.
| Variant | Memory | Speed | Notes |
|---|---|---|---|
POET-X_fast |
Medium | Fast | Standard autograd, saves activation |
POET-X_mem |
Lowest | Moderate | Gradient checkpointing, recomputes |
POET-XQ |
Lowest | High throughput | INT8 quantized base weights, dequantized on-the-fly |
@article{qiu2025poet,
title={Reparameterized LLM Training via Orthogonal Equivalence Transformation},
author={Qiu, Zeju and Buchholz, Simon and Xiao, Tim Z. and Dax, Maximilian and Sch{\"o}lkopf, Bernhard and Liu, Weiyang},
journal={arXiv preprint arXiv:2506.08001},
year={2025}
}
@article{qiu2025poetx,
title={POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation},
author={Qiu, Zeju and Liu, Lixin and Weller, Adrian and Shi, Han and Liu, Weiyang},
journal={arXiv preprint arXiv:2603.05500},
year={2026},
}