POET & POET-X for LLM Pretraining

Reparameterized LLM Training via Orthogonal Equivalence Transformation

POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

Overview

This repository contains the official implementation of POET and POET-X — a family of reparameterized LLM training algorithms that optimize weight matrices through Orthogonal Equivalence Transformation (OET), achieving superior generalization with provably bounded weight spectra.

POET's three learning phases: conical shell searching → stable learning → final adjusting.

Installation

git clone https://github.com/Sphere-AI-Lab/poet.git
cd poet
pip install -e .

Requirements:

Python ≥ 3.10
PyTorch ≥ 2.7
CUDA ≥ 12.6
Triton ≥ 3.4.0

Data Preparation

The training scripts expect the C4 dataset at ./c4/en/ relative to the repo root. Run the following commands from the root of this repository:

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4
cd c4
git lfs pull --include "en/*"
cd ..

This will create a c4/en/ folder directly inside the repo root, resulting in the following structure:

poet/
├── c4/
│   └── en/
│       ├── c4-train.00000-of-01024.json.gz
│       └── ...
├── torchrun_main.py
└── ...

Note: If no local data is found, the training script will automatically fall back to streaming the dataset directly from HuggingFace (allenai/c4), which requires an internet connection but no local storage.

Usage

Quick Start

# Pretrain LLaMA-3B with POET-X (block_size=512) on C4
bash scripts/benchmark_c4_poet/pretrain_poet_3b.sh

# Pretrain LLaMA-3B with POET-XQ (block_size=512) on C4
bash scripts/benchmark_c4_qpoet/pretrain_qpoet_3b.sh

POET

Method

POET reparameterizes each weight matrix as:

$$W_{RP} = R , W_0 , P$$

where $W_0 \in \mathbb{R}^{m \times n}$ is a fixed randomly initialized matrix, and $R \in \mathbb{R}^{m \times m}$, $P \in \mathbb{R}^{n \times n}$ are learnable orthogonal matrices. Training only updates $R$ and $P$, leaving $W_0$ unchanged.

Why orthogonal transformations? They preserve singular values exactly — giving POET direct, provable control over the weight spectrum throughout training.

Dynamics of singular values: POET (right) avoids the large singular value growth seen in standard AdamW training (left).

Spectral Diversity

POET maintains consistently higher SVD entropy (singular value diversity) throughout training compared to AdamW and Muon.

Efficient Approximation: Stochastic Primitive Optimization (SPO)

Large orthogonal matrices $R \in \mathbb{R}^{m \times m}$ are expensive to optimize naively. POET introduces two efficient variants:

POET-FS (Fully Stochastic SPO): Randomly samples a small $b \times b$ submatrix at each step. Highly parameter-efficient; decouples parameter count from matrix size.
POET-BS (Block-Stochastic SPO): Block-diagonal structure with random permutations; transforms all dimensions simultaneously. More expressive per parameter.

Weight update coverage: POET-BS achieves more even updates across all weight elements compared to POET-FS.

Orthogonal matrices are parameterized via Cayley-Neumann Parameterization (CNP), which approximates the matrix inverse using a truncated Neumann series for numerical stability:

$$R = (I + Q)(I - Q)^{-1} \approx (I + Q)\left(I + \sum_{i=1}^{k} Q^i\right)$$

A merge-then-reinitialize trick periodically absorbs $R, P$ into $W_0$, preventing error accumulation and keeping the Neumann series convergent.

Results

POET outperforms AdamW with significantly fewer trainable parameters across all LLaMA model sizes on C4.

Method	Params	60M PPL	130M PPL	350M PPL	1.3B PPL
AdamW	Full	26.68	20.82	16.78	14.73
GaLore	Full	29.81	22.35	17.99	18.33
LoRA (r=64)	~5%	39.70	32.07	25.19	20.55
POET-BS (b=128)	~13%	26.90	21.86	18.05	16.24
POET-BS (b=256)	~26%	25.29	19.88	16.27	14.56

Quantitative comparison of validation perplexity

POET-FS (b=1/2) still outperforms AdamW even when AdamW is trained with ~3× more tokens.

POET-X

Overview

POET-X is a scalable, memory-efficient variant of POET that makes orthogonal equivalence training practical at the billion-parameter scale.

The original POET must store the full transformed weight $RW_0P$ for backpropagation, making it more memory-intensive than AdamW. POET-X resolves this through a suite of engineering innovations.

Key Results

Latency breakdown: POET-X reduces forward+backward latency from 10.59ms (POET) to 1.38ms (POET-Xfast), approaching standard linear layers.

Memory breakdown for Llama-8B training on a single GPU. POET-X_mem achieves PEFT-level memory; POET runs OOM.

Pretraining Results

Llama-3B pretraining on 60B C4 tokens: POET-X achieves better PPL than AdamW and all memory-efficient baselines.

POET-XQ (quantized): Best PPL of 14.78 with minimal memory footprint, outperforming GaLore and APOLLO.

Training dynamics with different block sizes:

Validation PPL curves at block size b=256 (left) and b=1024 (right).

Memory Efficiency

Peak GPU memory across model sizes (3B–13B) and sequence lengths: POET-X_mem outperforms all baselines including LoRA.

Throughput & Distributed Scaling

POET-X closely follows ideal linear scaling on 64× H100s, while AdamW (FSDP) plateaus due to communication overhead.

Method: Key Optimizations

The core insight is an input-centric formulation that avoids materializing the full $m \times n$ transformed weight:

$$z = \underbrace{\Phi_n G_P^\top \Phi_n^\top}_{P^\top} W \underbrace{\Phi_m G_R^\top \Phi_m^\top}_{R^\top} x$$

This reduces complexity from $O(nm^2)$ to a sequence of matrix-vector products.

Four engineering innovations:

Permutation Acceleration — Custom CUDA kernels for index-mapped permutations (up to 20× speedup).
Permutation Reduction — Pre-computes permuted weights once per inner loop, eliminating redundant ops.
Batch-Parallel Strategy — Treats each block of block-diagonal $G_P$, $G_R$ as an independent batch element; avoids large sparse matrix construction.
Fused Cayley-Neumann Kernels — Triton kernel loads $Q$ and $Q^2$ into shared memory once for all terms; backward pass also fused.

Fused Cayley-Neumann parameterization: batch-wise implementation via Triton kernel fusion.

POET-X Variants

Variant	Memory	Speed	Notes
`POET-X_fast`	Medium	Fast	Standard autograd, saves activation $b$
`POET-X_mem`	Lowest	Moderate	Gradient checkpointing, recomputes $b$ on-the-fly
`POET-XQ`	Lowest	High throughput	INT8 quantized base weights, dequantized on-the-fly

Citation

@article{qiu2025poet,
  title={Reparameterized LLM Training via Orthogonal Equivalence Transformation},
  author={Qiu, Zeju and Buchholz, Simon and Xiao, Tim Z. and Dax, Maximilian and Sch{\"o}lkopf, Bernhard and Liu, Weiyang},
  journal={arXiv preprint arXiv:2506.08001},
  year={2025}
}

@article{qiu2025poetx,
  title={POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation}, 
  author={Qiu, Zeju and Liu, Lixin and Weller, Adrian and Shi, Han and Liu, Weiyang},
  journal={arXiv preprint arXiv:2603.05500},
  year={2026},
}

Related Work

OFT — Orthogonal Finetuning for diffusion models
GaLore — Gradient low-rank projection
Muon — Gradient orthogonalization optimizer

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
MUON		MUON
assets		assets
configs		configs
docs		docs
peft_pretraining		peft_pretraining
poet_torch		poet_torch
scripts		scripts
.gitignore		.gitignore
README.md		README.md
main_poet_minimal.py		main_poet_minimal.py
requirements.txt		requirements.txt
setup.py		setup.py
torchrun_main.py		torchrun_main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

POET & POET-X for LLM Pretraining

Table of Contents

Overview

Installation

Data Preparation

Usage

Quick Start

POET

Method

Spectral Diversity

Efficient Approximation: Stochastic Primitive Optimization (SPO)

Results

POET-X

Overview

Key Results

Pretraining Results

Memory Efficiency

Throughput & Distributed Scaling

Method: Key Optimizations

POET-X Variants

Citation

Related Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

POET & POET-X for LLM Pretraining

Table of Contents

Overview

Installation

Data Preparation

Usage

Quick Start

POET

Method

Spectral Diversity

Efficient Approximation: Stochastic Primitive Optimization (SPO)

Results

POET-X

Overview

Key Results

Pretraining Results

Memory Efficiency

Throughput & Distributed Scaling

Method: Key Optimizations

POET-X Variants

Citation

Related Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages