Skip to content

Sphere-AI-Lab/poet

Repository files navigation

POET & POET-X for LLM Pretraining

Reparameterized LLM Training via Orthogonal Equivalence Transformation

Paper NeurIPS 2025 POET Page

POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

Paper POET-X Page


Table of Contents


Overview

This repository contains the official implementation of POET and POET-X — a family of reparameterized LLM training algorithms that optimize weight matrices through Orthogonal Equivalence Transformation (OET), achieving superior generalization with provably bounded weight spectra.

POET three learning phases
POET's three learning phases: conical shell searching → stable learning → final adjusting.


Installation

git clone https://github.com/Sphere-AI-Lab/poet.git
cd poet
pip install -e .

Requirements:

  • Python ≥ 3.10
  • PyTorch ≥ 2.7
  • CUDA ≥ 12.6
  • Triton ≥ 3.4.0

Data Preparation

The training scripts expect the C4 dataset at ./c4/en/ relative to the repo root. Run the following commands from the root of this repository:

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4
cd c4
git lfs pull --include "en/*"
cd ..

This will create a c4/en/ folder directly inside the repo root, resulting in the following structure:

poet/
├── c4/
│   └── en/
│       ├── c4-train.00000-of-01024.json.gz
│       └── ...
├── torchrun_main.py
└── ...

Note: If no local data is found, the training script will automatically fall back to streaming the dataset directly from HuggingFace (allenai/c4), which requires an internet connection but no local storage.

Usage

Quick Start

# Pretrain LLaMA-3B with POET-X (block_size=512) on C4
bash scripts/benchmark_c4_poet/pretrain_poet_3b.sh

# Pretrain LLaMA-3B with POET-XQ (block_size=512) on C4
bash scripts/benchmark_c4_qpoet/pretrain_qpoet_3b.sh

POET

Method

POET reparameterizes each weight matrix as:

$$W_{RP} = R , W_0 , P$$

where $W_0 \in \mathbb{R}^{m \times n}$ is a fixed randomly initialized matrix, and $R \in \mathbb{R}^{m \times m}$, $P \in \mathbb{R}^{n \times n}$ are learnable orthogonal matrices. Training only updates $R$ and $P$, leaving $W_0$ unchanged.

Why orthogonal transformations? They preserve singular values exactly — giving POET direct, provable control over the weight spectrum throughout training.

Singular value dynamics
Dynamics of singular values: POET (right) avoids the large singular value growth seen in standard AdamW training (left).

Spectral Diversity

SVD entropy comparison
POET maintains consistently higher SVD entropy (singular value diversity) throughout training compared to AdamW and Muon.

Efficient Approximation: Stochastic Primitive Optimization (SPO)

Large orthogonal matrices $R \in \mathbb{R}^{m \times m}$ are expensive to optimize naively. POET introduces two efficient variants:

  • POET-FS (Fully Stochastic SPO): Randomly samples a small $b \times b$ submatrix at each step. Highly parameter-efficient; decouples parameter count from matrix size.
  • POET-BS (Block-Stochastic SPO): Block-diagonal structure with random permutations; transforms all dimensions simultaneously. More expressive per parameter.

Weight update patterns
Weight update coverage: POET-BS achieves more even updates across all weight elements compared to POET-FS.

Orthogonal matrices are parameterized via Cayley-Neumann Parameterization (CNP), which approximates the matrix inverse using a truncated Neumann series for numerical stability:

$$R = (I + Q)(I - Q)^{-1} \approx (I + Q)\left(I + \sum_{i=1}^{k} Q^i\right)$$

A merge-then-reinitialize trick periodically absorbs $R, P$ into $W_0$, preventing error accumulation and keeping the Neumann series convergent.

Results

Validation perplexity vs parameters
POET outperforms AdamW with significantly fewer trainable parameters across all LLaMA model sizes on C4.

Method Params 60M PPL 130M PPL 350M PPL 1.3B PPL
AdamW Full 26.68 20.82 16.78 14.73
GaLore Full 29.81 22.35 17.99 18.33
LoRA (r=64) ~5% 39.70 32.07 25.19 20.55
POET-BS (b=128) ~13% 26.90 21.86 18.05 16.24
POET-BS (b=256) ~26% 25.29 19.88 16.27 14.56

Quantitative comparison of validation perplexity

Training speedup
POET-FS (b=1/2) still outperforms AdamW even when AdamW is trained with ~3× more tokens.

POET-X

Overview

POET-X is a scalable, memory-efficient variant of POET that makes orthogonal equivalence training practical at the billion-parameter scale.

The original POET must store the full transformed weight $RW_0P$ for backpropagation, making it more memory-intensive than AdamW. POET-X resolves this through a suite of engineering innovations.

Key Results

Latency breakdown
Latency breakdown: POET-X reduces forward+backward latency from 10.59ms (POET) to 1.38ms (POET-Xfast), approaching standard linear layers.

Memory breakdown
Memory breakdown for Llama-8B training on a single GPU. POET-X_mem achieves PEFT-level memory; POET runs OOM.

Pretraining Results

PPL results
Llama-3B pretraining on 60B C4 tokens: POET-X achieves better PPL than AdamW and all memory-efficient baselines.

PPL results quantized
POET-XQ (quantized): Best PPL of 14.78 with minimal memory footprint, outperforming GaLore and APOLLO.

Training dynamics with different block sizes:

Val PPL b=256 Val PPL b=1024
Validation PPL curves at block size b=256 (left) and b=1024 (right).

Memory Efficiency

Peak GPU memory
Peak GPU memory across model sizes (3B–13B) and sequence lengths: POET-X_mem outperforms all baselines including LoRA.

Throughput & Distributed Scaling

Throughput scaling
POET-X closely follows ideal linear scaling on 64× H100s, while AdamW (FSDP) plateaus due to communication overhead.

Method: Key Optimizations

The core insight is an input-centric formulation that avoids materializing the full $m \times n$ transformed weight:

$$z = \underbrace{\Phi_n G_P^\top \Phi_n^\top}_{P^\top} W \underbrace{\Phi_m G_R^\top \Phi_m^\top}_{R^\top} x$$

This reduces complexity from $O(nm^2)$ to a sequence of matrix-vector products.

Four engineering innovations:

  1. Permutation Acceleration — Custom CUDA kernels for index-mapped permutations (up to 20× speedup).
  2. Permutation Reduction — Pre-computes permuted weights once per inner loop, eliminating redundant ops.
  3. Batch-Parallel Strategy — Treats each block of block-diagonal $G_P$, $G_R$ as an independent batch element; avoids large sparse matrix construction.
  4. Fused Cayley-Neumann Kernels — Triton kernel loads $Q$ and $Q^2$ into shared memory once for all terms; backward pass also fused.

Cayley-Neumann illustration
Fused Cayley-Neumann parameterization: batch-wise implementation via Triton kernel fusion.

POET-X Variants

Variant Memory Speed Notes
POET-X_fast Medium Fast Standard autograd, saves activation $b$
POET-X_mem Lowest Moderate Gradient checkpointing, recomputes $b$ on-the-fly
POET-XQ Lowest High throughput INT8 quantized base weights, dequantized on-the-fly

Citation

@article{qiu2025poet,
  title={Reparameterized LLM Training via Orthogonal Equivalence Transformation},
  author={Qiu, Zeju and Buchholz, Simon and Xiao, Tim Z. and Dax, Maximilian and Sch{\"o}lkopf, Bernhard and Liu, Weiyang},
  journal={arXiv preprint arXiv:2506.08001},
  year={2025}
}

@article{qiu2025poetx,
  title={POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation}, 
  author={Qiu, Zeju and Liu, Lixin and Weller, Adrian and Shi, Han and Liu, Weiyang},
  journal={arXiv preprint arXiv:2603.05500},
  year={2026},
}

Related Work

  • OFT — Orthogonal Finetuning for diffusion models
  • GaLore — Gradient low-rank projection
  • Muon — Gradient orthogonalization optimizer

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors