Train on fewer model-visible tokens. Decode exactly back to text. Measure the real efficiency gain with bits per original byte, not token-count optimism.
Compress the training sequence. Keep the original text recoverable. Test whether the model truly wins.
CoordCompressCodecLM is a research framework for training and evaluating language models on lossless recursive coordinate tokens.
It asks one precise question:
Can a model process fewer target tokens for the same original text without paying a larger predictive cost?
The system learns a normal byte-level BPE tokenizer, folds recurring token pairs into atomic coordinates, trains matched baseline and coordinate models, and decodes generated coordinates back to ordinary text.
| 67.8% token reduction on a 10k Apache-log preparation run |
49.0% token reduction in the TinyStories 100k L2 research gate |
Exact coordinate decode back to the original tokenizer IDs |
text -> BPE IDs -> L1 coordinates -> L2 coordinates -> language model
language model -> inverse L2 -> inverse L1 -> BPE decode -> text
This is not prompt summarization and it is not a lossy latent code. Every coordinate has an exact expansion:
C4096 -> (token 81, token 19)
C8192 -> (C4096, C4096)
The decoder recursively applies those definitions until only ordinary BPE IDs remain. Exact roundtrip is tested before a corpus is accepted.
One coordinate can therefore carry several baseline tokens while remaining one atomic prediction step for the model.
The repository separates representation wins from modeling wins.
| experiment | baseline tokens | coordinate tokens | ratio | saved | BPB result | exact |
|---|---|---|---|---|---|---|
| Apache logs, 10k rows, current implementation | 47,860 | 15,431 | 0.322 | 67.8% | preparation only | yes |
| TinyStories 100k, L2 16k research run | 404,056 | 206,215 | 0.510 | 49.0% | representation only | yes |
| TinyStories 10k, paired native LM | 209,377 | 132,145 | 0.631 | 36.9% | 1.7% worse | yes |
The paired 10k model generated 7.5% more decoded bytes per second in that run,
but used a larger vocabulary and trained more slowly per optimizer step. That is
promising evidence, not a claim that coordinate training already beats BPE in
every metric.
All published values and their status are recorded in
results/validated_results.json.
Token reduction alone is insufficient. A coordinate can shorten the sequence while making the next symbol harder to predict.
bits per original byte
= tokens per original byte
* nats per token
/ ln(2)
A coordinate model wins predictive efficiency only when its held-out bits per original byte is lower than the baseline on the same original bytes.
The implementation keeps five artifacts independent:
| artifact | purpose |
|---|---|
tokenizer.json |
shared byte-level BPE tokenizer |
universe.json |
exact recursive coordinate definitions |
*.ids |
compact prepared baseline and coordinate streams |
model.pt |
model weights |
report.json |
configuration, losses, timing, and BPB |
Coordinate discovery is memory-bounded. Each level counts adjacent pairs, selects recurring candidates, folds the corpus, and repeats on the new stream. It does not materialize every possible n-gram.
Special document boundary IDs are never folded. The model can therefore learn and emit a normal explicit end-of-text token.
Python 3.10+ is required.
python -m venv .venv
source .venv/bin/activate
pip install -e .Hugging Face datasets are optional:
pip install -e ".[datasets]"Prepare one paired corpus:
python prepare.py \
--input examples/data/mini_stories.jsonl \
--output artifacts/mini_stories \
--base-vocab-size 512 \
--coordinate-layers 2 \
--nodes-per-layer 256 \
--coordinate-min-count 2Train matched models:
python train.py \
--data artifacts/mini_stories \
--config configs/smoke.json \
--representation baseline \
--output checkpoints/baseline
python train.py \
--data artifacts/mini_stories \
--config configs/smoke.json \
--representation coordinate \
--output checkpoints/coordinateCompare them:
python compare.py \
--baseline checkpoints/baseline/report.json \
--coordinate checkpoints/coordinate/report.jsonGenerate model-visible IDs and decoded text:
python generate.py \
--checkpoint checkpoints/coordinate \
--prompt "Mina carried a red kite" \
--max-new-tokens 120Evaluate BPB and decoded throughput:
python evaluate.py \
--checkpoint checkpoints/coordinate \
--data artifacts/mini_stories \
--prompt "The little robot found" \
--max-new-tokens 160Prepare TinyStories without changing the training code:
python prepare.py \
--hf-dataset roneneldan/TinyStories \
--hf-split train \
--text-field text \
--limit 100000 \
--output artifacts/tinystories_100k \
--base-vocab-size 4096 \
--coordinate-layers 2 \
--nodes-per-layer 16000 \
--coordinate-min-count 4The prepared tokenizer and universe are cached in the output directory. Subsequent training runs load them directly and never rebuild the coordinate vocabulary.
from coordcompresscodeclm import (
CorpusPreparer,
LanguageModelTrainer,
PreparedCorpus,
TrainConfig,
)
preparer = CorpusPreparer(
base_vocab_size=4096,
coordinate_layers=2,
nodes_per_layer=16000,
)
report = preparer.prepare(
build_documents=train_build,
selection_documents=train_selection,
validation_documents=validation,
output_dir="artifacts/corpus",
)
config = TrainConfig.from_file("configs/reference.json")
trainer = LanguageModelTrainer(
PreparedCorpus("artifacts/corpus"),
config,
"checkpoints/coordinate",
)
training_report = trainer.train()Lower-level coordinate usage:
from coordcompresscodeclm import UniverseBuilder
documents = [
[81, 19, 81, 19, 42, 7],
[81, 19, 81, 19, 42, 7],
]
universe = UniverseBuilder(
layers=2,
nodes_per_layer=128,
min_count=2,
selection_min_count=0,
).fit(documents, base_vocab_size=100)
encoded = universe.encode(documents[0])
restored = universe.decode(encoded)
assert restored == documents[0]CoordCompressCodecLM/
├── src/ library code
├── tests/ exactness and training smoke tests
├── configs/ reproducible model configurations
├── examples/ small local corpus
├── docs/ method, protocol, findings, and limitations
├── results/ curated machine-readable results
├── prepare.py freeze a paired corpus
├── train.py train or resume one model
├── evaluate.py measure BPB and decoded throughput
├── generate.py inspect IDs and decoded output
└── compare.py compare matched reports
What is established:
- Coordinate encoding and decoding are lossless.
- Recursive layers can reduce model-visible token count substantially.
- A 49-52% TinyStories token reduction has been demonstrated.
- Semantic initialization makes new coordinate embeddings easier to learn.
- Decoded bytes per second is the correct generation-throughput metric.
What remains open:
- Matching or beating baseline BPB reliably across datasets and model sizes.
- Reducing the softmax and embedding cost of large coordinate vocabularies.
- Establishing quality parity with long, compute-matched training runs.
- Learning transferable coordinate universes across domains.
See docs/findings.md for the evidence and docs/limitations.md for the failure modes.
CoordCompressCodec provides the general-purpose lossless coordinate file codec. This repository studies the separate language-model question: using reversible coordinates as the model's training and generation vocabulary.
CoordCompressCodecLM is released under the CoordCompressCodec Non-Commercial License.
Non-commercial use is permitted. Commercial use requires prior written permission from Moustapha Oumar.