CoordCompressCodecLM

Lossless coordinate tokens for faster language-model training research.

Train on fewer model-visible tokens. Decode exactly back to text. Measure the real efficiency gain with bits per original byte, not token-count optimism.

Compress the training sequence. Keep the original text recoverable. Test whether the model truly wins.

CoordCompressCodecLM is a research framework for training and evaluating language models on lossless recursive coordinate tokens.

It asks one precise question:

Can a model process fewer target tokens for the same original text without paying a larger predictive cost?

The system learns a normal byte-level BPE tokenizer, folds recurring token pairs into atomic coordinates, trains matched baseline and coordinate models, and decodes generated coordinates back to ordinary text.

67.8%
token reduction on a 10k Apache-log preparation run

49.0%
token reduction in the TinyStories 100k L2 research gate

Exact
coordinate decode back to the original tokenizer IDs

text -> BPE IDs -> L1 coordinates -> L2 coordinates -> language model
language model -> inverse L2 -> inverse L1 -> BPE decode -> text

What Is Different

This is not prompt summarization and it is not a lossy latent code. Every coordinate has an exact expansion:

C4096 -> (token 81, token 19)
C8192 -> (C4096, C4096)

The decoder recursively applies those definitions until only ordinary BPE IDs remain. Exact roundtrip is tested before a corpus is accepted.

One coordinate can therefore carry several baseline tokens while remaining one atomic prediction step for the model.

Current Evidence

The repository separates representation wins from modeling wins.

experiment	baseline tokens	coordinate tokens	ratio	saved	BPB result	exact
Apache logs, 10k rows, current implementation	47,860	15,431	0.322	67.8%	preparation only	yes
TinyStories 100k, L2 16k research run	404,056	206,215	0.510	49.0%	representation only	yes
TinyStories 10k, paired native LM	209,377	132,145	0.631	36.9%	1.7% worse	yes

The paired 10k model generated 7.5% more decoded bytes per second in that run, but used a larger vocabulary and trained more slowly per optimizer step. That is promising evidence, not a claim that coordinate training already beats BPE in every metric.

All published values and their status are recorded in results/validated_results.json.

The Fair Metric

Token reduction alone is insufficient. A coordinate can shorten the sequence while making the next symbol harder to predict.

bits per original byte
  = tokens per original byte
    * nats per token
    / ln(2)

A coordinate model wins predictive efficiency only when its held-out bits per original byte is lower than the baseline on the same original bytes.

Design

The implementation keeps five artifacts independent:

artifact	purpose
`tokenizer.json`	shared byte-level BPE tokenizer
`universe.json`	exact recursive coordinate definitions
`*.ids`	compact prepared baseline and coordinate streams
`model.pt`	model weights
`report.json`	configuration, losses, timing, and BPB

Coordinate discovery is memory-bounded. Each level counts adjacent pairs, selects recurring candidates, folds the corpus, and repeats on the new stream. It does not materialize every possible n-gram.

Special document boundary IDs are never folded. The model can therefore learn and emit a normal explicit end-of-text token.

Install

Python 3.10+ is required.

python -m venv .venv
source .venv/bin/activate
pip install -e .

Hugging Face datasets are optional:

pip install -e ".[datasets]"

Quick Start

Prepare one paired corpus:

python prepare.py \
  --input examples/data/mini_stories.jsonl \
  --output artifacts/mini_stories \
  --base-vocab-size 512 \
  --coordinate-layers 2 \
  --nodes-per-layer 256 \
  --coordinate-min-count 2

Train matched models:

python train.py \
  --data artifacts/mini_stories \
  --config configs/smoke.json \
  --representation baseline \
  --output checkpoints/baseline

python train.py \
  --data artifacts/mini_stories \
  --config configs/smoke.json \
  --representation coordinate \
  --output checkpoints/coordinate

Compare them:

python compare.py \
  --baseline checkpoints/baseline/report.json \
  --coordinate checkpoints/coordinate/report.json

Generate model-visible IDs and decoded text:

python generate.py \
  --checkpoint checkpoints/coordinate \
  --prompt "Mina carried a red kite" \
  --max-new-tokens 120

Evaluate BPB and decoded throughput:

python evaluate.py \
  --checkpoint checkpoints/coordinate \
  --data artifacts/mini_stories \
  --prompt "The little robot found" \
  --max-new-tokens 160

Hugging Face Corpus

Prepare TinyStories without changing the training code:

python prepare.py \
  --hf-dataset roneneldan/TinyStories \
  --hf-split train \
  --text-field text \
  --limit 100000 \
  --output artifacts/tinystories_100k \
  --base-vocab-size 4096 \
  --coordinate-layers 2 \
  --nodes-per-layer 16000 \
  --coordinate-min-count 4

The prepared tokenizer and universe are cached in the output directory. Subsequent training runs load them directly and never rebuild the coordinate vocabulary.

Python API

from coordcompresscodeclm import (
    CorpusPreparer,
    LanguageModelTrainer,
    PreparedCorpus,
    TrainConfig,
)

preparer = CorpusPreparer(
    base_vocab_size=4096,
    coordinate_layers=2,
    nodes_per_layer=16000,
)

report = preparer.prepare(
    build_documents=train_build,
    selection_documents=train_selection,
    validation_documents=validation,
    output_dir="artifacts/corpus",
)

config = TrainConfig.from_file("configs/reference.json")
trainer = LanguageModelTrainer(
    PreparedCorpus("artifacts/corpus"),
    config,
    "checkpoints/coordinate",
)
training_report = trainer.train()

Lower-level coordinate usage:

from coordcompresscodeclm import UniverseBuilder

documents = [
    [81, 19, 81, 19, 42, 7],
    [81, 19, 81, 19, 42, 7],
]

universe = UniverseBuilder(
    layers=2,
    nodes_per_layer=128,
    min_count=2,
    selection_min_count=0,
).fit(documents, base_vocab_size=100)

encoded = universe.encode(documents[0])
restored = universe.decode(encoded)

assert restored == documents[0]

Repository Structure

CoordCompressCodecLM/
├── src/                  library code
├── tests/                exactness and training smoke tests
├── configs/              reproducible model configurations
├── examples/             small local corpus
├── docs/                 method, protocol, findings, and limitations
├── results/              curated machine-readable results
├── prepare.py            freeze a paired corpus
├── train.py              train or resume one model
├── evaluate.py           measure BPB and decoded throughput
├── generate.py           inspect IDs and decoded output
└── compare.py            compare matched reports

Research Status

What is established:

Coordinate encoding and decoding are lossless.
Recursive layers can reduce model-visible token count substantially.
A 49-52% TinyStories token reduction has been demonstrated.
Semantic initialization makes new coordinate embeddings easier to learn.
Decoded bytes per second is the correct generation-throughput metric.

What remains open:

Matching or beating baseline BPB reliably across datasets and model sizes.
Reducing the softmax and embedding cost of large coordinate vocabularies.
Establishing quality parity with long, compute-matched training runs.
Learning transferable coordinate universes across domains.

See docs/findings.md for the evidence and docs/limitations.md for the failure modes.

Related Project

CoordCompressCodec provides the general-purpose lossless coordinate file codec. This repository studies the separate language-model question: using reversible coordinates as the model's training and generation vocabulary.

License

CoordCompressCodecLM is released under the CoordCompressCodec Non-Commercial License.

Non-commercial use is permitted. Commercial use requires prior written permission from Moustapha Oumar.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoordCompressCodecLM

Lossless coordinate tokens for faster language-model training research.

What Is Different

Current Evidence

The Fair Metric

Design

Install

Quick Start

Hugging Face Corpus

Python API

Repository Structure

Research Status

Related Project

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
artifacts		artifacts
checkpoints		checkpoints
configs		configs
docs		docs
examples/data		examples/data
results		results
src		src
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
compare.py		compare.py
evaluate.py		evaluate.py
generate.py		generate.py
prepare.py		prepare.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

CoordCompressCodecLM

Lossless coordinate tokens for faster language-model training research.

What Is Different

Current Evidence

The Fair Metric

Design

Install

Quick Start

Hugging Face Corpus

Python API

Repository Structure

Research Status

Related Project

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages