Skip to content

mito0o852/CoordCompressCodecLM

Repository files navigation

CoordCompressCodecLM

Lossless coordinate tokens for faster language-model training research.

Train on fewer model-visible tokens. Decode exactly back to text. Measure the real efficiency gain with bits per original byte, not token-count optimism.

License: Non-Commercial Python 3.10+ PyTorch Exact decode Coordinate tokens

Compress the training sequence. Keep the original text recoverable. Test whether the model truly wins.


CoordCompressCodecLM is a research framework for training and evaluating language models on lossless recursive coordinate tokens.

It asks one precise question:

Can a model process fewer target tokens for the same original text without paying a larger predictive cost?

The system learns a normal byte-level BPE tokenizer, folds recurring token pairs into atomic coordinates, trains matched baseline and coordinate models, and decodes generated coordinates back to ordinary text.

67.8%
token reduction on a 10k Apache-log preparation run
49.0%
token reduction in the TinyStories 100k L2 research gate
Exact
coordinate decode back to the original tokenizer IDs
text -> BPE IDs -> L1 coordinates -> L2 coordinates -> language model
language model -> inverse L2 -> inverse L1 -> BPE decode -> text

Coordinate language-model pipeline

What Is Different

This is not prompt summarization and it is not a lossy latent code. Every coordinate has an exact expansion:

C4096 -> (token 81, token 19)
C8192 -> (C4096, C4096)

The decoder recursively applies those definitions until only ordinary BPE IDs remain. Exact roundtrip is tested before a corpus is accepted.

One coordinate can therefore carry several baseline tokens while remaining one atomic prediction step for the model.

Recursive coordinate layers

Current Evidence

The repository separates representation wins from modeling wins.

experiment baseline tokens coordinate tokens ratio saved BPB result exact
Apache logs, 10k rows, current implementation 47,860 15,431 0.322 67.8% preparation only yes
TinyStories 100k, L2 16k research run 404,056 206,215 0.510 49.0% representation only yes
TinyStories 10k, paired native LM 209,377 132,145 0.631 36.9% 1.7% worse yes

The paired 10k model generated 7.5% more decoded bytes per second in that run, but used a larger vocabulary and trained more slowly per optimizer step. That is promising evidence, not a claim that coordinate training already beats BPE in every metric.

All published values and their status are recorded in results/validated_results.json.

The Fair Metric

Token reduction alone is insufficient. A coordinate can shorten the sequence while making the next symbol harder to predict.

bits per original byte
  = tokens per original byte
    * nats per token
    / ln(2)

A coordinate model wins predictive efficiency only when its held-out bits per original byte is lower than the baseline on the same original bytes.

Compression and learnability balance

Design

The implementation keeps five artifacts independent:

artifact purpose
tokenizer.json shared byte-level BPE tokenizer
universe.json exact recursive coordinate definitions
*.ids compact prepared baseline and coordinate streams
model.pt model weights
report.json configuration, losses, timing, and BPB

Coordinate discovery is memory-bounded. Each level counts adjacent pairs, selects recurring candidates, folds the corpus, and repeats on the new stream. It does not materialize every possible n-gram.

Special document boundary IDs are never folded. The model can therefore learn and emit a normal explicit end-of-text token.

Install

Python 3.10+ is required.

python -m venv .venv
source .venv/bin/activate
pip install -e .

Hugging Face datasets are optional:

pip install -e ".[datasets]"

Quick Start

Prepare one paired corpus:

python prepare.py \
  --input examples/data/mini_stories.jsonl \
  --output artifacts/mini_stories \
  --base-vocab-size 512 \
  --coordinate-layers 2 \
  --nodes-per-layer 256 \
  --coordinate-min-count 2

Train matched models:

python train.py \
  --data artifacts/mini_stories \
  --config configs/smoke.json \
  --representation baseline \
  --output checkpoints/baseline

python train.py \
  --data artifacts/mini_stories \
  --config configs/smoke.json \
  --representation coordinate \
  --output checkpoints/coordinate

Compare them:

python compare.py \
  --baseline checkpoints/baseline/report.json \
  --coordinate checkpoints/coordinate/report.json

Generate model-visible IDs and decoded text:

python generate.py \
  --checkpoint checkpoints/coordinate \
  --prompt "Mina carried a red kite" \
  --max-new-tokens 120

Evaluate BPB and decoded throughput:

python evaluate.py \
  --checkpoint checkpoints/coordinate \
  --data artifacts/mini_stories \
  --prompt "The little robot found" \
  --max-new-tokens 160

Hugging Face Corpus

Prepare TinyStories without changing the training code:

python prepare.py \
  --hf-dataset roneneldan/TinyStories \
  --hf-split train \
  --text-field text \
  --limit 100000 \
  --output artifacts/tinystories_100k \
  --base-vocab-size 4096 \
  --coordinate-layers 2 \
  --nodes-per-layer 16000 \
  --coordinate-min-count 4

The prepared tokenizer and universe are cached in the output directory. Subsequent training runs load them directly and never rebuild the coordinate vocabulary.

Python API

from coordcompresscodeclm import (
    CorpusPreparer,
    LanguageModelTrainer,
    PreparedCorpus,
    TrainConfig,
)

preparer = CorpusPreparer(
    base_vocab_size=4096,
    coordinate_layers=2,
    nodes_per_layer=16000,
)

report = preparer.prepare(
    build_documents=train_build,
    selection_documents=train_selection,
    validation_documents=validation,
    output_dir="artifacts/corpus",
)

config = TrainConfig.from_file("configs/reference.json")
trainer = LanguageModelTrainer(
    PreparedCorpus("artifacts/corpus"),
    config,
    "checkpoints/coordinate",
)
training_report = trainer.train()

Lower-level coordinate usage:

from coordcompresscodeclm import UniverseBuilder

documents = [
    [81, 19, 81, 19, 42, 7],
    [81, 19, 81, 19, 42, 7],
]

universe = UniverseBuilder(
    layers=2,
    nodes_per_layer=128,
    min_count=2,
    selection_min_count=0,
).fit(documents, base_vocab_size=100)

encoded = universe.encode(documents[0])
restored = universe.decode(encoded)

assert restored == documents[0]

Repository Structure

CoordCompressCodecLM/
├── src/                  library code
├── tests/                exactness and training smoke tests
├── configs/              reproducible model configurations
├── examples/             small local corpus
├── docs/                 method, protocol, findings, and limitations
├── results/              curated machine-readable results
├── prepare.py            freeze a paired corpus
├── train.py              train or resume one model
├── evaluate.py           measure BPB and decoded throughput
├── generate.py           inspect IDs and decoded output
└── compare.py            compare matched reports

Research Status

What is established:

  • Coordinate encoding and decoding are lossless.
  • Recursive layers can reduce model-visible token count substantially.
  • A 49-52% TinyStories token reduction has been demonstrated.
  • Semantic initialization makes new coordinate embeddings easier to learn.
  • Decoded bytes per second is the correct generation-throughput metric.

What remains open:

  • Matching or beating baseline BPB reliably across datasets and model sizes.
  • Reducing the softmax and embedding cost of large coordinate vocabularies.
  • Establishing quality parity with long, compute-matched training runs.
  • Learning transferable coordinate universes across domains.

See docs/findings.md for the evidence and docs/limitations.md for the failure modes.

Related Project

CoordCompressCodec provides the general-purpose lossless coordinate file codec. This repository studies the separate language-model question: using reversible coordinates as the model's training and generation vocabulary.

License

CoordCompressCodecLM is released under the CoordCompressCodec Non-Commercial License.

Non-commercial use is permitted. Commercial use requires prior written permission from Moustapha Oumar.

About

Lossless recursive coordinate tokens for efficient language-model training and generation.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages