Skip to content

Releases: rushikeshmore/DataCortex

DataCortex v0.1.0 — Format-Aware Lossless Text Compression

17 Mar 06:56

Choose a tag to compare

DataCortex v0.1.0

Next-generation lossless text compression engine. Understands file structure (JSON, NDJSON, Markdown, logs, code) and applies format-aware preprocessing before bit-level context mixing with 19 models.

Benchmarks

vs General-Purpose Compressors (enwik8, 100MB Wikipedia)

Compressor Size bpb DataCortex wins by
gzip -9 36.4 MB 3.28 43%
bzip2 -9 29.0 MB 2.64 29%
zstd -22 24.0 MB 2.16 13%
xz -9 24.7 MB 1.98 6%
DataCortex 23.3 MB 1.87

Format-Aware Compression (NDJSON, 200 rows analytics events)

Compressor bpb DataCortex wins by
bzip2 2.32 82%
zstd -3 1.46 72%
zstd -22 0.76 46%
DataCortex Balanced 0.41

3.5x better than zstd on structured JSON data. DataCortex understands your data's structure.

Three Modes

Mode enwik8 bpb Speed Memory Use Case
Fast ~3.0 Fast ~10 MB Quick compression (zstd backend)
Balanced 1.87 ~170 KB/s ~450 MB Best ratio for general use
Max 1.87 ~85 KB/s ~900 MB Maximum compression

Format Detection

Automatically detects and optimizes for: JSON, NDJSON, Markdown, CSV, source code, log files, generic text.

NDJSON gets special columnar reorg that groups similar values together — timestamps with timestamps, user IDs with user IDs — enabling dramatically better compression.

Installation

# From source
git clone https://github.com/rushikeshmore/DataCortex.git
cd DataCortex
cargo build --release

Usage

# Compress
datacortex compress data.json -m balanced

# Decompress
datacortex decompress data.dcx output.json

# Benchmark a directory
datacortex bench corpus/ -m balanced --compare

# File info
datacortex info data.dcx

Architecture

19 prediction models (Order 0-9, match, word, sparse, run, JSON context, indirect, PPM, DMC, ISSE) combined through a triple logistic mixer and 7-stage APM cascade. Binary arithmetic coder with 12-bit precision.

The key innovation: NDJSON columnar reorg transforms row-oriented JSON into column-oriented layout before compression, enabling 0.41 bpb on structured data — a result no other general-purpose compressor achieves.

Stats

  • 242 tests, all passing
  • Clippy clean, zero warnings
  • Lossless roundtrip verified on 100MB enwik8
  • MIT licensed