DataCortex v0.1.0

Next-generation lossless text compression engine. Understands file structure (JSON, NDJSON, Markdown, logs, code) and applies format-aware preprocessing before bit-level context mixing with 19 models.

Benchmarks

vs General-Purpose Compressors (enwik8, 100MB Wikipedia)

Compressor	Size	bpb	DataCortex wins by
gzip -9	36.4 MB	3.28	43%
bzip2 -9	29.0 MB	2.64	29%
zstd -22	24.0 MB	2.16	13%
xz -9	24.7 MB	1.98	6%
DataCortex	23.3 MB	1.87	—

Format-Aware Compression (NDJSON, 200 rows analytics events)

Compressor	bpb	DataCortex wins by
bzip2	2.32	82%
zstd -3	1.46	72%
zstd -22	0.76	46%
DataCortex Balanced	0.41	—

3.5x better than zstd on structured JSON data. DataCortex understands your data's structure.

Three Modes

Mode	enwik8 bpb	Speed	Memory	Use Case
Fast	~3.0	Fast	~10 MB	Quick compression (zstd backend)
Balanced	1.87	~170 KB/s	~450 MB	Best ratio for general use
Max	1.87	~85 KB/s	~900 MB	Maximum compression

Format Detection

Automatically detects and optimizes for: JSON, NDJSON, Markdown, CSV, source code, log files, generic text.

NDJSON gets special columnar reorg that groups similar values together — timestamps with timestamps, user IDs with user IDs — enabling dramatically better compression.

Installation

# From source
git clone https://github.com/rushikeshmore/DataCortex.git
cd DataCortex
cargo build --release

Usage

# Compress
datacortex compress data.json -m balanced

# Decompress
datacortex decompress data.dcx output.json

# Benchmark a directory
datacortex bench corpus/ -m balanced --compare

# File info
datacortex info data.dcx

Architecture

19 prediction models (Order 0-9, match, word, sparse, run, JSON context, indirect, PPM, DMC, ISSE) combined through a triple logistic mixer and 7-stage APM cascade. Binary arithmetic coder with 12-bit precision.

The key innovation: NDJSON columnar reorg transforms row-oriented JSON into column-oriented layout before compression, enabling 0.41 bpb on structured data — a result no other general-purpose compressor achieves.

Stats

242 tests, all passing
Clippy clean, zero warnings
Lossless roundtrip verified on 100MB enwik8
MIT licensed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

DataCortex v0.1.0

Benchmarks

vs General-Purpose Compressors (enwik8, 100MB Wikipedia)

Format-Aware Compression (NDJSON, 200 rows analytics events)

Three Modes

Format Detection

Installation

Usage

Architecture

Stats

Uh oh!

Releases: rushikeshmore/DataCortex

DataCortex v0.1.0 — Format-Aware Lossless Text Compression

DataCortex v0.1.0

Benchmarks

vs General-Purpose Compressors (enwik8, 100MB Wikipedia)

Format-Aware Compression (NDJSON, 200 rows analytics events)

Three Modes

Format Detection

Installation

Usage

Architecture

Stats

Uh oh!