Releases: rushikeshmore/DataCortex
DataCortex v0.1.0 — Format-Aware Lossless Text Compression
DataCortex v0.1.0
Next-generation lossless text compression engine. Understands file structure (JSON, NDJSON, Markdown, logs, code) and applies format-aware preprocessing before bit-level context mixing with 19 models.
Benchmarks
vs General-Purpose Compressors (enwik8, 100MB Wikipedia)
| Compressor | Size | bpb | DataCortex wins by |
|---|---|---|---|
| gzip -9 | 36.4 MB | 3.28 | 43% |
| bzip2 -9 | 29.0 MB | 2.64 | 29% |
| zstd -22 | 24.0 MB | 2.16 | 13% |
| xz -9 | 24.7 MB | 1.98 | 6% |
| DataCortex | 23.3 MB | 1.87 | — |
Format-Aware Compression (NDJSON, 200 rows analytics events)
| Compressor | bpb | DataCortex wins by |
|---|---|---|
| bzip2 | 2.32 | 82% |
| zstd -3 | 1.46 | 72% |
| zstd -22 | 0.76 | 46% |
| DataCortex Balanced | 0.41 | — |
3.5x better than zstd on structured JSON data. DataCortex understands your data's structure.
Three Modes
| Mode | enwik8 bpb | Speed | Memory | Use Case |
|---|---|---|---|---|
| Fast | ~3.0 | Fast | ~10 MB | Quick compression (zstd backend) |
| Balanced | 1.87 | ~170 KB/s | ~450 MB | Best ratio for general use |
| Max | 1.87 | ~85 KB/s | ~900 MB | Maximum compression |
Format Detection
Automatically detects and optimizes for: JSON, NDJSON, Markdown, CSV, source code, log files, generic text.
NDJSON gets special columnar reorg that groups similar values together — timestamps with timestamps, user IDs with user IDs — enabling dramatically better compression.
Installation
# From source
git clone https://github.com/rushikeshmore/DataCortex.git
cd DataCortex
cargo build --releaseUsage
# Compress
datacortex compress data.json -m balanced
# Decompress
datacortex decompress data.dcx output.json
# Benchmark a directory
datacortex bench corpus/ -m balanced --compare
# File info
datacortex info data.dcxArchitecture
19 prediction models (Order 0-9, match, word, sparse, run, JSON context, indirect, PPM, DMC, ISSE) combined through a triple logistic mixer and 7-stage APM cascade. Binary arithmetic coder with 12-bit precision.
The key innovation: NDJSON columnar reorg transforms row-oriented JSON into column-oriented layout before compression, enabling 0.41 bpb on structured data — a result no other general-purpose compressor achieves.
Stats
- 242 tests, all passing
- Clippy clean, zero warnings
- Lossless roundtrip verified on 100MB enwik8
- MIT licensed