Zedda is a blazing-fast Exploratory Data Analysis (EDA) library for Python. It replaces dozens of lines of pandas boilerplate with a single function call — and runs 2,000× faster than traditional tools by offloading all heavy computation to a custom C++17 streaming engine.
import zedda as zd
zd.profile("titanic.csv") # Full EDA report in 19ms
zd.ml_ready("data.csv") # ML readiness score out of 100
zd.compare("train.csv", "test.csv") # Drift detection in one line
zd.fix("data.csv") # Copy-pasteable fix code, instantly
zd.ask("data.csv", "which columns have nulls?") # Natural language queries| Feature | pandas | ydata-profiling | Zedda |
|---|---|---|---|
| Titanic (891 rows) | manual, 0.8s | ~45s | 19ms ⚡ |
| 6.3M row CSV | manual, 8.2s | OOM crash | 23s ⚡ |
| 1TB Parquet | OOM crash | OOM crash | < 2s ⚡ |
| RAM usage | |||
| pip install size | ~30 MB | 200 MB+ | < 1 MB ✅ |
| Pearson correlation | manual | slow | single-pass ✅ |
| ML readiness hints | ❌ | ❌ | ✅ |
| Auto-Fix Code Gen | ❌ | ❌ | ✅ |
| Data Drift Detection | ❌ | ❌ | ✅ |
pip install zedda- ✅ No C++ compiler needed — pre-built wheels for Windows, macOS, and Linux
- ✅ Requires Python 3.9+
- ✅ Tiny install — less than 1 MB, no heavy dependencies
Instantly generate a beautiful, rich terminal report with data quality scores, outlier detection, distribution stats, and single-pass Pearson correlations — all in milliseconds.
import zedda as zd
zd.profile("data.csv") # CSV
zd.profile("data.parquet") # Parquet — uses footer cheat code
zd.profile("data.arrow") # Arrow IPC
zd.profile("big.csv", sample_size=500_000) # Force sampling
zd.profile() — Full dataset EDA in a single line. Data Quality Score, column stats, Smart Warnings, and Pearson correlations.
Computes an ML Readiness score out of 100 by flagging nulls, extreme outliers, high cardinality, multi-collinearity, and more.
zd.ml_ready("data.csv")Detect data drift between Train/Test splits or Production vs. Baseline in one line. Uses Z-score distribution shift detection (threshold > 1.0) and flags new categories not seen in training.
zd.compare("train.csv", "test.csv")Don't just find the issues — fix them. Zedda generates exact, copy-pasteable pandas or scikit-learn code snippets to resolve every detected problem.
zd.fix("data.csv") # Print fix code snippets
# Or apply them directly — returns a clean DataFrame!
clean_df = zd.fix("data.csv", apply=True)View all data quality warnings for your dataset in a clean, structured list.
zd.warnings("data.csv")Ask plain-English questions about your dataset and get instant answers. Features a fast offline rule engine for common questions (no API key needed) and Zedda AI for complex analytical queries.
# Instant offline answers (no API key needed)
zd.ask("titanic.csv", "which columns have more than 10% nulls?")
zd.ask("titanic.csv", "is this dataset good for fraud detection?")
zd.ask("titanic.csv", "what is the survival rate by class?")
zd.ask("titanic.csv", "how many rows are there?")
zd.ask("titanic.csv", "what should I drop?")
zd.ask("titanic.csv", "mean of Age")
# Zedda AI for complex questions (requires ZEDDA_AI_KEY)
zd.ask("data.csv", "which features should I use for a random forest?")
# Suppress output, capture the answer as a string
answer = zd.ask("data.csv", "mean of Fare", print_output=False)Need raw stats for your own pipelines? scan() returns the full profile object silently — no terminal output.
p = zd.scan("titanic.csv")
print(p.num_rows) # 891
print(p.num_cols) # 12
print(p.overall_null_pct) # 28.3
for col in p.columns:
if col.null_pct > 20:
print(f"High nulls: {col.name} ({col.null_pct:.1f}%)")See full API reference:
docs/API.md
Zedda ships with a full command-line interface:
# Profile a file directly in your terminal
zedda run data.csv
# Compare two datasets
zedda compare train.csv test.csv
# Quick file info (fast, no full scan)
zedda info data.csv
# Show version
zedda versionZedda is built on a custom C++17 streaming core connected to Python via nanobind — the fastest Python/C++ binding library available.
Python API (zd.profile, zd.scan, zd.compare ...)
│
│ nanobind (zero-copy)
▼
C++ Streaming Engine
┌─────────────────────────────────────────────────────┐
│ Welford's Algorithm → Mean / StdDev / Skew │
│ HyperLogLog → Cardinality (16KB/col) │
│ Pearson Engine → O(1) memory correlation │
│ Parquet Footer Reader → Exact min/max from meta │
│ Stratified Sampler → 99.9% accuracy, 100x I/O │
└─────────────────────────────────────────────────────┘
│
│ Arrow C Data Interface (zero-copy)
▼
PyArrow (Parquet / Arrow IPC file reading)
| Algorithm | What It Does | Why |
|---|---|---|
| Welford's Online Algorithm | Stable mean/variance/stddev/skewness/kurtosis | Single-pass, no catastrophic cancellation |
| HyperLogLog | Cardinality estimation (~99% accuracy) | Uses only 16 KB per column, regardless of dataset size |
| Pearson Correlation Engine | Exact |
|
| Parquet Footer Cheat Code | Reads exact nulls/min/max from file footer | Milliseconds for any file size, no data scan needed |
| Stratified Row-Group Sampling | Picks start, middle, and end row groups | 99.9% statistical accuracy with 100× less I/O |
Zedda uses
| Dataset | pandas RAM | Zedda RAM |
|---|---|---|
| 1M rows, 10 cols | ~800 MB | ~2 MB |
| 10M rows, 30 cols | ~8 GB | ~6 MB |
| 1TB Parquet | OOM | ~50 MB |
Tested on MacBook Pro M2, 16 GB RAM.
| Dataset | pandas describe() |
ydata-profiling | Zedda |
|---|---|---|---|
| Titanic (891 rows, 12 cols) | 0.8s | 42.0s | 0.019s ⚡ |
| Fraud (6.3M rows, 31 cols) | 8.2s (no insights) | OOM | 23.0s ⚡ |
| 1TB Parquet (footer mode) | OOM | OOM | 1.8s ⚡ |
Zedda on Fraud: with Smart Warnings + Pearson correlations included.
| Status | Phase | Description |
|---|---|---|
| ✅ | Phase 1 | C++ streaming core (Welford, HyperLogLog) |
| ✅ | Phase 2 | Zero-copy Parquet + Arrow support |
| ✅ | Phase 3 | Intelligent Sampling Engine (1TB in 2s) |
| ✅ | Phase 3.1 | Smart Warnings, Data Quality Score, Pearson Correlation |
| ✅ | Phase 4 | zd.ml_ready() and zd.fix() — ML readiness + auto-fix code gen |
| ✅ | Phase 5 | zd.compare() — Data drift detection for production vs. baseline |
| ✅ | Phase 6 | zd.ask() — Natural language queries over your dataset |
Contributions are welcome and appreciated! Zedda is actively maintained and open to PRs of all sizes.
# 1. Fork and clone the repository
git clone https://github.com/Zedda-Labs/Zedda.git --recursive
cd Zedda
# 2. Install in editable/development mode
pip install -e ".[dev]"
# 3. Run the test suite
pytest tests/
# 4. Make your changes and open a PR!See the full contribution guide:
CONTRIBUTING.md
If you discover a security vulnerability, please report it privately via GitHub's Security Advisories — do not open a public issue.
See:
SECURITY.md
Zedda is open source software licensed under the MIT License.
See LICENSE for details.
Built with passion and C++17
PyPI • GitHub • Issues • Contributing • API Docs
If Zedda saved you time, please give it a ⭐ on GitHub — it helps a lot!

