DataMorpher

A compact Python utility for seamless data conversion and cleaning, featuring both CLI and web interfaces. DataMorpher automatically detects data types, handles missing values, removes duplicates, and generates detailed conversion reports.

Features

Core Capabilities

Multi-format Support: Convert between CSV, Excel (.xlsx/.xls), and JSON formats
Automatic Type Detection: Intelligently identifies numeric, categorical, date, and boolean columns
Data Cleaning: Remove duplicates, detect anomalies, and impute missing values
Smart JSON Detection: Handles standard and newline-delimited JSON
Robust Validation: Cleans malformed numbers and dates
Detailed Reports: Generate Markdown reports with conversion statistics
Dual Interface: Choose between CLI for power users or Streamlit web app for non-technical users

Performance

Process 50,000+ row datasets in under 5 seconds
Minimal memory footprint with streaming processing
Type-safe operations with pandas backend

Installation

# Clone the repository
git clone https://github.com/NeurArk/DataMorpher.git
cd DataMorpher

# Install dependencies
pip install .

# For development (includes testing tools)
pip install .[dev]

Usage

Command Line Interface

Basic conversion:

python -m datamorpher --input data.csv --output data.xlsx

With cleaning and report:

python -m datamorpher --input sales.csv --output sales.json --clean --report report.md

Force overwrite existing files:

python -m datamorpher --input data.json --output data.csv --force

Web Interface (Streamlit)

Launch the web application:

streamlit run datamorpher/streamlit_app.py

The web interface provides:

Drag-and-drop file upload
Format selection via radio buttons
Cleaning options with checkboxes
One-click report generation
Instant file download

Project Structure

datamorpher/
├── __main__.py        # CLI entry point using Typer
├── converter.py       # Core conversion logic
├── cleaner.py         # Data cleaning operations
├── reporter.py        # Markdown report generation
└── streamlit_app.py   # Web interface
tests/
├── test_converter.py
├── test_cleaner.py
└── test_reporter.py
.github/
└── workflows/
    └── ci.yml         # Continuous integration
pyproject.toml         # Project configuration

Data Cleaning Options

When enabled, DataMorpher performs:

Duplicate Removal: Eliminates exact row duplicates
Anomaly Detection: Detects and reports issues like negative values in stock, infinity values, etc.
Missing Value Imputation:
- Numeric columns: Median value
- Categorical columns: Mode (most frequent value)
- Date columns: Left as missing
- Boolean columns: Mode
Smart Parsing:
- Extracts numbers from corrupted strings (e.g. "8000foo0" -> 8000)
- Converts textual expressions (e.g. "four hundred fifty" -> 450)
- Handles special patterns like "95ABC.50" -> 95.50
Date Formats: Supports multiple formats including %Y-%m-%d, %d/%m/%Y, %m/%d/%Y, %Y/%m/%d and textual formats like "March 20 2023"

Conversion Report

Each conversion generates a Markdown report containing:

Number of rows read/written
Duplicates removed count
Values imputed per column
Detected data types
Total execution time
Detailed transformations of cleaned values
Detected anomalies and warnings

Example report snippet:

# DataMorpher Conversion Report

## Summary
- Input: sales.csv (50,000 rows)
- Output: sales.xlsx (49,850 rows)
- Duplicates removed: 150
- Execution time: 2.3s

## Column Types Detected
- OrderID: numeric
- CustomerName: categorical
- OrderDate: date
- Amount: numeric
- Status: categorical

The report also lists transformations applied and anomalies detected, for example:

## Applied Transformations
### Column 'price'
- $49.99 -> 49.99 (currency conversion)
- 200$ -> 200.0 (currency conversion)

## Notes and Warnings
- Column 'stock' contains 1 negative value(s)
- Column 'stock' contains 1 infinite value(s)

Sample Data

A messy CSV file is included in sample_data/test_messy_data_improved.csv to test the cleaning features, anomaly detection, and edge cases.

Development

Running Tests

pytest

Linting

ruff check .

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Requirements

Python 3.12+
pandas
openpyxl
streamlit
typer
tabulate

License

MIT License - see LICENSE file for details

Acknowledgments

Built with modern Python tooling including:

pandas for data manipulation
Typer for CLI development
Streamlit for web interfaces
Ruff for code quality
pytest for testing

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
datamorpher		datamorpher
sample_data		sample_data
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
sample_report.md		sample_report.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataMorpher

Features

Core Capabilities

Performance

Installation

Usage

Command Line Interface

Web Interface (Streamlit)

Project Structure

Data Cleaning Options

Conversion Report

Sample Data

Development

Running Tests

Linting

Contributing

Requirements

License

Acknowledgments

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DataMorpher

Features

Core Capabilities

Performance

Installation

Usage

Command Line Interface

Web Interface (Streamlit)

Project Structure

Data Cleaning Options

Conversion Report

Sample Data

Development

Running Tests

Linting

Contributing

Requirements

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages