Skip to content

Prajwal18py/Auto-eda-cleaner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📊 Auto EDA & Data Cleaning Tool

A powerful, interactive web application built with Streamlit that streamlines the Data Science workflow. This tool allows users to upload datasets, perform comprehensive Exploratory Data Analysis (EDA), apply advanced data cleaning techniques (including ML-based outlier detection), and download the processed data.

🚀 Features

1. 📋 Data Overview

  • Instant Preview: View the head of your dataset immediately after upload.
  • Metadata: Automatically generate column data types, non-null counts, and unique value counts.
  • Missing Value Analysis: Get a summarized report of missing data percentages per column.

2. 🧹 Advanced Data Cleaning

  • ⚡ Auto-Clean Mode: A one-click solution that removes duplicates and intelligently fills missing values (median for numbers, mode for categories).

  • Handling Missing Values:

  • Drop rows.

  • Impute with Mean, Median, or Mode.

  • Fill with a specific custom value.

  • Duplicate Removal: Detect and remove duplicate rows instantly.

  • Outlier Detection & Removal:

  • IQR Method: Standard statistical method for outlier removal.

  • Isolation Forest: Machine Learning algorithm (Unsupervised) to detect anomalies in complex distributions.

3. 📊 Exploratory Data Analysis (EDA)

  • Statistical Summary: detailed descriptive statistics (mean, std, min, max, percentiles).
  • Interactive Visualizations:
  • Distributions: Histograms with interactive tooltips.
  • Box Plots: For spotting outliers visually.
  • Correlation Heatmap: Visualize relationships between numeric variables.
  • Categorical Counts: Bar charts for top appearing categories.

4. 💾 Export

  • Comparison Metrics: See how many rows were removed during cleaning.
  • Download: Export the final cleaned dataset as a .csv file.

📂 Project Structure

To ensure the imports work correctly, organize your files as follows:

auto-eda-tool/
│
├── app.py                 # The main Streamlit application
├── requirements.txt       # List of dependencies
└── utils/
    ├── __init__.py        # Empty file to make utils a Python package
    ├── data_cleaner.py    # Contains the cleaning logic functions
    └── eda_functions.py   # Contains the plotting and stats functions


🛠️ Installation & Setup

  1. Clone the repository (or create the folder structure above):
mkdir auto-eda-tool
cd auto-eda-tool
  1. Create a virtual environment (Recommended):
# Windows
python -m venv venv
venv\Scripts\activate

# Mac/Linux
python3 -m venv venv
source venv/bin/activate
  1. Install dependencies: Create a requirements.txt file with the contents below, then run:
pip install -r requirements.txt

requirements.txt content:

streamlit
pandas
numpy
scikit-learn
plotly

  1. Run the application:
streamlit run app.py

📖 Usage Guide

  1. Upload: Use the sidebar to upload a CSV file.
  2. Overview Tab: Check the "Missing Values Summary" to see what needs fixing.
  3. Cleaning Tab:
  • Use "Auto Clean" for a quick fix.
  • Or, go step-by-step: Pick a strategy for missing values -> Remove duplicates -> Select a column to strip outliers.
  • Note: The app uses Session State, so you can perform multiple cleaning actions in sequence.
  1. EDA Tab: Select specific columns to visualize their distribution or check the heatmap for correlations.
  2. Download Tab: Review the final row count and download your clean dataset.

🧰 Tech Stack


🤝 Contributing

Contributions are welcome!

  1. Fork the repository.
  2. Create a feature branch (git checkout -b feature/NewFeature).
  3. Commit your changes.
  4. Push to the branch.
  5. Open a Pull Request.

📄 License

This project is open-source and available under the MIT License.

About

An interactive web application built with Streamlit for automated Exploratory Data Analysis (EDA) and intelligent data cleaning (handling missing values, outliers, and duplicates) without writing code.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages