A powerful, interactive web application built with Streamlit that streamlines the Data Science workflow. This tool allows users to upload datasets, perform comprehensive Exploratory Data Analysis (EDA), apply advanced data cleaning techniques (including ML-based outlier detection), and download the processed data.
- Instant Preview: View the head of your dataset immediately after upload.
- Metadata: Automatically generate column data types, non-null counts, and unique value counts.
- Missing Value Analysis: Get a summarized report of missing data percentages per column.
-
⚡ Auto-Clean Mode: A one-click solution that removes duplicates and intelligently fills missing values (median for numbers, mode for categories).
-
Handling Missing Values:
-
Drop rows.
-
Impute with Mean, Median, or Mode.
-
Fill with a specific custom value.
-
Duplicate Removal: Detect and remove duplicate rows instantly.
-
Outlier Detection & Removal:
-
IQR Method: Standard statistical method for outlier removal.
-
Isolation Forest: Machine Learning algorithm (Unsupervised) to detect anomalies in complex distributions.
- Statistical Summary: detailed descriptive statistics (mean, std, min, max, percentiles).
- Interactive Visualizations:
- Distributions: Histograms with interactive tooltips.
- Box Plots: For spotting outliers visually.
- Correlation Heatmap: Visualize relationships between numeric variables.
- Categorical Counts: Bar charts for top appearing categories.
- Comparison Metrics: See how many rows were removed during cleaning.
- Download: Export the final cleaned dataset as a
.csvfile.
To ensure the imports work correctly, organize your files as follows:
auto-eda-tool/
│
├── app.py # The main Streamlit application
├── requirements.txt # List of dependencies
└── utils/
├── __init__.py # Empty file to make utils a Python package
├── data_cleaner.py # Contains the cleaning logic functions
└── eda_functions.py # Contains the plotting and stats functions
- Clone the repository (or create the folder structure above):
mkdir auto-eda-tool
cd auto-eda-tool
- Create a virtual environment (Recommended):
# Windows
python -m venv venv
venv\Scripts\activate
# Mac/Linux
python3 -m venv venv
source venv/bin/activate
- Install dependencies:
Create a
requirements.txtfile with the contents below, then run:
pip install -r requirements.txt
requirements.txt content:
streamlit
pandas
numpy
scikit-learn
plotly
- Run the application:
streamlit run app.py
- Upload: Use the sidebar to upload a CSV file.
- Overview Tab: Check the "Missing Values Summary" to see what needs fixing.
- Cleaning Tab:
- Use "Auto Clean" for a quick fix.
- Or, go step-by-step: Pick a strategy for missing values -> Remove duplicates -> Select a column to strip outliers.
- Note: The app uses Session State, so you can perform multiple cleaning actions in sequence.
- EDA Tab: Select specific columns to visualize their distribution or check the heatmap for correlations.
- Download Tab: Review the final row count and download your clean dataset.
- Frontend: Streamlit
- Data Manipulation: Pandas, NumPy
- Machine Learning: Scikit-learn (SimpleImputer, LabelEncoder, StandardScaler, IsolationForest)
- Visualization: Plotly Express
Contributions are welcome!
- Fork the repository.
- Create a feature branch (
git checkout -b feature/NewFeature). - Commit your changes.
- Push to the branch.
- Open a Pull Request.
This project is open-source and available under the MIT License.