🧠 Agentic Data Scientist

An autonomous, multi-agent data science pipeline that takes a raw CSV and delivers cleaned data, trained models, visualisations, and a PDF report — with zero manual intervention.

Demo

youtube-link - https://youtu.be/RiIRHVYMaQU?si=y3EDQNewk33L4Hkz

✨ What It Does

Upload a CSV, pick a target column and problem type, and the pipeline runs itself end-to-end through six stages:

Stage	Agent	What Happens
1	Profiler	Statistical analysis — missing values, skew, cardinality, correlations
2	Cleaner + Manager	Imputation, outlier capping, encoding, scaling — Manager validates and may request rework
3	Pipeline	Feature/target split, class distribution check
4	Visualiser	Correlation heatmap, distributions, box plots, pairplot, feature importances
5	ML Engineer + Manager	RandomForest, XGBoost, LightGBM tuned with Optuna — Manager reviews performance
6	Report Generator	Full PDF report with profile, cleaning log, plots, and metrics

The Manager Agent acts as a quality gate between stages 2 and 5. It evaluates results using an LLM (with rule-based fallback) and can trigger rework loops with specific feedback before the pipeline moves on.

🏗️ Architecture

Agentic_data_scientist/
│
├── app.py                  # Streamlit frontend
├── api.py                  # FastAPI backend (async, background tasks)
│
├── agents/
│   ├── profiler.py         # Stage 1 — statistical profiling
│   ├── cleaner.py          # Stage 2 — data cleaning
│   ├── visualizer.py       # Stage 4 — Plotly EDA charts
│   ├── ml_engineer.py      # Stage 5 — Optuna hyperparameter tuning
│   ├── report_generator.py # Stage 6 — PDF report
│   └── tools.py            # Placeholder for future agent tools
│
└── core/
    ├── pipeline.py         # Orchestrates all six stages
    ├── manager_validator.py# LLM-powered quality gate (with rule fallback)
    ├── llm.py              # HuggingFace LLM wrapper
    └── models.py           # Pydantic request/response models

The frontend and backend are decoupled — Streamlit talks to FastAPI over HTTP, so the UI stays responsive while the pipeline runs in the background.

⚙️ Setup

1. Clone and install

git clone <your-repo>
cd Agentic_data_scientist
pip install -r requirements.txt

2. Configure environment

Copy .env and add your HuggingFace API key (optional — free models work without one):

# .env
HF_API_KEY=hf_your_key_here          # optional, needed for gated models
HF_MODEL_NAME=Qwen/Qwen2.5-7B-Instruct  # recommended

Recommended model: Qwen/Qwen2.5-7B-Instruct — best JSON reliability for the Manager Agent. See the Model Selection section below.

3. Start the API

python api.py
# or
uvicorn api:app --host 0.0.0.0 --port 8000 --reload

4. Start the UI (in a second terminal)

streamlit run app.py

Open http://localhost:8501 in your browser.

🚀 Usage

Upload a CSV file via the sidebar
Select the target column from the dropdown
Choose problem type: classification or regression
Click 🚀 Run Pipeline
Watch real-time logs as each stage completes
Download the trained model (.pkl), cleaned data (.csv), or full report (.pdf) from the Downloads tab

🤖 Model Selection

The Manager Agent uses an LLM to evaluate cleaning and model quality. The model is configured via HF_MODEL_NAME in .env.

Model	Size	Notes
`Qwen/Qwen2.5-7B-Instruct`	7B	Recommended. Best JSON output, ungated, free
`meta-llama/Meta-Llama-3.1-8B-Instruct`	8B	Excellent reasoning, requires HF access approval
`microsoft/Phi-3.5-mini-instruct`	3.8B	Fastest, good for rate-limited free tier usage
`mistralai/Mistral-7B-Instruct-v0.3`	7B	Original default, works but more verbose

If the LLM call fails for any reason (rate limit, network, parse error), the Manager automatically falls back to rule-based evaluation — so the pipeline always completes.

🔬 How the Manager Agent Works

The Manager uses a two-tier system:

Tier 1 — LLM evaluation (when API is reachable)

Sends the data profile + cleaning log (or model metrics) to the LLM
Expects a JSON response: {"approved": true/false, "feedback": "..."}
If approved: false, the feedback is passed back to the relevant agent for a targeted retry

Tier 2 — Rule-based fallback (always available)

Cleaning: checks that missing values, outliers, and scaling were all addressed
ML: checks that the best model's held-out test score clears a minimum threshold (0.65 accuracy for classification, 0.40 R² for regression)

📊 What the ML Engineer Does

Train/test split: 80/20 stratified split is made before any tuning — the test set is never seen during hyperparameter search
Optuna tuning: each model gets 10–20 trials on a ≤5000-row subsample of train data for speed
Final fit: best params are used to fit on the full train set
Metrics reported: CV scores on train and final scores on the held-out test set
Retry logic: if the Manager rejects performance, feedback keywords (e.g. "class imbalance", "overfit", "learning rate") adjust the hyperparameter search strategy before rerunning

🧹 Cleaning Strategy

Situation	Action
Column >70% missing	Drop
Duplicate rows	Remove
Numeric missing values	Median imputation (default) or mean (if Manager requests)
Categorical missing values	Mode imputation
Numeric outliers	IQR capping (or Yeo-Johnson / Box-Cox if Manager requests)
Low-cardinality categoricals (≤10)	One-hot encoding
Medium-cardinality (11–50)	Frequency encoding
High-cardinality (>50)	Target encoding (via `category_encoders`)
All numeric features	RobustScaler — target column is always excluded

📡 API Reference

The FastAPI backend exposes these endpoints:

Method	Endpoint	Description
`GET`	`/health`	Health check
`POST`	`/upload`	Upload a CSV file, returns `file_id`
`POST`	`/pipeline/run`	Start a pipeline run, returns `run_id`
`GET`	`/pipeline/status/{run_id}`	Poll status, progress, and logs
`GET`	`/pipeline/result/{run_id}`	Get final result once complete
`GET`	`/download/model/{run_id}`	Download best model `.pkl`
`GET`	`/download/data/{run_id}`	Download cleaned data `.csv`
`GET`	`/download/report/{run_id}`	Download PDF report

Interactive docs at http://localhost:8000/docs.

🧰 Tech Stack

Category	Libraries
Frontend	Streamlit, Plotly
Backend	FastAPI, Uvicorn
ML	scikit-learn, XGBoost, LightGBM, Optuna
Data	pandas, NumPy, SciPy
Encoding	category-encoders
LLM	LangChain, HuggingFace Hub
Report	fpdf2, Plotly (kaleido for image export)
Serving	joblib, python-multipart

📋 Requirements

Python 3.10+
See requirements.txt for all dependencies

⚠️ Limitations

Data Constraints

Accepts only CSV files — no Excel, JSON, Parquet, or database connections supported yet
Performance degrades on datasets exceeding ~500K rows due to in-memory Pandas processing
Multimodal data (images, text columns, time-series) is not handled by the current pipeline

Cleaning Agent

Imputation strategy is fixed (median/mode) — no adaptive selection based on data distribution
High-cardinality categorical columns above 15 unique values are label-encoded, which may mislead tree models into assuming ordinality
Feature engineering (log transforms, polynomial features, interaction terms) is not performed

EDA Agent

Charts are static PNGs — no interactive drill-down or zoom capability
Anomaly detection is limited to IQR clipping; isolation-forest or DBSCAN-based outlier detection is absent

ML Engineer Agent

Hyperparameter search space is predefined — unusual datasets may benefit from wider grids
No support for neural networks, SVMs, or ensemble stacking
Class imbalance handling (SMOTE, class weights) is not automated when imbalance is detected
Models are evaluated but not persisted — no .pkl export or model registry integration

Agent Orchestration

LLM API costs accumulate on large datasets since every agent reasoning step consumes tokens
Pipeline is sequential — no parallel agent execution, meaning runtime scales linearly with complexity
Agent retries on tool failure are limited; a bad LLM response can stall the pipeline

Infrastructure

No authentication layer — the API endpoint is publicly accessible as deployed
Session state is in-memory; server restart loses all uploaded files and run history
Not containerized yet — environment setup requires manual dependency management

📄 License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
agents		agents
core		core
.gitignore		.gitignore
README.md		README.md
api.py		api.py
app.py		app.py
requirements.txt		requirements.txt
ui_components.py		ui_components.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 Agentic Data Scientist

Demo

✨ What It Does

🏗️ Architecture

⚙️ Setup

1. Clone and install

2. Configure environment

3. Start the API

4. Start the UI (in a second terminal)

🚀 Usage

🤖 Model Selection

🔬 How the Manager Agent Works

📊 What the ML Engineer Does

🧹 Cleaning Strategy

📡 API Reference

🧰 Tech Stack

📋 Requirements

⚠️ Limitations

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧠 Agentic Data Scientist

Demo

✨ What It Does

🏗️ Architecture

⚙️ Setup

1. Clone and install

2. Configure environment

3. Start the API

4. Start the UI (in a second terminal)

🚀 Usage

🤖 Model Selection

🔬 How the Manager Agent Works

📊 What the ML Engineer Does

🧹 Cleaning Strategy

📡 API Reference

🧰 Tech Stack

📋 Requirements

⚠️ Limitations

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages