Skip to content

santosh374maker/Agentic_Data_Scientist

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 Agentic Data Scientist

An autonomous, multi-agent data science pipeline that takes a raw CSV and delivers cleaned data, trained models, visualisations, and a PDF report — with zero manual intervention.


Demo

youtube-link - https://youtu.be/RiIRHVYMaQU?si=y3EDQNewk33L4Hkz

✨ What It Does

Upload a CSV, pick a target column and problem type, and the pipeline runs itself end-to-end through six stages:

Stage Agent What Happens
1 Profiler Statistical analysis — missing values, skew, cardinality, correlations
2 Cleaner + Manager Imputation, outlier capping, encoding, scaling — Manager validates and may request rework
3 Pipeline Feature/target split, class distribution check
4 Visualiser Correlation heatmap, distributions, box plots, pairplot, feature importances
5 ML Engineer + Manager RandomForest, XGBoost, LightGBM tuned with Optuna — Manager reviews performance
6 Report Generator Full PDF report with profile, cleaning log, plots, and metrics

The Manager Agent acts as a quality gate between stages 2 and 5. It evaluates results using an LLM (with rule-based fallback) and can trigger rework loops with specific feedback before the pipeline moves on.


🏗️ Architecture

Agentic_data_scientist/
│
├── app.py                  # Streamlit frontend
├── api.py                  # FastAPI backend (async, background tasks)
│
├── agents/
│   ├── profiler.py         # Stage 1 — statistical profiling
│   ├── cleaner.py          # Stage 2 — data cleaning
│   ├── visualizer.py       # Stage 4 — Plotly EDA charts
│   ├── ml_engineer.py      # Stage 5 — Optuna hyperparameter tuning
│   ├── report_generator.py # Stage 6 — PDF report
│   └── tools.py            # Placeholder for future agent tools
│
└── core/
    ├── pipeline.py         # Orchestrates all six stages
    ├── manager_validator.py# LLM-powered quality gate (with rule fallback)
    ├── llm.py              # HuggingFace LLM wrapper
    └── models.py           # Pydantic request/response models

The frontend and backend are decoupled — Streamlit talks to FastAPI over HTTP, so the UI stays responsive while the pipeline runs in the background.


⚙️ Setup

1. Clone and install

git clone <your-repo>
cd Agentic_data_scientist
pip install -r requirements.txt

2. Configure environment

Copy .env and add your HuggingFace API key (optional — free models work without one):

# .env
HF_API_KEY=hf_your_key_here          # optional, needed for gated models
HF_MODEL_NAME=Qwen/Qwen2.5-7B-Instruct  # recommended

Recommended model: Qwen/Qwen2.5-7B-Instruct — best JSON reliability for the Manager Agent. See the Model Selection section below.

3. Start the API

python api.py
# or
uvicorn api:app --host 0.0.0.0 --port 8000 --reload

4. Start the UI (in a second terminal)

streamlit run app.py

Open http://localhost:8501 in your browser.


🚀 Usage

  1. Upload a CSV file via the sidebar
  2. Select the target column from the dropdown
  3. Choose problem type: classification or regression
  4. Click 🚀 Run Pipeline
  5. Watch real-time logs as each stage completes
  6. Download the trained model (.pkl), cleaned data (.csv), or full report (.pdf) from the Downloads tab

🤖 Model Selection

The Manager Agent uses an LLM to evaluate cleaning and model quality. The model is configured via HF_MODEL_NAME in .env.

Model Size Notes
Qwen/Qwen2.5-7B-Instruct 7B Recommended. Best JSON output, ungated, free
meta-llama/Meta-Llama-3.1-8B-Instruct 8B Excellent reasoning, requires HF access approval
microsoft/Phi-3.5-mini-instruct 3.8B Fastest, good for rate-limited free tier usage
mistralai/Mistral-7B-Instruct-v0.3 7B Original default, works but more verbose

If the LLM call fails for any reason (rate limit, network, parse error), the Manager automatically falls back to rule-based evaluation — so the pipeline always completes.


🔬 How the Manager Agent Works

The Manager uses a two-tier system:

Tier 1 — LLM evaluation (when API is reachable)

  • Sends the data profile + cleaning log (or model metrics) to the LLM
  • Expects a JSON response: {"approved": true/false, "feedback": "..."}
  • If approved: false, the feedback is passed back to the relevant agent for a targeted retry

Tier 2 — Rule-based fallback (always available)

  • Cleaning: checks that missing values, outliers, and scaling were all addressed
  • ML: checks that the best model's held-out test score clears a minimum threshold (0.65 accuracy for classification, 0.40 R² for regression)

📊 What the ML Engineer Does

  • Train/test split: 80/20 stratified split is made before any tuning — the test set is never seen during hyperparameter search
  • Optuna tuning: each model gets 10–20 trials on a ≤5000-row subsample of train data for speed
  • Final fit: best params are used to fit on the full train set
  • Metrics reported: CV scores on train and final scores on the held-out test set
  • Retry logic: if the Manager rejects performance, feedback keywords (e.g. "class imbalance", "overfit", "learning rate") adjust the hyperparameter search strategy before rerunning

🧹 Cleaning Strategy

Situation Action
Column >70% missing Drop
Duplicate rows Remove
Numeric missing values Median imputation (default) or mean (if Manager requests)
Categorical missing values Mode imputation
Numeric outliers IQR capping (or Yeo-Johnson / Box-Cox if Manager requests)
Low-cardinality categoricals (≤10) One-hot encoding
Medium-cardinality (11–50) Frequency encoding
High-cardinality (>50) Target encoding (via category_encoders)
All numeric features RobustScaler — target column is always excluded

📡 API Reference

The FastAPI backend exposes these endpoints:

Method Endpoint Description
GET /health Health check
POST /upload Upload a CSV file, returns file_id
POST /pipeline/run Start a pipeline run, returns run_id
GET /pipeline/status/{run_id} Poll status, progress, and logs
GET /pipeline/result/{run_id} Get final result once complete
GET /download/model/{run_id} Download best model .pkl
GET /download/data/{run_id} Download cleaned data .csv
GET /download/report/{run_id} Download PDF report

Interactive docs at http://localhost:8000/docs.


🧰 Tech Stack

Category Libraries
Frontend Streamlit, Plotly
Backend FastAPI, Uvicorn
ML scikit-learn, XGBoost, LightGBM, Optuna
Data pandas, NumPy, SciPy
Encoding category-encoders
LLM LangChain, HuggingFace Hub
Report fpdf2, Plotly (kaleido for image export)
Serving joblib, python-multipart

📋 Requirements

  • Python 3.10+
  • See requirements.txt for all dependencies

⚠️ Limitations

Data Constraints

  • Accepts only CSV files — no Excel, JSON, Parquet, or database connections supported yet
  • Performance degrades on datasets exceeding ~500K rows due to in-memory Pandas processing
  • Multimodal data (images, text columns, time-series) is not handled by the current pipeline

Cleaning Agent

  • Imputation strategy is fixed (median/mode) — no adaptive selection based on data distribution
  • High-cardinality categorical columns above 15 unique values are label-encoded, which may mislead tree models into assuming ordinality
  • Feature engineering (log transforms, polynomial features, interaction terms) is not performed

EDA Agent

  • Charts are static PNGs — no interactive drill-down or zoom capability
  • Anomaly detection is limited to IQR clipping; isolation-forest or DBSCAN-based outlier detection is absent

ML Engineer Agent

  • Hyperparameter search space is predefined — unusual datasets may benefit from wider grids
  • No support for neural networks, SVMs, or ensemble stacking
  • Class imbalance handling (SMOTE, class weights) is not automated when imbalance is detected
  • Models are evaluated but not persisted — no .pkl export or model registry integration

Agent Orchestration

  • LLM API costs accumulate on large datasets since every agent reasoning step consumes tokens
  • Pipeline is sequential — no parallel agent execution, meaning runtime scales linearly with complexity
  • Agent retries on tool failure are limited; a bad LLM response can stall the pipeline

Infrastructure

  • No authentication layer — the API endpoint is publicly accessible as deployed
  • Session state is in-memory; server restart loses all uploaded files and run history
  • Not containerized yet — environment setup requires manual dependency management

📄 License

MIT

About

Autonomous multi-agent Data Science pipeline — upload a CSV, get a trained model, EDA charts, and an executive report. Zero manual intervention. Built with CrewAI · FastAPI · React · XGBoost · WebSockets.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages