An autonomous, multi-agent data science pipeline that takes a raw CSV and delivers cleaned data, trained models, visualisations, and a PDF report — with zero manual intervention.
youtube-link - https://youtu.be/RiIRHVYMaQU?si=y3EDQNewk33L4Hkz
Upload a CSV, pick a target column and problem type, and the pipeline runs itself end-to-end through six stages:
| Stage | Agent | What Happens |
|---|---|---|
| 1 | Profiler | Statistical analysis — missing values, skew, cardinality, correlations |
| 2 | Cleaner + Manager | Imputation, outlier capping, encoding, scaling — Manager validates and may request rework |
| 3 | Pipeline | Feature/target split, class distribution check |
| 4 | Visualiser | Correlation heatmap, distributions, box plots, pairplot, feature importances |
| 5 | ML Engineer + Manager | RandomForest, XGBoost, LightGBM tuned with Optuna — Manager reviews performance |
| 6 | Report Generator | Full PDF report with profile, cleaning log, plots, and metrics |
The Manager Agent acts as a quality gate between stages 2 and 5. It evaluates results using an LLM (with rule-based fallback) and can trigger rework loops with specific feedback before the pipeline moves on.
Agentic_data_scientist/
│
├── app.py # Streamlit frontend
├── api.py # FastAPI backend (async, background tasks)
│
├── agents/
│ ├── profiler.py # Stage 1 — statistical profiling
│ ├── cleaner.py # Stage 2 — data cleaning
│ ├── visualizer.py # Stage 4 — Plotly EDA charts
│ ├── ml_engineer.py # Stage 5 — Optuna hyperparameter tuning
│ ├── report_generator.py # Stage 6 — PDF report
│ └── tools.py # Placeholder for future agent tools
│
└── core/
├── pipeline.py # Orchestrates all six stages
├── manager_validator.py# LLM-powered quality gate (with rule fallback)
├── llm.py # HuggingFace LLM wrapper
└── models.py # Pydantic request/response models
The frontend and backend are decoupled — Streamlit talks to FastAPI over HTTP, so the UI stays responsive while the pipeline runs in the background.
git clone <your-repo>
cd Agentic_data_scientist
pip install -r requirements.txtCopy .env and add your HuggingFace API key (optional — free models work without one):
# .env
HF_API_KEY=hf_your_key_here # optional, needed for gated models
HF_MODEL_NAME=Qwen/Qwen2.5-7B-Instruct # recommendedRecommended model:
Qwen/Qwen2.5-7B-Instruct— best JSON reliability for the Manager Agent. See the Model Selection section below.
python api.py
# or
uvicorn api:app --host 0.0.0.0 --port 8000 --reloadstreamlit run app.pyOpen http://localhost:8501 in your browser.
- Upload a CSV file via the sidebar
- Select the target column from the dropdown
- Choose problem type:
classificationorregression - Click 🚀 Run Pipeline
- Watch real-time logs as each stage completes
- Download the trained model (
.pkl), cleaned data (.csv), or full report (.pdf) from the Downloads tab
The Manager Agent uses an LLM to evaluate cleaning and model quality. The model is configured via HF_MODEL_NAME in .env.
| Model | Size | Notes |
|---|---|---|
Qwen/Qwen2.5-7B-Instruct |
7B | Recommended. Best JSON output, ungated, free |
meta-llama/Meta-Llama-3.1-8B-Instruct |
8B | Excellent reasoning, requires HF access approval |
microsoft/Phi-3.5-mini-instruct |
3.8B | Fastest, good for rate-limited free tier usage |
mistralai/Mistral-7B-Instruct-v0.3 |
7B | Original default, works but more verbose |
If the LLM call fails for any reason (rate limit, network, parse error), the Manager automatically falls back to rule-based evaluation — so the pipeline always completes.
The Manager uses a two-tier system:
Tier 1 — LLM evaluation (when API is reachable)
- Sends the data profile + cleaning log (or model metrics) to the LLM
- Expects a JSON response:
{"approved": true/false, "feedback": "..."} - If
approved: false, the feedback is passed back to the relevant agent for a targeted retry
Tier 2 — Rule-based fallback (always available)
- Cleaning: checks that missing values, outliers, and scaling were all addressed
- ML: checks that the best model's held-out test score clears a minimum threshold (0.65 accuracy for classification, 0.40 R² for regression)
- Train/test split: 80/20 stratified split is made before any tuning — the test set is never seen during hyperparameter search
- Optuna tuning: each model gets 10–20 trials on a ≤5000-row subsample of train data for speed
- Final fit: best params are used to fit on the full train set
- Metrics reported: CV scores on train and final scores on the held-out test set
- Retry logic: if the Manager rejects performance, feedback keywords (e.g. "class imbalance", "overfit", "learning rate") adjust the hyperparameter search strategy before rerunning
| Situation | Action |
|---|---|
| Column >70% missing | Drop |
| Duplicate rows | Remove |
| Numeric missing values | Median imputation (default) or mean (if Manager requests) |
| Categorical missing values | Mode imputation |
| Numeric outliers | IQR capping (or Yeo-Johnson / Box-Cox if Manager requests) |
| Low-cardinality categoricals (≤10) | One-hot encoding |
| Medium-cardinality (11–50) | Frequency encoding |
| High-cardinality (>50) | Target encoding (via category_encoders) |
| All numeric features | RobustScaler — target column is always excluded |
The FastAPI backend exposes these endpoints:
| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Health check |
POST |
/upload |
Upload a CSV file, returns file_id |
POST |
/pipeline/run |
Start a pipeline run, returns run_id |
GET |
/pipeline/status/{run_id} |
Poll status, progress, and logs |
GET |
/pipeline/result/{run_id} |
Get final result once complete |
GET |
/download/model/{run_id} |
Download best model .pkl |
GET |
/download/data/{run_id} |
Download cleaned data .csv |
GET |
/download/report/{run_id} |
Download PDF report |
Interactive docs at http://localhost:8000/docs.
| Category | Libraries |
|---|---|
| Frontend | Streamlit, Plotly |
| Backend | FastAPI, Uvicorn |
| ML | scikit-learn, XGBoost, LightGBM, Optuna |
| Data | pandas, NumPy, SciPy |
| Encoding | category-encoders |
| LLM | LangChain, HuggingFace Hub |
| Report | fpdf2, Plotly (kaleido for image export) |
| Serving | joblib, python-multipart |
- Python 3.10+
- See
requirements.txtfor all dependencies
Data Constraints
- Accepts only CSV files — no Excel, JSON, Parquet, or database connections supported yet
- Performance degrades on datasets exceeding ~500K rows due to in-memory Pandas processing
- Multimodal data (images, text columns, time-series) is not handled by the current pipeline
Cleaning Agent
- Imputation strategy is fixed (median/mode) — no adaptive selection based on data distribution
- High-cardinality categorical columns above 15 unique values are label-encoded, which may mislead tree models into assuming ordinality
- Feature engineering (log transforms, polynomial features, interaction terms) is not performed
EDA Agent
- Charts are static PNGs — no interactive drill-down or zoom capability
- Anomaly detection is limited to IQR clipping; isolation-forest or DBSCAN-based outlier detection is absent
ML Engineer Agent
- Hyperparameter search space is predefined — unusual datasets may benefit from wider grids
- No support for neural networks, SVMs, or ensemble stacking
- Class imbalance handling (SMOTE, class weights) is not automated when imbalance is detected
- Models are evaluated but not persisted — no
.pklexport or model registry integration
Agent Orchestration
- LLM API costs accumulate on large datasets since every agent reasoning step consumes tokens
- Pipeline is sequential — no parallel agent execution, meaning runtime scales linearly with complexity
- Agent retries on tool failure are limited; a bad LLM response can stall the pipeline
Infrastructure
- No authentication layer — the API endpoint is publicly accessible as deployed
- Session state is in-memory; server restart loses all uploaded files and run history
- Not containerized yet — environment setup requires manual dependency management
MIT