Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 0 additions & 49 deletions README.md

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Use Python 3.12 slim (already has Python and pip).
FROM python:3.12-slim

# Avoid interactive prompts during apt operations.
ENV DEBIAN_FRONTEND=noninteractive

# Install CA certificates (needed for HTTPS).
RUN apt-get update && apt-get install -y \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*

# Install project specific packages.
RUN mkdir -p /install
COPY requirements.txt /install/requirements.txt
RUN pip install --upgrade pip && \
pip install --no-cache-dir jupyterlab jupyterlab_vim jupytext -r /install/requirements.txt

# Config.
COPY etc_sudoers /install/
COPY etc_sudoers /etc/sudoers
COPY bashrc /root/.bashrc

# Report package versions.
COPY version.sh /install/
RUN /install/version.sh 2>&1 | tee version.log

# Jupyter.
EXPOSE 8888

CMD ["/bin/bash"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Benchmarking-in-Agentic-Reasoning-for-Data-Science-

## Description

This project moves beyond evaluating third-party "black box" tools to engineering a custom, stateful multi-agent system using LangGraph. While standard agents (like ChatGPT) follow linear, one-shot processes, this research builds a cyclic architecture where agents can plan, execute, critique, and self-correct. By developing an internal "Analyst-Reviewer" loop, the project explores the frontier of Agentic Reasoning—testing whether a structured graph of specialized agents can outperform monolithic AI models in reliability, code quality, and handling "adversarial" or "noisy" data science tasks.

| Type | Name | Description | Website | Strength |
| --------------------------- | ------------------------------------------------- | ------------------------------------------------------------------------------------------------------ | ---------------------------------------- | -------------|
| Notebook agent | Data Interpreter (ChatGPT Advanced Data Analysis) | Upload data → automatic cleaning, analysis, modeling, and visualization | https://chat.openai.com | Fast exploratory analysis |
| AutoML agent | AutoGluon | Automated model selection, feature engineering, and tuning pipelines | https://auto.gluon.ai | Strong tabular ML performance |
| Multi-agent research system | Microsoft AutoGen | Agents collaborate to plan experiments, write code, and critique results | https://github.com/microsoft/autogen | Research workflows |
| Workflow agent | LangGraph | Stateful agent graphs for long-running analytical pipelines | https://langchain-ai.github.io/langgraph | Persistent reasoning loops |


## Project Objective

The primary goal is to benchmark the efficacy of stateful multi-agent orchestration against single-agent and AutoML baselines. This project aims to answer:
Can a cyclic multi-agent graph (LangGraph) significantly reduce "hallucinations" and logical errors compared to single-agent assistants?
Does a "Reviewer" node in an agentic workflow produce more production-ready, modular code than one-shot generation?
How do different agent architectures (Linear vs. Cyclic vs. AutoML) recover when faced with corrupted or ambiguous data?

## Dataset Suggestions
- **Heart Disease Prediction (UCI / Kaggle)**
- Source: Kaggle — UCI Heart Disease Dataset
- URL: https://www.kaggle.com/datasets/redwankarimsony/heart-disease-uci
- Contains: 14 clinical features (age, cholesterol, chest pain type, etc.)
with a binary target indicating presence of heart disease; ~300 rows
- Access: Free Kaggle account required; download via
`kaggle datasets download` CLI or direct CSV link; no authentication token
needed for manual download

- **NYC Yellow Taxi Trip Records**
- Source: NYC Open Data / TLC Trip Record Data
- URL: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
- Contains: Pick-up/drop-off timestamps, GPS coordinates, trip distance, fare
amount, tip, and passenger count; monthly Parquet files (~millions of rows —
use one month's subset)
- Access: Fully public, no authentication; direct Parquet download links
available on the page; recommend sampling 50k rows for laptop use

- **Air Quality — OpenAQ**
- Source: OpenAQ public API
- URL: https://api.openaq.org/v2/measurements (REST, no key required for basic
access)
- Contains: Real-time and historical PM2.5, PM10, NO₂, O₃, CO readings from
thousands of global monitoring stations with timestamps and GPS
- Access: Free tier with no API key; query by city, parameter, and date range;
returns JSON easily loaded with `requests` + `pandas`

- **Amazon Product Reviews — HuggingFace Datasets**
- Source: HuggingFace Hub — `McAuley-Lab/Amazon-Reviews-2023`
- URL: https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023
- Contains: Product ratings (1–5 stars), review text, verified purchase flag,
product category; load a small subset (e.g., "All_Beauty", ~500k rows) with
`datasets.load_dataset()`
- Access: Free, no authentication; streamed or downloaded via `datasets`
library

## Breakdown of the Nodes

* The Planner (Node 1): Analyzes the dataset schema and sets the strategy (e.g., "This is a classification problem with imbalanced data").
* The EDA Analyst (Node 2): Performs exploratory data analysis, cleans data, and identifies outliers.
* The ML Architect (Node 3): Selects algorithms (e.g., XGBoost, Random Forest), performs hyperparameter tuning, and trains the model.
* The Quality Reviewer (Node 4): Acts as the scientific "guardrail." It inspects the Analyst's results—if Accuracy is high but Recall is low on imbalanced data, it triggers a loop back to the Architect.
* The Report Writer (Node 5): Synthesizes the final journey, documenting both the results and the errors caught/corrected by the Reviewer.

## The Quality Reviewer Rules (Guardrails)

### Code Integrity (The "Compiler" Gate)

* Syntax & Execution: Verified execution in a containerized Python environment.
* Modularity: Checks if code follows DRY (Don't Repeat Yourself) principles and proper function definitions.
* Library Hygiene: Ensures no unauthorized or deprecated packages are used.

### Statistical Logic (The "Data Scientist" Gate)
* Leakage Detection: Scans for target variables accidentally included in the feature set.
* Imbalance Audit: Rejects models that only report "Accuracy" for imbalanced clinical datasets like Heart Disease Prediction.
* Impossible Values: Flags unrealistic data points (e.g., negative taxi fares) for re-cleaning.

### Explainability (The "Researcher" Gate)

* Narrative Consistency: Verifies that the written report matches the generated SHAP/Feature Importance plots.
* Logical Grounding: Rejects generic explanations in favor of data-backed insights.

## Benchmark Comparison Framework

We benchmark the custom LangGraph system against three distinct philosophies of AI:

1. Single-Agent Baseline: ChatGPT (Advanced Data Analysis) – Testing monolithic performance.
2. Conversational Multi-Agent: Microsoft AutoGen – Testing "Group Chat" vs. "Graph-based" logic.
3. Standard AutoML: AutoGluon – Testing AI reasoning vs. mathematical automation.

## Tasks & Implementation

1. Environment Setup: Version pinning for reproducibility across all agents.
2. Graph Construction: Implementing the StateGraph and Conditional Edges in LangGraph.
3. Benchmarking Execution: Running all competitors against Amazon Reviews, NYC Taxi, and Heart Disease datasets.
4. Adversarial Reliability Test: Introducing mislabeled data and extreme outliers to test system resilience.
5. Interpretability Audit: Analyzing the "thought logs" to determine which architecture is most transparent for human researchers.

## Useful Resources
- **AutoGluon Documentation** — Tabular prediction quickstart and benchmarks:
https://auto.gluon.ai/stable/tutorials/tabular/tabular-quick-start.html
- **Microsoft AutoGen GitHub** — Multi-agent conversation examples including
data science workflows: https://github.com/microsoft/autogen
- **OpenML Benchmark Suite** — Curated tabular datasets and standardized
evaluation protocols for AutoML comparison studies:
https://www.openml.org/search?type=benchmark
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
set -o vi
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
#!/usr/bin/env python

"""
Copy Docker-related files from the source directory to a destination directory.

This script copies all Docker configuration and utility files from
class_project/project_template/ to a specified destination directory.

Usage examples:
# Copy all files to a target directory.
> ./copy_docker_files.py --dst_dir /path/to/destination

# Copy with verbose logging.
> ./copy_docker_files.py --dst_dir /path/to/destination -v DEBUG

Import as:

import class_project.project_template.copy_docker_files as cpdccodo
"""

import argparse
import logging
import os
from typing import List

import helpers.hdbg as hdbg
import helpers.hio as hio
import helpers.hparser as hparser
import helpers.hsystem as hsystem

_LOG = logging.getLogger(__name__)

# #############################################################################
# Constants
# #############################################################################

# List of files to copy from the source directory.
_FILES_TO_COPY = [
"bashrc",
"docker_bash.sh",
"docker_build.sh",
"docker_clean.sh",
"docker_cmd.sh",
"docker_exec.sh",
"docker_jupyter.sh",
"docker_name.sh",
"docker_push.sh",
"etc_sudoers",
"install_jupyter_extensions.sh",
"run_jupyter.sh"
"version.sh",
]


# #############################################################################
# Helper functions
# #############################################################################


def _get_source_dir() -> str:
"""
Get the absolute path to the source directory containing Docker files.

:return: absolute path to class_project/project_template/
"""
# Get the directory where this script is located.
script_dir = os.path.dirname(os.path.abspath(__file__))
_LOG.debug("Script directory='%s'", script_dir)
return script_dir


def _copy_files(
*,
src_dir: str,
dst_dir: str,
files: List[str],
) -> None:
"""
Copy specified files from source directory to destination directory.

:param src_dir: source directory path
:param dst_dir: destination directory path
:param files: list of filenames to copy
"""
# Verify source directory exists.
hdbg.dassert_dir_exists(src_dir, "Source directory does not exist:", src_dir)
# Create destination directory if it doesn't exist.
hio.create_dir(dst_dir, incremental=True)
_LOG.info("Copying %d files from '%s' to '%s'", len(files), src_dir, dst_dir)
# Copy each file.
copied_count = 0
for filename in files:
src_path = os.path.join(src_dir, filename)
dst_path = os.path.join(dst_dir, filename)
# Verify source file exists.
hdbg.dassert_path_exists(
src_path, "Source file does not exist:", src_path
)
# Copy the file using cp -a to preserve all permissions and attributes.
_LOG.debug("Copying '%s' -> '%s'", src_path, dst_path)
cmd = f"cp -a {src_path} {dst_path}"
hsystem.system(cmd)
copied_count += 1
#
_LOG.info("Successfully copied %d files", copied_count)


# #############################################################################


def _parse() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(
description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter,
)
parser.add_argument(
"--dst_dir",
action="store",
required=True,
help="Destination directory where files will be copied",
)
hparser.add_verbosity_arg(parser)
return parser


def _main(parser: argparse.ArgumentParser) -> None:
args = parser.parse_args()
hdbg.init_logger(verbosity=args.log_level, use_exec_path=True)
# Get source directory.
src_dir = _get_source_dir()
# Copy files to destination.
_copy_files(
src_dir=src_dir,
dst_dir=args.dst_dir,
files=_FILES_TO_COPY,
)


if __name__ == "__main__":
_main(_parse())
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#!/bin/bash
# """
# Build a Docker container image for the project.
#
# This script sets up the build environment with error handling and command
# tracing, loads Docker configuration from docker_name.sh, and builds the
# Docker image using the build_container_image utility function. It supports
# both single-architecture and multi-architecture builds via the
# DOCKER_BUILD_MULTI_ARCH environment variable.
# """

# Exit immediately if any command exits with a non-zero status.
set -e

# Import the utility functions.
GIT_ROOT=$(git rev-parse --show-toplevel)
source $GIT_ROOT/class_project/project_template/utils.sh

# Parse default args (-h, -v) and enable set -x if -v is passed.
# Shift processed option flags so remaining args are passed to the build.
parse_default_args "$@"
shift $((OPTIND-1))

# Load Docker configuration variables (REPO_NAME, IMAGE_NAME, FULL_IMAGE_NAME).
get_docker_vars_script ${BASH_SOURCE[0]}
source $DOCKER_NAME
print_docker_vars

# Configure Docker build settings.
# Enable BuildKit for improved build performance and features.
export DOCKER_BUILDKIT=1
#export DOCKER_BUILDKIT=0

# Configure single-architecture build (set to 1 for multi-arch build).
#export DOCKER_BUILD_MULTI_ARCH=1
export DOCKER_BUILD_MULTI_ARCH=0

# Build the container image.
# Pass extra arguments (e.g., --no-cache) via command line after -v.
build_container_image "$@"
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
the input device is not a TTY
Loading