Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions research/agentic_data_science/schema_agent/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
FROM python:3.12-slim

ENV DEBIAN_FRONTEND=noninteractive
ENV PATH="/opt/venv/bin:$PATH"

# This allows 'import helpers' to work if helpers is inside /git_root/helpers_root
ENV PYTHONPATH="/git_root/research/agentic_data_science/schema_agent:/git_root/helpers_root:${PYTHONPATH:-}"

RUN apt-get update && apt-get install -y \
ca-certificates build-essential curl sudo gnupg git vim \
libgl1 libglib2.0-0 libgomp1 \
&& rm -rf /var/lib/apt/lists/*

RUN curl -Ls https://astral.sh/uv/install.sh | sh
ENV PATH="/root/.local/bin:$PATH"

RUN uv venv /opt/venv

# Requirements installation
COPY requirements.txt /install/requirements.txt
RUN uv pip install --python /opt/venv/bin/python --no-cache -r /install/requirements.txt jupyterlab

# Create the skeleton directory structure
WORKDIR /git_root

# Address reviewer feedback: We assume schema_agent.py is in the context
# We will chmod it inside the container during build or via the mount script
EXPOSE 8888

CMD ["/bin/bash"]
184 changes: 106 additions & 78 deletions research/agentic_data_science/schema_agent/README.md
Original file line number Diff line number Diff line change
@@ -1,134 +1,162 @@
# Data Profiler Agent

Automated statistical profiling and LLM-powered semantic analysis for CSV datasets. Generates column-level insights including semantic meaning, data quality assessment, and testable business hypotheses.
Automated statistical profiling and LLM-powered semantic analysis for CSV datasets. Generates column-level insights including semantic classification, data quality assessment, and testable business hypotheses.

## Features
## Key Features

- **Temporal Detection:** Auto-detects and converts date/datetime columns across multiple formats
- **Statistical Profiling:** Computes numeric summaries, data quality metrics, and categorical distributions
- **LLM Semantic Analysis:** Generates column roles (ID, Feature, Target, Timestamp), semantic meaning, and hypotheses
- **Cost Optimization:** Filter columns before LLM analysis to control token usage and API costs
- **Multi-Format Output:** JSON reports and Markdown summaries
- **Automatic temporal detection** — Identifies and converts date/datetime columns across multiple formats
- **Statistical profiling** — Computes numeric summaries, data quality metrics, and categorical distributions
- **LLM-powered semantic analysis** — Infers column roles (ID, Feature, Target, Timestamp), semantic meaning, and generates testable business hypotheses
- **Smart cost control** — Selectively analyze columns to optimize API usage and reduce costs
- **Flexible output formats** — Generate machine-readable JSON reports and human-friendly Markdown summaries

## Setup
## Quick Start

Go into the schema folder:
```bash
cd research/agentic_data_science/schema_agent
```
### Installation

Install the requirements:
```bash
pip install -r requirements.txt
```
Navigate to the project directory and install dependencies:

Set the `OPENAI_API_KEY` in your environment:
```bash
cd research/agentic_data_science/schema_agent
pip install -r requirements.txt
export OPENAI_API_KEY=sk-...
chmod +x schema_agent.py
```

## Module Structure

The agent is split into six focused modules:

| Module | Responsibility |
|--------|---------------|
| `schema_agent_models.py` | Pydantic schemas for type-safe column/dataset insights |
| `schema_agent_loader.py` | CSV loading, type inference, datetime detection |
| `schema_agent_stats.py` | Numeric summaries, quality reports, categorical distributions |
| `schema_agent_llm.py` | Prompt building, OpenAI/LangChain calls, structured output parsing |
| `schema_agent_report.py` | Column profiles, JSON and Markdown export |
| `schema_agent.py` | Pipeline orchestration and CLI entry point |
### Basic Usage

## Usage

### Basic
Profile a single CSV file:

```bash
python schema_agent.py data.csv
./schema_agent.py data.csv
```

Outputs:
- `data_profile_report.json` — Machine-readable report
- `data_profile_summary.md` — Human-readable summary
This generates two output files:
- **`data_profile_report.json`**Complete statistical and semantic analysis
- **`data_profile_summary.md`**Readable summary table with insights

### Advanced
### Advanced Usage

```bash
# Multiple files with tags
python schema_agent.py dataset1.csv dataset2.csv --tags sales_2024 inv_q1
# Profile multiple files with custom labels
./schema_agent.py dataset1.csv dataset2.csv --tags sales_2024 inventory_q1

# Cost-optimized: only high-null columns
python schema_agent.py data.csv --llm-scope nulls --model gpt-4o-mini
# Cost-optimized analysis (only high-null columns)
./schema_agent.py data.csv --llm-scope nulls --model gpt-4o-mini

# Custom metrics and output
python schema_agent.py data.csv --metrics mean std max --output-json my_report.json
# Custom metrics and output paths
./schema_agent.py data.csv --metrics mean std max --output-json my_report.json

# LangChain backend
python schema_agent.py data.csv --use-langchain
# Use LangChain as the inference backend
./schema_agent.py data.csv --use-langchain
```

## Command-Line Arguments
## Architecture

The agent consists of six focused modules working together:

| Module | Purpose |
|--------|---------|
| `schema_agent_models.py` | Type-safe Pydantic schemas for column profiles and dataset insights |
| `schema_agent_loader.py` | CSV loading, type inference, and datetime detection |
| `schema_agent_stats.py` | Numeric summaries, data quality metrics, and categorical distributions |
| `schema_agent_llm.py` | LLM integration for semantic analysis and hypothesis generation |
| `schema_agent_report.py` | Report generation in JSON and Markdown formats |
| `schema_agent.py` | Pipeline orchestration and command-line interface |

For detailed examples of individual module usage, see `schema_agent.example`. For end-to-end pipeline examples, see `schema_agent.API`.

## Command-Line Options

| Argument | Default | Description |
|----------|---------|-------------|
| `csv_paths` | Required | One or more CSV file paths |
| `--tags` | File stems | Tags for each CSV (must match count) |
| `--model` | `gpt-4o` | LLM model (`gpt-4o`, `gpt-4o-mini`, etc.) |
| `--llm-scope` | `all` | Which columns to profile: `all`, `semantic`, `nulls` |
| `--metrics` | Subset | Numeric metrics: `mean`, `std`, `min`, `25%`, `50%`, `75%`, `max` |
| `--use-langchain` | False | Use LangChain instead of hllmcli |
| `--output-json` | `data_profile_report.json` | JSON report path |
| `--output-md` | `data_profile_summary.md` | Markdown summary path |
| `csv_paths` | Required | One or more CSV file paths to analyze |
| `--tags` | File stems | Custom labels for each CSV (must match number of files) |
| `--model` | `gpt-4o` | OpenAI model to use (`gpt-4o`, `gpt-4o-mini`, etc.) |
| `--llm-scope` | `all` | Strategy for column selection: `all`, `semantic`, or `nulls` |
| `--metrics` | Subset | Statistics to compute: `mean`, `std`, `min`, `25%`, `50%`, `75%`, `max` |
| `--use-langchain` | `false` | Use LangChain instead of default inference client |
| `--output-json` | `data_profile_report.json` | Path for JSON report output |
| `--output-md` | `data_profile_summary.md` | Path for Markdown summary output |

## LLM Scoping
## Cost Optimization with LLM Scoping

- **`all`** — Every column (highest cost, comprehensive)
- **`semantic`** — Non-numeric columns only
- **`nulls`** — Columns with >5% null values (cost-optimized)
The `--llm-scope` parameter controls which columns are sent to the LLM, helping you balance analysis depth with costs:

| Scope | What Gets Analyzed | Cost Level | Best For |
|-------|-------------------|-----------|----------|
| `all` | Every column | High | Complete dataset understanding |
| `semantic` | Non-numeric columns only | Medium | Text and categorical analysis |
| `nulls` | Columns with >5% null values | Low | Data quality issues only |

## Python API

### Full pipeline
### Run the full pipeline programmatically

```python
import schema_agent as radsasag
tag_to_df, stats = radsasag.run_pipeline(
import research.agentic_data_science.schema_agent.schema_agent as agent

tag_to_df, stats = agent.run_pipeline(
csv_paths=["data.csv"],
model="gpt-4o-mini",
llm_scope="semantic"
)
```

### Individual modules
### Use individual modules independently

Each module can be imported independently for exploratory use or testing:
Each module can be imported and used separately for custom workflows:

```python
import schema_agent_loader as radsasal
import schema_agent_stats as radsasas
import schema_agent_llm as radsasal
import schema_agent_report as radsasar
import research.agentic_data_science.schema_agent.schema_agent_loader as loader
import research.agentic_data_science.schema_agent.schema_agent_stats as stats
import research.agentic_data_science.schema_agent.schema_agent_llm as llm
import research.agentic_data_science.schema_agent.schema_agent_report as report
```

## Output
## Output Details

### data_profile_report.json
Structured report with column profiles, technical stats, and LLM insights.
### `data_profile_report.json`

### data_profile_summary.md
Formatted table summary: Column | Meaning | Role | Quality | Hypotheses
A structured JSON report containing:
- Per-column statistical profiles
- Data quality metrics
- LLM-generated semantic insights
- Column role classifications

### `data_profile_summary.md`

A formatted Markdown table with columns:
- **Column** — Column name
- **Meaning** — Inferred semantic description
- **Role** — Classified role (ID, Feature, Target, Timestamp)
- **Quality** — Data quality assessment
- **Hypotheses** — Generated business insights

## Troubleshooting

**API Key Error:**
### API key not configured

Set your OpenAI API key:
```bash
export OPENAI_API_KEY=sk-...
```

**Validation Errors:**
- Use `--llm-scope nulls` or `--llm-scope semantic` to reduce columns
- Try `--model gpt-4o-mini`
### Validation or parsing errors on large datasets

Reduce the number of columns analyzed by the LLM:
```bash
./schema_agent.py data.csv --llm-scope nulls
./schema_agent.py data.csv --llm-scope semantic --model gpt-4o-mini
```

### No datetime columns detected

This is normal behavior — the agent automatically skips temporal detection when no date-like columns are present in the dataset.

## Next Steps

**Datetime Detection:**
Skipped automatically if no temporal columns detected.
- Check out example notebooks for detailed workflows
- Integrate into your data science pipelines
- Extend with custom metrics or export formats
- Review individual module documentation for advanced use cases
Loading