gpsaggese · PranavShashidhara · Mar 30, 2026 · Mar 31, 2026 · Apr 1, 2026 · Apr 1, 2026
diff --git a/research/agentic_data_science/schema_agent/Dockerfile b/research/agentic_data_science/schema_agent/Dockerfile
@@ -0,0 +1,30 @@
+FROM python:3.12-slim
+
+ENV DEBIAN_FRONTEND=noninteractive
+ENV PATH="/opt/venv/bin:$PATH"
+
+# This allows 'import helpers' to work if helpers is inside /git_root/helpers_root
+ENV PYTHONPATH="/git_root/research/agentic_data_science/schema_agent:/git_root/helpers_root:${PYTHONPATH:-}"
+
+RUN apt-get update && apt-get install -y \
+    ca-certificates build-essential curl sudo gnupg git vim \
+    libgl1 libglib2.0-0 libgomp1 \
+    && rm -rf /var/lib/apt/lists/*
+
+RUN curl -Ls https://astral.sh/uv/install.sh | sh
+ENV PATH="/root/.local/bin:$PATH"
+
+RUN uv venv /opt/venv
+
+# Requirements installation
+COPY requirements.txt /install/requirements.txt
+RUN uv pip install --python /opt/venv/bin/python --no-cache -r /install/requirements.txt jupyterlab
+
+# Create the skeleton directory structure
+WORKDIR /git_root
+
+# Address reviewer feedback: We assume schema_agent.py is in the context
+# We will chmod it inside the container during build or via the mount script
+EXPOSE 8888
+
+CMD ["/bin/bash"]
diff --git a/research/agentic_data_science/schema_agent/README.md b/research/agentic_data_science/schema_agent/README.md
@@ -1,134 +1,162 @@
 # Data Profiler Agent
 
-Automated statistical profiling and LLM-powered semantic analysis for CSV datasets. Generates column-level insights including semantic meaning, data quality assessment, and testable business hypotheses.
+Automated statistical profiling and LLM-powered semantic analysis for CSV datasets. Generates column-level insights including semantic classification, data quality assessment, and testable business hypotheses.
 
-## Features
+## Key Features
 
-- **Temporal Detection:** Auto-detects and converts date/datetime columns across multiple formats
-- **Statistical Profiling:** Computes numeric summaries, data quality metrics, and categorical distributions
-- **LLM Semantic Analysis:** Generates column roles (ID, Feature, Target, Timestamp), semantic meaning, and hypotheses
-- **Cost Optimization:** Filter columns before LLM analysis to control token usage and API costs
-- **Multi-Format Output:** JSON reports and Markdown summaries
+- **Automatic temporal detection** — Identifies and converts date/datetime columns across multiple formats
+- **Statistical profiling** — Computes numeric summaries, data quality metrics, and categorical distributions
+- **LLM-powered semantic analysis** — Infers column roles (ID, Feature, Target, Timestamp), semantic meaning, and generates testable business hypotheses
+- **Smart cost control** — Selectively analyze columns to optimize API usage and reduce costs
+- **Flexible output formats** — Generate machine-readable JSON reports and human-friendly Markdown summaries
 
-## Setup
+## Quick Start
 
-Go into the schema folder:
-```bash
-cd research/agentic_data_science/schema_agent
-```
+### Installation
 
-Install the requirements:
-```bash
-pip install -r requirements.txt
-```
+Navigate to the project directory and install dependencies:
 
-Set the `OPENAI_API_KEY` in your environment:
 ```bash
+cd research/agentic_data_science/schema_agent
+pip install -r requirements.txt
 export OPENAI_API_KEY=sk-...
+chmod +x schema_agent.py
 ```
 
-## Module Structure
-
-The agent is split into six focused modules:
-
-| Module | Responsibility |
-|--------|---------------|
-| `schema_agent_models.py` | Pydantic schemas for type-safe column/dataset insights |
-| `schema_agent_loader.py` | CSV loading, type inference, datetime detection |
-| `schema_agent_stats.py` | Numeric summaries, quality reports, categorical distributions |
-| `schema_agent_llm.py` | Prompt building, OpenAI/LangChain calls, structured output parsing |
-| `schema_agent_report.py` | Column profiles, JSON and Markdown export |
-| `schema_agent.py` | Pipeline orchestration and CLI entry point |
+### Basic Usage
 
-## Usage
-
-### Basic
+Profile a single CSV file:
 
 ```bash
-python schema_agent.py data.csv
+./schema_agent.py data.csv
 ```
 
-Outputs:
-- `data_profile_report.json` — Machine-readable report
-- `data_profile_summary.md` — Human-readable summary
+This generates two output files:
+- **`data_profile_report.json`** — Complete statistical and semantic analysis
+- **`data_profile_summary.md`** — Readable summary table with insights
 
-### Advanced
+### Advanced Usage
 
 ```bash
-# Multiple files with tags
-python schema_agent.py dataset1.csv dataset2.csv --tags sales_2024 inv_q1
+# Profile multiple files with custom labels
+./schema_agent.py dataset1.csv dataset2.csv --tags sales_2024 inventory_q1
 
-# Cost-optimized: only high-null columns
-python schema_agent.py data.csv --llm-scope nulls --model gpt-4o-mini
+# Cost-optimized analysis (only high-null columns)
+./schema_agent.py data.csv --llm-scope nulls --model gpt-4o-mini
 
-# Custom metrics and output
-python schema_agent.py data.csv --metrics mean std max --output-json my_report.json
+# Custom metrics and output paths
+./schema_agent.py data.csv --metrics mean std max --output-json my_report.json
 
-# LangChain backend
-python schema_agent.py data.csv --use-langchain
+# Use LangChain as the inference backend
+./schema_agent.py data.csv --use-langchain
 ```
 
-## Command-Line Arguments
+## Architecture
+
+The agent consists of six focused modules working together:
+
+| Module | Purpose |
+|--------|---------|
+| `schema_agent_models.py` | Type-safe Pydantic schemas for column profiles and dataset insights |
+| `schema_agent_loader.py` | CSV loading, type inference, and datetime detection |
+| `schema_agent_stats.py` | Numeric summaries, data quality metrics, and categorical distributions |
+| `schema_agent_llm.py` | LLM integration for semantic analysis and hypothesis generation |
+| `schema_agent_report.py` | Report generation in JSON and Markdown formats |
+| `schema_agent.py` | Pipeline orchestration and command-line interface |
+
+For detailed examples of individual module usage, see `schema_agent.example`. For end-to-end pipeline examples, see `schema_agent.API`.
+
+## Command-Line Options
 
 | Argument | Default | Description |
 |----------|---------|-------------|
-| `csv_paths` | Required | One or more CSV file paths |
-| `--tags` | File stems | Tags for each CSV (must match count) |
-| `--model` | `gpt-4o` | LLM model (`gpt-4o`, `gpt-4o-mini`, etc.) |
-| `--llm-scope` | `all` | Which columns to profile: `all`, `semantic`, `nulls` |
-| `--metrics` | Subset | Numeric metrics: `mean`, `std`, `min`, `25%`, `50%`, `75%`, `max` |
-| `--use-langchain` | False | Use LangChain instead of hllmcli |
-| `--output-json` | `data_profile_report.json` | JSON report path |
-| `--output-md` | `data_profile_summary.md` | Markdown summary path |
+| `csv_paths` | Required | One or more CSV file paths to analyze |
+| `--tags` | File stems | Custom labels for each CSV (must match number of files) |
+| `--model` | `gpt-4o` | OpenAI model to use (`gpt-4o`, `gpt-4o-mini`, etc.) |
+| `--llm-scope` | `all` | Strategy for column selection: `all`, `semantic`, or `nulls` |
+| `--metrics` | Subset | Statistics to compute: `mean`, `std`, `min`, `25%`, `50%`, `75%`, `max` |
+| `--use-langchain` | `false` | Use LangChain instead of default inference client |
+| `--output-json` | `data_profile_report.json` | Path for JSON report output |
+| `--output-md` | `data_profile_summary.md` | Path for Markdown summary output |
 
-## LLM Scoping
+## Cost Optimization with LLM Scoping
 
-- **`all`** — Every column (highest cost, comprehensive)
-- **`semantic`** — Non-numeric columns only
-- **`nulls`** — Columns with >5% null values (cost-optimized)
+The `--llm-scope` parameter controls which columns are sent to the LLM, helping you balance analysis depth with costs:
+
+| Scope | What Gets Analyzed | Cost Level | Best For |
+|-------|-------------------|-----------|----------|
+| `all` | Every column | High | Complete dataset understanding |
+| `semantic` | Non-numeric columns only | Medium | Text and categorical analysis |
+| `nulls` | Columns with >5% null values | Low | Data quality issues only |
 
 ## Python API
 
-### Full pipeline
+### Run the full pipeline programmatically
 
 ```python
-import schema_agent as radsasag
-tag_to_df, stats = radsasag.run_pipeline(
+import research.agentic_data_science.schema_agent.schema_agent as agent
+
+tag_to_df, stats = agent.run_pipeline(
     csv_paths=["data.csv"],
     model="gpt-4o-mini",
     llm_scope="semantic"
 )
 ```
 
-### Individual modules
+### Use individual modules independently
 
-Each module can be imported independently for exploratory use or testing:
+Each module can be imported and used separately for custom workflows:
 
 ```python
-import schema_agent_loader as radsasal
-import schema_agent_stats as radsasas
-import schema_agent_llm as radsasal
-import schema_agent_report as radsasar
+import research.agentic_data_science.schema_agent.schema_agent_loader as loader
+import research.agentic_data_science.schema_agent.schema_agent_stats as stats
+import research.agentic_data_science.schema_agent.schema_agent_llm as llm
+import research.agentic_data_science.schema_agent.schema_agent_report as report
 ```
 
-## Output
+## Output Details
 
-### data_profile_report.json
-Structured report with column profiles, technical stats, and LLM insights.
+### `data_profile_report.json`
 
-### data_profile_summary.md
-Formatted table summary: Column | Meaning | Role | Quality | Hypotheses
+A structured JSON report containing:
+- Per-column statistical profiles
+- Data quality metrics
+- LLM-generated semantic insights
+- Column role classifications
+
+### `data_profile_summary.md`
+
+A formatted Markdown table with columns:
+- **Column** — Column name
+- **Meaning** — Inferred semantic description
+- **Role** — Classified role (ID, Feature, Target, Timestamp)
+- **Quality** — Data quality assessment
+- **Hypotheses** — Generated business insights
 
 ## Troubleshooting
 
-**API Key Error:**
+### API key not configured
+
+Set your OpenAI API key:
 ```bash
 export OPENAI_API_KEY=sk-...
 ```
 
-**Validation Errors:**
-- Use `--llm-scope nulls` or `--llm-scope semantic` to reduce columns
-- Try `--model gpt-4o-mini`
+### Validation or parsing errors on large datasets
+
+Reduce the number of columns analyzed by the LLM:
+```bash
+./schema_agent.py data.csv --llm-scope nulls
+./schema_agent.py data.csv --llm-scope semantic --model gpt-4o-mini
+```
+
+### No datetime columns detected
+
+This is normal behavior — the agent automatically skips temporal detection when no date-like columns are present in the dataset.
+
+## Next Steps
 
-**Datetime Detection:**
-Skipped automatically if no temporal columns detected.
+- Check out example notebooks for detailed workflows
+- Integrate into your data science pipelines
+- Extend with custom metrics or export formats
+- Review individual module documentation for advanced use cases