diff --git a/research/agentic_data_science/schema_agent/Dockerfile b/research/agentic_data_science/schema_agent/Dockerfile new file mode 100644 index 000000000..a7e060b4a --- /dev/null +++ b/research/agentic_data_science/schema_agent/Dockerfile @@ -0,0 +1,30 @@ +FROM python:3.12-slim + +ENV DEBIAN_FRONTEND=noninteractive +ENV PATH="/opt/venv/bin:$PATH" + +# This allows 'import helpers' to work if helpers is inside /git_root/helpers_root +ENV PYTHONPATH="/git_root/research/agentic_data_science/schema_agent:/git_root/helpers_root:${PYTHONPATH:-}" + +RUN apt-get update && apt-get install -y \ + ca-certificates build-essential curl sudo gnupg git vim \ + libgl1 libglib2.0-0 libgomp1 \ + && rm -rf /var/lib/apt/lists/* + +RUN curl -Ls https://astral.sh/uv/install.sh | sh +ENV PATH="/root/.local/bin:$PATH" + +RUN uv venv /opt/venv + +# Requirements installation +COPY requirements.txt /install/requirements.txt +RUN uv pip install --python /opt/venv/bin/python --no-cache -r /install/requirements.txt jupyterlab + +# Create the skeleton directory structure +WORKDIR /git_root + +# Address reviewer feedback: We assume schema_agent.py is in the context +# We will chmod it inside the container during build or via the mount script +EXPOSE 8888 + +CMD ["/bin/bash"] \ No newline at end of file diff --git a/research/agentic_data_science/schema_agent/README.md b/research/agentic_data_science/schema_agent/README.md index c3d1c3a13..7c33925bd 100644 --- a/research/agentic_data_science/schema_agent/README.md +++ b/research/agentic_data_science/schema_agent/README.md @@ -1,134 +1,162 @@ # Data Profiler Agent -Automated statistical profiling and LLM-powered semantic analysis for CSV datasets. Generates column-level insights including semantic meaning, data quality assessment, and testable business hypotheses. +Automated statistical profiling and LLM-powered semantic analysis for CSV datasets. Generates column-level insights including semantic classification, data quality assessment, and testable business hypotheses. -## Features +## Key Features -- **Temporal Detection:** Auto-detects and converts date/datetime columns across multiple formats -- **Statistical Profiling:** Computes numeric summaries, data quality metrics, and categorical distributions -- **LLM Semantic Analysis:** Generates column roles (ID, Feature, Target, Timestamp), semantic meaning, and hypotheses -- **Cost Optimization:** Filter columns before LLM analysis to control token usage and API costs -- **Multi-Format Output:** JSON reports and Markdown summaries +- **Automatic temporal detection** — Identifies and converts date/datetime columns across multiple formats +- **Statistical profiling** — Computes numeric summaries, data quality metrics, and categorical distributions +- **LLM-powered semantic analysis** — Infers column roles (ID, Feature, Target, Timestamp), semantic meaning, and generates testable business hypotheses +- **Smart cost control** — Selectively analyze columns to optimize API usage and reduce costs +- **Flexible output formats** — Generate machine-readable JSON reports and human-friendly Markdown summaries -## Setup +## Quick Start -Go into the schema folder: -```bash -cd research/agentic_data_science/schema_agent -``` +### Installation -Install the requirements: -```bash -pip install -r requirements.txt -``` +Navigate to the project directory and install dependencies: -Set the `OPENAI_API_KEY` in your environment: ```bash +cd research/agentic_data_science/schema_agent +pip install -r requirements.txt export OPENAI_API_KEY=sk-... +chmod +x schema_agent.py ``` -## Module Structure - -The agent is split into six focused modules: - -| Module | Responsibility | -|--------|---------------| -| `schema_agent_models.py` | Pydantic schemas for type-safe column/dataset insights | -| `schema_agent_loader.py` | CSV loading, type inference, datetime detection | -| `schema_agent_stats.py` | Numeric summaries, quality reports, categorical distributions | -| `schema_agent_llm.py` | Prompt building, OpenAI/LangChain calls, structured output parsing | -| `schema_agent_report.py` | Column profiles, JSON and Markdown export | -| `schema_agent.py` | Pipeline orchestration and CLI entry point | +### Basic Usage -## Usage - -### Basic +Profile a single CSV file: ```bash -python schema_agent.py data.csv +./schema_agent.py data.csv ``` -Outputs: -- `data_profile_report.json` — Machine-readable report -- `data_profile_summary.md` — Human-readable summary +This generates two output files: +- **`data_profile_report.json`** — Complete statistical and semantic analysis +- **`data_profile_summary.md`** — Readable summary table with insights -### Advanced +### Advanced Usage ```bash -# Multiple files with tags -python schema_agent.py dataset1.csv dataset2.csv --tags sales_2024 inv_q1 +# Profile multiple files with custom labels +./schema_agent.py dataset1.csv dataset2.csv --tags sales_2024 inventory_q1 -# Cost-optimized: only high-null columns -python schema_agent.py data.csv --llm-scope nulls --model gpt-4o-mini +# Cost-optimized analysis (only high-null columns) +./schema_agent.py data.csv --llm-scope nulls --model gpt-4o-mini -# Custom metrics and output -python schema_agent.py data.csv --metrics mean std max --output-json my_report.json +# Custom metrics and output paths +./schema_agent.py data.csv --metrics mean std max --output-json my_report.json -# LangChain backend -python schema_agent.py data.csv --use-langchain +# Use LangChain as the inference backend +./schema_agent.py data.csv --use-langchain ``` -## Command-Line Arguments +## Architecture + +The agent consists of six focused modules working together: + +| Module | Purpose | +|--------|---------| +| `schema_agent_models.py` | Type-safe Pydantic schemas for column profiles and dataset insights | +| `schema_agent_loader.py` | CSV loading, type inference, and datetime detection | +| `schema_agent_stats.py` | Numeric summaries, data quality metrics, and categorical distributions | +| `schema_agent_llm.py` | LLM integration for semantic analysis and hypothesis generation | +| `schema_agent_report.py` | Report generation in JSON and Markdown formats | +| `schema_agent.py` | Pipeline orchestration and command-line interface | + +For detailed examples of individual module usage, see `schema_agent.example`. For end-to-end pipeline examples, see `schema_agent.API`. + +## Command-Line Options | Argument | Default | Description | |----------|---------|-------------| -| `csv_paths` | Required | One or more CSV file paths | -| `--tags` | File stems | Tags for each CSV (must match count) | -| `--model` | `gpt-4o` | LLM model (`gpt-4o`, `gpt-4o-mini`, etc.) | -| `--llm-scope` | `all` | Which columns to profile: `all`, `semantic`, `nulls` | -| `--metrics` | Subset | Numeric metrics: `mean`, `std`, `min`, `25%`, `50%`, `75%`, `max` | -| `--use-langchain` | False | Use LangChain instead of hllmcli | -| `--output-json` | `data_profile_report.json` | JSON report path | -| `--output-md` | `data_profile_summary.md` | Markdown summary path | +| `csv_paths` | Required | One or more CSV file paths to analyze | +| `--tags` | File stems | Custom labels for each CSV (must match number of files) | +| `--model` | `gpt-4o` | OpenAI model to use (`gpt-4o`, `gpt-4o-mini`, etc.) | +| `--llm-scope` | `all` | Strategy for column selection: `all`, `semantic`, or `nulls` | +| `--metrics` | Subset | Statistics to compute: `mean`, `std`, `min`, `25%`, `50%`, `75%`, `max` | +| `--use-langchain` | `false` | Use LangChain instead of default inference client | +| `--output-json` | `data_profile_report.json` | Path for JSON report output | +| `--output-md` | `data_profile_summary.md` | Path for Markdown summary output | -## LLM Scoping +## Cost Optimization with LLM Scoping -- **`all`** — Every column (highest cost, comprehensive) -- **`semantic`** — Non-numeric columns only -- **`nulls`** — Columns with >5% null values (cost-optimized) +The `--llm-scope` parameter controls which columns are sent to the LLM, helping you balance analysis depth with costs: + +| Scope | What Gets Analyzed | Cost Level | Best For | +|-------|-------------------|-----------|----------| +| `all` | Every column | High | Complete dataset understanding | +| `semantic` | Non-numeric columns only | Medium | Text and categorical analysis | +| `nulls` | Columns with >5% null values | Low | Data quality issues only | ## Python API -### Full pipeline +### Run the full pipeline programmatically ```python -import schema_agent as radsasag -tag_to_df, stats = radsasag.run_pipeline( +import research.agentic_data_science.schema_agent.schema_agent as agent + +tag_to_df, stats = agent.run_pipeline( csv_paths=["data.csv"], model="gpt-4o-mini", llm_scope="semantic" ) ``` -### Individual modules +### Use individual modules independently -Each module can be imported independently for exploratory use or testing: +Each module can be imported and used separately for custom workflows: ```python -import schema_agent_loader as radsasal -import schema_agent_stats as radsasas -import schema_agent_llm as radsasal -import schema_agent_report as radsasar +import research.agentic_data_science.schema_agent.schema_agent_loader as loader +import research.agentic_data_science.schema_agent.schema_agent_stats as stats +import research.agentic_data_science.schema_agent.schema_agent_llm as llm +import research.agentic_data_science.schema_agent.schema_agent_report as report ``` -## Output +## Output Details -### data_profile_report.json -Structured report with column profiles, technical stats, and LLM insights. +### `data_profile_report.json` -### data_profile_summary.md -Formatted table summary: Column | Meaning | Role | Quality | Hypotheses +A structured JSON report containing: +- Per-column statistical profiles +- Data quality metrics +- LLM-generated semantic insights +- Column role classifications + +### `data_profile_summary.md` + +A formatted Markdown table with columns: +- **Column** — Column name +- **Meaning** — Inferred semantic description +- **Role** — Classified role (ID, Feature, Target, Timestamp) +- **Quality** — Data quality assessment +- **Hypotheses** — Generated business insights ## Troubleshooting -**API Key Error:** +### API key not configured + +Set your OpenAI API key: ```bash export OPENAI_API_KEY=sk-... ``` -**Validation Errors:** -- Use `--llm-scope nulls` or `--llm-scope semantic` to reduce columns -- Try `--model gpt-4o-mini` +### Validation or parsing errors on large datasets + +Reduce the number of columns analyzed by the LLM: +```bash +./schema_agent.py data.csv --llm-scope nulls +./schema_agent.py data.csv --llm-scope semantic --model gpt-4o-mini +``` + +### No datetime columns detected + +This is normal behavior — the agent automatically skips temporal detection when no date-like columns are present in the dataset. + +## Next Steps -**Datetime Detection:** -Skipped automatically if no temporal columns detected. \ No newline at end of file +- Check out example notebooks for detailed workflows +- Integrate into your data science pipelines +- Extend with custom metrics or export formats +- Review individual module documentation for advanced use cases \ No newline at end of file diff --git a/research/agentic_data_science/schema_agent/data_profile_report.json b/research/agentic_data_science/schema_agent/data_profile_report.json new file mode 100644 index 000000000..df7adbef6 --- /dev/null +++ b/research/agentic_data_science/schema_agent/data_profile_report.json @@ -0,0 +1,608 @@ +{ + "report_metadata": { + "version": "1.2", + "agent": "Data-Profiler-Agent", + "generated_at": "2026-04-07T16:06:18.296448Z" + }, + "column_profiles": [ + { + "column": "order_datetime", + "dtype": "datetime64[us]", + "null_pct": 0.0, + "unique_count": 14903, + "sample_values": [ + "2009-12-01 07:45:00", + "2009-12-01 07:45:00", + "2009-12-01 09:06:00" + ] + }, + { + "column": "year", + "dtype": "int64", + "null_pct": 0.0, + "unique_count": 2, + "sample_values": [ + 2009, + 2009, + 2009 + ], + "mean": 2009.92928, + "std": 0.2563578334933183, + "min": 2009.0, + "median": 2010.0, + "max": 2010.0 + }, + { + "column": "month", + "dtype": "int64", + "null_pct": 0.0, + "unique_count": 12, + "sample_values": [ + 12, + 12, + 12 + ], + "mean": 7.37759, + "std": 3.456656661667856, + "min": 1.0, + "median": 8.0, + "max": 12.0 + }, + { + "column": "week_of_year", + "dtype": "int64", + "null_pct": 0.0, + "unique_count": 52, + "sample_values": [ + 49, + 49, + 49 + ], + "mean": 29.91514, + "std": 15.003268635903897, + "min": 1.0, + "median": 33.0, + "max": 52.0 + }, + { + "column": "day_of_week", + "dtype": "int64", + "null_pct": 0.0, + "unique_count": 7, + "sample_values": [ + 1, + 1, + 1 + ], + "mean": 2.58328, + "std": 1.9231592308007859, + "min": 0.0, + "median": 2.0, + "max": 6.0 + }, + { + "column": "order_hour", + "dtype": "int64", + "null_pct": 0.0, + "unique_count": 14, + "sample_values": [ + 7, + 7, + 9 + ], + "mean": 12.68047, + "std": 2.35158794833593, + "min": 7.0, + "median": 13.0, + "max": 20.0 + }, + { + "column": "is_weekend", + "dtype": "int64", + "null_pct": 0.0, + "unique_count": 2, + "sample_values": [ + 0, + 0, + 0 + ], + "mean": 0.15396, + "std": 0.36091220674314933, + "min": 0.0, + "median": 0.0, + "max": 1.0 + }, + { + "column": "country", + "dtype": "str", + "null_pct": 0.0, + "unique_count": 34, + "sample_values": [ + "United Kingdom", + "United Kingdom", + "United Kingdom" + ], + "top_values": { + "count": { + "United Kingdom": 64417, + "Ireland": 8507, + "Germany": 7654 + }, + "pct [%]": { + "United Kingdom": 64.417, + "Ireland": 8.507000000000001, + "Germany": 7.654 + } + }, + "semantic_meaning": "Represents the country where the transaction originated.", + "role": "Feature", + "data_quality_notes": "Data is well-distributed across several countries with a predominance in the United Kingdom.", + "hypotheses": [ + "Transactions from the United Kingdom have a higher total value than other countries.", + "Countries with a lower transaction count like Sweden have a higher average transaction value.", + "Country-specific marketing strategies positively impact sales volume." + ] + }, + { + "column": "country_code", + "dtype": "str", + "null_pct": 0.0, + "unique_count": 34, + "sample_values": [ + "GBR", + "GBR", + "GBR" + ], + "top_values": { + "count": { + "GBR": 64417, + "IRL": 8507, + "DEU": 7654 + }, + "pct [%]": { + "GBR": 64.417, + "IRL": 8.507000000000001, + "DEU": 7.654 + } + }, + "semantic_meaning": "3-letter code representing the country of each transaction.", + "role": "Feature", + "data_quality_notes": "Consistent with country, providing coded labels for countries.", + "hypotheses": [ + "Country codes correlate strongly with country-specific purchasing patterns.", + "The use of certain country codes predicts higher shipping costs.", + "Country codes are better predictors for regional discounts than country names." + ] + }, + { + "column": "product_id", + "dtype": "str", + "null_pct": 0.0, + "unique_count": 3623, + "sample_values": [ + "21523", + "79323W", + "82582" + ], + "top_values": { + "count": { + "POST": 731, + "85123A": 615, + "21212": 438 + }, + "pct [%]": { + "POST": 0.731, + "85123A": 0.615, + "21212": 0.438 + } + }, + "semantic_meaning": "Unique identifier for each product sold.", + "role": "Feature", + "data_quality_notes": "Varied distribution across products indicates a potential for high product diversity.", + "hypotheses": [ + "Products with higher sale counts like 'POST' have a higher discount rate applied.", + "Products with lower counts have a higher average profit margin.", + "Rarely sold products are linked with specific promotional campaigns." + ] + }, + { + "column": "customer_id", + "dtype": "int64", + "null_pct": 0.0, + "unique_count": 4012, + "sample_values": [ + 13085, + 13085, + 13078 + ], + "mean": 14768.12664, + "std": 1799.1647503828826, + "min": 12346.0, + "median": 14646.0, + "max": 18287.0 + }, + { + "column": "unit_price_gbp", + "dtype": "float64", + "null_pct": 0.0, + "unique_count": 300, + "sample_values": [ + 5.95, + 6.75, + 2.1 + ], + "mean": 3.88915772, + "std": 59.75020429686513, + "min": 0.001, + "median": 1.95, + "max": 10953.5 + }, + { + "column": "quantity_sold", + "dtype": "int64", + "null_pct": 0.0, + "unique_count": 232, + "sample_values": [ + 10, + 12, + 12 + ], + "mean": 18.65779, + "std": 159.34650236322747, + "min": 1.0, + "median": 6.0, + "max": 19152.0 + }, + { + "column": "sales_amount_gbp", + "dtype": "float64", + "null_pct": 0.0, + "unique_count": 1541, + "sample_values": [ + 59.5, + 81.0, + 25.200000000000003 + ], + "mean": 26.948917120000004, + "std": 92.39021385230444, + "min": 0.001, + "median": 14.98, + "max": 10953.5 + }, + { + "column": "population_total", + "dtype": "float64", + "null_pct": 0.0, + "unique_count": 55, + "sample_values": [ + 62276270.0, + 62276270.0, + 62276270.0 + ], + "mean": 54098116.95651, + "std": 26644482.35245398, + "min": 318041.0, + "median": 62766365.0, + "max": 309378227.0 + }, + { + "column": "gdp_current_usd", + "dtype": "float64", + "null_pct": 0.0, + "unique_count": 55, + "sample_values": [ + 2412840006231.5, + 2412840006231.5, + 2412840006231.5 + ], + "mean": 2161192799869.4167, + "std": 1115049256125.8184, + "min": 9035824366.00804, + "median": 2485482596184.709, + "max": 15048971000000.0 + }, + { + "column": "gdp_growth_pct", + "dtype": "float64", + "null_pct": 0.0, + "unique_count": 55, + "sample_values": [ + -17.633975690892566, + -17.633975690892566, + -17.633975690892566 + ], + "mean": 0.46262588609441824, + "std": 6.134116051369821, + "min": -19.62987001225588, + "median": 3.010667502428644, + "max": 32.50404703691798 + }, + { + "column": "inflation_consumer_pct", + "dtype": "float64", + "null_pct": 0.0, + "unique_count": 55, + "sample_values": [ + 1.89709031895291, + 1.89709031895291, + 1.89709031895291 + ], + "mean": 1.1042501219771699, + "std": 1.6555131180385045, + "min": -15.1829798865339, + "median": 1.58908069179591, + "max": 16.5278863640702 + } + ], + "technical_stats": { + "temporal_boundaries": null, + "quality_reports": {}, + "categorical_distributions": { + "ecommerce_data": { + "country": { + "United Kingdom": { + "count": 64417, + "pct [%]": 64.417 + }, + "Ireland": { + "count": 8507, + "pct [%]": 8.507000000000001 + }, + "Germany": { + "count": 7654, + "pct [%]": 7.654 + }, + "France": { + "count": 5470, + "pct [%]": 5.47 + }, + "Netherlands": { + "count": 2729, + "pct [%]": 2.7289999999999996 + }, + "Spain": { + "count": 1235, + "pct [%]": 1.2349999999999999 + }, + "Switzerland": { + "count": 1170, + "pct [%]": 1.17 + }, + "Belgium": { + "count": 1037, + "pct [%]": 1.0370000000000001 + }, + "Portugal": { + "count": 984, + "pct [%]": 0.984 + }, + "Sweden": { + "count": 868, + "pct [%]": 0.868 + } + }, + "country_code": { + "GBR": { + "count": 64417, + "pct [%]": 64.417 + }, + "IRL": { + "count": 8507, + "pct [%]": 8.507000000000001 + }, + "DEU": { + "count": 7654, + "pct [%]": 7.654 + }, + "FRA": { + "count": 5470, + "pct [%]": 5.47 + }, + "NLD": { + "count": 2729, + "pct [%]": 2.7289999999999996 + }, + "ESP": { + "count": 1235, + "pct [%]": 1.2349999999999999 + }, + "CHE": { + "count": 1170, + "pct [%]": 1.17 + }, + "BEL": { + "count": 1037, + "pct [%]": 1.0370000000000001 + }, + "PRT": { + "count": 984, + "pct [%]": 0.984 + }, + "SWE": { + "count": 868, + "pct [%]": 0.868 + } + }, + "product_id": { + "POST": { + "count": 731, + "pct [%]": 0.731 + }, + "85123A": { + "count": 615, + "pct [%]": 0.615 + }, + "21212": { + "count": 438, + "pct [%]": 0.438 + }, + "22423": { + "count": 437, + "pct [%]": 0.437 + }, + "85099B": { + "count": 391, + "pct [%]": 0.391 + }, + "20725": { + "count": 334, + "pct [%]": 0.334 + }, + "84991": { + "count": 298, + "pct [%]": 0.298 + }, + "20914": { + "count": 295, + "pct [%]": 0.295 + }, + "21232": { + "count": 295, + "pct [%]": 0.295 + }, + "84879": { + "count": 285, + "pct [%]": 0.28500000000000003 + } + } + } + }, + "numeric_summary": { + "ecommerce_data": { + "year": { + "mean": 2009.92928, + "std": 0.2563578334933183, + "min": 2009.0, + "median": 2010.0, + "max": 2010.0 + }, + "month": { + "mean": 7.37759, + "std": 3.456656661667856, + "min": 1.0, + "median": 8.0, + "max": 12.0 + }, + "week_of_year": { + "mean": 29.91514, + "std": 15.003268635903897, + "min": 1.0, + "median": 33.0, + "max": 52.0 + }, + "day_of_week": { + "mean": 2.58328, + "std": 1.9231592308007859, + "min": 0.0, + "median": 2.0, + "max": 6.0 + }, + "order_hour": { + "mean": 12.68047, + "std": 2.35158794833593, + "min": 7.0, + "median": 13.0, + "max": 20.0 + }, + "is_weekend": { + "mean": 0.15396, + "std": 0.36091220674314933, + "min": 0.0, + "median": 0.0, + "max": 1.0 + }, + "customer_id": { + "mean": 14768.12664, + "std": 1799.1647503828826, + "min": 12346.0, + "median": 14646.0, + "max": 18287.0 + }, + "unit_price_gbp": { + "mean": 3.88915772, + "std": 59.75020429686513, + "min": 0.001, + "median": 1.95, + "max": 10953.5 + }, + "quantity_sold": { + "mean": 18.65779, + "std": 159.34650236322747, + "min": 1.0, + "median": 6.0, + "max": 19152.0 + }, + "sales_amount_gbp": { + "mean": 26.948917120000004, + "std": 92.39021385230444, + "min": 0.001, + "median": 14.98, + "max": 10953.5 + }, + "population_total": { + "mean": 54098116.95651, + "std": 26644482.35245398, + "min": 318041.0, + "median": 62766365.0, + "max": 309378227.0 + }, + "gdp_current_usd": { + "mean": 2161192799869.4167, + "std": 1115049256125.8184, + "min": 9035824366.00804, + "median": 2485482596184.709, + "max": 15048971000000.0 + }, + "gdp_growth_pct": { + "mean": 0.46262588609441824, + "std": 6.134116051369821, + "min": -19.62987001225588, + "median": 3.010667502428644, + "max": 32.50404703691798 + }, + "inflation_consumer_pct": { + "mean": 1.1042501219771699, + "std": 1.6555131180385045, + "min": -15.1829798865339, + "median": 1.58908069179591, + "max": 16.5278863640702 + } + } + }, + "datetime_columns": {} + }, + "semantic_insights": { + "columns": { + "country": { + "semantic_meaning": "Represents the country where the transaction originated.", + "role": "Feature", + "data_quality_notes": "Data is well-distributed across several countries with a predominance in the United Kingdom.", + "hypotheses": [ + "Transactions from the United Kingdom have a higher total value than other countries.", + "Countries with a lower transaction count like Sweden have a higher average transaction value.", + "Country-specific marketing strategies positively impact sales volume." + ] + }, + "country_code": { + "semantic_meaning": "3-letter code representing the country of each transaction.", + "role": "Feature", + "data_quality_notes": "Consistent with country, providing coded labels for countries.", + "hypotheses": [ + "Country codes correlate strongly with country-specific purchasing patterns.", + "The use of certain country codes predicts higher shipping costs.", + "Country codes are better predictors for regional discounts than country names." + ] + }, + "product_id": { + "semantic_meaning": "Unique identifier for each product sold.", + "role": "Feature", + "data_quality_notes": "Varied distribution across products indicates a potential for high product diversity.", + "hypotheses": [ + "Products with higher sale counts like 'POST' have a higher discount rate applied.", + "Products with lower counts have a higher average profit margin.", + "Rarely sold products are linked with specific promotional campaigns." + ] + } + } + } +} \ No newline at end of file diff --git a/research/agentic_data_science/schema_agent/data_profile_summary.md b/research/agentic_data_science/schema_agent/data_profile_summary.md new file mode 100644 index 000000000..5ba7c62fe --- /dev/null +++ b/research/agentic_data_science/schema_agent/data_profile_summary.md @@ -0,0 +1,102 @@ +# Data Profile Summary + +## Column Profiles + +| Column | Meaning | Role | Quality | Hypotheses | +|--------|---------|------|---------|------------| +| order_datetime | | | | [] | +| year | | | | [] | +| month | | | | [] | +| week_of_year | | | | [] | +| day_of_week | | | | [] | +| order_hour | | | | [] | +| is_weekend | | | | [] | +| country | Represents the country where the transaction originated. | Feature | Data is well-distributed across several countries with a predominance in the United Kingdom. | 1. Transactions from the United Kingdom have a higher total value than other countries.
2. Countries with a lower transaction count like Sweden have a higher average transaction value.
3. Country-specific marketing strategies positively impact sales volume. | +| country_code | 3-letter code representing the country of each transaction. | Feature | Consistent with country, providing coded labels for countries. | 1. Country codes correlate strongly with country-specific purchasing patterns.
2. The use of certain country codes predicts higher shipping costs.
3. Country codes are better predictors for regional discounts than country names. | +| product_id | Unique identifier for each product sold. | Feature | Varied distribution across products indicates a potential for high product diversity. | 1. Products with higher sale counts like 'POST' have a higher discount rate applied.
2. Products with lower counts have a higher average profit margin.
3. Rarely sold products are linked with specific promotional campaigns. | +| customer_id | | | | [] | +| unit_price_gbp | | | | [] | +| quantity_sold | | | | [] | +| sales_amount_gbp | | | | [] | +| population_total | | | | [] | +| gdp_current_usd | | | | [] | +| gdp_growth_pct | | | | [] | +| inflation_consumer_pct | | | | [] | + +## Numeric Column Statistics + +### ecommerce_data + +| Column | Metric | Value | +|--------|--------|-------| +| year | mean | 2,009.93 | +| year | std | 0.2564 | +| year | min | 2,009.00 | +| year | median | 2,010.00 | +| year | max | 2,010.00 | +| month | mean | 7.38 | +| month | std | 3.46 | +| month | min | 1.00 | +| month | median | 8.00 | +| month | max | 12.00 | +| week_of_year | mean | 29.92 | +| week_of_year | std | 15.00 | +| week_of_year | min | 1.00 | +| week_of_year | median | 33.00 | +| week_of_year | max | 52.00 | +| day_of_week | mean | 2.58 | +| day_of_week | std | 1.92 | +| day_of_week | min | 0.0000 | +| day_of_week | median | 2.00 | +| day_of_week | max | 6.00 | +| order_hour | mean | 12.68 | +| order_hour | std | 2.35 | +| order_hour | min | 7.00 | +| order_hour | median | 13.00 | +| order_hour | max | 20.00 | +| is_weekend | mean | 0.1540 | +| is_weekend | std | 0.3609 | +| is_weekend | min | 0.0000 | +| is_weekend | median | 0.0000 | +| is_weekend | max | 1.00 | +| customer_id | mean | 14,768.13 | +| customer_id | std | 1,799.16 | +| customer_id | min | 12,346.00 | +| customer_id | median | 14,646.00 | +| customer_id | max | 18,287.00 | +| unit_price_gbp | mean | 3.89 | +| unit_price_gbp | std | 59.75 | +| unit_price_gbp | min | 0.0010 | +| unit_price_gbp | median | 1.95 | +| unit_price_gbp | max | 10,953.50 | +| quantity_sold | mean | 18.66 | +| quantity_sold | std | 159.35 | +| quantity_sold | min | 1.00 | +| quantity_sold | median | 6.00 | +| quantity_sold | max | 19,152.00 | +| sales_amount_gbp | mean | 26.95 | +| sales_amount_gbp | std | 92.39 | +| sales_amount_gbp | min | 0.0010 | +| sales_amount_gbp | median | 14.98 | +| sales_amount_gbp | max | 10,953.50 | +| population_total | mean | 54,098,116.96 | +| population_total | std | 26,644,482.35 | +| population_total | min | 318,041.00 | +| population_total | median | 62,766,365.00 | +| population_total | max | 309,378,227.00 | +| gdp_current_usd | mean | 2,161,192,799,869.42 | +| gdp_current_usd | std | 1,115,049,256,125.82 | +| gdp_current_usd | min | 9,035,824,366.01 | +| gdp_current_usd | median | 2,485,482,596,184.71 | +| gdp_current_usd | max | 15,048,971,000,000.00 | +| gdp_growth_pct | mean | 0.4626 | +| gdp_growth_pct | std | 6.13 | +| gdp_growth_pct | min | -19.63 | +| gdp_growth_pct | median | 3.01 | +| gdp_growth_pct | max | 32.50 | +| inflation_consumer_pct | mean | 1.10 | +| inflation_consumer_pct | std | 1.66 | +| inflation_consumer_pct | min | -15.18 | +| inflation_consumer_pct | median | 1.59 | +| inflation_consumer_pct | max | 16.53 | + diff --git a/research/agentic_data_science/schema_agent/docker_bash.sh b/research/agentic_data_science/schema_agent/docker_bash.sh new file mode 100755 index 000000000..0025e81f4 --- /dev/null +++ b/research/agentic_data_science/schema_agent/docker_bash.sh @@ -0,0 +1,34 @@ +#!/bin/bash +# """ +# This script launches a Docker container with an interactive bash shell for +# development. +# """ + +# Exit immediately if any command exits with a non-zero status. +set -e + +# Import the utility functions from the project template. +GIT_ROOT=$(git rev-parse --show-toplevel) +source $GIT_ROOT/class_project/project_template/utils.sh + +# Parse default args (-h, -v) and enable set -x if -v is passed. +parse_default_args "$@" + +# Load Docker configuration variables for this script. +get_docker_vars_script ${BASH_SOURCE[0]} +source $DOCKER_NAME +print_docker_vars + +# List the available Docker images matching the expected image name. +run "docker image ls $FULL_IMAGE_NAME" + +# Configure and run the Docker container with interactive bash shell. +# - Container is removed automatically on exit (--rm) +# - Interactive mode with TTY allocation (-ti) +# - Port forwarding for Jupyter or other services +# - Git root mounted to /git_root inside container +CONTAINER_NAME=${IMAGE_NAME}_bash +PORT= +DOCKER_CMD=$(get_docker_bash_command) +DOCKER_CMD_OPTS=$(get_docker_bash_options $CONTAINER_NAME $PORT) +run "$DOCKER_CMD $DOCKER_CMD_OPTS $FULL_IMAGE_NAME" diff --git a/research/agentic_data_science/schema_agent/docker_build.sh b/research/agentic_data_science/schema_agent/docker_build.sh new file mode 100755 index 000000000..5b0957a99 --- /dev/null +++ b/research/agentic_data_science/schema_agent/docker_build.sh @@ -0,0 +1,40 @@ +#!/bin/bash +# """ +# Build a Docker container image for the project. +# +# This script sets up the build environment with error handling and command +# tracing, loads Docker configuration from docker_name.sh, and builds the +# Docker image using the build_container_image utility function. It supports +# both single-architecture and multi-architecture builds via the +# DOCKER_BUILD_MULTI_ARCH environment variable. +# """ + +# Exit immediately if any command exits with a non-zero status. +set -e + +# Import the utility functions. +GIT_ROOT=$(git rev-parse --show-toplevel) +source $GIT_ROOT/class_project/project_template/utils.sh + +# Parse default args (-h, -v) and enable set -x if -v is passed. +# Shift processed option flags so remaining args are passed to the build. +parse_default_args "$@" +shift $((OPTIND-1)) + +# Load Docker configuration variables (REPO_NAME, IMAGE_NAME, FULL_IMAGE_NAME). +get_docker_vars_script ${BASH_SOURCE[0]} +source $DOCKER_NAME +print_docker_vars + +# Configure Docker build settings. +# Enable BuildKit for improved build performance and features. +export DOCKER_BUILDKIT=1 +#export DOCKER_BUILDKIT=0 + +# Configure single-architecture build (set to 1 for multi-arch build). +#export DOCKER_BUILD_MULTI_ARCH=1 +export DOCKER_BUILD_MULTI_ARCH=0 + +# Build the container image. +# Pass extra arguments (e.g., --no-cache) via command line after -v. +build_container_image "$@" diff --git a/research/agentic_data_science/schema_agent/docker_build.version.log b/research/agentic_data_science/schema_agent/docker_build.version.log new file mode 100644 index 000000000..d60536643 --- /dev/null +++ b/research/agentic_data_science/schema_agent/docker_build.version.log @@ -0,0 +1,166 @@ +# Python3 +Python 3.12.13 +# pip3 +pip 26.0.1 from /opt/venv/lib/python3.12/site-packages/pip (python 3.12) +# jupyter +Selected Jupyter core packages... +IPython : 9.12.0 +ipykernel : 7.2.0 +ipywidgets : not installed +jupyter_client : 8.8.0 +jupyter_core : 5.9.1 +jupyter_server : 2.17.0 +jupyterlab : 4.5.6 +nbclient : 0.10.4 +nbconvert : 7.17.0 +nbformat : 5.10.4 +notebook : not installed +qtconsole : not installed +traitlets : 5.14.3 +# Python packages +Package Version +------------------------- ------------ +aiohappyeyeballs 2.6.1 +aiohttp 3.13.5 +aiosignal 1.4.0 +annotated-types 0.7.0 +anthropic 0.89.0 +anyio 4.13.0 +argon2-cffi 25.1.0 +argon2-cffi-bindings 25.1.0 +arrow 1.4.0 +asttokens 3.0.1 +async-lru 2.3.0 +attrs 26.1.0 +babel 2.18.0 +beautifulsoup4 4.14.3 +bleach 6.3.0 +certifi 2026.2.25 +cffi 2.0.0 +charset-normalizer 3.4.7 +click 8.3.2 +click-default-group 1.2.4 +comm 0.2.3 +condense-json 0.1.3 +debugpy 1.8.20 +decorator 5.2.1 +defusedxml 0.7.1 +distro 1.9.0 +docstring_parser 0.17.0 +dotenv 0.9.9 +executing 2.2.1 +fastjsonschema 2.21.2 +fqdn 1.5.1 +frozenlist 1.8.0 +h11 0.16.0 +httpcore 1.0.9 +httpx 0.28.1 +idna 3.11 +ipykernel 7.2.0 +ipython 9.12.0 +ipython_pygments_lexers 1.1.1 +isoduration 20.11.0 +jedi 0.19.2 +Jinja2 3.1.6 +jiter 0.13.0 +json5 0.14.0 +jsonpatch 1.33 +jsonpointer 3.1.1 +jsonschema 4.26.0 +jsonschema-specifications 2025.9.1 +jupyter_client 8.8.0 +jupyter_core 5.9.1 +jupyter-events 0.12.0 +jupyter-lsp 2.3.1 +jupyter_server 2.17.0 +jupyter_server_terminals 0.5.4 +jupyterlab 4.5.6 +jupyterlab_pygments 0.3.0 +jupyterlab_server 2.28.0 +langchain-core 1.2.27 +langchain-openai 1.1.12 +langgraph 1.1.6 +langgraph-checkpoint 4.0.1 +langgraph-prebuilt 1.0.9 +langgraph-sdk 0.3.12 +langsmith 0.7.26 +lark 1.3.1 +llm 0.30 +MarkupSafe 3.0.3 +matplotlib-inline 0.2.1 +mistune 3.2.0 +multidict 6.7.1 +nbclient 0.10.4 +nbconvert 7.17.0 +nbformat 5.10.4 +nest-asyncio 1.6.0 +notebook_shim 0.2.4 +numpy 2.4.4 +openai 2.30.0 +orjson 3.11.8 +ormsgpack 1.12.2 +packaging 26.0 +pandas 3.0.2 +pandocfilters 1.5.1 +parso 0.8.6 +pexpect 4.9.0 +pip 26.0.1 +platformdirs 4.9.4 +pluggy 1.6.0 +prometheus_client 0.24.1 +prompt_toolkit 3.0.52 +propcache 0.4.1 +psutil 7.2.2 +ptyprocess 0.7.0 +pure_eval 0.2.3 +puremagic 2.1.1 +pycparser 3.0 +pydantic 2.12.5 +pydantic_core 2.41.5 +Pygments 2.20.0 +python-dateutil 2.9.0.post0 +python-dotenv 1.2.2 +python-json-logger 4.1.0 +python-ulid 3.1.0 +pytz 2026.1.post1 +PyYAML 6.0.3 +pyzmq 27.1.0 +referencing 0.37.0 +regex 2026.4.4 +requests 2.33.1 +requests-toolbelt 1.0.0 +rfc3339-validator 0.1.4 +rfc3986-validator 0.1.1 +rfc3987-syntax 1.1.0 +rpds-py 0.30.0 +Send2Trash 2.1.0 +setuptools 82.0.1 +six 1.17.0 +sniffio 1.3.1 +soupsieve 2.8.3 +sqlite-fts4 1.0.3 +sqlite-migrate 0.1b0 +sqlite-utils 3.39 +stack-data 0.6.3 +tabulate 0.10.0 +tenacity 9.1.4 +terminado 0.18.1 +tiktoken 0.12.0 +tinycss2 1.4.0 +tokencost 0.1.26 +tornado 6.5.5 +tqdm 4.67.3 +traitlets 5.14.3 +typing_extensions 4.15.0 +typing-inspection 0.4.2 +tzdata 2026.1 +uri-template 1.3.0 +urllib3 2.6.3 +uuid_utils 0.14.1 +wcwidth 0.6.0 +webcolors 25.10.0 +webencodings 0.5.1 +websocket-client 1.9.0 +xxhash 3.6.0 +yarl 1.23.0 +zstandard 0.25.0 diff --git a/research/agentic_data_science/schema_agent/docker_clean.sh b/research/agentic_data_science/schema_agent/docker_clean.sh new file mode 100755 index 000000000..7e40839ae --- /dev/null +++ b/research/agentic_data_science/schema_agent/docker_clean.sh @@ -0,0 +1,26 @@ +#!/bin/bash +# """ +# Remove Docker container image for the project. +# +# This script cleans up Docker images by removing the container image +# matching the project configuration. Useful for freeing disk space or +# ensuring a fresh build. +# """ + +# Exit immediately if any command exits with a non-zero status. +set -e + +# Import the utility functions. +GIT_ROOT=$(git rev-parse --show-toplevel) +source $GIT_ROOT/class_project/project_template/utils.sh + +# Parse default args (-h, -v) and enable set -x if -v is passed. +parse_default_args "$@" + +# Load Docker configuration variables for this script. +get_docker_vars_script ${BASH_SOURCE[0]} +source $DOCKER_NAME +print_docker_vars + +# Remove the container image. +remove_container_image diff --git a/research/agentic_data_science/schema_agent/docker_cmd.sh b/research/agentic_data_science/schema_agent/docker_cmd.sh new file mode 100755 index 000000000..906d7a77b --- /dev/null +++ b/research/agentic_data_science/schema_agent/docker_cmd.sh @@ -0,0 +1,41 @@ +#!/bin/bash +# """ +# Execute a command in a Docker container. +# +# This script runs a specified command inside a new Docker container instance. +# The container is removed automatically after the command completes. The +# git root is mounted to /git_root inside the container. +# """ + +# Exit immediately if any command exits with a non-zero status. +set -e + +# Import the utility functions. +GIT_ROOT=$(git rev-parse --show-toplevel) +source $GIT_ROOT/class_project/project_template/utils.sh + +# Parse default args (-h, -v) and enable set -x if -v is passed. +# Shift processed option flags so remaining args form the command. +parse_default_args "$@" +shift $((OPTIND-1)) + +# Capture the command to execute from remaining arguments. +CMD="$@" +echo "Executing: '$CMD'" + +# Load Docker configuration variables for this script. +get_docker_vars_script ${BASH_SOURCE[0]} +source $DOCKER_NAME +print_docker_vars + +# List available Docker images matching the expected image name. +run "docker image ls $FULL_IMAGE_NAME" +#(docker manifest inspect $FULL_IMAGE_NAME | grep arch) || true + +# Configure and run the Docker container with the specified command. +CONTAINER_NAME=$IMAGE_NAME +DOCKER_CMD=$(get_docker_cmd_command) +PORT="" +DOCKER_RUN_OPTS="" +DOCKER_CMD_OPTS=$(get_docker_bash_options $CONTAINER_NAME $PORT $DOCKER_RUN_OPTS) +run "$DOCKER_CMD $DOCKER_CMD_OPTS $FULL_IMAGE_NAME bash -c '$CMD'" diff --git a/research/agentic_data_science/schema_agent/docker_exec.sh b/research/agentic_data_science/schema_agent/docker_exec.sh new file mode 100755 index 000000000..24f8e401a --- /dev/null +++ b/research/agentic_data_science/schema_agent/docker_exec.sh @@ -0,0 +1,25 @@ +#!/bin/bash +# """ +# Execute a bash shell in a running Docker container. +# +# This script connects to an already running Docker container and opens an +# interactive bash session for debugging or inspection purposes. +# """ + +# Exit immediately if any command exits with a non-zero status. +set -e + +# Import the utility functions. +GIT_ROOT=$(git rev-parse --show-toplevel) +source $GIT_ROOT/class_project/project_template/utils.sh + +# Parse default args (-h, -v) and enable set -x if -v is passed. +parse_default_args "$@" + +# Load Docker configuration variables for this script. +get_docker_vars_script ${BASH_SOURCE[0]} +source $DOCKER_NAME +print_docker_vars + +# Execute bash shell in the running container. +exec_container diff --git a/research/agentic_data_science/schema_agent/docker_jupyter.sh b/research/agentic_data_science/schema_agent/docker_jupyter.sh new file mode 100755 index 000000000..6c7d09b13 --- /dev/null +++ b/research/agentic_data_science/schema_agent/docker_jupyter.sh @@ -0,0 +1,37 @@ +#!/bin/bash +# """ +# Execute Jupyter Lab in a Docker container. +# +# This script launches a Docker container running Jupyter Lab with +# configurable port, directory mounting, and vim bindings. It passes +# command-line options to the run_jupyter.sh script inside the container. +# +# Usage: +# > docker_jupyter.sh [options] +# """ + +# Exit immediately if any command exits with a non-zero status. +set -e + +# Import the utility functions. +GIT_ROOT=$(git rev-parse --show-toplevel) +source $GIT_ROOT/class_project/project_template/utils.sh + +# Parse command-line options and set Jupyter configuration variables. +parse_docker_jupyter_args "$@" + +# Load Docker configuration variables for this script. +get_docker_vars_script ${BASH_SOURCE[0]} +source $DOCKER_NAME +print_docker_vars + +# List available Docker images and inspect architecture. +run "docker image ls $FULL_IMAGE_NAME" +(docker manifest inspect $FULL_IMAGE_NAME | grep arch) || true + +# Run the Docker container with Jupyter Lab. +CMD=$(get_run_jupyter_cmd "${BASH_SOURCE[0]}" "$OLD_CMD_OPTS") +CONTAINER_NAME=$IMAGE_NAME +DOCKER_CMD=$(get_docker_jupyter_command) +DOCKER_CMD_OPTS=$(get_docker_jupyter_options $CONTAINER_NAME $JUPYTER_HOST_PORT $JUPYTER_USE_VIM) +run "$DOCKER_CMD $DOCKER_CMD_OPTS $FULL_IMAGE_NAME $CMD" diff --git a/research/agentic_data_science/schema_agent/docker_name.sh b/research/agentic_data_science/schema_agent/docker_name.sh new file mode 100644 index 000000000..1d6f8a55c --- /dev/null +++ b/research/agentic_data_science/schema_agent/docker_name.sh @@ -0,0 +1,12 @@ +#!/bin/bash +# """ +# Docker image naming configuration. +# +# This file defines the repository name, image name, and full image name +# variables used by all docker_*.sh scripts in the project template. +# """ + +REPO_NAME=gpsaggese +# The file should be all lower case. +IMAGE_NAME=umd_schema_agent +FULL_IMAGE_NAME=$REPO_NAME/$IMAGE_NAME diff --git a/research/agentic_data_science/schema_agent/docker_push.sh b/research/agentic_data_science/schema_agent/docker_push.sh new file mode 100755 index 000000000..27d752dd9 --- /dev/null +++ b/research/agentic_data_science/schema_agent/docker_push.sh @@ -0,0 +1,25 @@ +#!/bin/bash +# """ +# Push Docker container image to Docker Hub or registry. +# +# This script authenticates with the Docker registry using credentials from +# ~/.docker/passwd.$REPO_NAME.txt and pushes the locally built container +# image to the remote repository. +# """ + +# Exit immediately if any command exits with a non-zero status. +set -e + +# Import the utility functions. +GIT_ROOT=$(git rev-parse --show-toplevel) +source $GIT_ROOT/class_project/project_template/utils.sh + +# Parse default args (-h, -v) and enable set -x if -v is passed. +parse_default_args "$@" + +# Load Docker image naming configuration. +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +source $SCRIPT_DIR/docker_name.sh + +# Push the container image to the registry. +push_container_image diff --git a/research/agentic_data_science/schema_agent/requirements.in b/research/agentic_data_science/schema_agent/requirements.in new file mode 100644 index 000000000..08dbfa79b --- /dev/null +++ b/research/agentic_data_science/schema_agent/requirements.in @@ -0,0 +1,18 @@ +pandas==3.0.2 +numpy==2.4.4 + +langchain-core==1.2.27 +langchain-openai==1.1.12 + +langgraph==1.1.6 +langgraph-checkpoint==4.0.1 +langgraph-prebuilt==1.0.9 +langgraph-sdk==0.3.12 + +llm==0.30 +tokencost==0.1.26 + +pytz==2026.1.post1 +python-dotenv==1.2.2 + +setuptools>=65.0.0 \ No newline at end of file diff --git a/research/agentic_data_science/schema_agent/requirements.txt b/research/agentic_data_science/schema_agent/requirements.txt index 0ce56331c..3d14b9e25 100644 --- a/research/agentic_data_science/schema_agent/requirements.txt +++ b/research/agentic_data_science/schema_agent/requirements.txt @@ -1,6 +1,226 @@ -pandas -langchain_core -langchain_openai -langgraph -llm -tokencost \ No newline at end of file +# +# This file is autogenerated by pip-compile with Python 3.12 +# by the following command: +# +# pip-compile requirements.in +# +aiohappyeyeballs==2.6.1 + # via aiohttp +aiohttp==3.13.5 + # via tokencost +aiosignal==1.4.0 + # via aiohttp +annotated-types==0.7.0 + # via pydantic +anthropic==0.92.0 + # via tokencost +anyio==4.13.0 + # via + # anthropic + # httpx + # openai +attrs==26.1.0 + # via aiohttp +certifi==2026.2.25 + # via + # httpcore + # httpx + # requests +charset-normalizer==3.4.7 + # via requests +click==8.3.2 + # via + # click-default-group + # llm + # sqlite-utils +click-default-group==1.2.4 + # via + # llm + # sqlite-utils +condense-json==0.1.3 + # via llm +distro==1.9.0 + # via + # anthropic + # openai +docstring-parser==0.17.0 + # via anthropic +frozenlist==1.8.0 + # via + # aiohttp + # aiosignal +h11==0.16.0 + # via httpcore +httpcore==1.0.9 + # via httpx +httpx==0.28.1 + # via + # anthropic + # langgraph-sdk + # langsmith + # openai +idna==3.11 + # via + # anyio + # httpx + # requests + # yarl +jiter==0.13.0 + # via + # anthropic + # openai +jsonpatch==1.33 + # via langchain-core +jsonpointer==3.1.1 + # via jsonpatch +langchain-core==1.2.27 + # via + # -r requirements.in + # langchain-openai + # langgraph + # langgraph-checkpoint + # langgraph-prebuilt +langchain-openai==1.1.12 + # via -r requirements.in +langgraph==1.1.6 + # via -r requirements.in +langgraph-checkpoint==4.0.1 + # via + # -r requirements.in + # langgraph + # langgraph-prebuilt +langgraph-prebuilt==1.0.9 + # via + # -r requirements.in + # langgraph +langgraph-sdk==0.3.12 + # via + # -r requirements.in + # langgraph +langsmith==0.7.29 + # via langchain-core +llm==0.30 + # via -r requirements.in +multidict==6.7.1 + # via + # aiohttp + # yarl +numpy==2.4.4 + # via + # -r requirements.in + # pandas +openai==2.31.0 + # via + # langchain-openai + # llm +orjson==3.11.8 + # via + # langgraph-sdk + # langsmith +ormsgpack==1.12.2 + # via langgraph-checkpoint +packaging==26.0 + # via + # langchain-core + # langsmith +pandas==3.0.2 + # via -r requirements.in +pluggy==1.6.0 + # via + # llm + # sqlite-utils +propcache==0.4.1 + # via + # aiohttp + # yarl +puremagic==2.2.0 + # via llm +pydantic==2.12.5 + # via + # anthropic + # langchain-core + # langgraph + # langsmith + # llm + # openai +pydantic-core==2.41.5 + # via pydantic +python-dateutil==2.9.0.post0 + # via + # pandas + # sqlite-utils +python-dotenv==1.2.2 + # via -r requirements.in +python-ulid==3.1.0 + # via llm +pytz==2026.1.post1 + # via -r requirements.in +pyyaml==6.0.3 + # via + # langchain-core + # llm +regex==2026.4.4 + # via tiktoken +requests==2.33.1 + # via + # langsmith + # requests-toolbelt + # tiktoken +requests-toolbelt==1.0.0 + # via langsmith +six==1.17.0 + # via python-dateutil +sniffio==1.3.1 + # via + # anthropic + # openai +sqlite-fts4==1.0.3 + # via sqlite-utils +sqlite-migrate==0.1b0 + # via llm +sqlite-utils==3.39 + # via + # llm + # sqlite-migrate +tabulate==0.10.0 + # via sqlite-utils +tenacity==9.1.4 + # via langchain-core +tiktoken==0.12.0 + # via + # langchain-openai + # tokencost +tokencost==0.1.26 + # via -r requirements.in +tqdm==4.67.3 + # via openai +typing-extensions==4.15.0 + # via + # aiosignal + # anthropic + # anyio + # langchain-core + # openai + # pydantic + # pydantic-core + # typing-inspection +typing-inspection==0.4.2 + # via pydantic +urllib3==2.6.3 + # via requests +uuid-utils==0.14.1 + # via + # langchain-core + # langsmith +xxhash==3.6.0 + # via + # langgraph + # langsmith +yarl==1.23.0 + # via aiohttp +zstandard==0.25.0 + # via langsmith + +# The following packages are considered to be unsafe in a requirements file: +# pip +# setuptools diff --git a/research/agentic_data_science/schema_agent/run_jupyter.sh b/research/agentic_data_science/schema_agent/run_jupyter.sh new file mode 100755 index 000000000..342a73f79 --- /dev/null +++ b/research/agentic_data_science/schema_agent/run_jupyter.sh @@ -0,0 +1,36 @@ +#!/bin/bash +# """ +# Launch Jupyter Lab server. +# +# This script starts Jupyter Lab on port 8888 with the following configuration: +# - No browser auto-launch (useful for Docker containers) +# - Accessible from any IP address (0.0.0.0) +# - Root user allowed (required for Docker environments) +# - No authentication token or password (for development convenience) +# - Vim keybindings can be enabled via JUPYTER_USE_VIM environment variable +# """ + +# Exit immediately if any command exits with a non-zero status. +set -e + +# Print each command to stdout before executing it. +#set -x + +# Import the utility functions from /git_root. +GIT_ROOT=/git_root +source $GIT_ROOT/class_project/project_template/utils.sh + +# Load Docker configuration variables for this script. +get_docker_vars_script ${BASH_SOURCE[0]} +source $DOCKER_NAME +print_docker_vars + +# Configure vim keybindings and notifications. +configure_jupyter_vim_keybindings +configure_jupyter_notifications + +# Initialize Jupyter Lab command with base configuration. +JUPYTER_ARGS=$(get_jupyter_args) + +# Start Jupyter Lab with development-friendly settings. +run "jupyter lab $JUPYTER_ARGS" diff --git a/research/agentic_data_science/schema_agent/schema_agent.API.ipynb b/research/agentic_data_science/schema_agent/schema_agent.API.ipynb new file mode 100644 index 000000000..4845fc9e1 --- /dev/null +++ b/research/agentic_data_science/schema_agent/schema_agent.API.ipynb @@ -0,0 +1,394 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "8881f77e-d668-4210-b5c5-06fad5f80608", + "metadata": {}, + "source": [ + "# API usage Notebook \n", + "- This notebook shows the implementation of each function from the respective libraries." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "3d4af97d-2052-4791-8b80-f9fa973b8233", + "metadata": {}, + "outputs": [], + "source": [ + "%load_ext autoreload\n", + "%autoreload 2\n", + "\n", + "import dotenv\n", + "import os\n", + "import pandas as pd\n", + "import numpy as np\n", + "\n", + "# Load environment variables (ensure OPENAI_API_KEY is set in your .env)\n", + "dotenv.load_dotenv()\n", + "\n", + "# Import the schema agent modules\n", + "import research.agentic_data_science.schema_agent.schema_agent_loader as radsasal\n", + "import research.agentic_data_science.schema_agent.schema_agent_stats as radsasas\n", + "import research.agentic_data_science.schema_agent.schema_agent_hllmcli as radsasah\n", + "import research.agentic_data_science.schema_agent.schema_agent_report as radsasar" + ] + }, + { + "cell_type": "markdown", + "id": "20ab6884-fddf-4205-8bb7-eab2704f6f1d", + "metadata": {}, + "source": [ + "## 1. Create dummy Dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "1ada4cd7-10bb-45c5-be2e-7cce4a5b4de1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Created dummy dataset at: dummy_employees.csv\n" + ] + } + ], + "source": [ + "# 1. Create a dummy dataset\n", + "np.random.seed(42)\n", + "num_rows = 100\n", + "\n", + "dummy_data = pd.DataFrame({\n", + " \"employee_id\": range(1000, 1000 + num_rows),\n", + " \"department\": np.random.choice([\"Engineering\", \"Sales\", \"HR\", \"Marketing\"], num_rows),\n", + " \"salary\": np.random.normal(85000, 20000, num_rows),\n", + " \"satisfaction_score\": np.random.uniform(1.0, 5.0, num_rows),\n", + " \"hire_date\": pd.date_range(start=\"2018-01-01\", periods=num_rows, freq=\"W\").astype(str),\n", + " \"notes\": [\"Good performance\"] * 50 + [None] * 50 # 50% nulls\n", + "})\n", + "\n", + "# Inject some missing values into salary\n", + "dummy_data.loc[10:20, \"salary\"] = np.nan\n", + "\n", + "# Save to CSV\n", + "csv_path = \"dummy_employees.csv\"\n", + "dummy_data.to_csv(csv_path, index=False)\n", + "print(f\"Created dummy dataset at: {csv_path}\")\n", + "dummy_data.head()\n", + "\n", + "csv_paths = [csv_path]\n", + "tags = [\"dummy_employees\"]" + ] + }, + { + "cell_type": "markdown", + "id": "25a81ada-67c5-4a5e-a22e-453f2b222b06", + "metadata": {}, + "source": [ + "## 2. Load and Infer datatypes from the columns" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "e79d8059-c49d-438e-ad61-7a4c160370d4", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "--- Loaded DataFrames ---\n", + "\n", + "DatetimeIndex: 100 entries, 2018-01-07 00:00:00+00:00 to 2019-12-01 00:00:00+00:00\n", + "Data columns (total 6 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 employee_id 100 non-null int64 \n", + " 1 department 100 non-null str \n", + " 2 salary 89 non-null float64 \n", + " 3 satisfaction_score 100 non-null float64 \n", + " 4 hire_date 100 non-null datetime64[us, UTC]\n", + " 5 notes 50 non-null str \n", + "dtypes: datetime64[us, UTC](1), float64(2), int64(1), str(2)\n", + "memory usage: 5.5 KB\n", + "None\n", + "\n", + "--- Datetime Inference Metadata ---\n", + "{'hire_date': {'semantic_type': 'temporal', 'granularity': 'date', 'format': 'inferred', 'confidence': 1.0}}\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/git_root/research/agentic_data_science/schema_agent/schema_agent_loader.py:75: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n", + " parsed = pd.to_datetime(df[col], errors=\"coerce\", utc=True)\n" + ] + } + ], + "source": [ + "# 1. Load and prepare DataFrames - now receiving 3 variables\n", + "tag_to_df, cat_cols_map, datetime_meta = radsasal.prepare_dataframes(csv_paths, tags)\n", + "\n", + "print(\"--- Loaded DataFrames ---\")\n", + "# The index will now show as a DatetimeIndex instead of a RangeIndex\n", + "print(tag_to_df[\"dummy_employees\"].info())\n", + "\n", + "# 2. Combine DataFrames while preserving the index\n", + "# We do NOT use ignore_index=True here because we want to keep the DatetimeIndex \n", + "# we just created in the loader.\n", + "updated_df = pd.concat(list(tag_to_df.values()), axis=0)\n", + "\n", + "print(\"\\n--- Datetime Inference Metadata ---\")\n", + "# This will now correctly show your temporal column info\n", + "print(datetime_meta)" + ] + }, + { + "cell_type": "markdown", + "id": "dc6c4466-d602-4c82-a2f6-b7ff7ada8e19", + "metadata": {}, + "source": [ + "## 3. Statistical Profiling" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "1368b932-6299-43c1-a70c-c8e9489eb2b2", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "=== Temporal Boundaries ===\n", + " min_index max_index min_valid_index max_valid_index\n", + "dummy_employees 2018-01-07 00:00:00+00:00 2019-12-01 00:00:00+00:00 2018-01-07 00:00:00+00:00 2018-12-16 00:00:00+00:00\n", + " employee_id salary satisfaction_score\n", + "2018-01-07 00:00:00+00:00 1000 99769.3316 4.268889\n", + "2018-01-14 00:00:00+00:00 1001 88427.365624 3.220803\n", + "... ... ... ...\n", + "2019-11-24 00:00:00+00:00 1098 101270.344347 3.313121\n", + "2019-12-01 00:00:00+00:00 1099 60382.713671 1.143769\n", + " num_rows num_zeros zeros [%] num_nans nans [%] num_infs infs [%] num_valid valid [%]\n", + "employee_id 100 0 0.0 0 0.0 0 0.0 100 100.0\n", + "salary 100 0 0.0 11 11.0 0 0.0 89 89.0\n", + "satisfaction_score 100 0 0.0 0 0.0 0 0.0 100 100.0\n", + "\n", + "=== Quality Report: dummy_employees ===\n", + " num_rows num_zeros zeros [%] num_nans nans [%] \\\n", + "employee_id 100 0 0.0 0 0.0 \n", + "salary 100 0 0.0 11 11.0 \n", + "satisfaction_score 100 0 0.0 0 0.0 \n", + "\n", + " num_infs infs [%] num_valid valid [%] \n", + "employee_id 0 0.0 100 100.0 \n", + "salary 0 0.0 89 89.0 \n", + "satisfaction_score 0 0.0 100 100.0 \n", + "\n", + "=== Distribution: dummy_employees / department ===\n", + " count pct [%]\n", + "department \n", + "Marketing 30 30.0\n", + "Sales 26 26.0\n", + "HR 24 24.0\n", + "Engineering 20 20.0\n", + "\n", + "=== Distribution: dummy_employees / notes ===\n", + " count pct [%]\n", + "notes \n", + "Good performance 50 50.0\n", + "\n", + "=== Numeric Summary: dummy_employees ===\n", + " mean std min max\n", + "employee_id 1049.500000 29.011492 1000.000000 1099.000000\n", + "salary 83981.174276 19304.098590 32605.097918 134264.842250\n", + "satisfaction_score 3.197062 1.163419 1.020246 4.960215\n", + "\n", + "--- Stats Computation Complete ---\n", + "Calculated stats for tags: ['dummy_employees']\n" + ] + } + ], + "source": [ + "# We pass the metadata we just generated into the stats function\n", + "stats = radsasas.compute_llm_agent_stats(\n", + " tag_to_df=tag_to_df,\n", + " categorical_cols_map=cat_cols_map,\n", + " metrics=[\"mean\", \"std\", \"min\", \"max\"]\n", + ")\n", + "\n", + "# Manually ensure the datetime_columns key is populated for the LLM\n", + "stats[\"datetime_columns\"] = datetime_meta\n", + "\n", + "print(\"\\n--- Stats Computation Complete ---\")\n", + "print(f\"Calculated stats for tags: {list(stats['numeric_summary'].keys())}\")" + ] + }, + { + "cell_type": "markdown", + "id": "41cf298b-2e84-4d82-8073-79f7f0c07277", + "metadata": {}, + "source": [ + "## 4. Call LLM for column type inferencing" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "db84d7ee-715d-464a-94e9-fd05261e36a4", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Cache hit for apply_llm\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Selected columns for LLM: ['employee_id', 'department', 'salary', 'satisfaction_score', 'hire_date', 'notes']\n", + "\n", + "--- LLM Prompt Snippet ---\n", + "You are a Senior Data Scientist and Domain Expert.\n", + "Analyze the provided dataset statistics and generate a profile for each column.\n", + "For each column, provide 2-3 testable hypotheses.\n", + "Example: 'Higher discount rates correlate with higher volume but lower margins.'\n", + "\n", + "--- DATASET STATISTICS ---\n", + "\n", + "Detected Datetime Columns:\n", + "{\n", + " \"hire_date\": {\n", + " \"semantic_type\": \"temporal\",\n", + " \"granularity\": \"date\",\n", + " \"format\": \"inferred\",\n", + " \"confidence\": 1.0\n", + " }\n", + "}\n", + "\n", + "Dataset [dummy_employees] Numeric Summary:\n", + " \n", + "...\n", + "\n", + "--- LLM Insights Retrieved Successfully ---\n" + ] + } + ], + "source": [ + "# 1. Select columns (e.g., let's just send everything)\n", + "columns_for_llm = radsasah._select_columns_for_llm(updated_df, scope=\"all\")\n", + "print(f\"Selected columns for LLM: {columns_for_llm}\\n\")\n", + "\n", + "# 2. Build the exact prompt string that goes to the LLM\n", + "prompt_text = radsasah.build_llm_prompt(stats, columns_to_include=columns_for_llm)\n", + "print(\"--- LLM Prompt Snippet ---\")\n", + "print(prompt_text[:500] + \"\\n...\\n\")\n", + "\n", + "# 3. Call the LLM to generate hypotheses (using gpt-4o as default)\n", + "# If you don't have an API key configured, you can mock this response by creating a static dict.\n", + "try:\n", + " semantic_insights = radsasah.generate_hypotheses_via_cli(\n", + " stats=stats,\n", + " model=\"gpt-4o\",\n", + " columns_to_include=columns_for_llm\n", + " )\n", + " print(\"--- LLM Insights Retrieved Successfully ---\")\n", + "except Exception as e:\n", + " print(f\"LLM call failed (Check API key): {e}\")\n", + " semantic_insights = {\"columns\": {}} # Fallback empty dict" + ] + }, + { + "cell_type": "markdown", + "id": "0b497a08-1e1f-47b3-8f7d-8a7038488c4d", + "metadata": {}, + "source": [ + "## 5. Export to JSON and Markdown" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "a3cd0c87-5951-4da3-b2ca-22abecebe626", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Pipeline complete! Check your directory for:\n", + "1. dummy_profile_report.json\n", + "2. dummy_profile_summary.md\n" + ] + } + ], + "source": [ + "# 1. Build structured column profiles\n", + "primary_df = list(tag_to_df.values())[0]\n", + "column_profiles = radsasar.build_column_profiles(\n", + " df=primary_df,\n", + " stats=stats,\n", + " insights=semantic_insights\n", + ")\n", + "\n", + "# 2. Export to JSON\n", + "json_out = \"dummy_profile_report.json\"\n", + "radsasar.merge_and_export_results(\n", + " stats=stats,\n", + " insights=semantic_insights,\n", + " column_profiles=column_profiles,\n", + " output_path=json_out\n", + ")\n", + "\n", + "# 3. Export to Markdown\n", + "md_out = \"dummy_profile_summary.md\"\n", + "radsasar.export_markdown_from_profiles(\n", + " column_profiles=column_profiles,\n", + " numeric_stats=stats.get(\"numeric_summary\", {}),\n", + " output_path=md_out\n", + ")\n", + "\n", + "print(f\"\\nPipeline complete! Check your directory for:\")\n", + "print(f\"1. {json_out}\")\n", + "print(f\"2. {md_out}\")\n", + "\n", + "# Clean up dummy CSV if desired\n", + "# os.remove(csv_path)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/research/agentic_data_science/schema_agent/schema_agent.API.py b/research/agentic_data_science/schema_agent/schema_agent.API.py new file mode 100644 index 000000000..939295380 --- /dev/null +++ b/research/agentic_data_science/schema_agent/schema_agent.API.py @@ -0,0 +1,163 @@ +# --- +# jupyter: +# jupytext: +# text_representation: +# extension: .py +# format_name: percent +# format_version: '1.3' +# jupytext_version: 1.19.1 +# kernelspec: +# display_name: Python 3 (ipykernel) +# language: python +# name: python3 +# --- + +# %% [markdown] +# # API usage Notebook +# - This notebook shows the implementation of each function from the respective libraries. + +# %% +# %load_ext autoreload +# %autoreload 2 + +import dotenv +import os +import pandas as pd +import numpy as np + +# Load environment variables (ensure OPENAI_API_KEY is set in your .env) +dotenv.load_dotenv() + +# Import the schema agent modules +import research.agentic_data_science.schema_agent.schema_agent_loader as radsasal +import research.agentic_data_science.schema_agent.schema_agent_stats as radsasas +import research.agentic_data_science.schema_agent.schema_agent_hllmcli as radsasah +import research.agentic_data_science.schema_agent.schema_agent_report as radsasar + +# %% [markdown] +# ## 1. Create dummy Dataset + +# %% +# 1. Create a dummy dataset +np.random.seed(42) +num_rows = 100 + +dummy_data = pd.DataFrame({ + "employee_id": range(1000, 1000 + num_rows), + "department": np.random.choice(["Engineering", "Sales", "HR", "Marketing"], num_rows), + "salary": np.random.normal(85000, 20000, num_rows), + "satisfaction_score": np.random.uniform(1.0, 5.0, num_rows), + "hire_date": pd.date_range(start="2018-01-01", periods=num_rows, freq="W").astype(str), + "notes": ["Good performance"] * 50 + [None] * 50 # 50% nulls +}) + +# Inject some missing values into salary +dummy_data.loc[10:20, "salary"] = np.nan + +# Save to CSV +csv_path = "dummy_employees.csv" +dummy_data.to_csv(csv_path, index=False) +print(f"Created dummy dataset at: {csv_path}") +dummy_data.head() + +csv_paths = [csv_path] +tags = ["dummy_employees"] + +# %% [markdown] +# ## 2. Load and Infer datatypes from the columns + +# %% +# 1. Load and prepare DataFrames - now receiving 3 variables +tag_to_df, cat_cols_map, datetime_meta = radsasal.prepare_dataframes(csv_paths, tags) + +print("--- Loaded DataFrames ---") +# The index will now show as a DatetimeIndex instead of a RangeIndex +print(tag_to_df["dummy_employees"].info()) + +# 2. Combine DataFrames while preserving the index +# We do NOT use ignore_index=True here because we want to keep the DatetimeIndex +# we just created in the loader. +updated_df = pd.concat(list(tag_to_df.values()), axis=0) + +print("\n--- Datetime Inference Metadata ---") +# This will now correctly show your temporal column info +print(datetime_meta) + +# %% [markdown] +# ## 3. Statistical Profiling + +# %% +# We pass the metadata we just generated into the stats function +stats = radsasas.compute_llm_agent_stats( + tag_to_df=tag_to_df, + categorical_cols_map=cat_cols_map, + metrics=["mean", "std", "min", "max"] +) + +# Manually ensure the datetime_columns key is populated for the LLM +stats["datetime_columns"] = datetime_meta + +print("\n--- Stats Computation Complete ---") +print(f"Calculated stats for tags: {list(stats['numeric_summary'].keys())}") + +# %% [markdown] +# ## 4. Call LLM for column type inferencing + +# %% +# 1. Select columns (e.g., let's just send everything) +columns_for_llm = radsasah._select_columns_for_llm(updated_df, scope="all") +print(f"Selected columns for LLM: {columns_for_llm}\n") + +# 2. Build the exact prompt string that goes to the LLM +prompt_text = radsasah.build_llm_prompt(stats, columns_to_include=columns_for_llm) +print("--- LLM Prompt Snippet ---") +print(prompt_text[:500] + "\n...\n") + +# 3. Call the LLM to generate hypotheses (using gpt-4o as default) +# If you don't have an API key configured, you can mock this response by creating a static dict. +try: + semantic_insights = radsasah.generate_hypotheses_via_cli( + stats=stats, + model="gpt-4o", + columns_to_include=columns_for_llm + ) + print("--- LLM Insights Retrieved Successfully ---") +except Exception as e: + print(f"LLM call failed (Check API key): {e}") + semantic_insights = {"columns": {}} # Fallback empty dict + +# %% [markdown] +# ## 5. Export to JSON and Markdown + +# %% +# 1. Build structured column profiles +primary_df = list(tag_to_df.values())[0] +column_profiles = radsasar.build_column_profiles( + df=primary_df, + stats=stats, + insights=semantic_insights +) + +# 2. Export to JSON +json_out = "dummy_profile_report.json" +radsasar.merge_and_export_results( + stats=stats, + insights=semantic_insights, + column_profiles=column_profiles, + output_path=json_out +) + +# 3. Export to Markdown +md_out = "dummy_profile_summary.md" +radsasar.export_markdown_from_profiles( + column_profiles=column_profiles, + numeric_stats=stats.get("numeric_summary", {}), + output_path=md_out +) + +print(f"\nPipeline complete! Check your directory for:") +print(f"1. {json_out}") +print(f"2. {md_out}") + +# Clean up dummy CSV if desired +# os.remove(csv_path) diff --git a/research/agentic_data_science/schema_agent/schema_agent.example.ipynb b/research/agentic_data_science/schema_agent/schema_agent.example.ipynb new file mode 100644 index 000000000..0355550c0 --- /dev/null +++ b/research/agentic_data_science/schema_agent/schema_agent.example.ipynb @@ -0,0 +1,434 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "b6e62e00-6cb3-45ef-8b7d-3a8ce84eb825", + "metadata": {}, + "source": [ + "# Schema Parser example \n", + "- This implementation in the notebook utilizes a suite of pre-existing functions to parse a single Excel (or CSV) file, automatically inferring data types and capturing temporal metadata for downstream analysis." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "3770d3bd-200f-4b7a-bb10-1fe76f26c4d7", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "WARNING: Running in Jupyter\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/git_root/research/agentic_data_science/schema_agent/schema_agent_loader.py:75: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n", + " parsed = pd.to_datetime(df[col], errors=\"coerce\", utc=True)\n", + "/git_root/research/agentic_data_science/schema_agent/schema_agent_loader.py:75: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n", + " parsed = pd.to_datetime(df[col], errors=\"coerce\", utc=True)\n", + "/git_root/research/agentic_data_science/schema_agent/schema_agent_loader.py:75: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n", + " parsed = pd.to_datetime(df[col], errors=\"coerce\", utc=True)\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "=== Temporal Boundaries ===\n", + " min_index max_index min_valid_index max_valid_index\n", + "ecommerce_data 2009-12-01 07:45:00+00:00 2010-12-09 20:01:00+00:00 2009-12-01 07:45:00+00:00 2010-12-09 20:01:00+00:00\n", + " year month week_of_year day_of_week order_hour is_weekend customer_id unit_price_gbp quantity_sold sales_amount_gbp population_total gdp_current_usd gdp_growth_pct inflation_consumer_pct\n", + "2009-12-01 07:45:00+00:00 2009 12 49 1 7 0 13085 5.95 10 59.5 62276270.0 2412840006231.5 -17.633976 1.89709\n", + "2009-12-01 07:45:00+00:00 2009 12 49 1 7 0 13085 6.75 12 81.0 62276270.0 2412840006231.5 -17.633976 1.89709\n", + "... ... ... ... ... ... ... ... ... ... ... ... ... ... ...\n", + "2010-12-09 20:01:00+00:00 2010 12 49 3 20 0 17530 1.95 4 7.8 62766365.0 2485482596184.708984 3.010668 1.589081\n", + "2010-12-09 20:01:00+00:00 2010 12 49 3 20 0 17530 1.25 4 5.0 62766365.0 2485482596184.708984 3.010668 1.589081\n", + " num_rows num_zeros zeros [%] num_nans nans [%] num_infs infs [%] num_valid valid [%]\n", + "year 100000 0 0.0 0 0.0 0 0.0 100000 100.0\n", + "month 100000 0 0.0 0 0.0 0 0.0 100000 100.0\n", + "... ... ... ... ... ... ... ... ... ...\n", + "gdp_growth_pct 100000 0 0.0 0 0.0 0 0.0 100000 100.0\n", + "inflation_consumer_pct 100000 0 0.0 0 0.0 0 0.0 100000 100.0\n", + "\n", + "=== Quality Report: ecommerce_data ===\n", + " num_rows num_zeros zeros [%] num_nans nans [%] \\\n", + "year 100000 0 0.0 0 0.0 \n", + "month 100000 0 0.0 0 0.0 \n", + "week_of_year 100000 0 0.0 0 0.0 \n", + "day_of_week 100000 16298 16.3 0 0.0 \n", + "order_hour 100000 0 0.0 0 0.0 \n", + "is_weekend 100000 84604 84.6 0 0.0 \n", + "customer_id 100000 0 0.0 0 0.0 \n", + "unit_price_gbp 100000 0 0.0 0 0.0 \n", + "quantity_sold 100000 0 0.0 0 0.0 \n", + "sales_amount_gbp 100000 0 0.0 0 0.0 \n", + "population_total 100000 0 0.0 0 0.0 \n", + "gdp_current_usd 100000 0 0.0 0 0.0 \n", + "gdp_growth_pct 100000 0 0.0 0 0.0 \n", + "inflation_consumer_pct 100000 0 0.0 0 0.0 \n", + "\n", + " num_infs infs [%] num_valid valid [%] \n", + "year 0 0.0 100000 100.0 \n", + "month 0 0.0 100000 100.0 \n", + "week_of_year 0 0.0 100000 100.0 \n", + "day_of_week 0 0.0 83702 83.7 \n", + "order_hour 0 0.0 100000 100.0 \n", + "is_weekend 0 0.0 15396 15.4 \n", + "customer_id 0 0.0 100000 100.0 \n", + "unit_price_gbp 0 0.0 100000 100.0 \n", + "quantity_sold 0 0.0 100000 100.0 \n", + "sales_amount_gbp 0 0.0 100000 100.0 \n", + "population_total 0 0.0 100000 100.0 \n", + "gdp_current_usd 0 0.0 100000 100.0 \n", + "gdp_growth_pct 0 0.0 100000 100.0 \n", + "inflation_consumer_pct 0 0.0 100000 100.0 \n", + "\n", + "=== Distribution: ecommerce_data / country ===\n", + " count pct [%]\n", + "country \n", + "United Kingdom 64417 64.417\n", + "Ireland 8507 8.507\n", + "Germany 7654 7.654\n", + "France 5470 5.470\n", + "Netherlands 2729 2.729\n", + "Spain 1235 1.235\n", + "Switzerland 1170 1.170\n", + "Belgium 1037 1.037\n", + "Portugal 984 0.984\n", + "Sweden 868 0.868\n", + "\n", + "=== Distribution: ecommerce_data / country_code ===\n", + " count pct [%]\n", + "country_code \n", + "GBR 64417 64.417\n", + "IRL 8507 8.507\n", + "DEU 7654 7.654\n", + "FRA 5470 5.470\n", + "NLD 2729 2.729\n", + "ESP 1235 1.235\n", + "CHE 1170 1.170\n", + "BEL 1037 1.037\n", + "PRT 984 0.984\n", + "SWE 868 0.868\n", + "\n", + "=== Distribution: ecommerce_data / product_id ===\n", + " count pct [%]\n", + "product_id \n", + "POST 731 0.731\n", + "85123A 615 0.615\n", + "21212 438 0.438\n", + "22423 437 0.437\n", + "85099B 391 0.391\n", + "20725 334 0.334\n", + "84991 298 0.298\n", + "20914 295 0.295\n", + "21232 295 0.295\n", + "84879 285 0.285\n", + "\n", + "=== Numeric Summary: ecommerce_data ===\n", + " mean std min median max\n", + "year 2.009929e+03 2.563578e-01 2.009000e+03 2.010000e+03 2.010000e+03\n", + "month 7.377590e+00 3.456657e+00 1.000000e+00 8.000000e+00 1.200000e+01\n", + "week_of_year 2.991514e+01 1.500327e+01 1.000000e+00 3.300000e+01 5.200000e+01\n", + "day_of_week 2.583280e+00 1.923159e+00 0.000000e+00 2.000000e+00 6.000000e+00\n", + "order_hour 1.268047e+01 2.351588e+00 7.000000e+00 1.300000e+01 2.000000e+01\n", + "is_weekend 1.539600e-01 3.609122e-01 0.000000e+00 0.000000e+00 1.000000e+00\n", + "customer_id 1.476813e+04 1.799165e+03 1.234600e+04 1.464600e+04 1.828700e+04\n", + "unit_price_gbp 3.889158e+00 5.975020e+01 1.000000e-03 1.950000e+00 1.095350e+04\n", + "quantity_sold 1.865779e+01 1.593465e+02 1.000000e+00 6.000000e+00 1.915200e+04\n", + "sales_amount_gbp 2.694892e+01 9.239021e+01 1.000000e-03 1.498000e+01 1.095350e+04\n", + "population_total 5.409812e+07 2.664448e+07 3.180410e+05 6.276636e+07 3.093782e+08\n", + "gdp_current_usd 2.161193e+12 1.115049e+12 9.035824e+09 2.485483e+12 1.504897e+13\n", + "gdp_growth_pct 4.626259e-01 6.134116e+00 -1.962987e+01 3.010668e+00 3.250405e+01\n", + "inflation_consumer_pct 1.104250e+00 1.655513e+00 -1.518298e+01 1.589081e+00 1.652789e+01\n", + "10:24:44 rss=0.222GB vms=1.643GB mem_pct=1% cpu=100% - \u001b[36mINFO \u001b[0m Task-20 schema_agent.py run_pipeline:131 LLM will profile 3 / 18 columns (scope=semantic).\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Cache hit for apply_llm\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
order_datetimeyearmonthweek_of_yearday_of_weekorder_houris_weekendcountrycountry_codeproduct_idcustomer_idunit_price_gbpquantity_soldsales_amount_gbppopulation_totalgdp_current_usdgdp_growth_pctinflation_consumer_pct
order_datetime
2009-12-01 07:45:00+00:002009-12-01 07:45:00+00:0020091249170United KingdomGBR21523130855.951059.5062276270.02.412840e+12-17.6339761.89709
2009-12-01 07:45:00+00:002009-12-01 07:45:00+00:0020091249170United KingdomGBR79323W130856.751281.0062276270.02.412840e+12-17.6339761.89709
2009-12-01 09:06:00+00:002009-12-01 09:06:00+00:0020091249190United KingdomGBR82582130782.101225.2062276270.02.412840e+12-17.6339761.89709
2009-12-01 09:06:00+00:002009-12-01 09:06:00+00:0020091249190United KingdomGBR22111130784.2524102.0062276270.02.412840e+12-17.6339761.89709
2009-12-01 09:06:00+00:002009-12-01 09:06:00+00:0020091249190United KingdomGBR21756130785.95317.8562276270.02.412840e+12-17.6339761.89709
\n", + "
" + ], + "text/plain": [ + " order_datetime year month \\\n", + "order_datetime \n", + "2009-12-01 07:45:00+00:00 2009-12-01 07:45:00+00:00 2009 12 \n", + "2009-12-01 07:45:00+00:00 2009-12-01 07:45:00+00:00 2009 12 \n", + "2009-12-01 09:06:00+00:00 2009-12-01 09:06:00+00:00 2009 12 \n", + "2009-12-01 09:06:00+00:00 2009-12-01 09:06:00+00:00 2009 12 \n", + "2009-12-01 09:06:00+00:00 2009-12-01 09:06:00+00:00 2009 12 \n", + "\n", + " week_of_year day_of_week order_hour is_weekend \\\n", + "order_datetime \n", + "2009-12-01 07:45:00+00:00 49 1 7 0 \n", + "2009-12-01 07:45:00+00:00 49 1 7 0 \n", + "2009-12-01 09:06:00+00:00 49 1 9 0 \n", + "2009-12-01 09:06:00+00:00 49 1 9 0 \n", + "2009-12-01 09:06:00+00:00 49 1 9 0 \n", + "\n", + " country country_code product_id \\\n", + "order_datetime \n", + "2009-12-01 07:45:00+00:00 United Kingdom GBR 21523 \n", + "2009-12-01 07:45:00+00:00 United Kingdom GBR 79323W \n", + "2009-12-01 09:06:00+00:00 United Kingdom GBR 82582 \n", + "2009-12-01 09:06:00+00:00 United Kingdom GBR 22111 \n", + "2009-12-01 09:06:00+00:00 United Kingdom GBR 21756 \n", + "\n", + " customer_id unit_price_gbp quantity_sold \\\n", + "order_datetime \n", + "2009-12-01 07:45:00+00:00 13085 5.95 10 \n", + "2009-12-01 07:45:00+00:00 13085 6.75 12 \n", + "2009-12-01 09:06:00+00:00 13078 2.10 12 \n", + "2009-12-01 09:06:00+00:00 13078 4.25 24 \n", + "2009-12-01 09:06:00+00:00 13078 5.95 3 \n", + "\n", + " sales_amount_gbp population_total \\\n", + "order_datetime \n", + "2009-12-01 07:45:00+00:00 59.50 62276270.0 \n", + "2009-12-01 07:45:00+00:00 81.00 62276270.0 \n", + "2009-12-01 09:06:00+00:00 25.20 62276270.0 \n", + "2009-12-01 09:06:00+00:00 102.00 62276270.0 \n", + "2009-12-01 09:06:00+00:00 17.85 62276270.0 \n", + "\n", + " gdp_current_usd gdp_growth_pct \\\n", + "order_datetime \n", + "2009-12-01 07:45:00+00:00 2.412840e+12 -17.633976 \n", + "2009-12-01 07:45:00+00:00 2.412840e+12 -17.633976 \n", + "2009-12-01 09:06:00+00:00 2.412840e+12 -17.633976 \n", + "2009-12-01 09:06:00+00:00 2.412840e+12 -17.633976 \n", + "2009-12-01 09:06:00+00:00 2.412840e+12 -17.633976 \n", + "\n", + " inflation_consumer_pct \n", + "order_datetime \n", + "2009-12-01 07:45:00+00:00 1.89709 \n", + "2009-12-01 07:45:00+00:00 1.89709 \n", + "2009-12-01 09:06:00+00:00 1.89709 \n", + "2009-12-01 09:06:00+00:00 1.89709 \n", + "2009-12-01 09:06:00+00:00 1.89709 " + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "%load_ext autoreload\n", + "%autoreload 2\n", + "import research.agentic_data_science.schema_agent.schema_agent as radsasag\n", + "\n", + "# Now run the pipeline\n", + "csv_files = [\"global_ecommerce_forecasting.csv\"]\n", + "tags = [\"ecommerce_data\"]\n", + "\n", + "tag_to_df, stats = radsasag.run_pipeline(\n", + " csv_paths=csv_files,\n", + " tags=tags,\n", + " model=\"gpt-4o\",\n", + " llm_scope=\"semantic\"\n", + ")\n", + "\n", + "display(tag_to_df[\"ecommerce_data\"].head())" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/research/agentic_data_science/schema_agent/schema_agent.example.py b/research/agentic_data_science/schema_agent/schema_agent.example.py new file mode 100644 index 000000000..1c9f4a455 --- /dev/null +++ b/research/agentic_data_science/schema_agent/schema_agent.example.py @@ -0,0 +1,35 @@ +# --- +# jupyter: +# jupytext: +# text_representation: +# extension: .py +# format_name: percent +# format_version: '1.3' +# jupytext_version: 1.19.1 +# kernelspec: +# display_name: Python 3 (ipykernel) +# language: python +# name: python3 +# --- + +# %% [markdown] +# # Schema Parser example +# - This implementation in the notebook utilizes a suite of pre-existing functions to parse a single Excel (or CSV) file, automatically inferring data types and capturing temporal metadata for downstream analysis. + +# %% +# %load_ext autoreload +# %autoreload 2 +import research.agentic_data_science.schema_agent.schema_agent as radsasag + +# Now run the pipeline +csv_files = ["global_ecommerce_forecasting.csv"] +tags = ["ecommerce_data"] + +tag_to_df, stats = radsasag.run_pipeline( + csv_paths=csv_files, + tags=tags, + model="gpt-4o", + llm_scope="semantic" +) + +display(tag_to_df["ecommerce_data"].head()) diff --git a/research/agentic_data_science/schema_agent/schema_agent.py b/research/agentic_data_science/schema_agent/schema_agent.py old mode 100644 new mode 100755 index 14491b6f6..de1777da7 --- a/research/agentic_data_science/schema_agent/schema_agent.py +++ b/research/agentic_data_science/schema_agent/schema_agent.py @@ -1,13 +1,14 @@ +#!/usr/bin/env python3 """ Data Profiler Agent — modular implementation. Main pipeline and CLI orchestration for end-to-end data profiling. Usage: - python schema_agent.py data.csv - python schema_agent.py data.csv --model gpt-4o-mini --llm-scope nulls - python schema_agent.py data.csv --metrics mean std min max --output-json out.json - python schema_agent.py data.csv data2.csv --tags sales inventory + ./schema_agent.py data.csv + ./schema_agent.py data.csv --model gpt-4o-mini --llm-scope nulls + ./schema_agent.py data.csv --metrics mean std min max --output-json out.json + ./schema_agent.py data.csv data2.csv --tags sales inventory Import as: @@ -23,10 +24,11 @@ import dotenv import pandas as pd import research.agentic_data_science.schema_agent.schema_agent_hllmcli as radsasah -import schema_agent_loader as radsasal -import schema_agent_report as radsasar -import schema_agent_stats as radsasas +import research.agentic_data_science.schema_agent.schema_agent_loader as radsasal +import research.agentic_data_science.schema_agent.schema_agent_report as radsasar +import research.agentic_data_science.schema_agent.schema_agent_stats as radsasas +import helpers.hdbg as hdbg import helpers.hlogging as hloggin # ============================================================================= @@ -35,9 +37,8 @@ dotenv.load_dotenv() api_key = os.environ.get("OPENAI_API_KEY") -if not api_key: - print("Error: OPENAI_API_KEY not found in environment.") - sys.exit(1) + +hdbg.dassert(api_key, "OPENAI_API_KEY not found in environment.") _LOG = hloggin.getLogger(__name__) _LOG.setLevel(logging.DEBUG) @@ -72,58 +73,62 @@ def run_pipeline( """ Execute the full data profiling pipeline over one or more CSV files. - Parameters - ---------- - csv_paths : list of str - One or more CSV file paths to profile. - tags : list of str, optional - Human-readable tag for each CSV. Defaults to filename stems. - model : str - LLM model name passed to OpenAI / hllmcli. - metrics : list of str, optional - Numeric metrics to include. Defaults to DEFAULT_METRICS. - llm_scope : str - "all", "semantic", or "nulls" — controls which columns are LLM-profiled. - output_json : str - Path for the merged JSON report. - output_md : str - Path for the Markdown summary. - use_langchain : bool - Use LangChain chain instead of hllmcli for LLM calls. - - Returns - ------- - (dict of tag → df, stats dict) + :param csv_paths: One or more CSV file paths to profile. + :type csv_paths: typing.List[str] + :param tags: Human-readable tag for each CSV. Defaults to filename stems. + :type tags: typing.Optional[typing.List[str]] + :param model: LLM model name passed to OpenAI / hllmcli. + :type model: str + :param metrics: Numeric metrics to include. Defaults to DEFAULT_METRICS. + :type metrics: typing.Optional[typing.List[str]] + :param llm_scope: "all", "semantic", or "nulls" — controls which columns are LLM-profiled. + :type llm_scope: str + :param output_json: Path for the merged JSON report. + :type output_json: str + :param output_md: Path for the Markdown summary. + :type output_md: str + :param use_langchain: Use LangChain chain instead of hllmcli for LLM calls. + :type use_langchain: bool + :return: A tuple containing a dict of tag -> df mappings, and a stats dict. + :rtype: typing.Tuple[typing.Dict[str, pd.DataFrame], typing.Dict[str, typing.Any]] """ + hdbg.dassert_isinstance(csv_paths, list) + hdbg.dassert_lt(0, len(csv_paths), "csv_paths must not be empty.") + if tags is None: tags = [os.path.splitext(os.path.basename(p))[0] for p in csv_paths] - if len(tags) != len(csv_paths): - raise ValueError( - f"Length of tags ({len(tags)}) must match csv_paths ({len(csv_paths)})." - ) + hdbg.dassert_eq( + len(tags), + len(csv_paths), + "Length of tags (%d) must match csv_paths (%d).", + len(tags), + len(csv_paths) + ) # --- Load & type-coerce --- - tag_to_df, cat_cols_map = radsasal.prepare_dataframes(csv_paths, tags) - - # Merge datetime metadata across all DataFrames (using the last loaded tag - # as the primary df for single-dataset runs; full merge for multi). - _, datetime_meta = radsasal.infer_and_convert_datetime_columns( - pd.concat(list(tag_to_df.values()), axis=0, ignore_index=True) - ) + # UPDATED: We now capture datetime_meta during loading to ensure timezone + # consistency and avoid re-inference warnings. + tag_to_df, cat_cols_map, datetime_meta = radsasal.prepare_dataframes(csv_paths, tags) # --- Compute stats --- + # The stats module now handles DatetimeIndex and filters out timestamp columns + # from math operations to prevent 'abs()' errors. stats = radsasas.compute_llm_agent_stats( tag_to_df, categorical_cols_map=cat_cols_map, metrics=metrics, ) + + # Inject captured datetime metadata into the stats object for the LLM. stats["datetime_columns"] = datetime_meta # --- LLM scope --- - # Use the concatenated DataFrame to decide which columns to send. - combined_df = pd.concat(list(tag_to_df.values()), axis=0, ignore_index=True) + # Combine dataframes for column selection logic. + # Note: We preserve the DatetimeIndex by not using ignore_index=True. + combined_df = pd.concat(list(tag_to_df.values()), axis=0) columns_for_llm = radsasah._select_columns_for_llm(combined_df, scope=llm_scope) + _LOG.info( "LLM will profile %d / %d columns (scope=%s).", len(columns_for_llm), @@ -146,8 +151,9 @@ def run_pipeline( columns_to_include=columns_for_llm, ) - # --- Build column profiles (use first / primary df for column ordering) --- - primary_df = list(tag_to_df.values())[0] + # --- Build column profiles --- + # We use the primary dataframe (first tag) as the template for the profile. + primary_df = tag_to_df[tags[0]] column_profiles = radsasar.build_column_profiles( df=primary_df, stats=stats, @@ -161,6 +167,7 @@ def run_pipeline( column_profiles=column_profiles, output_path=output_json, ) + radsasar.export_markdown_from_profiles( column_profiles, numeric_stats=stats.get("numeric_summary", {}), @@ -169,7 +176,6 @@ def run_pipeline( return tag_to_df, stats - # ============================================================================= # CLI # ============================================================================= @@ -250,9 +256,7 @@ def _build_arg_parser() -> argparse.ArgumentParser: def main() -> None: """ - CLI entry point. - - Parses arguments and delegates to run_pipeline(). + CLI entry point. Parses arguments and delegates to run_pipeline(). """ parser = _build_arg_parser() args = parser.parse_args() @@ -269,4 +273,4 @@ def main() -> None: if __name__ == "__main__": - main() + main() \ No newline at end of file diff --git a/research/agentic_data_science/schema_agent/schema_agent_hllmcli.py b/research/agentic_data_science/schema_agent/schema_agent_hllmcli.py index d2194684a..740dc6ce1 100644 --- a/research/agentic_data_science/schema_agent/schema_agent_hllmcli.py +++ b/research/agentic_data_science/schema_agent/schema_agent_hllmcli.py @@ -10,9 +10,11 @@ import langchain_core.output_parsers as lcop import langchain_core.prompts as lcpr import langchain_openai as lco +import pandas as pd import pydantic -import schema_agent_models as radsasam +import research.agentic_data_science.schema_agent.schema_agent_models as radsasam +import helpers.hdbg as hdbg import helpers.hllm_cli as hllmcli import helpers.hlogging as hloggin @@ -27,20 +29,19 @@ def _select_columns_for_llm( """ Return the list of column names that should be sent to the LLM. - Parameters - ---------- - df : pd.DataFrame - scope : str - "all" — every column - "semantic" — non-numeric columns only (object / category / string) - "nulls" — columns with null fraction above null_threshold - null_threshold : float - Fraction of nulls required for "nulls" scope. Default 5 %. - - Returns - ------- - list of str + :param df: Input dataframe. + :type df: pd.DataFrame + :param scope: "all" — every column, "semantic" — non-numeric columns + only, "nulls" — columns with high nulls. + :type scope: str + :param null_threshold: Fraction of nulls required for "nulls" scope. + Default 0.05. + :type null_threshold: float + :return: List of valid columns to process. + :rtype: typing.List[str] """ + hdbg.dassert_isinstance(df, pd.DataFrame) + if scope == "all": return list(df.columns) @@ -74,17 +75,16 @@ def build_llm_prompt( Serialize statistical data into a structured string prompt for LLM consumption. - Parameters - ---------- - stats : dict - Output of compute_llm_agent_stats(). - columns_to_include : list of str, optional - Subset of column names to include in the prompt. None = all. - - Returns - ------- - str + :param stats: Output of compute_llm_agent_stats(). + :type stats: typing.Dict[str, typing.Any] + :param columns_to_include: Subset of column names to include in the + prompt. None = all. + :type columns_to_include: typing.Optional[typing.List[str]] + :return: Formatted string prompt. + :rtype: str """ + hdbg.dassert_isinstance(stats, dict) + prompt_segments = [ "You are a Senior Data Scientist and Domain Expert.", "Analyze the provided dataset statistics and generate a profile for each column.", @@ -132,20 +132,20 @@ def generate_hypotheses_via_cli( Parses and Pydantic-validates the LLM response against DatasetInsights. - Parameters - ---------- - stats : dict - model : str - columns_to_include : list of str, optional - If provided, only these columns are sent to the LLM (cost control). - - Returns - ------- - dict — DatasetInsights-shaped dict, or {"error": ...} on failure. + :param stats: Computed dataset statistics. + :type stats: typing.Dict[str, typing.Any] + :param model: The target LLM model. + :type model: str + :param columns_to_include: Subset of column names to include. + :type columns_to_include: typing.Optional[typing.List[str]] + :return: DatasetInsights-shaped dict, or {"error": ...} on failure. + :rtype: typing.Dict[str, typing.Any] """ + hdbg.dassert_isinstance(stats, dict) + _LOG.info("Generating hypotheses via hllmcli (model=%s)...", model) - schema_json = radsasam.atasetInsights.model_json_schema() + schema_json = radsasam.DatasetInsights.model_json_schema() user_prompt = build_llm_prompt(stats, columns_to_include=columns_to_include) system_prompt = ( "You are a Senior Data Scientist. Analyze the following data statistics.\n" @@ -171,7 +171,7 @@ def generate_hypotheses_via_cli( ) raw = json.loads(cleaned) - # Pydantic validation — raises ValidationError on schema mismatch. + # Pydantic validation validated = radsasam.DatasetInsights.model_validate(raw) return validated.model_dump() @@ -194,18 +194,18 @@ def get_llm_semantic_insights_langchain( Process dataset metadata via LangChain to extract structured semantic insights. - Uses JsonOutputParser alongside the Pydantic schema. Validates output. - - Parameters - ---------- - prompt_text : str - Serialized stats from build_llm_prompt(). - model : str + Uses JsonOutputParser alongside the Pydantic schema. Validates + output. - Returns - ------- - dict + :param prompt_text: Serialized stats from build_llm_prompt(). + :type prompt_text: str + :param model: The target LLM model. + :type model: str + :return: Validated insights dictionary. + :rtype: typing.Dict[str, typing.Any] """ + hdbg.dassert_isinstance(prompt_text, str) + _LOG.info("Querying LLM via LangChain (%s)...", model) llm = lco.ChatOpenAI(model=model, temperature=0) parser = lcop.JsonOutputParser(pydantic_object=radsasam.DatasetInsights) @@ -224,7 +224,6 @@ def get_llm_semantic_insights_langchain( chain = prompt | llm | parser try: result = chain.invoke({"metadata_stats": prompt_text}) - # Validate against Pydantic schema. validated = radsasam.DatasetInsights.model_validate(result) return validated.model_dump() except pydantic.ValidationError as e: diff --git a/research/agentic_data_science/schema_agent/schema_agent_loader.py b/research/agentic_data_science/schema_agent/schema_agent_loader.py index 02fae3514..84a421b11 100644 --- a/research/agentic_data_science/schema_agent/schema_agent_loader.py +++ b/research/agentic_data_science/schema_agent/schema_agent_loader.py @@ -13,6 +13,7 @@ import pandas as pd +import helpers.hdbg as hdbg import helpers.hlogging as hloggin import helpers.hpandas_conversion as hpanconv import helpers.hpandas_io as hpanio @@ -24,123 +25,75 @@ def load_csv(csv_path: str) -> pd.DataFrame: """ Load a CSV into a DataFrame with clear error handling. - Parameters - ---------- - csv_path : str - Path to the CSV file. - - Returns - ------- - pd.DataFrame + :param csv_path: Path to the CSV file. + :type csv_path: str + :return: Loaded dataframe. + :rtype: pd.DataFrame """ + hdbg.dassert_isinstance(csv_path, str) try: df = hpanio.read_csv_to_df(csv_path) except FileNotFoundError: _LOG.error("CSV not found at '%s'.", csv_path) raise - if df.empty: - raise ValueError(f"CSV at '{csv_path}' loaded as an empty DataFrame.") + + hdbg.dassert_lt(0, len(df), "CSV at '%s' loaded as an empty DataFrame.", csv_path) + _LOG.info( "Loaded '%s': %d rows × %d columns.", csv_path, len(df), len(df.columns) ) return df -# keep legacy name for backwards compatibility -load_employee_data = load_csv - - def infer_and_convert_datetime_columns( df: pd.DataFrame, sample_size: int = 100, threshold: float = 0.8, ) -> typing.Tuple[pd.DataFrame, typing.Dict[str, typing.Any]]: - """ - Detect and convert date/datetime columns in a DataFrame. - - Uses sampling for performance. Returns the updated DataFrame and a - metadata dict with inference details per column. - - Parameters - ---------- - df : pd.DataFrame - sample_size : int - Number of rows to sample when testing format compliance. - threshold : float - Minimum fraction of parsed values required to accept a column as temporal. - - Returns - ------- - (pd.DataFrame, dict) - Updated DataFrame with converted columns + metadata per column. - """ - COMMON_FORMATS = [ - "%Y-%m-%d", - "%d-%m-%Y", - "%m-%d-%Y", - "%Y/%m/%d", - "%d/%m/%Y", - "%m/%d/%Y", - "%Y-%m-%d %H:%M:%S", - "%Y-%m-%d %H:%M", - "%d-%m-%Y %H:%M:%S", - "%m/%d/%Y %H:%M:%S", - ] - metadata: typing.Dict[str, typing.Any] = {} df_out = df.copy() for col in df.columns: - if not ( - pd.api.types.is_object_dtype(df[col]) - or pd.api.types.is_string_dtype(df[col]) - ): + # 1. If it's already datetime, just ensure UTC awareness + if pd.api.types.is_datetime64_any_dtype(df[col]): + df_out[col] = pd.to_datetime(df[col], utc=True) + metadata[col] = { + "semantic_type": "temporal", + "granularity": "datetime", + "format": "pre-converted", + "confidence": 1.0, + } continue - series = df[col].dropna().astype(str) - if series.empty: + # 2. Only attempt conversion on strings/objects + if not (pd.api.types.is_object_dtype(df[col]) or pd.api.types.is_string_dtype(df[col])): continue - sample = series.head(sample_size) - best_format: typing.Optional[str] = None - best_score = 0.0 - - for fmt in COMMON_FORMATS: - success = sum(1 for val in sample if _try_strptime(val, fmt)) - score = success / len(sample) - if score > best_score: - best_score = score - best_format = fmt - - if best_score >= threshold: - parsed = pd.to_datetime(df[col], format=best_format, errors="coerce") - used_format = best_format - else: - parsed = pd.to_datetime(df[col], errors="coerce") - used_format = None - - confidence = float(parsed.notna().mean()) - if confidence < threshold: + # Try to parse + try: + # We use errors="coerce" so non-dates become NaT + parsed = pd.to_datetime(df[col], errors="coerce", utc=True) + + valid_count = parsed.notna().sum() + if valid_count == 0: + continue + + confidence = float(valid_count / len(df[col])) + + # Only convert if it meets our confidence threshold + if confidence >= threshold: + df_out[col] = parsed + has_time = (parsed.dt.time != pd.Timestamp("00:00:00").time()).any() + metadata[col] = { + "semantic_type": "temporal", + "granularity": "datetime" if has_time else "date", + "format": "inferred", + "confidence": confidence, + } + _LOG.info("Converted column '%s' to datetime", col) + except Exception: continue - has_time = (parsed.dt.time != pd.Timestamp("00:00:00").time()).any() - col_type = "datetime" if has_time else "date" - df_out[col] = parsed - - metadata[col] = { - "semantic_type": "temporal", - "granularity": col_type, - "format": used_format, - "confidence": confidence, - } - _LOG.info( - "Column '%s' detected as %s (format=%s, confidence=%.2f)", - col, - col_type, - used_format, - confidence, - ) - return df_out, metadata @@ -155,38 +108,47 @@ def _try_strptime(val: str, fmt: str) -> bool: return False + def prepare_dataframes( csv_paths: typing.List[str], tags: typing.Optional[typing.List[str]] = None, ) -> typing.Tuple[ - typing.Dict[str, pd.DataFrame], typing.Dict[str, typing.List[str]] + typing.Dict[str, pd.DataFrame], + typing.Dict[str, typing.List[str]], + typing.Dict[str, typing.Any] # Added return type for metadata ]: """ Load and prepare all CSV files in one pass. - - Applies type coercion, datetime inference, and categorical detection. - - Parameters - ---------- - csv_paths : list of str - tags : list of str, optional - Human-readable tags; defaults to filename stems. - - Returns - ------- - (dict of tag → df, dict of tag → categorical_columns) """ + hdbg.dassert_isinstance(csv_paths, list) + if tags is None: + import os + tags = [os.path.splitext(os.path.basename(p))[0] for p in csv_paths] + tag_to_df: typing.Dict[str, pd.DataFrame] = {} cat_cols_map: typing.Dict[str, typing.List[str]] = {} + combined_dt_meta: typing.Dict[str, typing.Any] = {} # Store metadata here for path, tag in zip(csv_paths, tags): + # 1. Load and perform initial type conversion df = load_csv(path) df = hpanconv.convert_df(df) - df, _ = infer_and_convert_datetime_columns(df) + + # 2. Perform datetime inference and CAPTURE metadata + df, dt_meta = infer_and_convert_datetime_columns(df) + combined_dt_meta.update(dt_meta) # Merge metadata + + # 3. FIX: Automatically promote the first detected temporal column to + # the Index for Quality and Duration reports. + temporal_cols = [c for c, m in dt_meta.items() if m.get("semantic_type") == "temporal"] + if temporal_cols: + df = df.set_index(temporal_cols[0], drop=False) + tag_to_df[tag] = df + # 4. Identify categorical/string columns cat_cols_map[tag] = df.select_dtypes( include=["object", "category", "string"] ).columns.tolist() - return tag_to_df, cat_cols_map + return tag_to_df, cat_cols_map, combined_dt_meta \ No newline at end of file diff --git a/research/agentic_data_science/schema_agent/schema_agent_report.py b/research/agentic_data_science/schema_agent/schema_agent_report.py index 2c377322e..e46a34b07 100644 --- a/research/agentic_data_science/schema_agent/schema_agent_report.py +++ b/research/agentic_data_science/schema_agent/schema_agent_report.py @@ -14,6 +14,7 @@ import pandas as pd +import helpers.hdbg as hdbg import helpers.hlogging as hloggin _LOG = hloggin.getLogger(__name__) @@ -42,6 +43,9 @@ def build_column_profiles( """ profiles: typing.List[typing.Dict[str, typing.Any]] = [] + hdbg.dassert_isinstance(df, pd.DataFrame) + hdbg.dassert_isinstance(stats, dict) + hdbg.dassert_isinstance(insights, dict) numeric_summary = stats.get("numeric_summary", {}) categorical_stats = stats.get("categorical_distributions", {}) datetime_meta = stats.get("datetime_columns", {}) @@ -115,6 +119,11 @@ def merge_and_export_results( output_path : str """ _LOG.info("Merging results...") + hdbg.dassert_isinstance(stats, dict) + hdbg.dassert_isinstance(insights, dict) + hdbg.dassert_isinstance(column_profiles, list) + hdbg.dassert_isinstance(output_path, str) + hdbg.dassert(output_path, "output_path must be a non-empty string.") serializable_stats = _make_serializable(stats) final_report = { @@ -167,6 +176,11 @@ def _clean(val: typing.Any) -> str: return "" return str(val).replace("|", "\\|").replace("\n", " ") + hdbg.dassert_isinstance(column_profiles, list) + hdbg.dassert_lt(0, len(column_profiles), "column_profiles must be non-empty.") + hdbg.dassert_isinstance(output_path, str) + hdbg.dassert(output_path, "output_path must be a non-empty string.") + def _fmt(val: typing.Any) -> str: if isinstance(val, int): return str(val) @@ -215,4 +229,4 @@ def _fmt(val: typing.Any) -> str: with open(output_path, "w", encoding="utf-8") as f: f.write("\n".join(lines) + "\n") - _LOG.info("Exported Markdown report to '%s'.", output_path) + _LOG.info("Exported Markdown report to '%s'.", output_path) \ No newline at end of file diff --git a/research/agentic_data_science/schema_agent/schema_agent_stats.py b/research/agentic_data_science/schema_agent/schema_agent_stats.py index 213c3d25f..5c170fe17 100644 --- a/research/agentic_data_science/schema_agent/schema_agent_stats.py +++ b/research/agentic_data_science/schema_agent/schema_agent_stats.py @@ -13,6 +13,7 @@ import pandas as pd +import helpers.hdbg as hdbg import helpers.hlogging as hloggin import helpers.hpandas_stats as hpanstat @@ -61,6 +62,8 @@ def compute_llm_agent_stats( numeric_summary. """ metrics = _resolve_metrics(metrics) + hdbg.dassert_isinstance(tag_to_df, dict) + hdbg.dassert_lt(0, len(tag_to_df), "tag_to_df must be non-empty.") dataframe_stats: typing.Dict[str, typing.Any] = {} # 1. Temporal boundaries @@ -75,37 +78,42 @@ def compute_llm_agent_stats( # 2. Data quality dataframe_stats["quality_reports"] = {} for tag, df in tag_to_df.items(): - numeric_df = df.select_dtypes(include="number") + # Select ONLY actual numeric columns for the quality math + numeric_df = df.select_dtypes(include=["int64", "float64"]) + if numeric_df.empty: - _LOG.warning( - "No numeric columns in '%s'; skipping quality report", tag - ) + _LOG.warning("No numeric columns in '%s'; skipping quality report", tag) continue - df_stamped = hpanstat.add_end_download_timestamp(numeric_df.copy()) + try: + # Pass ONLY the numeric dataframe here quality = hpanstat.report_zero_nan_inf_stats( - df_stamped, + numeric_df, zero_threshold=1e-9, verbose=True, as_txt=True, ) dataframe_stats["quality_reports"][tag] = quality - print(f"\n=== Quality Report: {tag} ===\n", quality.to_string()) - except Exception as e: # pylint: disable=broad-exception-caught + print(f"\n=== Quality Report: {tag} ===\n", quality) + except Exception as e: _LOG.warning("Quality report failed for '%s': %s", tag, e) # 3. Categorical distributions dataframe_stats["categorical_distributions"] = {} if categorical_cols_map: for tag, cols in categorical_cols_map.items(): - if tag not in tag_to_df: - _LOG.warning("Tag '%s' not found in tag_to_df; skipping.", tag) - continue + hdbg.dassert_in( + tag, tag_to_df, "Tag '%s' not found in tag_to_df.", tag + ) dataframe_stats["categorical_distributions"][tag] = {} for col in cols: - if col not in tag_to_df[tag].columns: - _LOG.warning("Column '%s' not in '%s'; skipping.", col, tag) - continue + hdbg.dassert_in( + col, + tag_to_df[tag].columns, + "Column '%s' not found in dataset '%s'.", + col, + tag, + ) dist = hpanstat.get_value_counts_stats_df(tag_to_df[tag], col) dataframe_stats["categorical_distributions"][tag][col] = dist print( @@ -149,4 +157,4 @@ def _resolve_metrics( VALID_METRICS, ) resolved = [m for m in metrics if m in VALID_METRICS] - return resolved if resolved else DEFAULT_METRICS + return resolved if resolved else DEFAULT_METRICS \ No newline at end of file diff --git a/research/agentic_data_science/schema_agent/utils.sh b/research/agentic_data_science/schema_agent/utils.sh new file mode 100644 index 000000000..67426f5d5 --- /dev/null +++ b/research/agentic_data_science/schema_agent/utils.sh @@ -0,0 +1,504 @@ +#!/bin/bash +# """ +# Utility functions for Docker container management. +# """ + + +# ############################################################################# +# General utilities +# ############################################################################# + + +run() { + # """ + # Execute a command with echo output. + # + # :param cmd: Command string to execute + # :return: Exit status of the executed command + # """ + cmd="$*" + echo "> $cmd" + eval "$cmd" +} + + +enable_verbose_mode() { + # """ + # Enable shell command tracing (set -x) when VERBOSE is set to 1. + # + # Reads the VERBOSE variable set by parse_docker_jupyter_args. + # Call this after parsing args to activate tracing for the rest of the script. + # """ + if [[ $VERBOSE == 1 ]]; then + set -x + fi +} + + +# ############################################################################# +# Argument parsing +# ############################################################################# + + +_print_default_help() { + # """ + # Print usage information and available default options for docker scripts. + # """ + echo "Usage: $(basename $0) [options]" + echo "" + echo "Options:" + echo " -h Print this help message and exit" + echo " -v Enable verbose output (set -x)" +} + + +parse_default_args() { + # """ + # Parse default command-line arguments for docker scripts. + # + # Sets VERBOSE variable in the caller's scope and enables set -x when -v + # is passed. Prints help and exits when -h is passed. + # Updates OPTIND so the caller can shift away processed arguments. + # + # :param @: command-line arguments forwarded from the calling script + # """ + VERBOSE=0 + while getopts "hv" flag; do + case "${flag}" in + h) _print_default_help; exit 0;; + v) VERBOSE=1;; + *) _print_default_help; exit 1;; + esac + done + enable_verbose_mode +} + + +_print_docker_jupyter_help() { + # """ + # Print usage information and available options for docker_jupyter.sh. + # """ + echo "Usage: $(basename $0) [options]" + echo "" + echo "Launch Jupyter Lab inside a Docker container." + echo "" + echo "Options:" + echo " -h Print this help message and exit" + echo " -p PORT Host port to forward to Jupyter Lab (default: 8888)" + echo " -u Enable vim keybindings in Jupyter Lab" + echo " -v Enable verbose output (set -x)" +} + + +parse_docker_jupyter_args() { + # """ + # Parse command-line arguments for docker_jupyter.sh. + # + # Sets JUPYTER_HOST_PORT, JUPYTER_USE_VIM, TARGET_DIR, VERBOSE, and + # OLD_CMD_OPTS in the caller's scope. Enables set -x when -v is passed. + # Prints help and exits when -h is passed. + # + # :param @: command-line arguments forwarded from the calling script + # """ + # Set defaults. + JUPYTER_HOST_PORT=8888 + JUPYTER_USE_VIM=0 + VERBOSE=0 + # Save original args to pass through to run_jupyter.sh. + OLD_CMD_OPTS="$*" + # Parse options. + while getopts "hp:uv" flag; do + case "${flag}" in + h) _print_docker_jupyter_help; exit 0;; + p) JUPYTER_HOST_PORT=${OPTARG};; # Port for Jupyter Lab. + u) JUPYTER_USE_VIM=1;; # Enable vim bindings. + v) VERBOSE=1;; # Enable verbose output. + *) _print_docker_jupyter_help; exit 1;; + esac + done + # Enable command tracing if verbose mode is requested. + enable_verbose_mode +} + + +# ############################################################################# +# Docker image management +# ############################################################################# + + +get_docker_vars_script() { + # """ + # Load Docker variables from docker_name.sh script. + # + # :param script_path: Path to the script to determine the Docker configuration directory + # :return: Sources REPO_NAME, IMAGE_NAME, and FULL_IMAGE_NAME variables + # """ + local script_path=$1 + # Find the name of the container. + SCRIPT_DIR=$(dirname $script_path) + DOCKER_NAME="$SCRIPT_DIR/docker_name.sh" + if [[ ! -e $SCRIPT_DIR ]]; then + echo "Can't find $DOCKER_NAME" + exit -1 + fi; + source $DOCKER_NAME +} + + +print_docker_vars() { + # """ + # Print current Docker variables to stdout. + # """ + echo "REPO_NAME=$REPO_NAME" + echo "IMAGE_NAME=$IMAGE_NAME" + echo "FULL_IMAGE_NAME=$FULL_IMAGE_NAME" +} + + +build_container_image() { + # """ + # Build a Docker container image. + # + # Supports both single-architecture and multi-architecture builds. + # Creates temporary build directory, copies files, and builds the image. + # + # :param @: Additional options to pass to docker build/buildx build + # """ + echo "# ${FUNCNAME[0]} ..." + FULL_IMAGE_NAME=$REPO_NAME/$IMAGE_NAME + echo "FULL_IMAGE_NAME=$FULL_IMAGE_NAME" + # Prepare build area. + #tar -czh . | docker build $OPTS -t $IMAGE_NAME - + DIR="../tmp.build" + if [[ -d $DIR ]]; then + rm -rf $DIR + fi; + cp -Lr . $DIR || true + # Build container. + echo "DOCKER_BUILDKIT=$DOCKER_BUILDKIT" + echo "DOCKER_BUILD_MULTI_ARCH=$DOCKER_BUILD_MULTI_ARCH" + if [[ $DOCKER_BUILD_MULTI_ARCH != 1 ]]; then + # Build for a single architecture. + echo "Building for current architecture..." + OPTS="--progress plain $@" + (cd $DIR; docker build $OPTS -t $FULL_IMAGE_NAME . 2>&1 | tee ../docker_build.log; exit ${PIPESTATUS[0]}) + else + # Build for multiple architectures. + echo "Building for multiple architectures..." + OPTS="$@" + export DOCKER_CLI_EXPERIMENTAL=enabled + # Create a new builder. + #docker buildx rm --all-inactive --force + #docker buildx create --name mybuilder + #docker buildx use mybuilder + # Use the default builder. + docker buildx use multiarch + docker buildx inspect --bootstrap + # Note that one needs to push to the repo since otherwise it is not + # possible to keep multiple. + (cd $DIR; docker buildx build --push --platform linux/arm64,linux/amd64 $OPTS --tag $FULL_IMAGE_NAME . 2>&1 | tee ../docker_build.log; exit ${PIPESTATUS[0]}) + # Report the status. + docker buildx imagetools inspect $FULL_IMAGE_NAME + fi; + # Report build version. + if [ -f docker_build.version.log ]; then + rm docker_build.version.log + fi + (cd $DIR; docker run --rm -it -v $(pwd):/data $FULL_IMAGE_NAME bash -c "/data/version.sh") 2>&1 | tee docker_build.version.log + # + docker image ls $REPO_NAME/$IMAGE_NAME + rm -rf $DIR + echo "*****************************" + echo "SUCCESS" + echo "*****************************" +} + + +remove_container_image() { + # """ + # Remove Docker container image(s) matching the current configuration. + # """ + echo "# ${FUNCNAME[0]} ..." + FULL_IMAGE_NAME=$REPO_NAME/$IMAGE_NAME + echo "FULL_IMAGE_NAME=$FULL_IMAGE_NAME" + docker image ls | grep $FULL_IMAGE_NAME + docker image ls | grep $FULL_IMAGE_NAME | awk '{print $1}' | xargs -n 1 -t docker image rm -f + docker image ls + echo "${FUNCNAME[0]} ... done" +} + + +push_container_image() { + # """ + # Push Docker container image to registry. + # + # Authenticates using credentials from ~/.docker/passwd.$REPO_NAME.txt. + # """ + echo "# ${FUNCNAME[0]} ..." + FULL_IMAGE_NAME=$REPO_NAME/$IMAGE_NAME + echo "FULL_IMAGE_NAME=$FULL_IMAGE_NAME" + docker login --username $REPO_NAME --password-stdin <~/.docker/passwd.$REPO_NAME.txt + docker images $FULL_IMAGE_NAME + docker push $FULL_IMAGE_NAME + echo "${FUNCNAME[0]} ... done" +} + + +pull_container_image() { + # """ + # Pull Docker container image from registry. + # """ + echo "# ${FUNCNAME[0]} ..." + FULL_IMAGE_NAME=$REPO_NAME/$IMAGE_NAME + echo "FULL_IMAGE_NAME=$FULL_IMAGE_NAME" + docker pull $FULL_IMAGE_NAME + echo "${FUNCNAME[0]} ... done" +} + + +# ############################################################################# +# Docker container management +# ############################################################################# + + +kill_container() { + # """ + # Kill and remove Docker container(s) matching the current configuration. + # """ + echo "# ${FUNCNAME[0]} ..." + FULL_IMAGE_NAME=$REPO_NAME/$IMAGE_NAME + echo "FULL_IMAGE_NAME=$FULL_IMAGE_NAME" + docker container ls + # + CONTAINER_ID=$(docker container ls -a | grep $FULL_IMAGE_NAME | awk '{print $1}') + echo "CONTAINER_ID=$CONTAINER_ID" + if [[ ! -z $CONTAINER_ID ]]; then + docker container rm -f $CONTAINER_ID + docker container ls + fi; + echo "${FUNCNAME[0]} ... done" +} + + +exec_container() { + # """ + # Execute bash shell in running Docker container. + # + # Opens an interactive bash session in the first container matching the + # current configuration. + # """ + echo "# ${FUNCNAME[0]} ..." + FULL_IMAGE_NAME=$REPO_NAME/$IMAGE_NAME + echo "FULL_IMAGE_NAME=$FULL_IMAGE_NAME" + docker container ls + # + CONTAINER_ID=$(docker container ls -a | grep $FULL_IMAGE_NAME | awk '{print $1}') + echo "CONTAINER_ID=$CONTAINER_ID" + docker exec -it $CONTAINER_ID bash + echo "${FUNCNAME[0]} ... done" +} + + +# ############################################################################# +# Docker common options +# ############################################################################# + + +get_docker_common_options() { + # """ + # Return docker run options common to all container types. + # + # Includes volume mount for the git root, plus environment variables for + # PYTHONPATH and host OS name. + # + # :return: docker run options string with volume mounts and env vars + # """ + echo "-v $GIT_ROOT:/git_root \ + -e PYTHONPATH=/git_root:/git_root/helpers_root:/git_root/msml610/tutorials \ + -e CSFY_GIT_ROOT_PATH=/git_root \ + -e CSFY_HOST_OS_NAME=$(uname -s) \ + -e CSFY_HOST_NAME=$(uname -n)" +} + + +# ############################################################################# +# Docker bash +# ############################################################################# + + +get_docker_bash_command() { + # """ + # Return the base docker run command for an interactive bash shell. + # + # :return: docker run command string with --rm and -ti flags + # """ + if [ -t 0 ]; then + echo "docker run --rm -ti" + else + echo "docker run --rm -i" + fi +} + + +get_docker_bash_options() { + # """ + # Return docker run options for a Docker container. + # + # :param container_name: Name for the Docker container + # :param port: Port number to forward (optional, skipped if empty) + # :param extra_opts: Additional docker run options (optional) + # :return: docker run options string with name, volume mounts, and env vars + # """ + local container_name=$1 + local port=$2 + local extra_opts=$3 + local port_opt="" + if [[ -n $port ]]; then + port_opt="-p $port:$port" + fi + echo "--name $container_name \ + $port_opt \ + $extra_opts \ + $(get_docker_common_options)" +} + + +# ############################################################################# +# Docker cmd +# ############################################################################# + + +get_docker_cmd_command() { + # """ + # Return the base docker run command for executing a non-interactive command. + # + # :return: docker run command string with --rm and -i flags + # """ + echo "docker run --rm -i" +} + + +# ############################################################################# +# Docker Jupyter +# ############################################################################# + + +get_docker_jupyter_command() { + # """ + # Return the base docker run command for running Jupyter Lab interactively. + # + # :return: docker run command string with --rm and -ti flags + # """ + echo "docker run --rm -ti" +} + + +get_docker_jupyter_options() { + # """ + # Return docker run options for a Jupyter Lab container. + # + # :param container_name: Name for the Docker container + # :param host_port: Host port to forward to container port 8888 + # :param jupyter_use_vim: 0 or 1 to enable vim bindings + # :return: docker run options string + # """ + local container_name=$1 + local host_port=$2 + local jupyter_use_vim=$3 + # Run as the current user when user is saggese. + if [[ "$(whoami)" == "saggese" ]]; then + echo "Overwriting jupyter_use_vim since user='saggese'" + jupyter_use_vim=1 + fi + echo "--name $container_name \ + -p $host_port:8888 \ + $(get_docker_common_options) \ + -e JUPYTER_USE_VIM=$jupyter_use_vim" +} + + +configure_jupyter_vim_keybindings() { + # """ + # Configure JupyterLab vim keybindings based on JUPYTER_USE_VIM env var. + # + # Reads JUPYTER_USE_VIM; if 1, verifies jupyterlab_vim is installed and + # writes enabled settings; otherwise writes disabled settings. + # """ + mkdir -p ~/.jupyter/lab/user-settings/@axlair/jupyterlab_vim + if [[ $JUPYTER_USE_VIM == 1 ]]; then + # Check that jupyterlab_vim is installed before trying to enable it. + if ! pip show jupyterlab_vim > /dev/null 2>&1; then + echo "ERROR: jupyterlab_vim is not installed but vim bindings were requested." + echo "Install it with: pip install jupyterlab_vim" + exit 1 + fi + echo "Enabling vim." + cat < ~/.jupyter/lab/user-settings/\@axlair/jupyterlab_vim/plugin.jupyterlab-settings +{ + "enabled": true, + "enabledInEditors": true, + "extraKeybindings": [] +} +EOF + else + echo "Disabling vim." + cat < ~/.jupyter/lab/user-settings/\@axlair/jupyterlab_vim/plugin.jupyterlab-settings +{ + "enabled": false, + "enabledInEditors": false, + "extraKeybindings": [] +} +EOF + fi; +} + + +configure_jupyter_notifications() { + # """ + # Disable JupyterLab news fetching and update checks. + # """ + mkdir -p ~/.jupyter/lab/user-settings/@jupyterlab/apputils-extension + cat < ~/.jupyter/lab/user-settings/\@jupyterlab/apputils-extension/notification.jupyterlab-settings +{ + // Notifications + // @jupyterlab/apputils-extension:notification + // Notifications settings. + + // Fetch official Jupyter news + // Whether to fetch news from the Jupyter news feed. If Always (`true`), it will make a request to a website. + "fetchNews": "false", + "checkForUpdates": false +} +EOF +} + + +get_jupyter_args() { + # """ + # Print the standard Jupyter Lab command-line arguments. + # + # :return: space-separated Jupyter Lab args for port 8888 with no browser, + # allow root, and no authentication + # """ + echo "--port=8888 --no-browser --ip=0.0.0.0 --allow-root --ServerApp.token='' --ServerApp.password=''" +} + + +get_run_jupyter_cmd() { + # """ + # Return the command to run run_jupyter.sh inside a container. + # + # Computes the script's path relative to GIT_ROOT and builds the + # corresponding /git_root/... path used inside the container. + # + # :param script_path: path of the calling script (pass ${BASH_SOURCE[0]}) + # :param cmd_opts: options to forward to run_jupyter.sh + # :return: full command string to run run_jupyter.sh + # """ + local script_path=$1 + local cmd_opts=$2 + local script_dir + script_dir=$(cd "$(dirname "$script_path")" && pwd) + local rel_dir="${script_dir#${GIT_ROOT}/}" + echo "/git_root/${rel_dir}/run_jupyter.sh $cmd_opts" +} diff --git a/research/agentic_data_science/schema_agent/version.sh b/research/agentic_data_science/schema_agent/version.sh new file mode 100755 index 000000000..c46ed254c --- /dev/null +++ b/research/agentic_data_science/schema_agent/version.sh @@ -0,0 +1,28 @@ +#!/bin/bash +# """ +# Display versions of installed tools and packages. +# +# This script prints version information for Python, pip, Jupyter, and all +# installed Python packages. Used for debugging and documentation purposes +# to verify the Docker container environment setup. +# """ + +# Display Python 3 version. +echo "# Python3" +python3 --version + +# Display pip version. +echo "# pip3" +pip3 --version + +# Display Jupyter version. +echo "# jupyter" +jupyter --version + +# List all installed Python packages and their versions. +echo "# Python packages" +pip3 list + +# Template for adding additional tool versions. +# echo "# mongo" +# mongod --version diff --git a/website/docs/blog/posts/draft.Schema_agent.md b/website/docs/blog/posts/draft.Schema_agent.md new file mode 100644 index 000000000..f74928d50 --- /dev/null +++ b/website/docs/blog/posts/draft.Schema_agent.md @@ -0,0 +1,58 @@ +--- +title: "Data Profiler Agent in 30 Minutes" +authors: + - Your Name +date: 2026-04-10 +description: +categories: + - AI Research + - Data Science +--- + +TL;DR: Learn how to automatically profile CSV datasets with statistical summaries and LLM-powered semantic analysis in 30 minutes. Generate column-level insights, detect temporal patterns, and discover data quality issues. + + + +## Tutorial in 30 Seconds + +The Data Profiler Agent is an automated system that combines classical statistical analysis with LLM-powered semantic understanding to comprehensively profile CSV datasets. + +Key capabilities: + +- **Automatic temporal detection**: Identifies and converts date/datetime columns across multiple formats +- **Statistical profiling**: Computes numeric summaries, data quality metrics, and categorical distributions +- **LLM semantic analysis**: Infers column roles (ID, Feature, Target, Timestamp), semantic meaning, and testable hypotheses +- **Smart cost control**: Selectively analyze columns to manage API costs without sacrificing insights +- **Flexible output**: Machine-readable JSON reports and human-friendly Markdown summaries + +This tutorial's goal is to show you in 30 minutes: + +- How the modular architecture enables both quick profiling and extensibility +- How to profile datasets and interpret results in multiple formats +- How to optimize costs while maintaining analysis quality +- How to integrate profiling into existing data pipelines + +## Official References + +- [Data Profiler Agent Repository](../../../ ../research/agentic_data_science/schema_agent) +- [README](../../../../research/agentic_data_science/schema_agent/README.md) + +## Tutorial Content + +This tutorial includes all code, notebooks, and documentation in +[research/agentic_data_science/schema_agent](../../../../research/agentic_data_science/schema_agent) + +- [`README.md`](../../../../research/agentic_data_science/schema_agent/README.md): Installation, usage, and configuration guide +- Six modular Python files: + - `schema_agent_models.py`: Type-safe schemas for insights and profiles + - `schema_agent_loader.py`: CSV loading and type inference + - `schema_agent_stats.py`: Statistical computation and quality metrics + - `schema_agent_llm.py`: LLM integration and semantic analysis + - `schema_agent_report.py`: Report generation and export + - `schema_agent.py`: Pipeline orchestration and CLI +- [`schema_agent.example`](../../../../research/agentic_data_science/schema_agent/schema_agent.example.ipynb): Individual module usage examples +- [`schema_agent.API`](../../../../research/agentic_data_science/schema_agent/schema_agent.API.ipynb): End-to-end pipeline workflows and patterns +- Example notebooks demonstrating real-world use cases: + - Basic profiling and interpretation + - Cost-optimized multi-file analysis + - Extracting and validating business hypotheses \ No newline at end of file