Data Science Agent Toolkit — AI-assisted data pipeline builder.
DSAGT connects an MCP-compatible AI agent to three servers for building scientific data pipelines:
- Pipeline Server — Runs registered tools, logs provenance. Supports general-purpose processing and American Science Cloud targets; extendable to domain-specific workflows.
- Registry Builder — Analyzes CLI tools, documentation, and APIs to generate and store tool specifications.
- Knowledge Base — Semantic search over indexed document collections (FAISS + optional cross-encoder reranking).
The servers are platform-agnostic and communicate over MCP stdio.
- Python 3.10–3.13
- uv — required for portable MCP server configs across agent platforms
- An MCP-compatible agent (see Agent Setup)
git clone <repository-url>
cd dsagt
uv sync --all-groupsThis installs three CLI entry points:
dsagt-pipeline-serverdsagt-registry-serverdsagt-knowledge-server
The default registry ships with general-purpose data tools. To register additional tools, use the registry server through your agent:
- Point the agent at a CLI tool, its
--helpoutput, or documentation - The agent uses the registry builder to analyze the interface (it can read files, fetch URLs, and run commands)
- The agent proposes a tool spec and saves it via
save_tool_spec - The tool is immediately available through the pipeline server
Registry builder tools exposed to the agent:
read_file— Read local fileshttp_request— Fetch URLsrun_command— Execute commands (e.g.,tool --help)save_tool_spec— Save a tool specification to the registryget_registry— List all registered toolssearch_registry— Search tools by name or description
- The pipeline server copies the base
registry.yamltoruntime/registry.yamlat startup - The agent runs tools from the session registry
- Each execution is logged to a provenance file
The knowledge base provides semantic search over indexed document collections using FAISS, with optional cross-encoder reranking.
To set up the core collections (NeMo Curator, AIDRIN):
export LLM_API_KEY="your-api-key"
uv run python scripts/setup_core_kb.pyThis clones repos, downloads papers, chunks content, and builds FAISS indexes in kb_index/. Run python scripts/setup_core_kb.py --help for options.
# Default bundled registry
uv run dsagt-pipeline-server
# Custom registry from a previous session
uv run dsagt-pipeline-server --registry my_registry.yaml
# Specify registry file and runtime directory
uv run dsagt-pipeline-server --registry path/to/registry.yaml --runtime-dir ./my_session# Default: writes to ./runtime/registry.yaml
uv run dsagt-registry-server
# Custom registry path
uv run dsagt-registry-server --registry path/to/registry.yaml# Defaults: base=./kb_index, runtime=./runtime
uv run dsagt-knowledge-server
# Custom directories, with reranking
uv run dsagt-knowledge-server --base-index-dir path/to/kb_index --runtime-dir ./runtime --rerankDSAGT works with any MCP-compatible agent. The servers are identical across platforms; only the agent configuration format differs.
Platform-specific configs and quickstart guides live in agents/:
- Goose:
agents/goose/README.md - Roo Code (VS Code):
agents/roo/README.md - Claude Code:
agents/claude-code/README.md - Cline (VS Code):
agents/cline/README.md
Servers use relative paths by default (--runtime-dir ./runtime, --base-index-dir ./kb_index), resolved from the agent's working directory. This works as long as the agent launches from the DSAGT project root.
If your agent launches from elsewhere, use absolute paths:
uv run dsagt-knowledge-server --base-index-dir /absolute/path/to/kb_indexUser: I have a script at scripts/preprocess.py that cleans CSV files.
Register it as a pipeline tool.
Agent: [reads the file, runs --help, proposes a spec]
Registered "preprocess" with parameters for input_file,
output_file, and --drop-nulls. Want to try it?
User: Run it on data/raw.csv
Agent: [executes via pipeline server]
Output written to data/cleaned.csv. 142 rows processed,
3 null rows dropped.
tests/smoke_test/ contains fixtures for verifying all three servers end-to-end. Knowledge base steps require an embedding API key; skip them if you don't have one.
tests/smoke_test/
├── greet.py # Simple CLI tool to register and execute
└── knowledge/ # Documents for KB ingestion
├── DESCRIPTION.md
├── installation.md
├── api_reference.md
└── troubleshooting.md
uv run python tests/smoke_test/greet.py WorldExpected output: {"message": "Hello, World!", "status": "ok"}
Follow agents/<platform>/README.md to launch a session. The knowledge server runs without reranking by default; add --rerank to enable cross-encoder reranking (triggers model download on first use).
Without an embedding API key, omit the knowledge server and skip steps 5–6.
Register tests/smoke_test/greet.py as a pipeline tool.
Run "python tests/smoke_test/greet.py --help" to see its interface.
The agent should run --help via the registry builder, then call save_tool_spec.
Run the greet tool with name "World" and greeting "Hi".
Expected: JSON output with "message": "Hi, World!".
Ingest the folder tests/smoke_test/knowledge into the knowledge base.
The agent should call kb_ingest and report file/chunk counts.
Search the knowledge collection for "how to handle large files".
Top results should come from troubleshooting.md (lazy loading, OOM errors).
List all knowledge base collections.
Should show knowledge with the description from DESCRIPTION.md.
After the session, check from the project root:
Registry (should contain default tools plus greet):
cat runtime/registry.yamlProvenance log (timestamped entry with tool name, args, full command):
cat runtime/provenance.logKnowledge base index:
ls runtime/kb_index/knowledge/
# Expected: index.faiss, chunks.jsonl, DESCRIPTION.mdChunk format (JSON objects with id, text, metadata):
head -3 runtime/kb_index/knowledge/chunks.jsonlrm -rf runtimeTools are defined in YAML:
tools:
- name: tool_name
description: What the tool does
executable: command to run (e.g., "python script.py")
dependencies: # optional
- pandas>=2.0
- scikit-learn
parameters:
param_name:
type: string|integer|number|boolean|array|object
required: true|false
description: Parameter description
default: optional_default_valueRequired parameters are passed as positional arguments. Optional parameters use --flag value syntax.
When a tool spec includes dependencies, the registry server automatically installs them via uv pip install at registration time. Dependencies are stored in the registry YAML for reproducibility. Use install_dependencies to reinstall all deps from an existing registry (e.g., after setting up a fresh environment).
demo/ contains a complete microbial isolate demonstration package:
demo/isolate_demo.md— End-to-end demo instructions (asset collection, setup, prompts, and validation)demo/isolate_session.txt— Prompt script used to drive the demo interactiondemo/genomics.md— Pipeline context document for knowledge ingestiondemo/fastp_megahit_best_practices.md— Fastp/Megahit best-practices referencedemo/demoplan.md— Short demo plan outline
Run the isolate demo by following demo/isolate_demo.md from the DSAGT project root.
├── demo/
│ ├── isolate_demo.md # Microbial isolate demo runbook
│ ├── isolate_session.txt # Prompt sequence
│ ├── genomics.md # Domain context doc
│ ├── fastp_megahit_best_practices.md
│ └── demoplan.md
├── src/dsagt/
│ ├── __init__.py
│ ├── mcp_utils.py # Shared MCP server utilities
│ ├── registry.py # Tool registry management
│ ├── registry.yaml # Default tool registry (bundled)
│ ├── knowledge.py # Semantic search over document collections
│ ├── pipeline_server.py # MCP server: tool execution
│ ├── registry_server.py # MCP server: tool registration
│ └── knowledge_server.py # MCP server: knowledge base search
├── agents/
│ ├── goose/ # Goose agent config and quickstart
│ ├── roo/ # Roo Code (VS Code) config and quickstart
│ └── claude-code/ # Claude Code config and quickstart
├── tests/
│ ├── test_registry.py
│ ├── test_registry_server.py
│ ├── test_knowledge_base.py
│ ├── test_knowledge_server.py
│ ├── test_knowledge_integration.py # Requires API key
│ └── smoke_test/
│ ├── greet.py
│ └── knowledge/
├── scripts/
│ └── setup_core_kb.py
├── pyproject.toml
└── README.md
uv run pytestRun specific tests:
uv run pytest tests/test_registry.py
uv run pytest tests/test_registry_server.py
uv run pytest tests/test_knowledge_server.py
uv run pytest tests/test_knowledge_base.py
uv run pytest tests/test_registry.py::TestCallTool::test_success -vRegistry tests mock subprocess.run. Server tests invoke MCP handlers directly (no stdio transport, no network). Knowledge base tests mock the embedding API and use FAISS on temp files. Knowledge server tests use async helpers for background ingest/append jobs.
uv run which dsagt-pipeline-server
uv run which dsagt-registry-server
uv run which dsagt-knowledge-server
# Reinstall if needed
uv sync --reinstallCheck what command was run:
cat runtime/provenance.logVerify the executable path and that any interpreter (python, Rscript, etc.) is in your PATH.
Use absolute paths in configuration:
args:
- --registry
- /full/path/to/registry.yamlDSAGT currently shows high run-to-run variance (even at temperature 0). The root issue appears to be limited structure in tool description, invocation, and management. This plan focuses on improving reliability, reproducibility, and observability.
- Goal: Reduce stochastic behavior with stronger execution structure.
- Problem: Routing through the MCP pipeline server adds indirection and limits direct access to Unix return codes/process signals. Registry entries are too sparse for consistent tool selection and invocation.
- Plan:
- Extend the registry so each tool is paired with a skill (usage patterns, expected I/O, invocation examples).
- Use an agent-driven skill builder (in progress by Jean-Luca) to generate/refine tool skills as tools are registered.
- Move tool execution to direct shell invocation so the agent can use return codes, stderr, and standard Unix process controls.
- Goal: Improve reproducibility and isolation of tool execution environments.
- Problem: Ad hoc dependency handling creates conflicts and weak cross-machine reproducibility.
- Plan:
- Build a container-builder skill to generate Dockerized tool runtimes (dependencies, data paths, runtime config).
- Evaluate the Docker Agent tool shared by Shreyas (assessment led by Andrew) for reuse or adaptation.
- Goal: Add structured telemetry for debugging, evaluation, and cost tracking.
- Problem: No consistent logs for agent decisions, tool calls, or token usage, which blocks diagnosis and improvement tracking.
- Plan:
- Integrate OpenTelemetry for standards-based tracing/logging.
- Add MLflow wrappers for experiment tracking and token-usage monitoring.
- Goal: Add skills that estimate compute needs and prevent resource overcommit during pipeline execution.
- Plan:
- Create a resource-assessment skill that probes tool behavior, learns feasible operating ranges under current constraints, and recommends alternatives when resources are insufficient.