Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 31 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,16 @@ Model Context Protocol (MCP).

## Features

- 🔄 **Complete Data Operations** - Load, transform, analyze, and export CSV data
- 🔄 **Complete Data Operations** - Load, transform, and analyze CSV data from
URLs and string content
- 📊 **Advanced Analytics** - Statistics, correlations, outlier detection, data
profiling
- ✅ **Data Validation** - Schema validation, quality scoring, anomaly detection
- 🎯 **Stateless Design** - Clean MCP architecture with external context
management
- ⚡ **High Performance** - Handles large datasets with streaming and chunking
- ⚡ **High Performance** - Async I/O, streaming downloads, chunked processing
- 🔒 **Session Management** - Multi-user support with isolated sessions
- 🛡️ **Web-Safe** - No file system access; designed for secure web hosting
- 🌟 **Code Quality** - Zero ruff violations, 100% mypy compliance, perfect MCP
documentation standards, comprehensive test coverage

Expand Down Expand Up @@ -71,8 +73,9 @@ uv run databeak --transport http --host 0.0.0.0 --port 8000
Once configured, ask your AI assistant:

```text
"Load a CSV file and show me basic statistics"
"Remove duplicate rows and export as Excel"
"Load this CSV data: name,price\nWidget,10.99\nGadget,25.50"
"Load CSV from URL: https://example.com/data.csv"
"Remove duplicate rows and show me the statistics"
"Find outliers in the price column"
```

Expand All @@ -91,34 +94,47 @@ Once configured, ask your AI assistant:

## Environment Variables

| Variable | Default | Description |
| --------------------------- | ------- | ------------------------- |
| `DATABEAK_MAX_FILE_SIZE_MB` | 1024 | Maximum file size |
| `DATABEAK_CSV_HISTORY_DIR` | "." | History storage location |
| `DATABEAK_SESSION_TIMEOUT` | 3600 | Session timeout (seconds) |
Configure DataBeak behavior with environment variables (all use `DATABEAK_`
prefix):

| Variable | Default | Description |
| ------------------------------------- | --------- | ---------------------------------- |
| `DATABEAK_SESSION_TIMEOUT` | 3600 | Session timeout (seconds) |
| `DATABEAK_MAX_DOWNLOAD_SIZE_MB` | 100 | Maximum URL download size (MB) |
| `DATABEAK_MAX_MEMORY_USAGE_MB` | 1000 | Max DataFrame memory (MB) |
| `DATABEAK_MAX_ROWS` | 1,000,000 | Max DataFrame rows |
| `DATABEAK_URL_TIMEOUT_SECONDS` | 30 | URL download timeout |
| `DATABEAK_HEALTH_MEMORY_THRESHOLD_MB` | 2048 | Health monitoring memory threshold |

See [settings.py](src/databeak/core/settings.py) for complete configuration
options.

## Known Limitations

DataBeak is designed for interactive CSV processing with AI assistants. Be aware
of these constraints:

- **File Size**: Maximum 1024MB per file (configurable via
`DATABEAK_MAX_FILE_SIZE_MB`)
- **Data Loading**: URLs and string content only (no local file system access
for web hosting security)
- **Download Size**: Maximum 100MB per URL download (configurable via
`DATABEAK_MAX_DOWNLOAD_SIZE_MB`)
- **DataFrame Size**: Maximum 1GB memory and 1M rows per DataFrame
(configurable)
- **Session Management**: Maximum 100 concurrent sessions, 1-hour timeout
(configurable)
- **Memory**: Large datasets may require significant memory; monitor with
`system_info` tool
`health_check` tool
- **CSV Dialects**: Assumes standard CSV format; complex dialects may require
pre-processing
- **Concurrency**: Single-threaded processing per session; parallel sessions
- **Concurrency**: Async I/O for concurrent URL downloads; parallel sessions
supported
- **Data Types**: Automatic type inference; complex types may need explicit
conversion
- **URL Loading**: HTTPS only; blocks private networks (127.0.0.1, 192.168.x.x,
10.x.x.x) for security

For production deployments with larger datasets, consider adjusting environment
variables and monitoring resource usage.
For production deployments with larger datasets, adjust environment variables
and monitor resource usage with `health_check` and `get_server_info` tools.

## Contributing

Expand Down
36 changes: 18 additions & 18 deletions docs/api/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,10 @@ comprehensive error handling.

### 📁 I/O Operations

Tools for loading and exporting CSV data in various formats:
Tools for loading CSV data from web sources:

- **`load_csv`** - Load CSV from file path
- **`load_csv_from_url`** - Load CSV from HTTP/HTTPS URL
- **`load_csv_from_content`** - Load CSV from string content
- **`export_csv`** - Export to CSV, JSON, Excel, Parquet, HTML, Markdown
- **`get_session_info`** - Get current session details and statistics
- **`list_sessions`** - List all active sessions
- **`close_session`** - Close and cleanup a session
Expand Down Expand Up @@ -60,14 +58,11 @@ Tools for schema validation and quality checking:

### 🔄 Session Management

Tools for managing data sessions and workflow:
Tools for managing data sessions:

- **`configure_auto_save`** - Set up automatic saving strategies
- **`get_auto_save_status`** - Check current auto-save configuration
- **`undo`** - Undo the last operation
- **`redo`** - Redo previously undone operation
- **`get_history`** - View operation history
- **`restore_to_operation`** - Restore to specific point in history
- **`list_sessions`** - List all active sessions
- **`close_session`** - Close and cleanup a session
- **`get_session_info`** - Get session metadata and statistics

### ⚙️ System Tools

Expand Down Expand Up @@ -129,15 +124,20 @@ Filter operations support complex conditions:

### Environment Configuration

All tools respect these environment variables:
All tools respect these environment variables (all use `DATABEAK_` prefix):

| Variable | Default | Purpose |
| ------------------------------------- | --------- | -------------------------------- |
| `DATABEAK_SESSION_TIMEOUT` | 3600 | Session timeout (seconds) |
| `DATABEAK_MAX_DOWNLOAD_SIZE_MB` | 100 | Maximum URL download size (MB) |
| `DATABEAK_MAX_MEMORY_USAGE_MB` | 1000 | Max DataFrame memory (MB) |
| `DATABEAK_MAX_ROWS` | 1,000,000 | Max DataFrame rows |
| `DATABEAK_URL_TIMEOUT_SECONDS` | 30 | URL download timeout (seconds) |
| `DATABEAK_HEALTH_MEMORY_THRESHOLD_MB` | 2048 | Health monitoring threshold (MB) |

| Variable | Default | Purpose |
| --------------------------- | ------- | ------------------------- |
| `DATABEAK_MAX_FILE_SIZE_MB` | 1024 | Maximum file size |
| `DATABEAK_CSV_HISTORY_DIR` | "." | History storage location |
| `DATABEAK_SESSION_TIMEOUT` | 3600 | Session timeout (seconds) |
| `DATABEAK_CHUNK_SIZE` | 10000 | Processing chunk size |
| `DATABEAK_AUTO_SAVE` | true | Enable auto-save |
See
[DatabeakSettings](https://github.com/jonpspri/databeak/blob/main/src/databeak/core/settings.py)
for all configuration options.

## Advanced Features

Expand Down
51 changes: 28 additions & 23 deletions docs/tutorials/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,16 @@ process a sample sales dataset using natural language commands.

## Step 1: Load Your Data

Ask your AI assistant:
Ask your AI assistant to load data from a URL or paste CSV content:

> "Load the sales data from my CSV file"
> "Load the sales data from this URL: <https://example.com/sales.csv>"

The AI will use the `load_csv` tool to create a new session and load your data.
You'll see a response with:
Or provide CSV content directly:

> "Load this CSV data: name,price,quantity\\nWidget,10.99,5\\nGadget,25.50,3"

The AI will use the `load_csv_from_url` or `load_csv_from_content` tool to
create a new session and load your data. You'll see a response with:

- Session ID for tracking
- Data shape (rows × columns)
Expand Down Expand Up @@ -88,10 +92,11 @@ For detailed column analysis:

> "Check the overall data quality and give me a quality score"

## Step 6: Export Results
## Step 6: Save Results

> "Export this cleaned and analyzed data as an Excel file named
> 'sales_analysis.xlsx'"
DataBeak processes data in memory for web-based hosting security. To save
results, export them through your AI assistant which can save files on your
behalf.

## Advanced Features

Expand All @@ -102,11 +107,11 @@ Made a mistake? No problem:
> "Undo the last operation" "Show me the operation history" "Restore to the
> state before I added the total_value column"

### Auto-Save Configuration
### Data Retrieval

Set up automatic saving:
Get processed data back as CSV content for further use:

> "Export the cleaned data to a new CSV file for further analysis"
> "Show me the cleaned data as CSV content"

### Session Management

Expand All @@ -121,40 +126,40 @@ Work with multiple datasets:

```python
# Natural language commands:
"Load the messy customer data"
"Load customer data from URL: https://example.com/customers.csv"

"Remove duplicate rows"
"Fill missing email addresses with 'no-email@domain.com'"
"Standardize the phone number format"
"Remove rows where age is negative or over 120"
"Export the cleaned data"
"Show me the cleaned data preview"
```

### Analysis Pipeline

```python
# Business intelligence workflow:
"Load quarterly sales data"
"Load quarterly sales data from URL: https://example.com/q1-sales.csv"

"Filter for completed transactions only"
"Group by product category and month"
"Calculate total revenue and average order value"
"Find the top 10 selling products"
"Create correlation matrix for price vs quantity vs revenue"
"Export summary as Excel with charts"
"Show me the summary statistics"
```

### Data Validation

```python
# Quality assurance workflow:
"Load the new data batch"
"Load data from this CSV content: [paste CSV here]"

"Validate against the expected schema"
"Check data quality score"
"Find any statistical anomalies"
"Generate a data profiling report"
"Flag any quality issues for review"
"Show me any quality issues found"
```

## Tips for Success
Expand All @@ -171,18 +176,18 @@ where status equals 'active'"

### 3. **Chain Operations**

"Load sales.csv, remove duplicates, filter for 2024 data, then calculate monthly
totals"
"Load sales data from URL, remove duplicates, filter for 2024 data, then
calculate monthly totals"

### 4. **Leverage Auto-Save**
### 4. **Work with Web Data**

DataBeak automatically saves your work, so you can focus on analysis without
worrying about losing changes
DataBeak is designed for web-based hosting, so it works with URLs and in-memory
data without accessing your local file system

### 5. **Explore History**

Use DataBeak's stateless design to experiment with different approaches - export
intermediate results as needed
Use DataBeak's stateless design to experiment with different approaches -
retrieve results when needed

## Next Steps

Expand Down
2 changes: 0 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,6 @@ dependencies = [
"pytz>=2024.2",
"pydantic-settings>=2.10.1",
"psutil>=7.0.0",
"chardet>=5.2.0",
"scipy>=1.16.1",
"simpleeval>=1.0.3",
"pandera>=0.26.1",
Expand Down Expand Up @@ -384,7 +383,6 @@ dev = [
"twine>=6.1.0",
"ty>=0.0.1a21",
"types-aiofiles>=24.1.0.20250822",
"types-chardet>=5.0.4.6",
"types-jsonschema>=4.25.1.20250822",
"types-psutil>=7.0.0.20250822",
"types-pytz>=2025.2.0.20250809",
Expand Down
40 changes: 1 addition & 39 deletions src/databeak/core/session.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,11 @@
import logging
import threading
from datetime import UTC, datetime, timedelta
from pathlib import Path
from typing import TYPE_CHECKING, Any
from uuid import uuid4

from databeak.exceptions import NoDataLoadedError, SessionExpiredError
from databeak.models.data_models import ExportFormat, SessionInfo
from databeak.models.data_models import SessionInfo
from databeak.models.data_session import DataSession

if TYPE_CHECKING:
Expand Down Expand Up @@ -146,43 +145,6 @@ def get_info(self) -> SessionInfo:
file_path=data_info["file_path"],
)

async def _save_callback(
self,
file_path: str,
export_format: ExportFormat,
encoding: str,
) -> dict[str, Any]:
"""Handle auto-save operations."""
try:
if self._data_session.df is None:
return {"success": False, "error": "No data to save"}

# Handle different export formats
path_obj = Path(file_path)
path_obj.parent.mkdir(parents=True, exist_ok=True)

if export_format == ExportFormat.CSV:
self._data_session.df.to_csv(path_obj, index=False, encoding=encoding)
elif export_format == ExportFormat.TSV:
self._data_session.df.to_csv(path_obj, sep="\t", index=False, encoding=encoding)
elif export_format == ExportFormat.JSON:
self._data_session.df.to_json(path_obj, orient="records", indent=2)
elif export_format == ExportFormat.EXCEL:
self._data_session.df.to_excel(path_obj, index=False)
elif export_format == ExportFormat.PARQUET:
self._data_session.df.to_parquet(path_obj, index=False)
else:
return {"success": False, "error": f"Unsupported format: {export_format}"}

return {
"success": True,
"file_path": str(path_obj),
"rows": len(self._data_session.df),
"columns": len(self._data_session.df.columns),
}
except (OSError, PermissionError, ValueError, TypeError, UnicodeError) as e:
return {"success": False, "error": str(e)}

async def clear(self) -> None:
"""Clear session data to free memory."""
# Clear data session
Expand Down
Loading
Loading