diff --git a/README.md b/README.md index 9bd6931..14e67bd 100644 --- a/README.md +++ b/README.md @@ -14,14 +14,16 @@ Model Context Protocol (MCP). ## Features -- 🔄 **Complete Data Operations** - Load, transform, analyze, and export CSV data +- 🔄 **Complete Data Operations** - Load, transform, and analyze CSV data from + URLs and string content - 📊 **Advanced Analytics** - Statistics, correlations, outlier detection, data profiling - ✅ **Data Validation** - Schema validation, quality scoring, anomaly detection - 🎯 **Stateless Design** - Clean MCP architecture with external context management -- ⚡ **High Performance** - Handles large datasets with streaming and chunking +- ⚡ **High Performance** - Async I/O, streaming downloads, chunked processing - 🔒 **Session Management** - Multi-user support with isolated sessions +- 🛡️ **Web-Safe** - No file system access; designed for secure web hosting - 🌟 **Code Quality** - Zero ruff violations, 100% mypy compliance, perfect MCP documentation standards, comprehensive test coverage @@ -71,8 +73,9 @@ uv run databeak --transport http --host 0.0.0.0 --port 8000 Once configured, ask your AI assistant: ```text -"Load a CSV file and show me basic statistics" -"Remove duplicate rows and export as Excel" +"Load this CSV data: name,price\nWidget,10.99\nGadget,25.50" +"Load CSV from URL: https://example.com/data.csv" +"Remove duplicate rows and show me the statistics" "Find outliers in the price column" ``` @@ -91,34 +94,47 @@ Once configured, ask your AI assistant: ## Environment Variables -| Variable | Default | Description | -| --------------------------- | ------- | ------------------------- | -| `DATABEAK_MAX_FILE_SIZE_MB` | 1024 | Maximum file size | -| `DATABEAK_CSV_HISTORY_DIR` | "." | History storage location | -| `DATABEAK_SESSION_TIMEOUT` | 3600 | Session timeout (seconds) | +Configure DataBeak behavior with environment variables (all use `DATABEAK_` +prefix): + +| Variable | Default | Description | +| ------------------------------------- | --------- | ---------------------------------- | +| `DATABEAK_SESSION_TIMEOUT` | 3600 | Session timeout (seconds) | +| `DATABEAK_MAX_DOWNLOAD_SIZE_MB` | 100 | Maximum URL download size (MB) | +| `DATABEAK_MAX_MEMORY_USAGE_MB` | 1000 | Max DataFrame memory (MB) | +| `DATABEAK_MAX_ROWS` | 1,000,000 | Max DataFrame rows | +| `DATABEAK_URL_TIMEOUT_SECONDS` | 30 | URL download timeout | +| `DATABEAK_HEALTH_MEMORY_THRESHOLD_MB` | 2048 | Health monitoring memory threshold | + +See [settings.py](src/databeak/core/settings.py) for complete configuration +options. ## Known Limitations DataBeak is designed for interactive CSV processing with AI assistants. Be aware of these constraints: -- **File Size**: Maximum 1024MB per file (configurable via - `DATABEAK_MAX_FILE_SIZE_MB`) +- **Data Loading**: URLs and string content only (no local file system access + for web hosting security) +- **Download Size**: Maximum 100MB per URL download (configurable via + `DATABEAK_MAX_DOWNLOAD_SIZE_MB`) +- **DataFrame Size**: Maximum 1GB memory and 1M rows per DataFrame + (configurable) - **Session Management**: Maximum 100 concurrent sessions, 1-hour timeout (configurable) - **Memory**: Large datasets may require significant memory; monitor with - `system_info` tool + `health_check` tool - **CSV Dialects**: Assumes standard CSV format; complex dialects may require pre-processing -- **Concurrency**: Single-threaded processing per session; parallel sessions +- **Concurrency**: Async I/O for concurrent URL downloads; parallel sessions supported - **Data Types**: Automatic type inference; complex types may need explicit conversion - **URL Loading**: HTTPS only; blocks private networks (127.0.0.1, 192.168.x.x, 10.x.x.x) for security -For production deployments with larger datasets, consider adjusting environment -variables and monitoring resource usage. +For production deployments with larger datasets, adjust environment variables +and monitor resource usage with `health_check` and `get_server_info` tools. ## Contributing diff --git a/docs/api/index.md b/docs/api/index.md index 896ba37..5850d4d 100644 --- a/docs/api/index.md +++ b/docs/api/index.md @@ -13,12 +13,10 @@ comprehensive error handling. ### 📁 I/O Operations -Tools for loading and exporting CSV data in various formats: +Tools for loading CSV data from web sources: -- **`load_csv`** - Load CSV from file path - **`load_csv_from_url`** - Load CSV from HTTP/HTTPS URL - **`load_csv_from_content`** - Load CSV from string content -- **`export_csv`** - Export to CSV, JSON, Excel, Parquet, HTML, Markdown - **`get_session_info`** - Get current session details and statistics - **`list_sessions`** - List all active sessions - **`close_session`** - Close and cleanup a session @@ -60,14 +58,11 @@ Tools for schema validation and quality checking: ### 🔄 Session Management -Tools for managing data sessions and workflow: +Tools for managing data sessions: -- **`configure_auto_save`** - Set up automatic saving strategies -- **`get_auto_save_status`** - Check current auto-save configuration -- **`undo`** - Undo the last operation -- **`redo`** - Redo previously undone operation -- **`get_history`** - View operation history -- **`restore_to_operation`** - Restore to specific point in history +- **`list_sessions`** - List all active sessions +- **`close_session`** - Close and cleanup a session +- **`get_session_info`** - Get session metadata and statistics ### ⚙️ System Tools @@ -129,15 +124,20 @@ Filter operations support complex conditions: ### Environment Configuration -All tools respect these environment variables: +All tools respect these environment variables (all use `DATABEAK_` prefix): + +| Variable | Default | Purpose | +| ------------------------------------- | --------- | -------------------------------- | +| `DATABEAK_SESSION_TIMEOUT` | 3600 | Session timeout (seconds) | +| `DATABEAK_MAX_DOWNLOAD_SIZE_MB` | 100 | Maximum URL download size (MB) | +| `DATABEAK_MAX_MEMORY_USAGE_MB` | 1000 | Max DataFrame memory (MB) | +| `DATABEAK_MAX_ROWS` | 1,000,000 | Max DataFrame rows | +| `DATABEAK_URL_TIMEOUT_SECONDS` | 30 | URL download timeout (seconds) | +| `DATABEAK_HEALTH_MEMORY_THRESHOLD_MB` | 2048 | Health monitoring threshold (MB) | -| Variable | Default | Purpose | -| --------------------------- | ------- | ------------------------- | -| `DATABEAK_MAX_FILE_SIZE_MB` | 1024 | Maximum file size | -| `DATABEAK_CSV_HISTORY_DIR` | "." | History storage location | -| `DATABEAK_SESSION_TIMEOUT` | 3600 | Session timeout (seconds) | -| `DATABEAK_CHUNK_SIZE` | 10000 | Processing chunk size | -| `DATABEAK_AUTO_SAVE` | true | Enable auto-save | +See +[DatabeakSettings](https://github.com/jonpspri/databeak/blob/main/src/databeak/core/settings.py) +for all configuration options. ## Advanced Features diff --git a/docs/tutorials/quickstart.md b/docs/tutorials/quickstart.md index 13b0fb7..23ad802 100644 --- a/docs/tutorials/quickstart.md +++ b/docs/tutorials/quickstart.md @@ -16,12 +16,16 @@ process a sample sales dataset using natural language commands. ## Step 1: Load Your Data -Ask your AI assistant: +Ask your AI assistant to load data from a URL or paste CSV content: -> "Load the sales data from my CSV file" +> "Load the sales data from this URL: " -The AI will use the `load_csv` tool to create a new session and load your data. -You'll see a response with: +Or provide CSV content directly: + +> "Load this CSV data: name,price,quantity\\nWidget,10.99,5\\nGadget,25.50,3" + +The AI will use the `load_csv_from_url` or `load_csv_from_content` tool to +create a new session and load your data. You'll see a response with: - Session ID for tracking - Data shape (rows × columns) @@ -88,10 +92,11 @@ For detailed column analysis: > "Check the overall data quality and give me a quality score" -## Step 6: Export Results +## Step 6: Save Results -> "Export this cleaned and analyzed data as an Excel file named -> 'sales_analysis.xlsx'" +DataBeak processes data in memory for web-based hosting security. To save +results, export them through your AI assistant which can save files on your +behalf. ## Advanced Features @@ -102,11 +107,11 @@ Made a mistake? No problem: > "Undo the last operation" "Show me the operation history" "Restore to the > state before I added the total_value column" -### Auto-Save Configuration +### Data Retrieval -Set up automatic saving: +Get processed data back as CSV content for further use: -> "Export the cleaned data to a new CSV file for further analysis" +> "Show me the cleaned data as CSV content" ### Session Management @@ -121,40 +126,40 @@ Work with multiple datasets: ```python # Natural language commands: -"Load the messy customer data" +"Load customer data from URL: https://example.com/customers.csv" "Remove duplicate rows" "Fill missing email addresses with 'no-email@domain.com'" "Standardize the phone number format" "Remove rows where age is negative or over 120" -"Export the cleaned data" +"Show me the cleaned data preview" ``` ### Analysis Pipeline ```python # Business intelligence workflow: -"Load quarterly sales data" +"Load quarterly sales data from URL: https://example.com/q1-sales.csv" "Filter for completed transactions only" "Group by product category and month" "Calculate total revenue and average order value" "Find the top 10 selling products" "Create correlation matrix for price vs quantity vs revenue" -"Export summary as Excel with charts" +"Show me the summary statistics" ``` ### Data Validation ```python # Quality assurance workflow: -"Load the new data batch" +"Load data from this CSV content: [paste CSV here]" "Validate against the expected schema" "Check data quality score" "Find any statistical anomalies" "Generate a data profiling report" -"Flag any quality issues for review" +"Show me any quality issues found" ``` ## Tips for Success @@ -171,18 +176,18 @@ where status equals 'active'" ### 3. **Chain Operations** -"Load sales.csv, remove duplicates, filter for 2024 data, then calculate monthly -totals" +"Load sales data from URL, remove duplicates, filter for 2024 data, then +calculate monthly totals" -### 4. **Leverage Auto-Save** +### 4. **Work with Web Data** -DataBeak automatically saves your work, so you can focus on analysis without -worrying about losing changes +DataBeak is designed for web-based hosting, so it works with URLs and in-memory +data without accessing your local file system ### 5. **Explore History** -Use DataBeak's stateless design to experiment with different approaches - export -intermediate results as needed +Use DataBeak's stateless design to experiment with different approaches - +retrieve results when needed ## Next Steps diff --git a/pyproject.toml b/pyproject.toml index b8886e9..e774cab 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -60,7 +60,6 @@ dependencies = [ "pytz>=2024.2", "pydantic-settings>=2.10.1", "psutil>=7.0.0", - "chardet>=5.2.0", "scipy>=1.16.1", "simpleeval>=1.0.3", "pandera>=0.26.1", @@ -384,7 +383,6 @@ dev = [ "twine>=6.1.0", "ty>=0.0.1a21", "types-aiofiles>=24.1.0.20250822", - "types-chardet>=5.0.4.6", "types-jsonschema>=4.25.1.20250822", "types-psutil>=7.0.0.20250822", "types-pytz>=2025.2.0.20250809", diff --git a/src/databeak/core/session.py b/src/databeak/core/session.py index f24812b..cabb1e6 100644 --- a/src/databeak/core/session.py +++ b/src/databeak/core/session.py @@ -5,12 +5,11 @@ import logging import threading from datetime import UTC, datetime, timedelta -from pathlib import Path from typing import TYPE_CHECKING, Any from uuid import uuid4 from databeak.exceptions import NoDataLoadedError, SessionExpiredError -from databeak.models.data_models import ExportFormat, SessionInfo +from databeak.models.data_models import SessionInfo from databeak.models.data_session import DataSession if TYPE_CHECKING: @@ -146,43 +145,6 @@ def get_info(self) -> SessionInfo: file_path=data_info["file_path"], ) - async def _save_callback( - self, - file_path: str, - export_format: ExportFormat, - encoding: str, - ) -> dict[str, Any]: - """Handle auto-save operations.""" - try: - if self._data_session.df is None: - return {"success": False, "error": "No data to save"} - - # Handle different export formats - path_obj = Path(file_path) - path_obj.parent.mkdir(parents=True, exist_ok=True) - - if export_format == ExportFormat.CSV: - self._data_session.df.to_csv(path_obj, index=False, encoding=encoding) - elif export_format == ExportFormat.TSV: - self._data_session.df.to_csv(path_obj, sep="\t", index=False, encoding=encoding) - elif export_format == ExportFormat.JSON: - self._data_session.df.to_json(path_obj, orient="records", indent=2) - elif export_format == ExportFormat.EXCEL: - self._data_session.df.to_excel(path_obj, index=False) - elif export_format == ExportFormat.PARQUET: - self._data_session.df.to_parquet(path_obj, index=False) - else: - return {"success": False, "error": f"Unsupported format: {export_format}"} - - return { - "success": True, - "file_path": str(path_obj), - "rows": len(self._data_session.df), - "columns": len(self._data_session.df.columns), - } - except (OSError, PermissionError, ValueError, TypeError, UnicodeError) as e: - return {"success": False, "error": str(e)} - async def clear(self) -> None: """Clear session data to free memory.""" # Clear data session diff --git a/src/databeak/core/settings.py b/src/databeak/core/settings.py index 18f57ef..8afa30c 100644 --- a/src/databeak/core/settings.py +++ b/src/databeak/core/settings.py @@ -8,27 +8,67 @@ from pydantic_settings import BaseSettings -class DataBeakSettings(BaseSettings): - """Configuration settings for session management.""" +class DatabeakSettings(BaseSettings): + """Configuration settings for DataBeak operations. - max_file_size_mb: int = Field(default=1024, description="Maximum file size limit in megabytes") + Settings are organized into categories: + + Session Management: + - session_timeout: How long sessions stay alive + - session_capacity_warning_threshold: When to warn about session capacity + + Health Monitoring (for health_check tool): + - health_memory_threshold_mb: Total server memory limit for health status + - memory_warning_threshold: Ratio that triggers "degraded" status (75%) + - memory_critical_threshold: Ratio that triggers "unhealthy" status (90%) + + Data Loading Limits (enforced during load_csv_from_url/load_csv_from_content): + - max_memory_usage_mb: Hard limit for individual DataFrame memory + - max_rows: Hard limit for DataFrame row count + - url_timeout_seconds: Network timeout for URL downloads + - max_download_size_mb: Maximum download size from URLs + + Data Validation: + - Various thresholds for quality checks and anomaly detection + """ + + # Session management session_timeout: int = Field(default=3600, description="Session timeout in seconds") - chunk_size: int = Field( - default=10000, - description="Default chunk size for processing large datasets", + session_capacity_warning_threshold: float = Field( + default=0.90, description="Session capacity ratio that triggers warning (0.0-1.0)" ) - memory_threshold_mb: int = Field( - default=2048, description="Memory usage threshold in MB for health monitoring" + + # Health monitoring thresholds (used by health_check for server status) + health_memory_threshold_mb: int = Field( + default=2048, + description="Total server memory threshold in MB for health status monitoring (not a hard limit)", ) memory_warning_threshold: float = Field( - default=0.75, description="Memory usage ratio that triggers warning status (0.0-1.0)" + default=0.75, + description="Memory usage ratio that triggers 'degraded' health status (0.0-1.0)", ) memory_critical_threshold: float = Field( - default=0.90, description="Memory usage ratio that triggers critical status (0.0-1.0)" + default=0.90, + description="Memory usage ratio that triggers 'unhealthy' health status (0.0-1.0)", ) - session_capacity_warning_threshold: float = Field( - default=0.90, description="Session capacity ratio that triggers warning (0.0-1.0)" + + # Data loading limits (hard limits enforced during CSV loading operations) + max_memory_usage_mb: int = Field( + default=1000, + description="Maximum memory in MB for individual DataFrames (hard limit, loading fails if exceeded)", + ) + max_rows: int = Field( + default=1_000_000, + description="Maximum rows per DataFrame (hard limit, loading fails if exceeded)", + ) + url_timeout_seconds: int = Field( + default=30, description="Network timeout for URL downloads in seconds" ) + max_download_size_mb: int = Field( + default=100, description="Maximum download size for URLs in MB (hard limit)" + ) + + # Data validation and analysis max_validation_violations: int = Field( default=1000, description="Maximum number of validation violations to report" ) @@ -36,11 +76,6 @@ class DataBeakSettings(BaseSettings): default=10000, description="Maximum sample size for anomaly detection operations" ) - # Encoding detection thresholds - encoding_confidence_threshold: float = Field( - default=0.7, description="Minimum confidence threshold for encoding detection" - ) - # Data validation thresholds data_completeness_threshold: float = Field( default=0.5, description="Threshold for determining if data is complete enough" @@ -78,16 +113,16 @@ class DataBeakSettings(BaseSettings): model_config = {"env_prefix": "DATABEAK_", "case_sensitive": False} -_settings: DataBeakSettings | None = None +_settings: DatabeakSettings | None = None _lock = threading.Lock() -def create_settings() -> DataBeakSettings: +def create_settings() -> DatabeakSettings: """Create a new DataBeak settings instance.""" - return DataBeakSettings() + return DatabeakSettings() -def get_settings() -> DataBeakSettings: +def get_settings() -> DatabeakSettings: """Create or get the global DataBeak settings instance.""" global _settings # noqa: PLW0603 if _settings is None: diff --git a/src/databeak/instructions.md b/src/databeak/instructions.md index 7efa967..8f1a17d 100644 --- a/src/databeak/instructions.md +++ b/src/databeak/instructions.md @@ -13,12 +13,12 @@ comprehensive error handling. **Modular Architecture**: Tools are organized into logical categories: - **System**: Health checks and server information -- **I/O**: Loading and exporting data with format flexibility +- **I/O**: Loading data from web sources (URLs, string content) - **Data**: Filtering, sorting, transformations, and column operations - **Row**: Precise row-level and cell-level access and manipulation - **Analytics**: Statistical analysis and data profiling - **Validation**: Data quality checks and schema validation -- **System**: Session management and health monitoring +- **Session**: Session management and health monitoring ## 📐 Coordinate System (Critical for AI Success) @@ -70,11 +70,11 @@ sort_data(session_id, ["name"]) # Sort by column insert_row(session_id, -1, {"name": "Alice", "email": null}) # Add with nulls ``` -### Step 3: Analyze and Export +### Step 3: Analyze Results ```python get_statistics(session_id, ["age"]) # Statistical analysis -export_csv(session_id, "results.csv") # Save processed data +# Note: DataBeak processes data in memory for web-based hosting security ``` ## Enhanced Resource Endpoints diff --git a/src/databeak/models/__init__.py b/src/databeak/models/__init__.py index 01e64b4..f51fc16 100644 --- a/src/databeak/models/__init__.py +++ b/src/databeak/models/__init__.py @@ -1,4 +1,4 @@ -"""Data models for CSV Editor MCP Server.""" +"""Data models for DataBeak MCP Server.""" from __future__ import annotations @@ -12,7 +12,6 @@ DataPreview, DataStatistics, DataType, - ExportFormat, FilterCondition, LogicalOperator, OperationResult, @@ -28,7 +27,6 @@ "DataPreview", "DataStatistics", "DataType", - "ExportFormat", "FilterCondition", "LogicalOperator", "OperationResult", diff --git a/src/databeak/models/data_models.py b/src/databeak/models/data_models.py index edca1a7..8e4cb8d 100644 --- a/src/databeak/models/data_models.py +++ b/src/databeak/models/data_models.py @@ -71,18 +71,6 @@ class AggregateFunction(str, Enum): LAST = "last" -class ExportFormat(str, Enum): - """Supported export formats.""" - - CSV = "csv" - TSV = "tsv" - JSON = "json" - EXCEL = "excel" - PARQUET = "parquet" - HTML = "html" - MARKDOWN = "markdown" - - class FilterCondition(BaseModel): """A single filter condition.""" diff --git a/src/databeak/models/tool_responses.py b/src/databeak/models/tool_responses.py index 66ff201..496f054 100644 --- a/src/databeak/models/tool_responses.py +++ b/src/databeak/models/tool_responses.py @@ -82,8 +82,7 @@ class ServerInfoResult(BaseToolResponse): capabilities: dict[str, list[str]] = Field( description="Available operations organized by category", ) - supported_formats: list[str] = Field(description="Supported file formats and extensions") - max_file_size_mb: int = Field(description="Maximum file size limit in MB") + max_download_size_mb: int = Field(description="Maximum download size from URLs in MB") session_timeout_minutes: int = Field(description="Default session timeout in minutes") diff --git a/src/databeak/server.py b/src/databeak/server.py index 40aef3f..3dfc49a 100644 --- a/src/databeak/server.py +++ b/src/databeak/server.py @@ -14,6 +14,7 @@ # This module will tweak the JSON schema validator to accept relaxed types from databeak.core.json_schema_validate import initialize_relaxed_validation +from databeak.core.settings import DatabeakSettings # Local imports from databeak.servers.column_server import column_server @@ -93,7 +94,7 @@ def data_cleaning_prompt(session_id: str) -> str: # ============================================================================ -@smithery.server() +@smithery.server(config_schema=DatabeakSettings) def create_server() -> FastMCP: """Create and return the FastMCP server instance.""" # Initialize FastMCP server diff --git a/src/databeak/servers/io_server.py b/src/databeak/servers/io_server.py index c0803c1..444e7b8 100644 --- a/src/databeak/servers/io_server.py +++ b/src/databeak/servers/io_server.py @@ -1,35 +1,31 @@ """Standalone I/O server for DataBeak using FastMCP server composition. This module provides a complete I/O server implementation following DataBeak's modular server -architecture pattern. It includes comprehensive CSV loading, export, and session management -capabilities with robust error handling and AI-optimized documentation. +architecture pattern. It includes comprehensive CSV loading from web sources and session +management capabilities with robust error handling and AI-optimized documentation. """ from __future__ import annotations import logging -import socket from abc import ABC, abstractmethod from io import StringIO -from pathlib import Path -from typing import Annotated, Any, Literal -from urllib.error import HTTPError, URLError -from urllib.request import urlopen +from typing import Annotated, Literal -import chardet +import httpx import pandas as pd from fastmcp import Context, FastMCP from fastmcp.exceptions import ToolError from pydantic import BaseModel, Discriminator, Field, NonNegativeInt -from databeak.core.session import get_session_data, get_session_manager, get_session_only +from databeak.core.session import get_session_manager, get_session_only from databeak.core.settings import get_settings # Import session management and data models from the main package -from databeak.models import DataPreview, ExportFormat +from databeak.models import DataPreview from databeak.models.tool_responses import BaseToolResponse from databeak.services.data_operations import create_data_preview_with_indices -from databeak.utils.validators import validate_file_path, validate_url +from databeak.utils.validators import validate_url logger = logging.getLogger(__name__) @@ -97,12 +93,11 @@ def resolve_header_param(config: HeaderConfig) -> int | None | Literal["infer"]: return config.get_pandas_param() -# Configuration constants -MAX_FILE_SIZE_MB = 500 # Maximum file size in MB -MAX_MEMORY_USAGE_MB = 1000 # Maximum memory usage in MB for DataFrames -MAX_ROWS = 1_000_000 # Maximum number of rows to prevent memory issues -URL_TIMEOUT_SECONDS = 30 # Timeout for URL downloads -MAX_URL_SIZE_MB = 100 # Maximum download size for URLs +# Note: All configuration constants moved to DataBeakSettings for configurability +# - max_memory_usage_mb: Maximum memory usage in MB for DataFrames +# - max_rows: Maximum number of rows to prevent memory issues +# - url_timeout_seconds: Timeout for URL downloads +# - max_download_size_mb: Maximum download size from URLs # ============================================================================ # PYDANTIC MODELS FOR I/O OPERATIONS @@ -118,17 +113,6 @@ class LoadResult(BaseToolResponse): memory_usage_mb: float | None = Field(None, description="Memory usage in megabytes") -class ExportResult(BaseToolResponse): - """Response model for data export operations.""" - - file_path: str = Field(description="Path to exported file") - format: Literal["csv", "tsv", "json", "excel", "parquet", "html", "markdown"] = Field( - description="Export format used" - ) - rows_exported: int = Field(description="Number of rows exported") - file_size_mb: float | None = Field(None, description="Size of exported file in megabytes") - - class SessionInfoResult(BaseToolResponse): """Response model for session information.""" @@ -144,44 +128,6 @@ class SessionInfoResult(BaseToolResponse): # ============================================================================ -# Implementation: uses chardet for automatic detection with confidence validation -# Falls back to prioritized common encodings if detection fails or low confidence -# Reads 10KB sample for fast detection without loading full file -def detect_file_encoding(file_path: str) -> str: - """Detect file encoding using chardet with optimized fallbacks.""" - try: - # Read sample bytes for detection (first 10KB should be enough) - with open(file_path, "rb") as f: # noqa: PTH123 - raw_data = f.read(10240) # 10KB sample - - # Use chardet for automatic detection - detection = chardet.detect(raw_data) - settings = get_settings() - - if detection and detection["confidence"] > settings.encoding_confidence_threshold: - detected_encoding = detection["encoding"] - if detected_encoding: - logger.debug( - "Chardet detected encoding: %s (confidence: %.2f)", - detected_encoding, - detection["confidence"], - ) - return detected_encoding.lower() - logger.debug("Chardet detected encoding is None, using fallbacks") - - logger.debug( - "Chardet detection low confidence (%.2f), using fallbacks", - detection["confidence"] if detection else 0, - ) - - except (ImportError, AttributeError, UnicodeError, OSError) as e: - logger.debug("Chardet detection failed: %s, using fallbacks", e) - - # Fallback to common encodings in priority order - # UTF-8 first (most common), then Windows encodings, then Latin variants - return "utf-8" - - # Implementation: prioritizes encoding groups by primary encoding type # UTF variants -> Windows encodings -> Latin variants -> Asian encodings # Removes duplicates while preserving priority order @@ -236,165 +182,29 @@ def validate_dataframe_size(df: pd.DataFrame) -> None: ToolError: If DataFrame exceeds size limits """ - if len(df) > MAX_ROWS: - msg = f"File too large: {len(df):,} rows exceeds limit of {MAX_ROWS:,} rows" - raise ToolError(msg) + settings = get_settings() - memory_usage_mb = df.memory_usage(deep=True).sum() / (1024 * 1024) - if memory_usage_mb > MAX_MEMORY_USAGE_MB: - msg = f"File too large: {memory_usage_mb:.1f} MB exceeds memory limit of {MAX_MEMORY_USAGE_MB} MB" + if len(df) > settings.max_rows: + msg = f"Data too large: {len(df):,} rows exceeds limit of {settings.max_rows:,} rows" raise ToolError(msg) - -# Implementation: RFC 4180 compliant CSV parsing with automatic encoding detection -# Supports quoted fields, escaped quotes, mixed quoting, automatic type detection -# Memory limits: MAX_ROWS, MAX_FILE_SIZE_MB, MAX_MEMORY_USAGE_MB validation -# Encoding fallback strategy with chardet detection and prioritized fallbacks -# Progress reporting and comprehensive error handling with specific error messages -async def load_csv( - ctx: Annotated[Context, Field(description="FastMCP context for session access")], - file_path: Annotated[str, Field(description="Path to the CSV file to load")], - encoding: Annotated[ - str, Field(description="Text encoding for file reading (utf-8, latin1, cp1252, etc.)") - ] = "utf-8", - delimiter: Annotated[ - str, Field(description="Column delimiter character (comma, tab, semicolon, pipe)") - ] = ",", - header_config: Annotated[ - HeaderConfigUnion | None, - Field(default=None, description="Header detection configuration"), - ] = None, - na_values: Annotated[ - list[str] | None, Field(description="Additional strings to recognize as NA/NaN") - ] = None, - parse_dates: Annotated[list[str] | None, Field(description="Columns to parse as dates")] = None, -) -> LoadResult: - """Load CSV file into DataBeak session. - - Parses CSV data with encoding detection and error handling. Returns session ID and data preview - for further operations. - """ - # Get session_id from FastMCP context - session_id = ctx.session_id - - # Validate file path - is_valid, validated_path = validate_file_path(file_path) - if not is_valid: - msg = f"Invalid file path: {validated_path}" - - raise ToolError(msg) - - await ctx.info(f"Loading CSV file: {validated_path}") - await ctx.report_progress(0.1) - - # Check file size before attempting to load - file_size_mb = Path(validated_path).stat().st_size / (1024 * 1024) - if file_size_mb > MAX_FILE_SIZE_MB: - msg = f"File size {file_size_mb:.1f}MB exceeds limit of {MAX_FILE_SIZE_MB}MB" - + memory_usage_mb = df.memory_usage(deep=True).sum() / (1024 * 1024) + if memory_usage_mb > settings.max_memory_usage_mb: + msg = f"Data too large: {memory_usage_mb:.1f} MB exceeds memory limit of {settings.max_memory_usage_mb} MB" raise ToolError(msg) - await ctx.info(f"File size: {file_size_mb:.2f} MB") - # Get or create session - session_manager = get_session_manager() - session = session_manager.get_or_create_session(session_id) +def create_load_result(df: pd.DataFrame) -> LoadResult: + """Create LoadResult from a DataFrame. - await ctx.report_progress(0.3) - - # Handle default header configuration - if header_config is None: - header_config = AutoDetectHeader() - - # Build pandas read_csv parameters - # Using dict[str, Any] due to pandas read_csv's complex overloaded signature - read_params: dict[str, Any] = { - "filepath_or_buffer": validated_path, - "encoding": encoding, - "delimiter": delimiter, - "header": resolve_header_param(header_config), - # Note: Temporarily disabled dtype_backend="numpy_nullable" due to serialization issues - } - - if na_values: - read_params["na_values"] = na_values - if parse_dates: - read_params["parse_dates"] = parse_dates - - # Load CSV with comprehensive error handling - try: - # Add memory-conscious parameters for large files - df = pd.read_csv( - **read_params, chunksize=None - ) # Keep as None for now but ready for streaming - validate_dataframe_size(df) - except UnicodeDecodeError as e: - # Use optimized encoding detection and fallbacks - df = None - last_error = e - - await ctx.info("Encoding error detected, trying automatic detection...") - - # First, try automatic encoding detection - try: - detected_encoding = detect_file_encoding(validated_path) - if detected_encoding != encoding: - logger.info("Auto-detected encoding: %s", detected_encoding) - await ctx.info(f"Auto-detected encoding: {detected_encoding}") - - read_params["encoding"] = detected_encoding - df = pd.read_csv(**read_params) - validate_dataframe_size(df) - - logger.info( - "Successfully loaded with auto-detected encoding: %s", detected_encoding - ) - - except Exception as detection_error: - logger.debug("Auto-detection failed: %s, trying prioritized fallbacks", detection_error) - - # Fall back to optimized encoding list - fallback_encodings = get_encoding_fallbacks(encoding) - - for alt_encoding in fallback_encodings: - if alt_encoding != encoding: # Skip the original encoding we already tried - try: - read_params["encoding"] = alt_encoding - df = pd.read_csv(**read_params) - validate_dataframe_size(df) - - logger.warning( - "Used fallback encoding %s instead of %s", alt_encoding, encoding - ) - await ctx.info( - f"Used fallback encoding {alt_encoding} due to encoding error" - ) - break - except UnicodeDecodeError as fallback_error: - last_error = fallback_error - continue - except Exception as other_error: - logger.debug("Failed with encoding %s: %s", alt_encoding, other_error) - continue - else: - # All encodings failed - msg = f"Encoding error with all attempted encodings: {last_error}. Try specifying a different encoding or check file format." - raise ToolError(msg) from last_error - - if df is None: - msg = f"Failed to load CSV with any encoding: {last_error}" - - raise ToolError(msg) from last_error - - await ctx.report_progress(0.8) - - # Load into session - session.load_data(df, validated_path) + Args: + df: Loaded DataFrame - await ctx.report_progress(1.0) - await ctx.info(f"Loaded {len(df)} rows and {len(df.columns)} columns") + Returns: + LoadResult with data preview and metadata - # Create comprehensive data preview with indices + """ + # Create data preview with indices preview_data = create_data_preview_with_indices(df, 5) data_preview = DataPreview( rows=preview_data["records"], @@ -414,7 +224,7 @@ async def load_csv( # Implementation: HTTP/HTTPS download with security validation and timeouts # Blocks private networks, validates content-type, enforces size limits # Uses same encoding fallback strategy as file loading -# Timeout: URL_TIMEOUT_SECONDS, Max download: MAX_URL_SIZE_MB +# Configurable via DataBeakSettings: url_timeout_seconds, max_download_size_mb async def load_csv_from_url( ctx: Annotated[Context, Field(description="FastMCP context for session access")], url: Annotated[str, Field(description="URL of the CSV file to download and load")], @@ -436,6 +246,7 @@ async def load_csv_from_url( """ # Get session_id from FastMCP context session_id = ctx.session_id + settings = get_settings() # Handle default header configuration if header_config is None: @@ -456,13 +267,15 @@ async def load_csv_from_url( # Pre-download validation with timeout and content-type checking await ctx.info("Verifying URL and downloading content...") - # Set socket timeout for all operations - socket.setdefaulttimeout(URL_TIMEOUT_SECONDS) + # Use async HTTP client for non-blocking download + async with httpx.AsyncClient(timeout=settings.url_timeout_seconds) as client: + # HEAD request first to check content-type and size + head_response = await client.head(url, follow_redirects=True) + head_response.raise_for_status() - with urlopen(url, timeout=URL_TIMEOUT_SECONDS) as response: # nosec B310 # noqa: S310, ASYNC210 # Verify content-type - content_type = response.headers.get("Content-Type", "").lower() - content_length = response.headers.get("Content-Length") + content_type = head_response.headers.get("content-type", "").lower() + content_length = head_response.headers.get("content-length") # Check content type valid_content_types = [ @@ -479,70 +292,55 @@ async def load_csv_from_url( # Check content length if content_length: - size_mb = int(content_length) / (1024 * 1024) - if size_mb > MAX_URL_SIZE_MB: - msg = f"Download too large: {size_mb:.1f} MB exceeds limit of {MAX_URL_SIZE_MB} MB" + download_size_mb = int(content_length) / (1024 * 1024) + if download_size_mb > settings.max_download_size_mb: + msg = f"Download too large: {download_size_mb:.1f} MB exceeds limit of {settings.max_download_size_mb} MB" raise ToolError(msg) await ctx.info(f"Download validated. Content-type: {content_type or 'unknown'}") await ctx.report_progress(0.3) - # Download and parse CSV using pandas with timeout + # Download CSV content with size enforcement + max_bytes = settings.max_download_size_mb * 1024 * 1024 + downloaded_bytes = 0 + chunks = [] + + async with client.stream("GET", url, follow_redirects=True) as response: + response.raise_for_status() + + async for chunk in response.aiter_bytes(chunk_size=8192): + downloaded_bytes += len(chunk) + if downloaded_bytes > max_bytes: + msg = f"Download exceeded size limit of {settings.max_download_size_mb} MB during transfer" + raise ToolError(msg) + chunks.append(chunk) + + # Decode downloaded content + csv_bytes = b"".join(chunks) + csv_content = csv_bytes.decode("utf-8", errors="replace") + + # Parse CSV from downloaded content df = pd.read_csv( - url, + StringIO(csv_content), encoding=encoding, delimiter=delimiter, header=resolve_header_param(header_config), ) validate_dataframe_size(df) - except (TimeoutError, URLError, HTTPError) as e: + except (httpx.TimeoutException, httpx.HTTPError, httpx.RequestError) as e: logger.exception("Network error downloading URL") await ctx.error(f"Network error: {e}") msg = f"Network error: {e}" raise ToolError(msg) from e except UnicodeDecodeError as e: - # Use optimized encoding fallbacks for URL downloads - df = None - last_error = e - - await ctx.info("URL encoding error, trying optimized fallbacks...") - - # Use the same optimized fallback strategy - fallback_encodings = get_encoding_fallbacks(encoding) - - for alt_encoding in fallback_encodings: - if alt_encoding != encoding: # Skip the original encoding we already tried - try: - df = pd.read_csv( - url, - encoding=alt_encoding, - delimiter=delimiter, - header=resolve_header_param(header_config), - ) - validate_dataframe_size(df) - - logger.warning( - "Used fallback encoding %s instead of %s", alt_encoding, encoding - ) - await ctx.info(f"Used fallback encoding {alt_encoding} due to encoding error") - break - except UnicodeDecodeError as fallback_error: - last_error = fallback_error - continue - except Exception as other_error: - logger.debug("Failed with encoding %s: %s", alt_encoding, other_error) - continue - else: - msg = f"Encoding error with all attempted encodings: {last_error}. Try specifying a different encoding." - raise ToolError(msg) from last_error - - if df is None: - msg = f"Failed to download CSV with any encoding: {last_error}" - - raise ToolError(msg) from last_error + # CSV parsing succeeded but encoding specified doesn't match content + # This shouldn't happen with httpx.response.text (auto-detects encoding) + # but keeping fallback for edge cases + msg = f"Encoding error: {e}. The downloaded content encoding doesn't match '{encoding}'." + raise ToolError(msg) from e await ctx.report_progress(0.8) @@ -559,21 +357,7 @@ async def load_csv_from_url( await ctx.report_progress(1.0) await ctx.info(f"Loaded {len(df)} rows and {len(df.columns)} columns from URL") - # Create data preview with indices - preview_data = create_data_preview_with_indices(df, 5) - data_preview = DataPreview( - rows=preview_data["records"], - row_count=preview_data["total_rows"], - column_count=preview_data["total_columns"], - truncated=preview_data["preview_rows"] < preview_data["total_rows"], - ) - - return LoadResult( - rows_affected=len(df), - columns_affected=[str(col) for col in df.columns], - data=data_preview, - memory_usage_mb=df.memory_usage(deep=True).sum() / (1024 * 1024), - ) + return create_load_result(df) # Implementation: parses CSV from string using StringIO with pandas read_csv @@ -628,6 +412,9 @@ async def load_csv_from_content( msg = "Parsed CSV contains no data rows" raise ToolError(msg) + # Validate DataFrame size against limits + validate_dataframe_size(df) + # Get or create session session_manager = get_session_manager() session = session_manager.get_or_create_session(session_id) @@ -635,131 +422,7 @@ async def load_csv_from_content( await ctx.info(f"Loaded {len(df)} rows and {len(df.columns)} columns from content") - # Create data preview with indices - preview_data = create_data_preview_with_indices(df, 5) - data_preview = DataPreview( - rows=preview_data["records"], - row_count=preview_data["total_rows"], - column_count=preview_data["total_columns"], - truncated=preview_data["preview_rows"] < preview_data["total_rows"], - ) - - return LoadResult( - rows_affected=len(df), - columns_affected=[str(col) for col in df.columns], - data=data_preview, - memory_usage_mb=df.memory_usage(deep=True).sum() / (1024 * 1024), - ) - - -# Implementation: supports 7 export formats with auto-generated filenames using tempfile -# Format-specific parameters: CSV (RFC 4180), TSV (tab delimiter), JSON (records), Excel (XLSX) -# Parquet (columnar), HTML (web table), Markdown (GitHub format) -# Auto-cleanup on export errors, records operation in session history -async def export_csv( - ctx: Annotated[Context, Field(description="FastMCP context for session access")], - file_path: Annotated[ - str, - Field(description="Output file path - must be a valid path that can be parsed by Path()"), - ], - encoding: Annotated[ - str, Field(description="Text encoding for output file (utf-8, latin1, cp1252, etc.)") - ] = "utf-8", - *, - index: Annotated[bool, Field(description="Whether to include row index in output")] = False, -) -> ExportResult: - """Export session data to various file formats. - - Supports CSV, TSV, JSON, Excel, Parquet, HTML, and Markdown formats. Returns file path and - export statistics. - """ - # Get session_id from FastMCP context - session_id = ctx.session_id - - # Get session and validate data - _session, df = get_session_data(session_id) - - # Validate and parse the file path - try: - path_obj = Path(file_path) - except Exception as path_error: - msg = f"Invalid file path provided: {file_path}" - - raise ToolError(msg) from path_error - - # Infer format from file extension - suffix = path_obj.suffix.lower() - format_mapping = { - ".csv": ExportFormat.CSV, - ".tsv": ExportFormat.TSV, - ".json": ExportFormat.JSON, - ".xlsx": ExportFormat.EXCEL, - ".xls": ExportFormat.EXCEL, - ".parquet": ExportFormat.PARQUET, - ".html": ExportFormat.HTML, - ".htm": ExportFormat.HTML, - ".md": ExportFormat.MARKDOWN, - ".markdown": ExportFormat.MARKDOWN, - } - - # Default to CSV if suffix not recognized - format_enum = format_mapping.get(suffix, ExportFormat.CSV) - - await ctx.info(f"Exporting data in {format_enum.value} format to {file_path}") - await ctx.report_progress(0.1) - - # Create parent directory if it doesn't exist - path_obj.parent.mkdir(parents=True, exist_ok=True) - - await ctx.report_progress(0.5) - - # Export based on format with comprehensive options - try: - if format_enum == ExportFormat.CSV: - df.to_csv(path_obj, encoding=encoding, index=index, lineterminator="\n") - elif format_enum == ExportFormat.TSV: - df.to_csv(path_obj, sep="\t", encoding=encoding, index=index, lineterminator="\n") - elif format_enum == ExportFormat.JSON: - df.to_json(path_obj, orient="records", indent=2, force_ascii=False) - elif format_enum == ExportFormat.EXCEL: - with pd.ExcelWriter(path_obj, engine="openpyxl") as writer: - df.to_excel(writer, sheet_name="Data", index=index) - elif format_enum == ExportFormat.PARQUET: - df.to_parquet(path_obj, index=index, engine="pyarrow") - elif format_enum == ExportFormat.HTML: - df.to_html(path_obj, index=index, escape=False, table_id="data-table") - elif format_enum == ExportFormat.MARKDOWN: - df.to_markdown(path_obj, index=index, tablefmt="github") - else: - msg = f"Unsupported format: {format_enum}" - - raise ToolError(msg) - except (OSError, pd.errors.EmptyDataError, ValueError, ImportError) as export_error: - # Provide format-specific error guidance - if format_enum == ExportFormat.EXCEL and "openpyxl" in str(export_error): - msg = "Excel export requires openpyxl package. Install with: pip install openpyxl" - raise ToolError(msg) from export_error - if format_enum == ExportFormat.PARQUET and "pyarrow" in str(export_error): - msg = "Parquet export requires pyarrow package. Install with: pip install pyarrow" - raise ToolError(msg) from export_error - msg = f"Export failed: {export_error}" - - raise ToolError(msg) from export_error - - # No longer recording operations (simplified MCP architecture) - - await ctx.report_progress(1.0) - await ctx.info(f"Exported {len(df)} rows to {file_path}") - - # Calculate file size - file_size_mb = path_obj.stat().st_size / (1024 * 1024) if path_obj.exists() else 0 - - return ExportResult( - file_path=str(file_path), - format=format_enum.value, - rows_exported=len(df), - file_size_mb=round(file_size_mb, 3), - ) + return create_load_result(df) # Implementation: retrieves session metadata from session manager @@ -800,13 +463,11 @@ async def get_session_info( # Create I/O server io_server = FastMCP( "DataBeak-IO", - instructions="I/O operations server for DataBeak with comprehensive CSV loading and export capabilities", + instructions="I/O operations server for DataBeak with comprehensive CSV loading from web sources (URLs and string content)", ) # Register the logic functions directly as MCP tools (no wrapper functions needed) -io_server.tool(name="load_csv")(load_csv) io_server.tool(name="load_csv_from_url")(load_csv_from_url) io_server.tool(name="load_csv_from_content")(load_csv_from_content) -io_server.tool(name="export_csv")(export_csv) io_server.tool(name="get_session_info")(get_session_info) diff --git a/src/databeak/servers/system_server.py b/src/databeak/servers/system_server.py index d708d03..36bb86f 100644 --- a/src/databeak/servers/system_server.py +++ b/src/databeak/servers/system_server.py @@ -18,7 +18,7 @@ # Import version and session management from main package from databeak._version import __version__ from databeak.core.session import get_session_manager -from databeak.core.settings import DataBeakSettings, get_settings +from databeak.core.settings import DatabeakSettings, get_settings from databeak.models.tool_responses import HealthResult, ServerInfoResult logger = logging.getLogger(__name__) @@ -27,7 +27,7 @@ # MEMORY MONITORING CONSTANTS AND UTILITIES # ============================================================================ -# Memory monitoring will use configurable thresholds from DataBeakSettings +# Memory monitoring will use configurable thresholds from DatabeakSettings def get_memory_usage() -> float: @@ -50,7 +50,7 @@ def get_memory_usage() -> float: def get_memory_status( - current_mb: float, threshold_mb: float, settings: DataBeakSettings | None = None + current_mb: float, threshold_mb: float, settings: DatabeakSettings | None = None ) -> str: """Determine memory status based on configurable thresholds. @@ -97,8 +97,8 @@ async def health_check( # Get memory information current_memory_mb = get_memory_usage() - memory_threshold_mb = float(settings.memory_threshold_mb) - memory_status = get_memory_status(current_memory_mb, memory_threshold_mb) + health_threshold_mb = float(settings.health_memory_threshold_mb) + memory_status = get_memory_status(current_memory_mb, health_threshold_mb) # Determine overall health status status = "healthy" @@ -116,13 +116,13 @@ async def health_check( if memory_status == "critical": status = "unhealthy" await ctx.error( - f"Critical memory usage: {current_memory_mb:.1f}MB / {memory_threshold_mb:.1f}MB" + f"Critical memory usage: {current_memory_mb:.1f}MB / {health_threshold_mb:.1f}MB" ) elif memory_status == "warning": if status == "healthy": status = "degraded" await ctx.warning( - f"High memory usage: {current_memory_mb:.1f}MB / {memory_threshold_mb:.1f}MB" + f"High memory usage: {current_memory_mb:.1f}MB / {health_threshold_mb:.1f}MB" ) await ctx.info( @@ -137,7 +137,7 @@ async def health_check( max_sessions=session_manager.max_sessions, session_ttl_minutes=session_manager.ttl_minutes, memory_usage_mb=current_memory_mb, - memory_threshold_mb=memory_threshold_mb, + memory_threshold_mb=health_threshold_mb, memory_status=memory_status, history_operations_total=0, # History operations tracking removed history_limit_per_session=0, # History operations tracking removed @@ -205,11 +205,8 @@ async def get_server_info( description="A comprehensive MCP server for CSV file operations and data analysis", capabilities={ "data_io": [ - "load_csv", "load_csv_from_url", "load_csv_from_content", - "export_csv", - "multiple_export_formats", ], "data_manipulation": [ "filter_rows", @@ -249,16 +246,7 @@ async def get_server_info( "null_value_updates", ], }, - supported_formats=[ - "csv", - "tsv", - "json", - "excel", - "parquet", - "html", - "markdown", - ], - max_file_size_mb=settings.max_file_size_mb, + max_download_size_mb=settings.max_download_size_mb, session_timeout_minutes=settings.session_timeout // 60, ) diff --git a/tests/integration/test_csv_loading.py b/tests/integration/test_csv_loading.py deleted file mode 100644 index 7cc8c5e..0000000 --- a/tests/integration/test_csv_loading.py +++ /dev/null @@ -1,123 +0,0 @@ -"""Integration tests for CSV loading functionality.""" - -from pathlib import Path - -import pytest -from fastmcp import Client -from fastmcp.client.transports import FastMCPTransport - -from tests.integration.conftest import get_fixture_path - - -class TestCsvLoading: - """Test CSV file loading and basic operations.""" - - @pytest.mark.asyncio - async def test_load_sample_data(self, databeak_client: Client[FastMCPTransport]) -> None: - """Test loading a sample CSV file.""" - # Get the real path to the fixture - csv_path = get_fixture_path("sample_data.csv") - - # Load the CSV file - result = await databeak_client.call_tool("load_csv", {"file_path": csv_path}) - - # Should return a CallToolResult - - # Verify the result contains expected data - assert result.is_error is False - - @pytest.mark.asyncio - async def test_header_auto_detect(self, databeak_client: Client[FastMCPTransport]) -> None: - """Test auto-detection of headers.""" - csv_path = get_fixture_path("sample_data.csv") - - # Test auto-detect header mode (default) - result = await databeak_client.call_tool( - "load_csv", {"file_path": csv_path, "header_config": {"mode": "auto"}} - ) - - assert result.is_error is False - - @pytest.mark.asyncio - async def test_header_explicit_row(self, databeak_client: Client[FastMCPTransport]) -> None: - """Test explicit row number for headers.""" - csv_path = get_fixture_path("sample_data.csv") - - # Test explicit row 0 as header - result = await databeak_client.call_tool( - "load_csv", - {"file_path": csv_path, "header_config": {"mode": "row", "row_number": 0}}, - ) - - assert result.is_error is False - - @pytest.mark.asyncio - async def test_header_no_header(self, databeak_client: Client[FastMCPTransport]) -> None: - """Test no header mode with generated column names.""" - csv_path = get_fixture_path("sample_data.csv") - - # Test no header mode - result = await databeak_client.call_tool( - "load_csv", {"file_path": csv_path, "header_config": {"mode": "none"}} - ) - - assert result.is_error is False - - @pytest.mark.asyncio - async def test_header_modes_produce_different_results( - self, databeak_client: Client[FastMCPTransport] - ) -> None: - """Test that different header modes actually produce different column structures.""" - csv_path = get_fixture_path("sample_data.csv") - - # Load with auto-detect (should use first row as headers: name, age, city, salary) - auto_result = await databeak_client.call_tool( - "load_csv", {"file_path": csv_path, "header_config": {"mode": "auto"}} - ) - assert auto_result.is_error is False - - # Load with no headers (should generate: Column_0, Column_1, Column_2, Column_3) - none_result = await databeak_client.call_tool( - "load_csv", {"file_path": csv_path, "header_config": {"mode": "none"}} - ) - assert none_result.is_error is False - - # The results should be different (different column names) - # Note: We can't directly compare column names here since we'd need session access - # But we can verify both loaded successfully with different structures - - @pytest.mark.asyncio - async def test_load_sales_data_and_get_info( - self, databeak_client: Client[FastMCPTransport] - ) -> None: - """Test loading sales data and getting session info.""" - # Load sales data - csv_path = get_fixture_path("sales_data.csv") - load_result = await databeak_client.call_tool("load_csv", {"file_path": csv_path}) - - # Verify the load was successful - assert load_result.is_error is False - - @pytest.mark.asyncio - async def test_load_missing_values_csv(self, databeak_client: Client[FastMCPTransport]) -> None: - """Test loading CSV with missing values.""" - csv_path = get_fixture_path("missing_values.csv") - - result = await databeak_client.call_tool("load_csv", {"file_path": csv_path}) - - # Verify the load was successful - assert result.is_error is False - - @pytest.mark.asyncio - async def test_fixture_path_resolution(self) -> None: - """Test that the fixture path helper works correctly.""" - csv_path = get_fixture_path("sample_data.csv") - - # Should be an absolute path - assert Path(csv_path).is_absolute() - - # Should end with the fixture name - assert csv_path.endswith("sample_data.csv") - - # Should contain tests/fixtures in the path - assert "tests/fixtures" in csv_path diff --git a/tests/integration/test_direct_client.py b/tests/integration/test_direct_client.py index eef3ea3..21b6646 100644 --- a/tests/integration/test_direct_client.py +++ b/tests/integration/test_direct_client.py @@ -19,7 +19,6 @@ async def test_direct_client_tool_listing(databeak_client: Client[FastMCPTranspo # Verify some key tools are present assert "get_session_info" in tool_names assert "load_csv_from_content" in tool_names - assert "export_csv" in tool_names @pytest.mark.asyncio diff --git a/tests/integration/test_fastmcp_client_fixture.py b/tests/integration/test_fastmcp_client_fixture.py index 594430e..68b8c1a 100644 --- a/tests/integration/test_fastmcp_client_fixture.py +++ b/tests/integration/test_fastmcp_client_fixture.py @@ -18,7 +18,6 @@ async def test_list_tools(databeak_client: Client[FastMCPTransport]) -> None: tool_names = [tool.name for tool in tools] assert "get_session_info" in tool_names assert "load_csv_from_content" in tool_names - assert "export_csv" in tool_names @pytest.mark.asyncio @@ -33,7 +32,7 @@ async def test_get_session_info(databeak_client: Client[FastMCPTransport]) -> No @pytest.mark.asyncio async def test_load_csv_workflow(databeak_client: Client[FastMCPTransport]) -> None: - """Test a complete workflow: load CSV data, check session info, export.""" + """Test a complete workflow: load CSV data and check session info.""" # Step 1: Load some CSV data csv_content = "name,age,city\nAlice,30,New York\nBob,25,Boston\nCharlie,35,Chicago" @@ -51,19 +50,6 @@ async def test_load_csv_workflow(databeak_client: Client[FastMCPTransport]) -> N assert "row_count" in info_text or "3" in info_text assert "column_count" in info_text or "3" in info_text - # Step 3: Export the data back - export_result = await databeak_client.call_tool( - "export_csv", {"file_path": "/tmp/test_export.csv", "index": False} - ) - - assert export_result.is_error is False - assert isinstance(export_result.content[0], TextContent) - exported_content = export_result.content[0].text - # The export result contains status information about the export - assert "success" in exported_content - assert "rows_exported" in exported_content - assert "3" in exported_content # Should indicate 3 rows were exported - @pytest.mark.asyncio async def test_session_isolation(databeak_client: Client[FastMCPTransport]) -> None: diff --git a/tests/integration/test_system_server_integration.py b/tests/integration/test_system_server_integration.py index 0d111b2..239a5e4 100644 --- a/tests/integration/test_system_server_integration.py +++ b/tests/integration/test_system_server_integration.py @@ -42,7 +42,6 @@ async def test_get_server_info_via_client( assert "DataBeak" in content assert "version" in content assert "capabilities" in content - assert "supported_formats" in content @pytest.mark.asyncio async def test_get_server_info_returns_actual_version_via_client( diff --git a/tests/integration/test_unified_header_system.py b/tests/integration/test_unified_header_system.py index 3242ba4..6824088 100644 --- a/tests/integration/test_unified_header_system.py +++ b/tests/integration/test_unified_header_system.py @@ -4,38 +4,10 @@ from fastmcp import Client from fastmcp.client.transports import FastMCPTransport -from tests.integration.conftest import get_fixture_path - class TestUnifiedHeaderSystem: """Test that all CSV loading functions use consistent HeaderConfig system.""" - @pytest.mark.asyncio - async def test_load_csv_all_header_modes( - self, databeak_client: Client[FastMCPTransport] - ) -> None: - """Test load_csv with all header configuration modes.""" - csv_path = get_fixture_path("sample_data.csv") - - # Test auto-detect mode - auto_result = await databeak_client.call_tool( - "load_csv", {"file_path": csv_path, "header_config": {"mode": "auto"}} - ) - assert auto_result.is_error is False - - # Test explicit row mode - row_result = await databeak_client.call_tool( - "load_csv", - {"file_path": csv_path, "header_config": {"mode": "row", "row_number": 0}}, - ) - assert row_result.is_error is False - - # Test no header mode - none_result = await databeak_client.call_tool( - "load_csv", {"file_path": csv_path, "header_config": {"mode": "none"}} - ) - assert none_result.is_error is False - @pytest.mark.asyncio async def test_load_csv_from_content_all_header_modes( self, databeak_client: Client[FastMCPTransport] @@ -66,20 +38,13 @@ async def test_load_csv_from_content_all_header_modes( async def test_header_consistency_across_functions( self, databeak_client: Client[FastMCPTransport] ) -> None: - """Test that all CSV loading functions handle headers consistently.""" - csv_path = get_fixture_path("sample_data.csv") + """Test that CSV loading functions handle headers consistently.""" content = "name,age,city\nAlice,25,NYC\nBob,30,LA" - # Test that all three functions accept the same header_config format + # Test that loading functions accept the same header_config format header_configs = [{"mode": "auto"}, {"mode": "row", "row_number": 0}, {"mode": "none"}] for config in header_configs: - # load_csv - file_result = await databeak_client.call_tool( - "load_csv", {"file_path": csv_path, "header_config": config} - ) - assert file_result.is_error is False - # load_csv_from_content content_result = await databeak_client.call_tool( "load_csv_from_content", {"content": content, "header_config": config} @@ -90,37 +55,34 @@ async def test_header_consistency_across_functions( async def test_default_header_behavior_consistency( self, databeak_client: Client[FastMCPTransport] ) -> None: - """Test that default header behavior is consistent across all functions.""" - csv_path = get_fixture_path("sample_data.csv") + """Test that default header behavior is consistent.""" content = "name,age,city\nAlice,25,NYC\nBob,30,LA" - # Test default behavior (should all use auto-detect) - file_result = await databeak_client.call_tool("load_csv", {"file_path": csv_path}) + # Test default behavior (should use auto-detect) content_result = await databeak_client.call_tool( "load_csv_from_content", {"content": content} ) - # Both should succeed with default auto-detect behavior - assert file_result.is_error is False + # Should succeed with default auto-detect behavior assert content_result.is_error is False @pytest.mark.asyncio async def test_header_mode_validation(self, databeak_client: Client[FastMCPTransport]) -> None: """Test that invalid header configurations are properly rejected.""" - csv_path = get_fixture_path("sample_data.csv") + content = "name,age,city\nAlice,25,NYC\nBob,30,LA" # Test invalid mode with pytest.raises(Exception, match="invalid|validation|error"): await databeak_client.call_tool( - "load_csv", {"file_path": csv_path, "header_config": {"mode": "invalid"}} + "load_csv_from_content", {"content": content, "header_config": {"mode": "invalid"}} ) # Test missing row_number for explicit row mode with pytest.raises(Exception, match="row_number|required|validation"): await databeak_client.call_tool( - "load_csv", + "load_csv_from_content", { - "file_path": csv_path, + "content": content, "header_config": {"mode": "row"}, # Missing row_number }, ) diff --git a/tests/unit/models/test_config_validation.py b/tests/unit/models/test_config_validation.py index 5621511..9747621 100644 --- a/tests/unit/models/test_config_validation.py +++ b/tests/unit/models/test_config_validation.py @@ -5,7 +5,7 @@ import importlib.metadata from pathlib import Path -from databeak.core.settings import DataBeakSettings +from databeak.core.settings import DatabeakSettings class TestVersionLoading: @@ -38,11 +38,11 @@ def test_version_is_valid_string(self) -> None: class TestEnvironmentVariableConfiguration: - """Test environment variable configuration matches DataBeakSettings.""" + """Test environment variable configuration matches DatabeakSettings.""" def test_databeak_settings_has_correct_prefix(self) -> None: - """Test that DataBeakSettings uses DATABEAK_ prefix.""" - settings = DataBeakSettings() + """Test that DatabeakSettings uses DATABEAK_ prefix.""" + settings = DatabeakSettings() config = settings.model_config assert "env_prefix" in config @@ -51,14 +51,13 @@ def test_databeak_settings_has_correct_prefix(self) -> None: def test_environment_variables_mapping(self) -> None: """Test that documented environment variables map to settings fields.""" - settings = DataBeakSettings() + settings = DatabeakSettings() # Verify all documented environment variables have corresponding fields documented_vars = { - "DATABEAK_MAX_FILE_SIZE_MB": "max_file_size_mb", - # "csv_history_dir" removed - history functionality eliminated + "DATABEAK_MAX_DOWNLOAD_SIZE_MB": "max_download_size_mb", "DATABEAK_SESSION_TIMEOUT": "session_timeout", - "DATABEAK_CHUNK_SIZE": "chunk_size", + "DATABEAK_URL_TIMEOUT_SECONDS": "url_timeout_seconds", } for env_var, field_name in documented_vars.items(): @@ -67,31 +66,27 @@ def test_environment_variables_mapping(self) -> None: def test_settings_default_values(self) -> None: """Test that settings have sensible defaults.""" - settings = DataBeakSettings() + settings = DatabeakSettings() - assert settings.max_file_size_mb == 1024 - # csv_history_dir and auto_save removed - functionality eliminated + assert settings.max_download_size_mb == 100 assert settings.session_timeout == 3600 - assert settings.chunk_size == 10000 + assert settings.url_timeout_seconds == 30 assert settings.max_anomaly_sample_size == 10000 def test_environment_variable_override(self, monkeypatch) -> None: # type: ignore[no-untyped-def] """Test that environment variables properly override defaults.""" - # History functionality removed, so no temp directory needed # Set test environment variables - monkeypatch.setenv("DATABEAK_MAX_FILE_SIZE_MB", "2048") - # csv_history_dir removed - history functionality eliminated + monkeypatch.setenv("DATABEAK_MAX_DOWNLOAD_SIZE_MB", "200") monkeypatch.setenv("DATABEAK_SESSION_TIMEOUT", "7200") - monkeypatch.setenv("DATABEAK_CHUNK_SIZE", "5000") + monkeypatch.setenv("DATABEAK_URL_TIMEOUT_SECONDS", "60") monkeypatch.setenv("DATABEAK_MAX_ANOMALY_SAMPLE_SIZE", "5000") # Create new settings instance to pick up env vars - settings = DataBeakSettings() + settings = DatabeakSettings() - assert settings.max_file_size_mb == 2048 - # csv_history_dir removed - history functionality eliminated + assert settings.max_download_size_mb == 200 assert settings.session_timeout == 7200 - assert settings.chunk_size == 5000 + assert settings.url_timeout_seconds == 60 assert settings.max_anomaly_sample_size == 5000 diff --git a/tests/unit/models/test_session.py b/tests/unit/models/test_session.py index 22310b6..5d95549 100644 --- a/tests/unit/models/test_session.py +++ b/tests/unit/models/test_session.py @@ -2,7 +2,6 @@ import uuid from datetime import UTC -from pathlib import Path from unittest.mock import AsyncMock, patch import pandas as pd @@ -13,20 +12,17 @@ SessionManager, get_session_manager, ) -from databeak.core.settings import DataBeakSettings -from databeak.models.data_models import ExportFormat +from databeak.core.settings import DatabeakSettings -class TestDataBeakSettings: - """Tests for DataBeakSettings configuration.""" +class TestDatabeakSettings: + """Tests for DatabeakSettings configuration.""" def test_default_settings(self) -> None: """Test default settings initialization.""" - settings = DataBeakSettings() + settings = DatabeakSettings() assert settings.session_timeout == 3600 - # csv_history_dir removed - history functionality eliminated - assert settings.max_file_size_mb == 1024 - assert settings.memory_threshold_mb == 2048 + assert settings.health_memory_threshold_mb == 2048 assert settings.max_anomaly_sample_size == 10000 # Anomaly detection sample size @@ -76,118 +72,6 @@ def test_has_data_method(self) -> None: del session.df assert not session.has_data() - @pytest.mark.asyncio - async def test_save_callback_csv_format(self, tmp_path: Path) -> None: - """Test _save_callback with CSV format (lines 199-227).""" - session = DatabeakSession() - df = pd.DataFrame({"name": ["Alice", "Bob"], "age": [25, 30]}) - session.df = df - - file_path = str(tmp_path / "test.csv") - result = await session._save_callback(file_path, ExportFormat.CSV, "utf-8") - - assert result["success"] is True - assert result["file_path"] == file_path - assert result["rows"] == 2 - assert result["columns"] == 2 - assert Path(file_path).exists() - - @pytest.mark.asyncio - async def test_save_callback_tsv_format(self, tmp_path: Path) -> None: - """Test _save_callback with TSV format.""" - session = DatabeakSession() - df = pd.DataFrame({"name": ["Alice", "Bob"], "age": [25, 30]}) - session.df = df - - file_path = str(tmp_path / "test.tsv") - result = await session._save_callback(file_path, ExportFormat.TSV, "utf-8") - - assert result["success"] is True - assert Path(file_path).exists() - - # Verify TSV format (tab-separated) - content = Path(file_path).read_text() - assert "\t" in content - - @pytest.mark.asyncio - async def test_save_callback_json_format(self, tmp_path: Path) -> None: - """Test _save_callback with JSON format.""" - session = DatabeakSession() - df = pd.DataFrame({"name": ["Alice", "Bob"], "age": [25, 30]}) - session.df = df - - file_path = str(tmp_path / "test.json") - result = await session._save_callback(file_path, ExportFormat.JSON, "utf-8") - - assert result["success"] is True - assert Path(file_path).exists() - - @pytest.mark.asyncio - async def test_save_callback_excel_format(self, tmp_path: Path) -> None: - """Test _save_callback with Excel format.""" - session = DatabeakSession() - df = pd.DataFrame({"name": ["Alice", "Bob"], "age": [25, 30]}) - session.df = df - - file_path = str(tmp_path / "test.xlsx") - result = await session._save_callback(file_path, ExportFormat.EXCEL, "utf-8") - - assert result["success"] is True - assert Path(file_path).exists() - - @pytest.mark.asyncio - async def test_save_callback_parquet_format(self, tmp_path: Path) -> None: - """Test _save_callback with Parquet format.""" - session = DatabeakSession() - df = pd.DataFrame({"name": ["Alice", "Bob"], "age": [25, 30]}) - session.df = df - - file_path = str(tmp_path / "test.parquet") - result = await session._save_callback(file_path, ExportFormat.PARQUET, "utf-8") - - assert result["success"] is True - assert Path(file_path).exists() - - @pytest.mark.asyncio - async def test_save_callback_unsupported_format(self, tmp_path: Path) -> None: - """Test _save_callback with unsupported format.""" - session = DatabeakSession() - df = pd.DataFrame({"name": ["Alice", "Bob"], "age": [25, 30]}) - session.df = df - - file_path = str(tmp_path / "test.unknown") - # Use a string that's not in ExportFormat enum - result = await session._save_callback(file_path, "UNKNOWN", "utf-8") # type: ignore[arg-type] - - assert result["success"] is False - assert "Unsupported format" in result["error"] - - @pytest.mark.asyncio - async def test_save_callback_no_data(self, tmp_path: Path) -> None: - """Test _save_callback when no data is loaded.""" - session = DatabeakSession() - # Don't load any data - - file_path = str(tmp_path / "test.csv") - result = await session._save_callback(file_path, ExportFormat.CSV, "utf-8") - - assert result["success"] is False - assert "No data to save" in result["error"] - - @pytest.mark.asyncio - async def test_save_callback_exception_handling(self, tmp_path: Path) -> None: - """Test _save_callback exception handling.""" - session = DatabeakSession() - df = pd.DataFrame({"name": ["Alice", "Bob"], "age": [25, 30]}) - session.df = df - - # Use invalid path to trigger exception - invalid_path = "/invalid/path/that/does/not/exist/test.csv" - result = await session._save_callback(invalid_path, ExportFormat.CSV, "utf-8") - - assert result["success"] is False - assert "error" in result - class TestSessionManager: """Tests for SessionManager functionality.""" @@ -379,8 +263,8 @@ class TestMemoryConfiguration: def test_memory_threshold_configuration(self) -> None: """Test that memory threshold is configurable via settings.""" - settings = DataBeakSettings(memory_threshold_mb=4096) - assert settings.memory_threshold_mb == 4096 + settings = DatabeakSettings(health_memory_threshold_mb=4096) + assert settings.health_memory_threshold_mb == 4096 @pytest.mark.asyncio async def test_environment_variable_configuration(self) -> None: @@ -388,19 +272,19 @@ async def test_environment_variable_configuration(self) -> None: import os # Set environment variables - old_memory = os.environ.get("DATABEAK_MEMORY_THRESHOLD_MB") + old_memory = os.environ.get("DATABEAK_HEALTH_MEMORY_THRESHOLD_MB") try: - os.environ["DATABEAK_MEMORY_THRESHOLD_MB"] = "4096" + os.environ["DATABEAK_HEALTH_MEMORY_THRESHOLD_MB"] = "4096" # Create new settings instance to pick up env vars - settings = DataBeakSettings() + settings = DatabeakSettings() - assert settings.memory_threshold_mb == 4096 + assert settings.health_memory_threshold_mb == 4096 finally: # Clean up environment variables if old_memory is not None: - os.environ["DATABEAK_MEMORY_THRESHOLD_MB"] = old_memory + os.environ["DATABEAK_HEALTH_MEMORY_THRESHOLD_MB"] = old_memory else: - os.environ.pop("DATABEAK_MEMORY_THRESHOLD_MB", None) + os.environ.pop("DATABEAK_HEALTH_MEMORY_THRESHOLD_MB", None) diff --git a/tests/unit/models/test_settings.py b/tests/unit/models/test_settings.py index 9dfca58..4d6d2a7 100644 --- a/tests/unit/models/test_settings.py +++ b/tests/unit/models/test_settings.py @@ -3,71 +3,67 @@ import os from unittest.mock import patch -from databeak.core.settings import DataBeakSettings +from databeak.core.settings import DatabeakSettings -class TestDataBeakSettings: +class TestDatabeakSettings: """Test DataBeak settings configuration.""" def test_default_settings(self) -> None: """Test default settings configuration.""" - settings = DataBeakSettings() + settings = DatabeakSettings() # History functionality removed - test other defaults - assert settings.max_file_size_mb == 1024 + assert settings.max_download_size_mb == 100 assert settings.session_timeout == 3600 - assert settings.chunk_size == 10000 assert settings.max_anomaly_sample_size == 10000 def test_settings_with_custom_values(self) -> None: """Test settings with custom values.""" - settings = DataBeakSettings(max_file_size_mb=2048, session_timeout=7200, chunk_size=5000) - assert settings.max_file_size_mb == 2048 + settings = DatabeakSettings(max_download_size_mb=200, session_timeout=7200) + assert settings.max_download_size_mb == 200 assert settings.session_timeout == 7200 - assert settings.chunk_size == 5000 def test_environment_variable_override(self) -> None: """Test that environment variables override defaults.""" with patch.dict( os.environ, { - "DATABEAK_MAX_FILE_SIZE_MB": "4096", + "DATABEAK_MAX_DOWNLOAD_SIZE_MB": "200", "DATABEAK_SESSION_TIMEOUT": "14400", - "DATABEAK_CHUNK_SIZE": "20000", }, ): - settings = DataBeakSettings() - assert settings.max_file_size_mb == 4096 + settings = DatabeakSettings() + assert settings.max_download_size_mb == 200 assert settings.session_timeout == 14400 - assert settings.chunk_size == 20000 def test_case_insensitive_env_var(self) -> None: """Test that environment variables are case insensitive.""" - with patch.dict(os.environ, {"DATABEAK_MAX_FILE_SIZE_MB": "512"}): - settings = DataBeakSettings() - assert settings.max_file_size_mb == 512 + with patch.dict(os.environ, {"DATABEAK_MAX_DOWNLOAD_SIZE_MB": "512"}): + settings = DatabeakSettings() + assert settings.max_download_size_mb == 512 -class TestDataBeakSettingsIntegration: +class TestDatabeakSettingsIntegration: """Test DataBeak settings integration with sessions.""" def test_settings_are_configurable(self) -> None: """Test that settings can be configured multiple ways.""" # Test 1: Direct instantiation - settings1 = DataBeakSettings(max_file_size_mb=512) - assert settings1.max_file_size_mb == 512 + settings1 = DatabeakSettings(max_download_size_mb=512) + assert settings1.max_download_size_mb == 512 # Test 2: Environment variable - with patch.dict(os.environ, {"DATABEAK_MAX_FILE_SIZE_MB": "2048"}): - settings2 = DataBeakSettings() - assert settings2.max_file_size_mb == 2048 + with patch.dict(os.environ, {"DATABEAK_MAX_DOWNLOAD_SIZE_MB": "2048"}): + settings2 = DatabeakSettings() + assert settings2.max_download_size_mb == 2048 # Test 3: Default with patch.dict(os.environ, {}, clear=True): # Clear any existing env vars - if "DATABEAK_MAX_FILE_SIZE_MB" in os.environ: - del os.environ["DATABEAK_MAX_FILE_SIZE_MB"] - settings3 = DataBeakSettings() - assert settings3.max_file_size_mb == 1024 + if "DATABEAK_MAX_DOWNLOAD_SIZE_MB" in os.environ: + del os.environ["DATABEAK_MAX_DOWNLOAD_SIZE_MB"] + settings3 = DatabeakSettings() + assert settings3.max_download_size_mb == 100 class TestSettingsDocumentation: @@ -75,25 +71,23 @@ class TestSettingsDocumentation: def test_env_prefix_documentation(self) -> None: """Test that DATABEAK_ prefix works as documented.""" - with patch.dict(os.environ, {"DATABEAK_CHUNK_SIZE": "15000"}): - settings = DataBeakSettings() - assert settings.chunk_size == 15000 + with patch.dict(os.environ, {"DATABEAK_URL_TIMEOUT_SECONDS": "60"}): + settings = DatabeakSettings() + assert settings.url_timeout_seconds == 60 def test_default_values_documentation(self) -> None: """Test that default values match documentation.""" # Clear environment and test default values with patch.dict(os.environ, {}, clear=True): for var in [ - "DATABEAK_MAX_FILE_SIZE_MB", + "DATABEAK_MAX_DOWNLOAD_SIZE_MB", "DATABEAK_SESSION_TIMEOUT", - "DATABEAK_CHUNK_SIZE", ]: if var in os.environ: del os.environ[var] - settings = DataBeakSettings() - assert settings.max_file_size_mb == 1024, "Default file size limit should be 1024 MB" + settings = DatabeakSettings() + assert settings.max_download_size_mb == 100, "Default URL size limit should be 100 MB" assert settings.session_timeout == 3600, ( "Default session timeout should be 3600 seconds" ) - assert settings.chunk_size == 10000, "Default chunk size should be 10000" diff --git a/tests/unit/models/test_tool_responses.py b/tests/unit/models/test_tool_responses.py index 34df306..8c6ae98 100644 --- a/tests/unit/models/test_tool_responses.py +++ b/tests/unit/models/test_tool_responses.py @@ -7,9 +7,7 @@ from __future__ import annotations import json -import tempfile from datetime import datetime -from pathlib import Path from typing import Any import pytest @@ -53,7 +51,6 @@ # Import IO server models that moved to modular architecture from databeak.servers.io_server import ( - ExportResult, LoadResult, SessionInfoResult, ) @@ -406,8 +403,7 @@ def test_valid_creation(self) -> None: version="1.0.0", description="CSV manipulation server", capabilities={"analytics": ["statistics", "correlation"]}, - supported_formats=["csv", "json", "excel"], - max_file_size_mb=100, + max_download_size_mb=100, session_timeout_minutes=30, ) assert server_info.name == "DataBeak" @@ -520,44 +516,6 @@ def test_optional_count_fields(self) -> None: assert result.column_count is None -class TestExportResult: - """Test ExportResult model.""" - - def test_valid_creation(self) -> None: - """Test valid ExportResult creation.""" - with tempfile.NamedTemporaryFile(suffix=".csv", delete=False) as tmp: - result = ExportResult( - file_path=tmp.name, - format="csv", - rows_exported=100, - file_size_mb=0.5, - ) - assert result.format == "csv" - assert result.rows_exported == 100 - # Clean up - Path(tmp.name).unlink() - - def test_literal_format_validation(self) -> None: - """Test format field validates against literal values.""" - valid_formats = ["csv", "tsv", "json", "excel", "parquet", "html", "markdown"] - with tempfile.NamedTemporaryFile() as tmp: - for fmt in valid_formats: - result = ExportResult( - file_path=tmp.name, - format=fmt, - rows_exported=10, - ) - assert result.format == fmt - - # Test invalid format - with pytest.raises(ValidationError): - ExportResult( - file_path=tmp.name, - format="invalid_format", - rows_exported=10, - ) - - # ============================================================================= # ANALYTICS TOOL RESPONSES TESTS # ============================================================================= diff --git a/tests/unit/servers/test_io_server.py b/tests/unit/servers/test_io_server.py deleted file mode 100644 index 18dc729..0000000 --- a/tests/unit/servers/test_io_server.py +++ /dev/null @@ -1,736 +0,0 @@ -"""Comprehensive tests for io_server.py focusing on error conditions, edge cases, and -integration.""" - -import tempfile -from email.message import EmailMessage -from pathlib import Path -from typing import Any -from unittest.mock import AsyncMock, patch -from urllib.error import HTTPError - -import pandas as pd -import pytest -from fastmcp.exceptions import ToolError - -from databeak.servers.io_server import ( - MAX_FILE_SIZE_MB, - MAX_MEMORY_USAGE_MB, - MAX_ROWS, - MAX_URL_SIZE_MB, - NoHeader, - export_csv, - get_session_info, - load_csv, - load_csv_from_content, - load_csv_from_url, -) -from tests.test_mock_context import create_mock_context - - -@pytest.mark.asyncio -class TestErrorConditions: - """Test comprehensive error conditions.""" - - async def test_load_csv_file_not_found(self) -> None: - """Test loading non-existent CSV file.""" - with pytest.raises(ToolError): - await load_csv(create_mock_context(), "/nonexistent/path/file.csv") - - async def test_load_csv_invalid_extension(self) -> None: - """Test loading file with invalid extension.""" - with tempfile.NamedTemporaryFile(suffix=".doc", delete=False) as f: - f.write(b"name,age\nJohn,30") - temp_path = f.name - - try: - with pytest.raises(ToolError): - await load_csv(create_mock_context(), temp_path) - finally: - Path(temp_path).unlink() - - async def test_load_csv_encoding_error_all_fallbacks_fail(self) -> None: - """Test encoding error when all fallback encodings fail.""" - - # Mock pandas.read_csv to always raise UnicodeDecodeError - def mock_read_csv(*args: object, **kwargs: object) -> object: - encoding = str(kwargs.get("encoding", "utf-8")) - raise UnicodeDecodeError(encoding, b"", 0, 1, f"mock encoding error for {encoding}") - - with tempfile.NamedTemporaryFile(mode="w", suffix=".csv", delete=False) as f: - f.write("name,age\nJohn,30") - temp_path = f.name - - try: - with ( - patch("pandas.read_csv", side_effect=mock_read_csv), - pytest.raises(ToolError), - ): - await load_csv(create_mock_context(), temp_path, encoding="utf-8") - finally: - Path(temp_path).unlink() - - async def test_load_csv_memory_limit_exceeded(self) -> None: - """Test memory limit enforcement.""" - # Mock pandas read_csv to return a large DataFrame - large_data = pd.DataFrame( - {"col1": ["data"] * (MAX_ROWS + 100), "col2": list(range(MAX_ROWS + 100))}, - ) - - with tempfile.NamedTemporaryFile(mode="w", suffix=".csv", delete=False) as f: - f.write("col1,col2\n") - temp_path = f.name - - try: - with ( - patch("pandas.read_csv", return_value=large_data), - pytest.raises(ToolError), - ): - await load_csv(create_mock_context(), temp_path) - finally: - Path(temp_path).unlink() - - async def test_load_csv_memory_usage_exceeded(self) -> None: - """Test memory usage limit enforcement.""" - # Create DataFrame that exceeds memory limit - with tempfile.NamedTemporaryFile(mode="w", suffix=".csv", delete=False) as f: - f.write("col1,col2\n") - temp_path = f.name - - # Mock memory_usage to return value exceeding limit - def mock_memory_usage(*args: object, **kwargs: object) -> object: - class MockSeries: - def sum(self) -> int: - return (MAX_MEMORY_USAGE_MB + 100) * 1024 * 1024 - - return MockSeries() - - try: - with ( - patch("pandas.DataFrame.memory_usage", side_effect=mock_memory_usage), - pytest.raises(ToolError), - ): - await load_csv(create_mock_context(), temp_path) - finally: - Path(temp_path).unlink() - - async def test_load_csv_from_url_private_network_blocked(self) -> None: - """Test private network URL blocking.""" - private_urls = [ - "http://192.168.1.1/data.csv", - "http://10.0.0.1/data.csv", - "http://172.16.0.1/data.csv", - "http://localhost/data.csv", - "http://127.0.0.1/data.csv", - ] - - for url in private_urls: - with pytest.raises(ToolError): - await load_csv_from_url(create_mock_context(), url) - - async def test_load_csv_file_size_limit_exceeded(self) -> None: - """Test file size limit enforcement before loading.""" - # Create a file and mock its size to exceed limit - with tempfile.NamedTemporaryFile(mode="w", suffix=".csv", delete=False) as f: - f.write("name,age\nJohn,30") - temp_path = f.name - - try: - # Create a mock stat object - mock_stat_obj = type("MockStat", (), {})() - mock_stat_obj.st_size = (MAX_FILE_SIZE_MB + 10) * 1024 * 1024 - mock_stat_obj.st_mode = 0o100644 # Regular file mode - - with ( - patch("pathlib.Path.stat", return_value=mock_stat_obj), - pytest.raises(ToolError), - ): - await load_csv(create_mock_context(), temp_path) - finally: - Path(temp_path).unlink() - - -@pytest.mark.asyncio -class TestEdgeCases: - """Test edge cases and unusual inputs.""" - - async def test_load_csv_empty_file(self) -> None: - """Test loading empty CSV file.""" - with tempfile.NamedTemporaryFile(mode="w", suffix=".csv", delete=False) as f: - f.write("") # Empty file - temp_path = f.name - - try: - with pytest.raises(pd.errors.EmptyDataError, match="No columns to parse"): - await load_csv(create_mock_context(), temp_path) - finally: - Path(temp_path).unlink() - - async def test_load_csv_header_only(self) -> None: - """Test loading CSV with only headers.""" - with tempfile.NamedTemporaryFile(mode="w", suffix=".csv", delete=False) as f: - f.write("name,age,city\n") # Header only - temp_path = f.name - - try: - # Loading CSV with only headers is valid - creates empty DataFrame with columns - result = await load_csv(create_mock_context(), temp_path) - assert result.rows_affected == 0 - finally: - Path(temp_path).unlink() - - async def test_load_csv_special_characters(self) -> None: - """Test loading CSV with special characters.""" - with tempfile.NamedTemporaryFile( - mode="w", - suffix=".csv", - delete=False, - encoding="utf-8", - ) as f: - # Unicode characters, quotes, commas in data - f.write("name,description,price\n") - f.write('"José García","Product with ""quotes"" and, commas",€25.99\n') - f.write('"李小明","测试数据",¥100.00\n') - temp_path = f.name - - try: - result = await load_csv(create_mock_context(), temp_path) - assert result.success - assert result.rows_affected == 2 - assert result.data is not None - assert "José García" in str(result.data.rows) - finally: - Path(temp_path).unlink() - - async def test_load_csv_malformed_quotes(self) -> None: - """Test CSV with malformed quotes.""" - with tempfile.NamedTemporaryFile(mode="w", suffix=".csv", delete=False) as f: - f.write("name,description\n") - f.write('"Unclosed quote,data\n') # Malformed - temp_path = f.name - - try: - with pytest.raises(pd.errors.ParserError, match="Error tokenizing data"): - await load_csv(create_mock_context(), temp_path) - finally: - Path(temp_path).unlink() - - async def test_load_csv_from_content_various_delimiters(self) -> None: - """Test content loading with different delimiters.""" - # Tab-separated - content = "name\tage\tcity\nJohn\t30\tNYC\nJane\t25\tLA" - result = await load_csv_from_content(create_mock_context(), content, delimiter="\t") - assert result.success - assert result.rows_affected == 2 - - # Semicolon-separated - content = "name;age;city\nJohn;30;NYC\nJane;25;LA" - result = await load_csv_from_content(create_mock_context(), content, delimiter=";") - assert result.success - assert result.rows_affected == 2 - - async def test_load_csv_from_content_no_header(self) -> None: - """Test content loading without headers.""" - content = "John,30,NYC\nJane,25,LA" - result = await load_csv_from_content( - create_mock_context(), content, header_config=NoHeader() - ) - assert result.success - assert result.rows_affected == 2 - # Should have auto-generated column names like "0", "1", "2" - - -@pytest.mark.asyncio -class TestIntegrationWithSessions: - """Test integration with session management.""" - - async def test_load_csv_creates_new_session(self) -> None: - """Test that loading CSV creates a new session when none provided.""" - with tempfile.NamedTemporaryFile(mode="w", suffix=".csv", delete=False) as f: - f.write("name,age\nJohn,30\nJane,25") - temp_path = f.name - - try: - ctx = create_mock_context() - result = await load_csv(ctx, temp_path) - assert result.success - assert result.rows_affected == 2 - - # Verify session exists and has data - session_info = await get_session_info(create_mock_context(ctx.session_id)) - assert session_info.data_loaded - assert session_info.row_count == 2 - assert session_info.column_count == 2 - finally: - Path(temp_path).unlink() - - async def test_load_csv_into_existing_session(self) -> None: - """Test loading CSV into an existing session (replaces data).""" - # First load - with tempfile.NamedTemporaryFile(mode="w", suffix=".csv", delete=False) as f: - f.write("name,age\nJohn,30") - temp_path1 = f.name - - try: - ctx1 = create_mock_context() - await load_csv(ctx1, temp_path1) - session_id = ctx1.session_id - - # Second load into same session - with tempfile.NamedTemporaryFile(mode="w", suffix=".csv", delete=False) as f: - f.write("product,price,category\nLaptop,1000,Electronics\nBook,15,Education") - temp_path2 = f.name - - try: - result2 = await load_csv(create_mock_context(session_id), temp_path2) - assert result2.rows_affected == 2 - - # Verify session now has the new data - session_info = await get_session_info(create_mock_context(session_id)) - assert session_info.row_count == 2 - assert session_info.column_count == 3 - finally: - Path(temp_path2).unlink() - finally: - Path(temp_path1).unlink() - - async def test_session_lifecycle_complete(self) -> None: - """Test complete session lifecycle: create, use, export, close.""" - # Load data - content = "name,age,salary\nAlice,25,50000\nBob,30,60000\nCharlie,35,70000" - ctx = create_mock_context() - await load_csv_from_content(ctx, content) - session_id = ctx.session_id - - # Get session info - info = await get_session_info(create_mock_context(session_id)) - assert info.data_loaded - assert info.row_count == 3 - - # Export the data - with tempfile.NamedTemporaryFile(suffix=".json", delete=False) as tmp: - temp_path = tmp.name - - export_result = await export_csv(create_mock_context(session_id), file_path=temp_path) - assert export_result.rows_exported == 3 - assert Path(export_result.file_path).exists() - - # Clean up export file - Path(export_result.file_path).unlink() - - -@pytest.mark.asyncio -class TestTempFileCleanup: - """Test temporary file cleanup scenarios.""" - - async def test_export_csv_session_error_handling(self) -> None: - """Test error handling when session manager fails.""" - with tempfile.NamedTemporaryFile(suffix=".csv", delete=False) as tmp: - temp_path = tmp.name - - try: - with patch("databeak.servers.io_server.get_session_data") as mock_get_session_data: - mock_get_session_data.side_effect = Exception("Mock session error") - - with pytest.raises(Exception, match="Mock session error"): - await export_csv(create_mock_context(), file_path=temp_path) - finally: - Path(temp_path).unlink(missing_ok=True) - - -@pytest.mark.asyncio -class TestURLValidationSecurity: - """Test URL validation and security features.""" - - async def test_private_network_blocking_ipv4(self) -> None: - """Test blocking of private IPv4 networks.""" - private_networks = [ - "http://192.168.1.1/data.csv", # Class C private - "http://10.0.0.1/data.csv", # Class A private - "http://172.16.0.1/data.csv", # Class B private - "http://127.0.0.1/data.csv", # Loopback - "http://169.254.1.1/data.csv", # Link-local - ] - - for url in private_networks: - with pytest.raises(ToolError): - await load_csv_from_url(create_mock_context(), url) - - async def test_localhost_hostname_blocking(self) -> None: - """Test blocking of localhost hostnames.""" - localhost_urls = [ - "http://localhost/data.csv", - "https://localhost:8080/data.csv", - ] - - for url in localhost_urls: - with pytest.raises(ToolError): - await load_csv_from_url(create_mock_context(), url) - - async def test_invalid_url_schemes(self) -> None: - """Test rejection of non-HTTP schemes.""" - invalid_schemes = [ - "ftp://example.com/data.csv", - "file:///path/to/data.csv", - "mailto:user@example.com", - "javascript:alert('xss')", - ] - - for url in invalid_schemes: - with pytest.raises(ToolError): - await load_csv_from_url(create_mock_context(), url) - - async def test_url_timeout_handling(self) -> None: - """Test URL download timeout handling.""" - # Mock urlopen to raise timeout - with patch("databeak.servers.io_server.urlopen") as mock_urlopen: - mock_urlopen.side_effect = TimeoutError("Connection timed out") - - with pytest.raises(ToolError): - await load_csv_from_url(create_mock_context(), "https://example.com/data.csv") - - async def test_url_content_type_validation(self) -> None: - """Test content-type verification for URLs.""" - # Mock response with invalid content-type - mock_response = AsyncMock() - mock_response.headers = {"Content-Type": "text/html"} - - with patch("databeak.servers.io_server.urlopen") as mock_urlopen: - mock_urlopen.return_value.__enter__.return_value = mock_response - # Should proceed with warning for unexpected content-type - # The test validates the warning is logged - - async def test_url_size_limit_exceeded(self) -> None: - """Test URL download size limit enforcement.""" - # Mock response with large content-length - mock_response = AsyncMock() - mock_response.headers = { - "Content-Type": "text/csv", - "Content-Length": str((MAX_URL_SIZE_MB + 10) * 1024 * 1024), # Exceed limit - } - - with patch("databeak.servers.io_server.urlopen") as mock_urlopen: - mock_urlopen.return_value.__enter__.return_value = mock_response - - with pytest.raises(ToolError): - await load_csv_from_url(create_mock_context(), "https://example.com/large_file.csv") - - async def test_url_http_error_handling(self) -> None: - """Test HTTP error handling (404, 403, etc.).""" - with patch("databeak.servers.io_server.urlopen") as mock_urlopen: - mock_urlopen.side_effect = HTTPError( - url="https://example.com/notfound.csv", - code=404, - msg="Not Found", - hdrs=EmailMessage(), - fp=None, - ) - - with pytest.raises(ToolError): - await load_csv_from_url(create_mock_context(), "https://example.com/notfound.csv") - - -@pytest.mark.asyncio -class TestExportFormats: - """Test all export formats comprehensively.""" - - @pytest.fixture - async def session_with_data(self) -> str: - """Create session with test data.""" - content = "name,age,salary,active\nAlice,25,50000,true\nBob,30,60000,false" - ctx = create_mock_context() - await load_csv_from_content(ctx, content) - return ctx.session_id - - async def test_export_csv_format(self, session_with_data: str) -> None: - """Test CSV export format (inferred from .csv extension).""" - with tempfile.NamedTemporaryFile(suffix=".csv", delete=False) as tmp: - temp_path = tmp.name - - try: - result = await export_csv(create_mock_context(session_with_data), file_path=temp_path) - assert result.format == "csv" - assert result.file_path == temp_path - assert result.rows_exported == 2 - - # Verify file exists and has content - assert Path(result.file_path).exists() - content = Path(result.file_path).read_text() - assert "Alice" in content - assert "Bob" in content - assert "name,age" in content # CSV headers - finally: - Path(temp_path).unlink(missing_ok=True) - - async def test_export_tsv_format(self, session_with_data: str) -> None: - """Test TSV export format (inferred from .tsv extension).""" - with tempfile.NamedTemporaryFile(suffix=".tsv", delete=False) as tmp: - temp_path = tmp.name - - try: - result = await export_csv(create_mock_context(session_with_data), file_path=temp_path) - assert result.format == "tsv" - assert result.file_path == temp_path - - # Verify tab separation - content = Path(result.file_path).read_text() - assert "\t" in content - finally: - Path(temp_path).unlink(missing_ok=True) - - async def test_export_json_format(self, session_with_data: str) -> None: - """Test JSON export format (inferred from .json extension).""" - with tempfile.NamedTemporaryFile(suffix=".json", delete=False) as tmp: - temp_path = tmp.name - - try: - result = await export_csv(create_mock_context(session_with_data), file_path=temp_path) - assert result.format == "json" - assert result.file_path == temp_path - - # Verify valid JSON - import json - - with Path(result.file_path).open() as f: - data = json.load(f) - assert len(data) == 2 - assert data[0]["name"] == "Alice" - finally: - Path(temp_path).unlink(missing_ok=True) - - async def test_export_with_custom_path(self, session_with_data: str) -> None: - """Test export with user-specified file path.""" - with tempfile.TemporaryDirectory() as temp_dir: - custom_path = str(Path(temp_dir) / "my_export.csv") - result = await export_csv(create_mock_context(session_with_data), file_path=custom_path) - - assert result.file_path == custom_path - assert Path(custom_path).exists() - - -@pytest.mark.asyncio -class TestEncodingAndFallback: - """Test encoding detection and fallback logic.""" - - async def test_encoding_fallback_success(self) -> None: - """Test successful encoding fallback.""" - # Create file with latin1 encoding - content = "name,description\nJosé,Niño años" - - with tempfile.NamedTemporaryFile( - mode="w", - suffix=".csv", - delete=False, - encoding="latin1", - ) as f: - f.write(content) - temp_path = f.name - - try: - # Try to load with UTF-8 first, should fallback to latin1 - result = await load_csv(create_mock_context(), temp_path, encoding="utf-8") - assert result.success - assert result.rows_affected == 1 - finally: - Path(temp_path).unlink() - - async def test_custom_na_values(self) -> None: - """Test custom NA values handling.""" - with tempfile.NamedTemporaryFile(mode="w", suffix=".csv", delete=False) as f: - f.write("name,age,status\nJohn,30,MISSING\nJane,N/A,active") - temp_path = f.name - - try: - result = await load_csv(create_mock_context(), temp_path, na_values=["MISSING", "N/A"]) - assert result.success - # Verify NA values were handled properly (would need to check actual data) - finally: - Path(temp_path).unlink() - - async def test_automatic_encoding_detection_success(self) -> None: - """Test automatic encoding detection with chardet.""" - # Create file with special characters that require specific encoding - content = "name,description\nJosé García,Niño años\nFrançois,café" - - with tempfile.NamedTemporaryFile( - mode="w", - suffix=".csv", - delete=False, - encoding="latin1", - ) as f: - f.write(content) - temp_path = f.name - - try: - # Mock chardet to return high confidence detection - with patch("databeak.servers.io_server.chardet.detect") as mock_detect: - mock_detect.return_value = {"encoding": "ISO-8859-1", "confidence": 0.85} - - # Should use detected encoding instead of falling back - result = await load_csv( - create_mock_context(), - temp_path, - encoding="utf-8", - ) # Will trigger fallback - assert result.success - assert result.rows_affected == 2 - finally: - Path(temp_path).unlink() - - @pytest.mark.skip(reason="Complex mocking scenario - needs refactoring") - @patch("pandas.read_csv") - async def test_encoding_fallback_prioritization(self) -> None: - """Test that encoding fallbacks are tried in optimal order.""" - with tempfile.NamedTemporaryFile(mode="w", suffix=".csv", delete=False) as f: - f.write("name,age\nJohn,30") - temp_path = f.name - - try: - # Mock successful result - success_df = pd.DataFrame({"name": ["John"], "age": [30]}) - - # Mock encoding detection to fail so fallbacks are used - with ( - patch("databeak.servers.io_server.detect_file_encoding", return_value="utf-8"), - patch("pandas.read_csv") as mock_read_csv, - ): - call_count = [0] - - def mock_read_side_effect(*args: Any, **kwargs: Any) -> Any: - call_count[0] += 1 - if call_count[0] == 1: # First call (original encoding) - msg = "utf-8" - raise UnicodeDecodeError(msg, b"", 0, 1, "mock error") - if call_count[0] == 2: # Second call (auto-detection, same encoding) - msg = "utf-8" - raise UnicodeDecodeError(msg, b"", 0, 1, "auto-detect fails") - if call_count[0] == 3: # First fallback fails - msg = "utf-8-sig" - raise UnicodeDecodeError(msg, b"", 0, 1, "fallback 1 fails") - # Eventually succeed - return success_df - - mock_read_csv.side_effect = mock_read_side_effect - - result = await load_csv(create_mock_context(), temp_path, encoding="utf-8") - assert result.success - # Verify multiple attempts were made - assert mock_read_csv.call_count >= 3 - finally: - Path(temp_path).unlink() - - -@pytest.mark.asyncio -class TestMemoryAndPerformance: - """Test memory limits and performance scenarios.""" - - async def test_load_csv_row_limit_enforcement(self) -> None: - """Test that row limits are properly enforced.""" - # Create DataFrame exceeding row limit - large_df = pd.DataFrame({"id": range(MAX_ROWS + 10), "data": ["test"] * (MAX_ROWS + 10)}) - - with tempfile.NamedTemporaryFile(mode="w", suffix=".csv", delete=False) as f: - f.write("id,data\n") - temp_path = f.name - - try: - with ( - patch("pandas.read_csv", return_value=large_df), - pytest.raises(ToolError), - ): - await load_csv(create_mock_context(), temp_path) - finally: - Path(temp_path).unlink() - - @pytest.mark.skip(reason="Complex mocking scenario - needs refactoring") - async def test_encoding_fallback_memory_check(self) -> None: - """Test that memory limits are checked even in encoding fallback.""" - # Create large dataframe that exceeds memory limits - large_df = pd.DataFrame({"col": ["x" * 10000] * 1000}) # Large strings for memory usage - - with tempfile.NamedTemporaryFile(mode="w", suffix=".csv", delete=False) as f: - f.write("col\n") - temp_path = f.name - - try: - # Mock encoding detection to fail so fallbacks are used - with ( - patch("databeak.servers.io_server.detect_file_encoding", return_value="utf-8"), - patch("pandas.read_csv") as mock_read_csv, - ): - call_count = [0] - - def mock_read_side_effect(*args: Any, **kwargs: Any) -> Any: - call_count[0] += 1 - if call_count[0] == 1 or call_count[0] == 2: # First call (original encoding) - msg = "utf-8" - raise UnicodeDecodeError(msg, b"", 0, 1, "mock error") - # Fallback encoding succeeds but returns large df - return large_df - - mock_read_csv.side_effect = mock_read_side_effect - - # Mock to make memory check fail - with ( - patch("databeak.servers.io_server.MAX_MEMORY_USAGE_MB", 0.001), - pytest.raises(ToolError), - ): - await load_csv(create_mock_context(), temp_path, encoding="utf-8") - finally: - Path(temp_path).unlink() - - -@pytest.mark.asyncio -class TestProgressReporting: - """Test FastMCP context integration for progress reporting.""" - - async def test_load_csv_with_context(self) -> None: - """Test loading CSV with FastMCP context for progress reporting.""" - # Mock context - mock_ctx = AsyncMock() - mock_ctx.session_id = "test_context_session" - - content = "name,age\nJohn,30" - with tempfile.NamedTemporaryFile(mode="w", suffix=".csv", delete=False) as f: - f.write(content) - temp_path = f.name - - try: - result = await load_csv(mock_ctx, temp_path) - assert result.success - - # Verify context methods were called - mock_ctx.info.assert_called() - mock_ctx.report_progress.assert_called() - finally: - Path(temp_path).unlink() - - async def test_export_csv_with_context(self) -> None: - """Test export with context reporting.""" - mock_ctx = AsyncMock() - - # Load data first - content = "name,age\nJohn,30" - ctx = create_mock_context() - await load_csv_from_content(ctx, content) - session_id = ctx.session_id - mock_ctx.session_id = session_id - - with tempfile.NamedTemporaryFile(suffix=".csv", delete=False) as tmp: - temp_path = tmp.name - - result = await export_csv(mock_ctx, file_path=temp_path) - - # Verify context methods were called - mock_ctx.info.assert_called() - mock_ctx.report_progress.assert_called() - - # Cleanup - Path(result.file_path).unlink() - - -# Helper function to fix nullable dtypes for test compatibility -def create_test_dataframe() -> pd.DataFrame: - """Create test DataFrame compatible with session management.""" - return pd.DataFrame( - {"name": ["John", "Jane", "Alice"], "age": [30, 25, 35], "city": ["NYC", "LA", "Chicago"]}, - ) diff --git a/tests/unit/servers/test_io_server_additional.py b/tests/unit/servers/test_io_server_additional.py deleted file mode 100644 index a854f67..0000000 --- a/tests/unit/servers/test_io_server_additional.py +++ /dev/null @@ -1,207 +0,0 @@ -"""Additional tests for io_server to improve coverage.""" - -import tempfile -from pathlib import Path - -import pytest -from fastmcp.exceptions import ToolError - -from databeak.core.session import get_session_manager -from databeak.exceptions import NoDataLoadedError -from databeak.servers.io_server import ( - export_csv, - get_session_info, - load_csv_from_content, -) -from tests.test_mock_context import create_mock_context - - -class TestSessionManagement: - """Test session management functions.""" - - async def test_get_session_info_valid(self) -> None: - """Test getting info for a valid session.""" - # Create a session - csv_content = "name,value\ntest,123" - ctx = create_mock_context() - await load_csv_from_content(ctx, csv_content) - session_id = ctx.session_id - - info = await get_session_info(create_mock_context(session_id)) - - assert info.success is True - assert info.data_loaded is True - assert info.row_count == 1 - assert info.column_count == 2 - # SessionInfoResult doesn't have columns field, just counts - - async def test_get_session_info_invalid(self) -> None: - """Test getting info for invalid session.""" - with pytest.raises(NoDataLoadedError): - await get_session_info(create_mock_context("nonexistent-session-id")) - - -class TestCsvLoadingEdgeCases: - """Test CSV loading edge cases and error handling.""" - - async def test_load_csv_empty_content(self) -> None: - """Test loading empty CSV content.""" - with pytest.raises(ToolError): - await load_csv_from_content(create_mock_context(), "") - - async def test_load_csv_only_whitespace(self) -> None: - """Test loading CSV with only whitespace.""" - with pytest.raises(ToolError): - await load_csv_from_content(create_mock_context(), " \n \n ") - - async def test_load_csv_single_column(self) -> None: - """Test loading CSV with single column.""" - csv_content = "single_col\nvalue1\nvalue2\nvalue3" - result = await load_csv_from_content(create_mock_context(), csv_content) - - assert result.rows_affected == 3 - assert result.columns_affected == ["single_col"] - - async def test_load_csv_with_quotes(self) -> None: - """Test loading CSV with quoted values.""" - csv_content = 'name,description\n"John Doe","A person, with comma"\n"Jane","Normal"' - result = await load_csv_from_content(create_mock_context(), csv_content) - - assert result.rows_affected == 2 - assert result.columns_affected == ["name", "description"] - - async def test_load_csv_with_different_delimiter(self) -> None: - """Test loading CSV with semicolon delimiter.""" - csv_content = "col1;col2;col3\n1;2;3\n4;5;6" - result = await load_csv_from_content(create_mock_context(), csv_content, delimiter=";") - - assert result.rows_affected == 2 - assert len(result.columns_affected) == 3 - - async def test_load_csv_with_mixed_types(self) -> None: - """Test loading CSV with mixed data types.""" - csv_content = """id,name,value,is_active,date -1,Alice,100.5,true,2024-01-01 -2,Bob,200,false,2024-01-02 -3,Charlie,,true,""" - - result = await load_csv_from_content(create_mock_context(), csv_content) - - assert result.rows_affected == 3 - assert len(result.columns_affected) == 5 - - async def test_load_csv_duplicate_columns(self) -> None: - """Test loading CSV with duplicate column names.""" - csv_content = "col,col,col\n1,2,3\n4,5,6" - - # Should handle duplicate columns by renaming them - result = await load_csv_from_content(create_mock_context(), csv_content) - - assert result.rows_affected == 2 - # Pandas renames duplicates like col, col.1, col.2 - assert len(result.columns_affected) == 3 - - -class TestExportFunctionality: - """Test CSV export functionality.""" - - async def test_export_csv_basic(self) -> None: - """Test basic CSV export.""" - # Create a session with data - csv_content = "name,value\ntest1,100\ntest2,200" - ctx = create_mock_context() - await load_csv_from_content(ctx, csv_content) - session_id = ctx.session_id - - # Export to a temporary file - import tempfile - - with tempfile.NamedTemporaryFile(suffix=".csv", delete=False) as tmp: - export_result = await export_csv(create_mock_context(session_id), file_path=tmp.name) - - assert export_result.success is True - assert export_result.file_path == tmp.name - assert export_result.rows_exported == 2 - - # Verify file content - import pandas as pd - - df = pd.read_csv(tmp.name) - assert len(df) == 2 - assert list(df.columns) == ["name", "value"] - - # Clean up - Path(tmp.name).unlink() - - async def test_export_csv_with_subset(self) -> None: - """Test exporting a subset of columns.""" - csv_content = "col1,col2,col3,col4\n1,2,3,4\n5,6,7,8" - ctx = create_mock_context() - await load_csv_from_content(ctx, csv_content) - session_id = ctx.session_id - - import tempfile - - with tempfile.NamedTemporaryFile(suffix=".csv", delete=False) as tmp: - export_result = await export_csv(create_mock_context(session_id), file_path=tmp.name) - - assert export_result.rows_exported == 2 - - # Verify exported data - import pandas as pd - - df = pd.read_csv(tmp.name) - assert len(df.columns) == 4 # All columns exported - - # Clean up - Path(tmp.name).unlink() - - async def test_export_invalid_session(self) -> None: - """Test exporting with invalid session.""" - with tempfile.NamedTemporaryFile(suffix=".csv", delete=False) as tmp: - with pytest.raises(NoDataLoadedError): - await export_csv(create_mock_context("nonexistent-session-id"), file_path=tmp.name) - # Clean up - Path(tmp.name).unlink(missing_ok=True) - - async def test_export_no_data_loaded(self) -> None: - """Test exporting when no data is loaded.""" - session_manager = get_session_manager() - session_id = "empty_session_test" - session_manager.get_or_create_session(session_id) - - with tempfile.NamedTemporaryFile(suffix=".csv", delete=False) as tmp: - with pytest.raises(NoDataLoadedError): - await export_csv(create_mock_context(session_id), file_path=tmp.name) - # Clean up - Path(tmp.name).unlink(missing_ok=True) - - -class TestMemoryAndPerformance: - """Test memory and performance constraints.""" - - async def test_load_large_number_of_columns(self) -> None: - """Test loading CSV with many columns.""" - # Create CSV with 100 columns - columns = [f"col_{i}" for i in range(100)] - header = ",".join(columns) - row = ",".join(str(i) for i in range(100)) - csv_content = f"{header}\n{row}\n{row}" - - result = await load_csv_from_content(create_mock_context(), csv_content) - - assert len(result.columns_affected) == 100 - assert result.rows_affected == 2 - - async def test_session_memory_tracking(self) -> None: - """Test that memory usage is tracked.""" - csv_content = "col1,col2,col3\n" + "\n".join(f"{i},{i + 1},{i + 2}" for i in range(100)) - - ctx = create_mock_context() - await load_csv_from_content(ctx, csv_content) - session_id = ctx.session_id - info = await get_session_info(create_mock_context(session_id)) - - # SessionInfoResult doesn't have memory_usage_mb field - # Just check that session has data loaded - assert info.data_loaded is True diff --git a/tests/unit/servers/test_io_server_coverage.py b/tests/unit/servers/test_io_server_coverage.py deleted file mode 100644 index ae664d8..0000000 --- a/tests/unit/servers/test_io_server_coverage.py +++ /dev/null @@ -1,283 +0,0 @@ -"""Comprehensive tests to improve io_server.py coverage to 80%+.""" - -import tempfile -from pathlib import Path -from unittest.mock import patch - -import pandas as pd -import pytest -from fastmcp.exceptions import ToolError - -from databeak.servers.io_server import ( - export_csv, - get_encoding_fallbacks, - get_session_info, - load_csv, - load_csv_from_content, -) -from tests.test_mock_context import create_mock_context - - -class TestEncodingFallbacks: - """Test encoding detection and fallback mechanisms.""" - - def test_get_encoding_fallbacks_utf8(self) -> None: - """Test fallback encodings for UTF-8.""" - fallbacks = get_encoding_fallbacks("utf-8") - # UTF-8 is not included when it's the primary encoding (line 245 in io_server.py) - assert "utf-8-sig" in fallbacks - assert "latin1" in fallbacks # Note: latin1 not latin-1 - assert "iso-8859-1" in fallbacks - - def test_get_encoding_fallbacks_latin1(self) -> None: - """Test fallback encodings for Latin-1.""" - fallbacks = get_encoding_fallbacks("latin1") - assert "latin1" in fallbacks - assert "utf-8" in fallbacks - assert "cp1252" in fallbacks - - def test_get_encoding_fallbacks_windows(self) -> None: - """Test fallback encodings for Windows-1252.""" - fallbacks = get_encoding_fallbacks("cp1252") - assert "cp1252" in fallbacks - assert "windows-1252" in fallbacks - - def test_get_encoding_fallbacks_unknown(self) -> None: - """Test fallback encodings for unknown encoding.""" - fallbacks = get_encoding_fallbacks("unknown-encoding") - # Should return the primary encoding first - assert "unknown-encoding" in fallbacks - assert "utf-8" in fallbacks - assert "cp1252" in fallbacks - - -class TestLoadCsvWithEncoding: - """Test CSV loading with various encodings.""" - - async def test_load_csv_with_encoding_fallback(self) -> None: - """Test loading CSV with encoding that needs fallback.""" - # Create a file with Latin-1 encoding - with tempfile.NamedTemporaryFile( - mode="w", - encoding="latin-1", - suffix=".csv", - delete=False, - ) as f: - f.write("name,city\n") - f.write("José,São Paulo\n") # Latin-1 characters - f.write("François,Montréal\n") - temp_path = f.name - - try: - # Try to load with wrong encoding first (will trigger fallback) - result = await load_csv( - create_mock_context(), - file_path=temp_path, - encoding="ascii", # This will fail and trigger fallback - ) - - assert result.rows_affected == 2 - assert result.columns_affected == ["name", "city"] - finally: - Path(temp_path).unlink() - - async def test_load_csv_with_utf8_bom(self) -> None: - """Test loading CSV with UTF-8 BOM.""" - # Create a file with UTF-8 BOM - with tempfile.NamedTemporaryFile(mode="wb", suffix=".csv", delete=False) as f: - # Write BOM - f.write(b"\xef\xbb\xbf") - # Write CSV content - f.write(b"name,value\ntest,123\n") - temp_path = f.name - - try: - result = await load_csv(create_mock_context(), file_path=temp_path) - assert result.rows_affected == 1 - assert result.columns_affected == ["name", "value"] - finally: - Path(temp_path).unlink() - - async def test_load_csv_encoding_error_all_fallbacks_fail(self) -> None: - """Test when all encoding fallbacks fail.""" - # Create a file with mixed/corrupted encoding - with tempfile.NamedTemporaryFile(mode="wb", suffix=".csv", delete=False) as f: - # Write some invalid UTF-8 sequences - f.write(b"col1,col2\n") - f.write(b"\xff\xfe invalid bytes \xfd\xfc\n") - temp_path = f.name - - try: - # This should try all fallbacks and eventually succeed with error handling - result = await load_csv(create_mock_context(), file_path=temp_path, encoding="utf-8") - # latin-1 should handle any byte sequence - assert result is not None - except ToolError: - # Or it might fail completely which is also acceptable - pass - finally: - Path(temp_path).unlink() - - -class TestLoadCsvSizeConstraints: - """Test file size and memory constraints.""" - - async def test_load_csv_max_rows_exceeded(self) -> None: - """Test loading CSV that exceeds max rows.""" - # Create a large CSV - with tempfile.NamedTemporaryFile(mode="w", suffix=".csv", delete=False) as f: - f.write("col1,col2\n") - # Write more than MAX_ROWS (1,000,000) - for i in range(10): # Small test, normally would be 1000001 - f.write(f"{i},value{i}\n") - temp_path = f.name - - try: - # Mock the MAX_ROWS constant to make test faster - with ( - patch("databeak.servers.io_server.MAX_ROWS", 5), - pytest.raises(ToolError), - ): - await load_csv(create_mock_context(), file_path=temp_path) - finally: - Path(temp_path).unlink() - - async def test_load_csv_memory_limit_exceeded(self) -> None: - """Test loading CSV that exceeds memory limit.""" - # Create a CSV with large strings - with tempfile.NamedTemporaryFile(mode="w", suffix=".csv", delete=False) as f: - f.write("col1,col2\n") - # Create rows with large strings - large_string = "x" * 10000 - for i in range(10): - f.write(f"{i},{large_string}\n") - temp_path = f.name - - try: - # Mock the MAX_MEMORY_USAGE_MB to trigger the check - with ( - patch("databeak.servers.io_server.MAX_MEMORY_USAGE_MB", 0.001), - pytest.raises(ToolError), - ): - await load_csv(create_mock_context(), file_path=temp_path) - finally: - Path(temp_path).unlink() - - -class TestExportCsvAdvanced: - """Test advanced export functionality.""" - - async def test_export_csv_with_tabs(self) -> None: - """Test exporting as TSV (tab-separated).""" - # Create session with data - csv_content = "name,value,category\ntest1,100,A\ntest2,200,B" - ctx = create_mock_context() - await load_csv_from_content(ctx, csv_content) - session_id = ctx.session_id - - with tempfile.NamedTemporaryFile(suffix=".tsv", delete=False) as f: - temp_path = f.name - - try: - result = await export_csv(create_mock_context(session_id), file_path=temp_path) - - assert result.success is True - assert result.format == "tsv" - - # Verify the file is tab-separated - with Path(temp_path).open() as f: - content = f.read() - assert "\t" in content - assert "," not in content.split("\n")[0] # No commas in header - finally: - Path(temp_path).unlink() - - async def test_export_csv_with_quotes(self) -> None: - """Test exporting with quote handling.""" - # Create session with data containing commas and quotes - csv_content = ( - 'name,description\n"Smith, John","He said ""Hello"""\n"Doe, Jane","Normal text"' - ) - ctx = create_mock_context() - await load_csv_from_content(ctx, csv_content) - session_id = ctx.session_id - - with tempfile.NamedTemporaryFile(suffix=".csv", delete=False) as f: - temp_path = f.name - - try: - result = await export_csv(create_mock_context(session_id), file_path=temp_path) - - assert result.success is True - - # Verify quotes are properly handled - df = pd.read_csv(temp_path) - assert len(df) == 2 - assert "Smith, John" in df["name"].to_numpy() - finally: - Path(temp_path).unlink() - - async def test_export_csv_create_directory(self) -> None: - """Test export creates directory if it doesn't exist.""" - csv_content = "col1,col2\n1,2" - ctx = create_mock_context() - await load_csv_from_content(ctx, csv_content) - session_id = ctx.session_id - - # Use a directory that doesn't exist - with tempfile.TemporaryDirectory() as tmpdir: - new_dir = Path(tmpdir) / "new" / "nested" / "dir" - file_path = new_dir / "export.csv" - - result = await export_csv(create_mock_context(session_id), file_path=str(file_path)) - - assert result.success is True - assert file_path.exists() - assert new_dir.exists() - - -# The URL loading tests are covered in the main test_io_server.py file -# No need to duplicate them here with complex mocking - - -class TestLoadCsvFromContentEdgeCases: - """Test edge cases in load_csv_from_content.""" - - async def test_load_csv_from_content_single_row(self) -> None: - """Test loading CSV with only header and one row.""" - csv_content = "col1,col2\n1,2" - result = await load_csv_from_content(create_mock_context(), csv_content) - - assert result.rows_affected == 1 - assert result.columns_affected == ["col1", "col2"] - - async def test_load_csv_from_content_special_characters(self) -> None: - """Test loading CSV with special characters.""" - csv_content = "name,symbol\nAlpha,a\nBeta,b\nGamma,y" - result = await load_csv_from_content(create_mock_context(), csv_content) - - assert result.rows_affected == 3 - assert result.columns_affected == ["name", "symbol"] - - async def test_load_csv_from_content_numeric_columns(self) -> None: - """Test loading CSV with numeric column names.""" - csv_content = "1,2,3\na,b,c\nd,e,f" - result = await load_csv_from_content(create_mock_context(), csv_content) - - assert result.rows_affected == 2 - # Pandas converts numeric column names to strings - assert len(result.columns_affected) == 3 - - async def test_load_csv_from_content_with_index(self) -> None: - """Test that data is loaded correctly.""" - csv_content = "id,name,value\n1,test1,100\n2,test2,200" - ctx = create_mock_context() - result = await load_csv_from_content(ctx, csv_content) - session_id = ctx.session_id - - assert result.rows_affected == 2 - assert result.columns_affected == ["id", "name", "value"] - # Verify the session has data - info = await get_session_info(create_mock_context(session_id)) - assert info.row_count == 2 - assert info.column_count == 3 diff --git a/tests/unit/servers/test_io_server_coverage_fixes.py b/tests/unit/servers/test_io_server_coverage_fixes.py deleted file mode 100644 index c64f87e..0000000 --- a/tests/unit/servers/test_io_server_coverage_fixes.py +++ /dev/null @@ -1,635 +0,0 @@ -"""Tests to address specific coverage gaps and reach 80%+ coverage.""" - -import tempfile -from pathlib import Path -from unittest.mock import MagicMock, patch -from urllib.error import URLError - -import pandas as pd -import pytest -from fastmcp.exceptions import ToolError - -from databeak.servers.io_server import ( - MAX_URL_SIZE_MB, - detect_file_encoding, - export_csv, - get_encoding_fallbacks, - get_session_info, - load_csv, - load_csv_from_content, - load_csv_from_url, -) -from tests.test_mock_context import create_mock_context - - -class TestEncodingDetectionFallbacks: - """Test encoding detection and fallback edge cases.""" - - def test_detect_file_encoding_chardet_none_detection(self) -> None: - """Test when chardet returns None for encoding.""" - with tempfile.NamedTemporaryFile(mode="wb", delete=False) as f: - f.write(b"test,data\n1,2") - temp_path = f.name - - try: - with patch("chardet.detect") as mock_detect: - mock_detect.return_value = {"encoding": None, "confidence": 0.8} - - encoding = detect_file_encoding(temp_path) - assert encoding == "utf-8" # Should fallback to utf-8 - - finally: - Path(temp_path).unlink() - - def test_detect_file_encoding_low_confidence(self) -> None: - """Test when chardet has low confidence.""" - with tempfile.NamedTemporaryFile(mode="wb", delete=False) as f: - f.write(b"test,data\n1,2") - temp_path = f.name - - try: - with patch("chardet.detect") as mock_detect: - mock_detect.return_value = {"encoding": "ISO-8859-1", "confidence": 0.3} - - encoding = detect_file_encoding(temp_path) - assert encoding == "utf-8" # Should fallback to utf-8 due to low confidence - - finally: - Path(temp_path).unlink() - - def test_detect_file_encoding_import_error(self) -> None: - """Test when chardet import fails.""" - with tempfile.NamedTemporaryFile(mode="wb", delete=False) as f: - f.write(b"test,data\n1,2") - temp_path = f.name - - try: - with patch("chardet.detect", side_effect=ImportError("chardet not available")): - encoding = detect_file_encoding(temp_path) - assert encoding == "utf-8" # Should fallback to utf-8 - - finally: - Path(temp_path).unlink() - - def test_detect_file_encoding_unicode_error(self) -> None: - """Test when chardet raises UnicodeError.""" - with tempfile.NamedTemporaryFile(mode="wb", delete=False) as f: - f.write(b"\xff\xfe\x00\x00") # Invalid UTF sequence - temp_path = f.name - - try: - with patch("chardet.detect", side_effect=UnicodeError("Invalid sequence")): - encoding = detect_file_encoding(temp_path) - assert encoding == "utf-8" # Should fallback to utf-8 - - finally: - Path(temp_path).unlink() - - def test_detect_file_encoding_os_error(self) -> None: - """Test when file reading fails.""" - with patch("builtins.open", side_effect=OSError("File not accessible")): - encoding = detect_file_encoding("/nonexistent/path") - assert encoding == "utf-8" # Should fallback to utf-8 - - def test_get_encoding_fallbacks_utf_variants(self) -> None: - """Test fallbacks for UTF-16 encoding.""" - fallbacks = get_encoding_fallbacks("utf-16") - - # Should include UTF variants but not the primary encoding - assert "utf-16" in fallbacks - assert "utf-8" in fallbacks - assert "utf-32" in fallbacks - assert "cp1252" in fallbacks - - def test_get_encoding_fallbacks_windows_encoding(self) -> None: - """Test fallbacks for Windows-1251 encoding.""" - fallbacks = get_encoding_fallbacks("windows-1251") - - # Should prioritize Windows encodings - assert "windows-1251" in fallbacks - assert "cp1251" in fallbacks - assert "latin1" in fallbacks - assert "utf-8" in fallbacks - - def test_get_encoding_fallbacks_deduplication(self) -> None: - """Test that duplicates are removed from fallback list.""" - fallbacks = get_encoding_fallbacks("utf-8") - - # Should not have duplicates - assert len(fallbacks) == len(set(fallbacks)) - - -@pytest.mark.asyncio -class TestLoadCsvEncodingFallbackPaths: - """Test specific encoding fallback paths in load_csv.""" - - async def test_load_csv_auto_detection_failure_with_fallback(self) -> None: - """Test when auto-detection fails but fallback succeeds.""" - with tempfile.NamedTemporaryFile(mode="wb", suffix=".csv", delete=False) as f: - # Write content with special characters - f.write("name,city\nJosé,São Paulo".encode("latin1")) - temp_path = f.name - - try: - with patch("databeak.servers.io_server.detect_file_encoding") as mock_detect: - # Mock detection to raise an error - mock_detect.side_effect = Exception("Detection failed") - - # This should trigger the fallback encoding path - result = await load_csv( - create_mock_context(), - file_path=temp_path, - encoding="utf-8", # Will fail, trigger fallbacks - ) - - assert result.success - assert result.rows_affected == 1 - - finally: - Path(temp_path).unlink() - - async def test_load_csv_memory_check_in_fallback_encoding(self) -> None: - """Test memory limit check during encoding fallback.""" - with tempfile.NamedTemporaryFile( - mode="w", - suffix=".csv", - delete=False, - encoding="latin1", - ) as f: - f.write("col1,col2\n1,2") - temp_path = f.name - - try: - # Mock pandas.read_csv to succeed with fallback but return large df - large_df = pd.DataFrame({"col1": ["data"] * 1000, "col2": ["data"] * 1000}) - - with ( - patch("pandas.read_csv") as mock_read_csv, - patch( - "databeak.servers.io_server.MAX_MEMORY_USAGE_MB", - 0.001, - ), # Very low limit - ): - # First call fails with encoding error, second returns large df - mock_read_csv.side_effect = [ - UnicodeDecodeError("utf-8", b"", 0, 1, "encoding error"), - large_df, - ] - - with pytest.raises(ToolError): - await load_csv(create_mock_context(), file_path=temp_path, encoding="utf-8") - - finally: - Path(temp_path).unlink() - - async def test_load_csv_row_limit_in_fallback_encoding(self) -> None: - """Test row limit check during encoding fallback.""" - with tempfile.NamedTemporaryFile( - mode="w", - suffix=".csv", - delete=False, - encoding="latin1", - ) as f: - f.write("col1,col2\n1,2") - temp_path = f.name - - try: - # Mock pandas.read_csv to succeed with fallback but return large df - large_df = pd.DataFrame({"col1": range(10000), "col2": range(10000)}) - - with ( - patch("pandas.read_csv") as mock_read_csv, - patch("databeak.servers.io_server.MAX_ROWS", 5), # Very low limit - ): - # First call fails with encoding error, second returns large df - mock_read_csv.side_effect = [ - UnicodeDecodeError("utf-8", b"", 0, 1, "encoding error"), - large_df, - ] - - with pytest.raises(ToolError): - await load_csv(create_mock_context(), file_path=temp_path, encoding="utf-8") - - finally: - Path(temp_path).unlink() - - -@pytest.mark.asyncio -class TestLoadCsvErrorPaths: - """Test error handling paths in load_csv.""" - - async def test_load_csv_os_error(self) -> None: - """Test OSError handling in load_csv.""" - with ( - patch("pathlib.Path.stat", side_effect=OSError("File access denied")), - pytest.raises(ToolError), - ): - await load_csv(create_mock_context(), file_path="/some/file.csv") - - async def test_load_csv_pandas_empty_data_error(self) -> None: - """Test pandas EmptyDataError handling.""" - with tempfile.NamedTemporaryFile(mode="w", suffix=".csv", delete=False) as f: - f.write("") # Empty file - temp_path = f.name - - try: - with ( - patch("pandas.read_csv", side_effect=pd.errors.EmptyDataError("No data")), - pytest.raises(pd.errors.EmptyDataError, match="No data"), - ): - await load_csv(create_mock_context(), file_path=temp_path) - finally: - Path(temp_path).unlink() - - async def test_load_csv_pandas_parser_error(self) -> None: - """Test pandas ParserError handling.""" - with tempfile.NamedTemporaryFile(mode="w", suffix=".csv", delete=False) as f: - f.write("invalid,csv\ndata") - temp_path = f.name - - try: - with ( - patch("pandas.read_csv", side_effect=pd.errors.ParserError("Parse failed")), - pytest.raises(pd.errors.ParserError, match="Parse failed"), - ): - await load_csv(create_mock_context(), file_path=temp_path) - finally: - Path(temp_path).unlink() - - async def test_load_csv_memory_error(self) -> None: - """Test MemoryError handling.""" - with tempfile.NamedTemporaryFile(mode="w", suffix=".csv", delete=False) as f: - f.write("col1,col2\n1,2") - temp_path = f.name - - try: - with ( - patch("pandas.read_csv", side_effect=MemoryError("Out of memory")), - pytest.raises(MemoryError), - ): - await load_csv(create_mock_context(), file_path=temp_path) - finally: - Path(temp_path).unlink() - - -@pytest.mark.asyncio -class TestLoadCsvFromUrlEncodingFallbacks: - """Test URL loading encoding fallback paths.""" - - @patch("databeak.servers.io_server.urlopen") - @patch("pandas.read_csv") - async def test_load_url_memory_check_in_fallback( - self, mock_read_csv: MagicMock, mock_urlopen: MagicMock - ) -> None: - """Test memory check during URL encoding fallback.""" - # Mock response - mock_response = MagicMock() - mock_response.headers = {"Content-Type": "text/csv", "Content-Length": "100"} - mock_urlopen.return_value.__enter__.return_value = mock_response - - # Large DataFrame for memory limit test - large_df = pd.DataFrame({"col1": ["x"] * 10000, "col2": ["y"] * 10000}) - - # First encoding fails, second succeeds but exceeds memory - mock_read_csv.side_effect = [ - UnicodeDecodeError("utf-8", b"", 0, 1, "encoding error"), - large_df, - ] - - with ( - patch("databeak.servers.io_server.MAX_MEMORY_USAGE_MB", 0.001), - pytest.raises(ToolError), - ): - await load_csv_from_url( - create_mock_context(), - url="http://example.com/data.csv", - encoding="utf-8", - ) - - @patch("databeak.servers.io_server.urlopen") - @patch("pandas.read_csv") - async def test_load_url_row_check_in_fallback( - self, mock_read_csv: MagicMock, mock_urlopen: MagicMock - ) -> None: - """Test row limit check during URL encoding fallback.""" - # Mock response - mock_response = MagicMock() - mock_response.headers = {"Content-Type": "text/csv", "Content-Length": "100"} - mock_urlopen.return_value.__enter__.return_value = mock_response - - # Large DataFrame for row limit test - large_df = pd.DataFrame({"col1": range(10000), "col2": range(10000)}) - - # First encoding fails, second succeeds but exceeds row limit - mock_read_csv.side_effect = [ - UnicodeDecodeError("utf-8", b"", 0, 1, "encoding error"), - large_df, - ] - - with patch("databeak.servers.io_server.MAX_ROWS", 5), pytest.raises(ToolError): - await load_csv_from_url( - create_mock_context(), - url="http://example.com/data.csv", - encoding="utf-8", - ) - - -@pytest.mark.asyncio -class TestLoadCsvFromUrlErrorPaths: - """Test error handling paths in load_csv_from_url.""" - - async def test_load_url_timeout_error(self) -> None: - """Test timeout error handling.""" - with ( - patch( - "databeak.servers.io_server.urlopen", - side_effect=TimeoutError("Request timeout"), - ), - pytest.raises(ToolError), - ): - await load_csv_from_url(create_mock_context(), url="http://example.com/data.csv") - - async def test_load_url_url_error(self) -> None: - """Test URLError handling.""" - with ( - patch( - "databeak.servers.io_server.urlopen", - side_effect=URLError("Connection failed"), - ), - pytest.raises(ToolError), - ): - await load_csv_from_url(create_mock_context(), url="http://example.com/data.csv") - - @patch("databeak.servers.io_server.urlopen") - async def test_load_url_content_size_exceeded(self, mock_urlopen: MagicMock) -> None: - """Test content size limit exceeded.""" - mock_response = MagicMock() - mock_response.headers = { - "Content-Type": "text/csv", - "Content-Length": str((MAX_URL_SIZE_MB + 10) * 1024 * 1024), # Exceed limit - } - mock_urlopen.return_value.__enter__.return_value = mock_response - - with pytest.raises(ToolError): - await load_csv_from_url(create_mock_context(), url="http://example.com/large_file.csv") - - @patch("databeak.servers.io_server.urlopen") - async def test_load_url_content_type_warning(self, mock_urlopen: MagicMock) -> None: - """Test content type warning path.""" - mock_response = MagicMock() - mock_response.headers = {"Content-Type": "text/html", "Content-Length": "100"} - mock_urlopen.return_value.__enter__.return_value = mock_response - - # Mock pandas to succeed - with patch("pandas.read_csv", return_value=pd.DataFrame({"col": [1, 2]})): - result = await load_csv_from_url( - create_mock_context(), - url="http://example.com/data.csv", - ) - assert result.success - - async def test_load_url_pandas_empty_data_error(self) -> None: - """Test pandas EmptyDataError in URL loading.""" - with patch("databeak.servers.io_server.urlopen") as mock_urlopen: - mock_response = MagicMock() - mock_response.headers = {"Content-Type": "text/csv"} - mock_urlopen.return_value.__enter__.return_value = mock_response - - with ( - patch("pandas.read_csv", side_effect=pd.errors.EmptyDataError("No data")), - pytest.raises(pd.errors.EmptyDataError, match="No data"), - ): - await load_csv_from_url(create_mock_context(), url="http://example.com/empty.csv") - - async def test_load_url_pandas_parser_error(self) -> None: - """Test pandas ParserError in URL loading.""" - with patch("databeak.servers.io_server.urlopen") as mock_urlopen: - mock_response = MagicMock() - mock_response.headers = {"Content-Type": "text/csv"} - mock_urlopen.return_value.__enter__.return_value = mock_response - - with ( - patch("pandas.read_csv", side_effect=pd.errors.ParserError("Parse error")), - pytest.raises(pd.errors.ParserError, match="Parse error"), - ): - await load_csv_from_url(create_mock_context(), url="http://example.com/bad.csv") - - async def test_load_url_memory_error(self) -> None: - """Test MemoryError in URL loading.""" - with patch("databeak.servers.io_server.urlopen") as mock_urlopen: - mock_response = MagicMock() - mock_response.headers = {"Content-Type": "text/csv"} - mock_urlopen.return_value.__enter__.return_value = mock_response - - with ( - patch("pandas.read_csv", side_effect=MemoryError("Out of memory")), - pytest.raises(MemoryError), - ): - await load_csv_from_url(create_mock_context(), url="http://example.com/large.csv") - - async def test_load_url_os_error(self) -> None: - """Test OSError in URL loading.""" - with patch("databeak.servers.io_server.urlopen") as mock_urlopen: - mock_response = MagicMock() - mock_response.headers = {"Content-Type": "text/csv"} - mock_urlopen.return_value.__enter__.return_value = mock_response - - with ( - patch("pandas.read_csv", side_effect=OSError("Network error")), - pytest.raises(OSError), - ): - await load_csv_from_url(create_mock_context(), url="http://example.com/file.csv") - - -@pytest.mark.asyncio -class TestLoadCsvFromContentErrorPaths: - """Test error handling paths in load_csv_from_content.""" - - async def test_load_content_empty_dataframe(self) -> None: - """Test when parsed CSV results in empty DataFrame.""" - # Mock pandas to return empty DataFrame - with ( - patch("pandas.read_csv", return_value=pd.DataFrame()), - pytest.raises(ToolError), - ): - await load_csv_from_content(create_mock_context(), content="header\n") - - -@pytest.mark.asyncio -class TestExportCsvErrorPaths: - """Test error handling paths in export_csv.""" - - async def test_export_csv_excel_dependency_error(self) -> None: - """Test Excel export with missing openpyxl dependency.""" - # Create session with data - csv_content = "name,value\ntest,123" - ctx = create_mock_context() - await load_csv_from_content(ctx, csv_content) - session_id = ctx.session_id - - with tempfile.NamedTemporaryFile(suffix=".xlsx", delete=False) as tmp: - temp_path = tmp.name - - try: - with ( - patch("pandas.ExcelWriter", side_effect=ImportError("No module named 'openpyxl'")), - pytest.raises(ToolError), - ): - await export_csv(create_mock_context(session_id), file_path=temp_path) - finally: - Path(temp_path).unlink(missing_ok=True) - - async def test_export_csv_parquet_dependency_error(self) -> None: - """Test Parquet export with missing pyarrow dependency.""" - # Create session with data - csv_content = "name,value\ntest,123" - ctx = create_mock_context() - await load_csv_from_content(ctx, csv_content) - session_id = ctx.session_id - - with tempfile.NamedTemporaryFile(suffix=".parquet", delete=False) as tmp: - temp_path = tmp.name - - try: - with ( - patch( - "pandas.DataFrame.to_parquet", - side_effect=ImportError("No module named 'pyarrow'"), - ), - pytest.raises(ToolError), - ): - await export_csv(create_mock_context(session_id), file_path=temp_path) - finally: - Path(temp_path).unlink(missing_ok=True) - - async def test_export_csv_invalid_path_error(self) -> None: - """Test export with invalid file path.""" - # Create session with data - csv_content = "name,value\ntest,123" - ctx = create_mock_context() - await load_csv_from_content(ctx, csv_content) - session_id = ctx.session_id - - with pytest.raises(ToolError): - await export_csv(create_mock_context(session_id), file_path="\x00invalid\x00path") - - # Note: temp file cleanup test removed since export_csv no longer uses temp files - - -@pytest.mark.asyncio -class TestSessionManagementErrorPaths: - """Test error handling in session management functions.""" - - async def test_get_session_info_exception_handling(self) -> None: - """Test exception handling in get_session_info.""" - with patch("databeak.servers.io_server.get_session_only") as mock_get_session_only: - mock_get_session_only.side_effect = Exception("Session manager error") - with pytest.raises(Exception, match="Session manager error"): - await get_session_info(create_mock_context()) - - -@pytest.mark.asyncio -class TestSpecificCoveragePaths: - """Target specific uncovered lines to reach 80% coverage.""" - - async def test_load_csv_other_exception_in_fallback(self) -> None: - """Test non-UnicodeDecodeError exception during encoding fallback.""" - with tempfile.NamedTemporaryFile( - mode="w", - suffix=".csv", - delete=False, - encoding="latin1", - ) as f: - f.write("col1,col2\n1,2") - temp_path = f.name - - try: - with patch("pandas.read_csv") as mock_read_csv: - # First call: UnicodeDecodeError, Second call: different error, Third: success - mock_read_csv.side_effect = [ - UnicodeDecodeError("utf-8", b"", 0, 1, "encoding error"), - ValueError("Different error type"), - pd.DataFrame({"col1": [1], "col2": [2]}), - ] - - result = await load_csv( - create_mock_context(), - file_path=temp_path, - encoding="utf-8", - ) - assert result.success - assert mock_read_csv.call_count == 3 - - finally: - Path(temp_path).unlink() - - @patch("databeak.servers.io_server.urlopen") - @patch("pandas.read_csv") - async def test_load_url_other_exception_in_fallback( - self, mock_read_csv: MagicMock, mock_urlopen: MagicMock - ) -> None: - """Test non-UnicodeDecodeError exception during URL encoding fallback.""" - # Mock response - mock_response = MagicMock() - mock_response.headers = {"Content-Type": "text/csv", "Content-Length": "100"} - mock_urlopen.return_value.__enter__.return_value = mock_response - - # Encoding error, then different error, then success - mock_read_csv.side_effect = [ - UnicodeDecodeError("utf-8", b"", 0, 1, "encoding error"), - ValueError("Different error"), - pd.DataFrame({"col": [1, 2]}), - ] - - result = await load_csv_from_url( - create_mock_context(), - url="http://example.com/data.csv", - encoding="utf-8", - ) - - assert result.success - assert mock_read_csv.call_count == 3 - - async def test_load_csv_df_none_after_fallback_attempt(self) -> None: - """Test when df remains None after encoding fallback.""" - with tempfile.NamedTemporaryFile(mode="w", suffix=".csv", delete=False) as f: - f.write("col1,col2\n1,2") - temp_path = f.name - - try: - with ( - patch( - "pandas.read_csv", - side_effect=UnicodeDecodeError("utf-8", b"", 0, 1, "always fails"), - ), - pytest.raises(ToolError), - ): - await load_csv(create_mock_context(), file_path=temp_path, encoding="utf-8") - finally: - Path(temp_path).unlink() - - @patch("databeak.servers.io_server.urlopen") - async def test_load_url_df_none_after_fallback(self, mock_urlopen: MagicMock) -> None: - """Test when df remains None after URL encoding fallback.""" - mock_response = MagicMock() - mock_response.headers = {"Content-Type": "text/csv"} - mock_urlopen.return_value.__enter__.return_value = mock_response - - with ( - patch( - "pandas.read_csv", - side_effect=UnicodeDecodeError("utf-8", b"", 0, 1, "always fails"), - ), - pytest.raises(ToolError), - ): - await load_csv_from_url(create_mock_context(), url="http://example.com/data.csv") - - @patch("databeak.servers.io_server.urlopen") - async def test_load_url_df_none_check(self, mock_urlopen: MagicMock) -> None: - """Test URL loading df None check after successful response.""" - mock_response = MagicMock() - mock_response.headers = {"Content-Type": "text/csv"} - mock_urlopen.return_value.__enter__.return_value = mock_response - - with patch("pandas.read_csv", return_value=None), pytest.raises(TypeError): - await load_csv_from_url(create_mock_context(), url="http://example.com/data.csv") diff --git a/tests/unit/servers/test_io_server_encoding.py b/tests/unit/servers/test_io_server_encoding.py deleted file mode 100644 index 0f23988..0000000 --- a/tests/unit/servers/test_io_server_encoding.py +++ /dev/null @@ -1,204 +0,0 @@ -"""Tests specifically for encoding handling in io_server to reach 80% coverage.""" - -import tempfile -from pathlib import Path -from unittest.mock import AsyncMock, MagicMock, patch - -import pandas as pd -import pytest -from fastmcp.exceptions import ToolError - -from databeak.servers.io_server import ( - detect_file_encoding, - load_csv, - load_csv_from_url, -) -from tests.test_mock_context import create_mock_context - - -class TestFileEncodingDetection: - """Test file encoding detection.""" - - @pytest.mark.parametrize( - ("mock_encoding", "mock_confidence", "file_content", "expected_encoding"), - [ - ("UTF-8", 0.95, b"test content", "utf-8"), - ("ISO-8859-1", 0.3, b"\xef\xbb\xbftest,data\n1,2", ["utf-8", "utf-8-sig"]), - (None, 0, b"test,data\n1,2", "utf-8"), - ], - ) - @patch("chardet.detect") - def test_encoding_detection_scenarios( - self, - mock_detect: MagicMock, - mock_encoding: str | None, - mock_confidence: float, - file_content: bytes, - expected_encoding: str | list[str], - ) -> None: - """Test encoding detection with different chardet results.""" - mock_detect.return_value = {"encoding": mock_encoding, "confidence": mock_confidence} - - with tempfile.NamedTemporaryFile(mode="wb", delete=False) as f: - f.write(file_content) - temp_path = f.name - - try: - encoding = detect_file_encoding(temp_path) - if isinstance(expected_encoding, list): - assert encoding in expected_encoding - else: - assert encoding == expected_encoding - mock_detect.assert_called_once() - finally: - Path(temp_path).unlink() - - -class TestLoadCsvEncodingFallbacks: - """Test CSV loading with encoding fallbacks.""" - - async def test_load_csv_with_context_reporting(self) -> None: - """Test load_csv with context for progress reporting.""" - # Create a test file - with tempfile.NamedTemporaryFile(mode="w", suffix=".csv", delete=False) as f: - f.write("col1,col2\n1,2\n3,4") - temp_path = f.name - - try: - # Mock context with proper session_id - mock_ctx = MagicMock() - mock_ctx.session_id = "test_session_id" - mock_ctx.info = AsyncMock(return_value=None) - mock_ctx.report_progress = AsyncMock(return_value=None) - - result = await load_csv(mock_ctx, file_path=temp_path) - - assert result.rows_affected == 2 - # Progress should be reported - mock_ctx.report_progress.assert_called() - mock_ctx.info.assert_called() - finally: - Path(temp_path).unlink() - - @patch("pandas.read_csv") - async def test_load_csv_all_encodings_fail(self, mock_read_csv: MagicMock) -> None: - """Test when all encoding attempts fail.""" - # Make all read attempts fail - mock_read_csv.side_effect = UnicodeDecodeError("utf-8", b"", 0, 1, "invalid") - - with tempfile.NamedTemporaryFile(mode="wb", suffix=".csv", delete=False) as f: - f.write(b"test data") - temp_path = f.name - - try: - with pytest.raises(ToolError): - await load_csv(create_mock_context(), file_path=temp_path, encoding="utf-8") - finally: - Path(temp_path).unlink() - - @pytest.mark.skip( - reason="Complex encoding fallback + memory limit edge case - needs refactoring", - ) - async def test_load_csv_memory_check_on_fallback(self) -> None: - """Test memory limit check during encoding fallback.""" - - @pytest.mark.skip(reason="Complex encoding fallback + row limit edge case - needs refactoring") - async def test_load_csv_row_limit_on_fallback(self) -> None: - """Test row limit check during encoding fallback.""" - - -class TestLoadCsvFromUrlFallbacks: - """Test URL loading with encoding fallbacks.""" - - @patch("databeak.servers.io_server.urlopen") - @patch("pandas.read_csv") - async def test_load_url_encoding_fallback_success( - self, mock_read_csv: MagicMock, mock_urlopen: MagicMock - ) -> None: - """Test URL loading with successful encoding fallback.""" - mock_df = pd.DataFrame({"col1": [1, 2], "col2": [3, 4]}) - - # Mock urlopen response - mock_response = MagicMock() - mock_response.headers = {"Content-Type": "text/csv", "Content-Length": "100"} - mock_urlopen.return_value.__enter__.return_value = mock_response - - # First call fails with encoding error, second succeeds - mock_read_csv.side_effect = [UnicodeDecodeError("utf-8", b"", 0, 1, "invalid"), mock_df] - - # Mock context with proper session_id - mock_ctx = MagicMock() - mock_ctx.session_id = "test_session_id" - mock_ctx.info = AsyncMock(return_value=None) - mock_ctx.error = AsyncMock(return_value=None) - mock_ctx.report_progress = AsyncMock(return_value=None) - - result = await load_csv_from_url( - mock_ctx, - url="http://example.com/data.csv", - encoding="utf-8", - ) - - assert result.rows_affected == 2 - assert mock_read_csv.call_count == 2 - mock_ctx.info.assert_called() - - @pytest.mark.skip( - reason="Complex URL encoding fallback + memory limit edge case - needs refactoring", - ) - async def test_load_url_memory_check_fallback(self) -> None: - """Test URL loading with memory check during fallback.""" - - @pytest.mark.skip( - reason="Complex URL encoding fallback + row limit edge case - needs refactoring", - ) - async def test_load_url_row_limit_fallback(self) -> None: - """Test URL loading with row limit during fallback.""" - - @patch("databeak.servers.io_server.urlopen") - @patch("pandas.read_csv") - async def test_load_url_all_encodings_fail( - self, mock_read_csv: MagicMock, mock_urlopen: MagicMock - ) -> None: - """Test URL loading when all encodings fail.""" - # Mock urlopen response - mock_response = MagicMock() - mock_response.headers = {"Content-Type": "text/csv", "Content-Length": "100"} - mock_urlopen.return_value.__enter__.return_value = mock_response - - # All attempts fail - mock_read_csv.side_effect = UnicodeDecodeError("utf-8", b"", 0, 1, "invalid") - - with pytest.raises(ToolError): - await load_csv_from_url( - create_mock_context(), - url="http://example.com/data.csv", - encoding="utf-8", - ) - - @patch("databeak.servers.io_server.urlopen") - @patch("pandas.read_csv") - async def test_load_url_other_exception_during_fallback( - self, mock_read_csv: MagicMock, mock_urlopen: MagicMock - ) -> None: - """Test URL loading with non-encoding exception during fallback.""" - # Mock urlopen response - mock_response = MagicMock() - mock_response.headers = {"Content-Type": "text/csv", "Content-Length": "100"} - mock_urlopen.return_value.__enter__.return_value = mock_response - - # First encoding error, then different error - mock_read_csv.side_effect = [ - UnicodeDecodeError("utf-8", b"", 0, 1, "invalid"), - ValueError("Different error"), - pd.DataFrame({"col": [1]}), # Eventually succeeds - ] - - result = await load_csv_from_url( - create_mock_context(), - url="http://example.com/data.csv", - encoding="utf-8", - ) - - assert result.rows_affected == 1 - assert mock_read_csv.call_count == 3 diff --git a/tests/unit/servers/test_system_server.py b/tests/unit/servers/test_system_server.py index 8029082..0f6c6c0 100644 --- a/tests/unit/servers/test_system_server.py +++ b/tests/unit/servers/test_system_server.py @@ -309,7 +309,7 @@ async def test_get_server_info_basic_structure(self) -> None: """Test server info returns proper structure with all required fields.""" with patch("databeak.servers.system_server.get_settings") as mock_settings: mock_config = Mock() - mock_config.max_file_size_mb = 500 + mock_config.max_download_size_mb = 500 mock_config.session_timeout = 3600 # 1 hour in seconds mock_settings.return_value = mock_config @@ -323,7 +323,7 @@ async def test_get_server_info_basic_structure(self) -> None: assert "comprehensive MCP server" in result.description # Verify configuration - assert result.max_file_size_mb == 500 + assert result.max_download_size_mb == 500 assert result.session_timeout_minutes == 60 # Converted from seconds @pytest.mark.asyncio @@ -331,7 +331,7 @@ async def test_get_server_info_capabilities_structure(self) -> None: """Test server info includes all expected capability categories.""" with patch("databeak.servers.system_server.get_settings") as mock_settings: mock_config = Mock() - mock_config.max_file_size_mb = 100 + mock_config.max_download_size_mb = 100 mock_config.session_timeout = 1800 mock_settings.return_value = mock_config @@ -357,7 +357,7 @@ async def test_get_server_info_data_io_capabilities(self) -> None: """Test server info includes expected data I/O capabilities.""" with patch("databeak.servers.system_server.get_settings") as mock_settings: mock_config = Mock() - mock_config.max_file_size_mb = 200 + mock_config.max_download_size_mb = 200 mock_config.session_timeout = 7200 mock_settings.return_value = mock_config @@ -365,44 +365,13 @@ async def test_get_server_info_data_io_capabilities(self) -> None: data_io_caps = result.capabilities["data_io"] expected_io_caps = [ - "load_csv", "load_csv_from_url", "load_csv_from_content", - "export_csv", - "multiple_export_formats", ] for cap in expected_io_caps: assert cap in data_io_caps - @pytest.mark.asyncio - async def test_get_server_info_supported_formats(self) -> None: - """Test server info includes expected supported formats.""" - with patch("databeak.servers.system_server.get_settings") as mock_settings: - mock_config = Mock() - mock_config.max_file_size_mb = 300 - mock_config.session_timeout = 1200 - mock_settings.return_value = mock_config - - result = await get_server_info(create_mock_context()) - - expected_formats = [ - "csv", - "tsv", - "json", - "excel", - "parquet", - "html", - "markdown", - ] - - for fmt in expected_formats: - assert fmt in result.supported_formats - - # Verify it's a proper list - assert isinstance(result.supported_formats, list) - assert len(result.supported_formats) == len(expected_formats) - @pytest.mark.asyncio async def test_get_server_info_with_context(self) -> None: """Test server info with FastMCP context logging.""" @@ -412,7 +381,7 @@ async def test_get_server_info_with_context(self) -> None: with patch("databeak.servers.system_server.get_settings") as mock_settings: mock_config = Mock() - mock_config.max_file_size_mb = 150 + mock_config.max_download_size_mb = 150 mock_config.session_timeout = 2400 mock_settings.return_value = mock_config @@ -442,7 +411,7 @@ async def test_get_server_info_null_handling_capabilities(self) -> None: """Test server info includes comprehensive null handling capabilities.""" with patch("databeak.servers.system_server.get_settings") as mock_settings: mock_config = Mock() - mock_config.max_file_size_mb = 400 + mock_config.max_download_size_mb = 400 mock_config.session_timeout = 5400 mock_settings.return_value = mock_config @@ -465,7 +434,7 @@ async def test_get_server_info_data_manipulation_capabilities(self) -> None: """Test server info includes expected data manipulation capabilities.""" with patch("databeak.servers.system_server.get_settings") as mock_settings: mock_config = Mock() - mock_config.max_file_size_mb = 250 + mock_config.max_download_size_mb = 250 mock_config.session_timeout = 3000 mock_settings.return_value = mock_config @@ -493,7 +462,7 @@ async def test_get_server_info_response_model_validation(self) -> None: """Test server info response validates as proper Pydantic model.""" with patch("databeak.servers.system_server.get_settings") as mock_settings: mock_config = Mock() - mock_config.max_file_size_mb = 600 + mock_config.max_download_size_mb = 600 mock_config.session_timeout = 7200 mock_settings.return_value = mock_config @@ -507,8 +476,7 @@ async def test_get_server_info_response_model_validation(self) -> None: assert "name" in result_dict assert "version" in result_dict assert "capabilities" in result_dict - assert "supported_formats" in result_dict - assert "max_file_size_mb" in result_dict + assert "max_download_size_mb" in result_dict assert "session_timeout_minutes" in result_dict # Verify capabilities is a dict of lists @@ -525,7 +493,7 @@ async def test_get_server_info_returns_actual_version(self) -> None: with patch("databeak.servers.system_server.get_settings") as mock_settings: mock_config = Mock() - mock_config.max_file_size_mb = 500 + mock_config.max_download_size_mb = 500 mock_config.session_timeout = 3600 mock_settings.return_value = mock_config diff --git a/uv.lock b/uv.lock index e471d3c..3adda27 100644 --- a/uv.lock +++ b/uv.lock @@ -190,15 +190,6 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/c5/55/51844dd50c4fc7a33b653bfaba4c2456f06955289ca770a5dbd5fd267374/cfgv-3.4.0-py2.py3-none-any.whl", hash = "sha256:b7265b1f29fd3316bfcd2b330d63d024f2bfd8bcb8b0272f8e19a504856c48f9", size = 7249, upload-time = "2023-08-12T20:38:16.269Z" }, ] -[[package]] -name = "chardet" -version = "5.2.0" -source = { registry = "https://pypi.org/simple" } -sdist = { url = "https://files.pythonhosted.org/packages/f3/0d/f7b6ab21ec75897ed80c17d79b15951a719226b9fababf1e40ea74d69079/chardet-5.2.0.tar.gz", hash = "sha256:1b3b6ff479a8c414bc3fa2c0852995695c4a026dcd6d0633b2dd092ca39c1cf7", size = 2069618, upload-time = "2023-08-01T19:23:02.662Z" } -wheels = [ - { url = "https://files.pythonhosted.org/packages/38/6f/f5fbc992a329ee4e0f288c1fe0e2ad9485ed064cac731ed2fe47dcc38cbf/chardet-5.2.0-py3-none-any.whl", hash = "sha256:e1cf59446890a00105fe7b7912492ea04b6e6f06d4b742b2c788469e34c82970", size = 199385, upload-time = "2023-08-01T19:23:00.661Z" }, -] - [[package]] name = "charset-normalizer" version = "3.4.3" @@ -426,7 +417,6 @@ version = "0.1.2" source = { editable = "." } dependencies = [ { name = "aiofiles" }, - { name = "chardet" }, { name = "fastapi" }, { name = "fastmcp" }, { name = "httpx" }, @@ -469,7 +459,6 @@ dev = [ { name = "twine" }, { name = "ty" }, { name = "types-aiofiles" }, - { name = "types-chardet" }, { name = "types-jsonschema" }, { name = "types-psutil" }, { name = "types-pytz" }, @@ -479,7 +468,6 @@ dev = [ [package.metadata] requires-dist = [ { name = "aiofiles", specifier = ">=24.1.0" }, - { name = "chardet", specifier = ">=5.2.0" }, { name = "fastapi", specifier = ">=0.117.1" }, { name = "fastmcp", specifier = ">=2.12.4" }, { name = "httpx", specifier = ">=0.27.0" }, @@ -522,7 +510,6 @@ dev = [ { name = "twine", specifier = ">=6.1.0" }, { name = "ty", specifier = ">=0.0.1a21" }, { name = "types-aiofiles", specifier = ">=24.1.0.20250822" }, - { name = "types-chardet", specifier = ">=5.0.4.6" }, { name = "types-jsonschema", specifier = ">=4.25.1.20250822" }, { name = "types-psutil", specifier = ">=7.0.0.20250822" }, { name = "types-pytz", specifier = ">=2025.2.0.20250809" }, @@ -2579,15 +2566,6 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/bc/8e/5e6d2215e1d8f7c2a94c6e9d0059ae8109ce0f5681956d11bb0a228cef04/types_aiofiles-24.1.0.20250822-py3-none-any.whl", hash = "sha256:0ec8f8909e1a85a5a79aed0573af7901f53120dd2a29771dd0b3ef48e12328b0", size = 14322, upload-time = "2025-08-22T03:02:21.918Z" }, ] -[[package]] -name = "types-chardet" -version = "5.0.4.6" -source = { registry = "https://pypi.org/simple" } -sdist = { url = "https://files.pythonhosted.org/packages/dd/47/932d35ac07203e936e69102dc9570e83606d386bacb60696f0c403224e86/types-chardet-5.0.4.6.tar.gz", hash = "sha256:caf4c74cd13ccfd8b3313c314aba943b159de562a2573ed03137402b2bb37818", size = 4592, upload-time = "2023-05-10T15:22:21.325Z" } -wheels = [ - { url = "https://files.pythonhosted.org/packages/10/35/2a06c5c892eb1a0a4f4f74a6aff1ade05da82444af0190cf731761f2c46c/types_chardet-5.0.4.6-py3-none-any.whl", hash = "sha256:ea832d87e798abf1e4dfc73767807c2b7fee35d0003ae90348aea4ae00fb004d", size = 5853, upload-time = "2023-05-10T15:22:19.797Z" }, -] - [[package]] name = "types-jsonschema" version = "4.25.1.20250822"