v0.6.0 by jamesbconner · Pull Request #69 · jamesbconner/TableSleuth

jamesbconner · 2026-02-23T22:45:10Z

Note

High Risk
Large release that adds a new web server/API surface and changes profiling/query execution paths (Iceberg/Delta scanning and snapshot patching), increasing the chance of regressions and new security/performance issues.

Overview
Introduces an optional browser-based Web UI via a new tablesleuth web CLI command that runs a FastAPI app, serves a bundled Next.js static export, and exposes REST endpoints for Parquet/Iceberg/Delta analysis, configuration management, and GizmoSQL query + snapshot/version comparison.

Updates packaging and automation to build/bundle the frontend into the Python wheel (new [web] extras, Hatch force-include, CI/publish workflows adding Node build steps, .gitignore for web artifacts), plus cross-platform Makefile targets for web dev/release builds and Windows path/URI handling fixes for Iceberg metadata loading/scan.

Enhances performance comparison/profiling to support Delta version scans, adds MOR scan-stat breakdown fields to QueryPerformanceMetrics, and introduces an Iceberg metadata patcher to work around DuckDB delete-manifest issues; docs/changelog are updated and dependencies/version are bumped to 0.6.0.

^{Written by Cursor Bugbot for commit a3fb1cb. This will update automatically on new commits. Configure here.}

…ontend (v0.6.0) - Bump version 0.5.3 → 0.6.0 - Add [web] optional deps: fastapi, uvicorn[standard], python-multipart - New FastAPI app (src/tablesleuth/api/) with 5 routers: parquet, iceberg, delta, config, gizmosql — all reusing existing services, no new business logic - New CLI command: tablesleuth web (--host, --port, --no-browser, --log-level) auto-discovered via cli/web.py convention - Next.js 16 frontend (web-ui/) with static export; pages for all formats, Monaco SQL editor, shadcn/ui styling - Makefile: dev-api, dev-web, build-web, build-release, start-web targets - CI: add --extra web to uv sync; overhaul build job to run npm build before uv build so wheel contains the bundled frontend - Publish: same frontend-first build order before uv build / PyPI push - 21 API smoke tests in tests/api/ (gracefully skip without web extras) - src/tablesleuth/web/index.html placeholder committed so hatchling force-include does not fail during uv sync in dev

- Commit web-ui/package-lock.json (required for npm ci in CI/CD) - Remove v0.6.0 from ci.yml branch triggers (version branches merge via PR)

…on equivalents clean, build-release, and zip now use uv run python instead of rm -rf, cp -r, and find/grep/cut — works on Windows, macOS, and Ubuntu.

- Add overrides.minimatch: >=10.2.1 to fix GHSA-3ppc-4f35-3m26 (ReDoS) affecting ESLint plugins (dev tooling only, not shipped in build) - Add autoprefixer as explicit devDependency (was previously implicit transitive; became required after minimatch override resolved packages)

…ssets The /{full_path:path} catch-all was intercepting all requests — including /_next/static/css and /_next/static/js bundles — before StaticFiles could serve them. Browsers received index.html where they expected CSS/JS and discarded it, producing an unstyled, non-interactive page. Next.js static export with trailingSlash:true generates a dedicated index.html per route, so StaticFiles(html=True) handles all routing automatically without a custom fallback.

- iceberg_metadata_service: Convert absolute local paths to file:// URI before passing to StaticTable.from_metadata(). On Windows, drive letters like 'D:' are parsed as URI schemes by PyIceberg, causing 'Unrecognized filesystem type in URI: d'. Path.as_uri() produces file:///D:/path/... which PyIceberg handles correctly. S3/GCS/HDFS URIs are passed through. - iceberg router: Replace dataclasses.asdict() with dataclasses.fields() + getattr() in _to_dict(). asdict() deep-copies all fields (including native_table) before the loop can skip it, triggering 'cannot pickle module object' on the PyIceberg Table object.

…Windows On Windows, Python's urlparse('file:///D:/path/file.json').path returns '/D:/path/file.json' (with a leading slash before the drive letter). PyArrow's LocalFileSystem rejects this path with WinError 123 (invalid filename syntax). The patch intercepts parse_location on Windows and strips the spurious leading slash from drive-letter paths (/D:/path → D:/path). This fix applies to all PyArrow file I/O operations — metadata file read, manifest list, individual manifests, and data files — because all go through the same PyArrowFileIO path.

…aml upload, track web-ui/src/lib/ - Fix `PyArrowFileIO.parse_location` monkey-patch: remove erroneous `.__func__` access (staticmethod accessed via class IS the raw function); confirmed working by direct test — `file:///D:/path` now resolves to `D:/path` correctly on Windows - Add `POST /config/pyiceberg/upload` endpoint for uploading .pyiceberg.yaml files - Add pyiceberg.yaml upload UI to Settings page with file input, status feedback, and live preview refresh after upload - Add `uploadPyiceberg(file)` to API client using multipart FormData - Fix root .gitignore: anchor `lib/` and `lib64/` with leading `/` so they only match repo-root Python venv dirs, not `web-ui/src/lib/` - Track web-ui/src/lib/{api,types,utils}.ts (were excluded by gitignore bug) - Update web-ui/tsconfig.json: Next.js 16 auto-updates (jsx: react-jsx, target: ES2017)

…ent JS precision loss Iceberg snapshot IDs are Java long (int64). JavaScript's float64 can only represent integers exactly up to 2^53 - 1 (Number.MAX_SAFE_INTEGER). Snapshot IDs from AWS Glue Catalog routinely exceed this, causing silent rounding: real: 3994147258430532637 JS sees: 3994147258430532608 ← wrong ID sent back to API → 404 Not Found Fix: in _to_dict(), any integer with abs(value) > 2^53 is serialized as a JSON string instead of a number. Python's path-param parser and int() conversion handle these string IDs correctly since Python int is arbitrary precision. - iceberg.py: add _JS_MAX_SAFE_INT constant; emit large ints as strings in _to_dict() - iceberg.py: CompareRequest snapshot IDs accept str | int; cast to int() before service call - types.ts: IcebergSnapshotInfo.snapshot_id / parent_snapshot_id → string - types.ts: IcebergTableInfo.current_snapshot_id → string | null - api.ts: snapshotDetails and compare accept string snapshot IDs - iceberg/page.tsx, snapshot-list.tsx: selectedId and onSelect use string

dataclasses.fields() only returns declared fields, not @Property methods. IcebergSnapshotInfo.delete_ratio, read_amplification, and has_deletes are computed properties — they were silently absent from the API response, causing 'Cannot read properties of undefined (reading toFixed)' in the snapshot detail view. This was masked previously by the int64 precision bug (404 errors before the component rendered); the fix for that exposed this underlying issue. Fix: after serializing dataclass fields, also iterate vars(type(obj)) to pick up any @Property descriptors and include their values in the dict.

Clicking a version in the list now renders a detail panel on the right showing: - Version number, timestamp, operation, file count, record count, total size - Commit summary key-value pairs from the Delta log - Scrollable data files table (path, size, records) - Delete files table (shown only when present) Previously setSelectedId was called on click but no detail view existed — the right side always showed "Select a version to inspect" regardless of selection. All data comes from the already-loaded versions list (no extra API call needed).

… loader Backend: - GET /api/iceberg/catalogs — reads .pyiceberg.yaml and returns catalog names - POST /api/iceberg/catalog-tables — loads a catalog and enumerates all tables across all namespaces via list_namespaces() + list_tables() Frontend (table-loader.tsx): - Default mode is now "Catalog" (was "Metadata File") - Catalog field is a <select> dropdown populated from /api/iceberg/catalogs on mount; falls back to a free-text input if no .pyiceberg.yaml is found - "List Tables" button calls /api/iceberg/catalog-tables for the selected catalog and populates a table <select> dropdown; falls back to free-text input before listing or on error - Changing the catalog resets the table selection and clears errors

…ta sample - Delta: add POST /delta/schema endpoint returning field names/types/nullability - Delta version list: show total size and row count per version - Delta version detail: add schema section - Iceberg snapshot detail: add commit summary section - Both pages: restructure right panel with Details/Forensics/Data Sample tabs - Delta Forensics tab: move "Run Forensic Analysis" button inside the tab - Iceberg Forensics tab: show schema evolution history (lazy-loaded on tab open) - Both pages: add Data Sample tab using shared DataSample component - New shared component: web-ui/src/components/shared/data-sample.tsx

…ta version field) - IcebergTableLoader now accepts an optional Snapshot ID input in both catalog and metadata modes, mirroring the optional version input on the Delta loader - After loading, the page auto-selects the requested snapshot (or falls back to the current snapshot) without requiring the user to scroll through the list

…berg MOR support - Add ComparisonPanel component and /api/gizmosql/compare endpoint for side-by-side query performance comparison between two Iceberg snapshots or Delta Lake versions - Iceberg MOR fix: new iceberg_manifest_patch.py rewrites delete manifest avro files (via fastavro, handles any codec) to lowercase file_format 'PARQUET' → 'parquet', working around DuckDB iceberg_scan() case bug; metadata.json and manifest-list.avro are patched in a temp dir with posix paths (not file:// URIs) so DuckDB opens them correctly on Windows - Delta comparison: delta_scan() has no version parameter; resolve active Parquet URIs at registration time via DeltaTable.file_uris() and rewrite queries to read_parquet([...]) - Fix _clean_file_path() and _read_bytes() to use Path.from_uri() (Python 3.13+) instead of naive 7-char strip, preventing /C:/... invalid paths on Windows - Fix re.sub replacement strings containing Windows paths (\D, \U, etc.) by using lambda callbacks instead of literal replacement strings - Add _supplement_metrics() to fill zero-valued EXPLAIN ANALYZE metrics from file stats captured at registration time; derive rows_returned from actual query results for COUNT(*) queries - Compare tab dropdowns sorted by timestamp descending with human-readable labels in both Iceberg and Delta views - Add fastavro>=1.9.0 to [web] optional dependencies

- Regenerate minified HTML bundle for web UI - Update build artifacts with latest Next.js compilation

- Upgrade pyiceberg from 0.9.1 to 0.11.0 - Upgrade deltalake from 0.22.0 to 1.4.2 - Upgrade pyarrow from 22.0.0 to 23.0.0 - Upgrade pandas from 2.3.0 to 3.0.1 - Upgrade adbc-driver-flightsql from 1.7.0 to 1.10.0 - Upgrade click from 8.1.0 to 8.3.1 - Upgrade pydantic from 2.11.0 to 2.12.5 - Upgrade rich from 13.0.0 to 14.3.3 - Upgrade sqlalchemy from 2.0.0 to 2.0.46 - Upgrade duckdb from 1.1.0 to 1.4.4 - Upgrade pip from 24.0 to 25.0 - Upgrade uv from 0.5.0 to 0.10.4 - Upgrade fastapi from 0.115.0 to 0.131.0 - Update uv.lock with resolved dependencies and platform-specific markers

…ove code formatting - Add fastavro dependency to pre-commit and mypy configuration - Exclude web/ directory from Hatchling VCS-based discovery to prevent duplicate entries - Implement _iceberg_snapshot_scan_stats() to extract scan metrics from PyIceberg snapshot summaries - Extend QueryPerformanceMetrics serialization with data/delete file and row counts - Add snapshot scan statistics to comparison results for both Iceberg snapshots and Delta versions - Improve code formatting and line wrapping in delta.py and gizmosql.py routers - Add type assertion in delta.py load_table() for type safety - Update comparison panel UI to display additional scan statistics - Refactor type checking in gizmosql.py to use modern union syntax (int | float | bool | str)

tests/unit/profiling/test_gizmo_duckdb_unit.py

…handling - Update type ignore comment from `assignment` to `method-assign` for PyArrowFileIO.parse_location staticmethod reassignment - Fix Windows file URI path handling in _clean_file_path to return normalized path without leading slash - Update test expectation to match corrected Windows path behavior (file:///C:/path → C:/path) - Improve test documentation to clarify Path.from_uri() Windows path normalization behavior

…d bundle - Add try-catch fallback in _clean_file_path() for non-absolute URIs - Handle both Windows (file:///C:/path) and Unix (file:///path) style URIs - Strip file:// prefix when Path.from_uri() fails on incompatible platforms - Rebuild Next.js production bundle with updated dependencies

src/tablesleuth/api/routers/parquet.py

src/tablesleuth/cli/web.py

src/tablesleuth/api/routers/config.py

Makefile

…tility - Extract _resolve_web_dir() function from main.py and web.py into new web_utils module - Add TOML string escaping in config router to handle special characters in catalog and GizmoSQL credentials - Optimize Parquet sample reading to use batch iterator instead of loading entire file - Fix Makefile clean target to properly handle egg-info directories as directories not globs - Eliminates code duplication and improves maintainability across API and CLI modules

- Rebrand "Table Sleuth" to "TableSleuth" for consistency across all docs - Add Node.js 20+ and npm to prerequisites for web UI development - Document new web UI development workflow with Makefile targets (dev-api, dev-web, build-release, start-web) - Update CHANGELOG with comprehensive v0.6.0 release notes including FastAPI backend, browser-based UI, GizmoSQL snapshot comparison, and Iceberg metadata patching - Add web UI development section to DEVELOPMENT_SETUP.md with step-by-step instructions - Update QUICKSTART.md with `tablesleuth web` command and optional web extras installation - Clarify build targets and add web-specific build instructions - Update all documentation files to reflect new web UI capabilities and architecture changes

- Reformat config.py with proper spacing and line breaks for readability - Fix docstring formatting in _escape_toml_string function - Improve line wrapping for long conditional expressions - Rebuild Next.js production bundle with updated styling - Update test file formatting for consistency

src/tablesleuth/api/routers/parquet.py

- Upgrade GizmoSQL CLI from v1.12.13 to v1.18.4 in EC2 user data script - Ensures latest GizmoSQL features and bug fixes are available in deployed instances

- Create new serialization.py module with to_dict() function for consistent object-to-dict conversion - Move _to_dict() implementations from delta.py, iceberg.py, and parquet.py routers to shared utility - Add support for skipping fields, including @Property values, and safe integer conversion via parameters - Update all router modules to import and use to_dict() from serialization module - Add comprehensive unit tests for serialization functionality - Reduces code duplication across routers and improves maintainability

…into v0.6.0

- Remove trailing whitespace from iceberg router docstring - Add blank line after test docstring for consistency - Improve code formatting to match project standards

src/tablesleuth/services/profiling/gizmo_duckdb.py

…sable method - Remove register_file_table() method and associated file table tracking - Extract table reference replacement pattern logic into _replace_table_ref() static method - Consolidate duplicate regex pattern matching across iceberg, delta, and file table replacements - Simplify _replace_iceberg_tables() to use new _replace_table_ref() method - Remove file table stats collection from profiling metrics gathering - Add comprehensive unit tests for table reference replacement with various quoting styles - Reduces code duplication and improves maintainability of table substitution logic

- Convert single-quoted strings to double-quoted strings for consistency - Break long function arguments across multiple lines for readability - Reformat multi-line assertions to improve code clarity - Align with project's code style standards

src/tablesleuth/api/routers/gizmosql.py

src/tablesleuth/api/routers/parquet.py

src/tablesleuth/services/profiling/gizmo_duckdb.py

…ndling - Add storage_options parameter to register_delta_table_with_version for cloud backend support - Pass storage_options through delta table registration in compare_performance endpoint - Handle empty Parquet files gracefully with StopIteration exception handling - Improve scalar value conversion with try-except for various numeric types (int, float, Decimal, str) - Add comprehensive test coverage for empty Parquet files and delta storage options - Add unit tests for scalar conversion edge cases

src/tablesleuth/services/profiling/gizmo_duckdb.py

src/tablesleuth/api/routers/parquet.py

…ction - Add serialization logic to handle non-JSON-safe types (Decimal, bytes, dates, timestamps) in Parquet sample endpoint - Enhance row count detection in DuckDB profiler to distinguish COUNT aggregates from other functions (SUM, AVG, MAX) - Add sanity check to prevent misinterpreting large aggregate results as row counts - Add comprehensive test suite for Parquet serialization with complex data types - Update scalar conversion tests to reflect improved type handling

- Reformat table creation to single line for improved readability - Reduce unnecessary line breaks in pa.Table.from_pydict() call - Maintain consistency with codebase formatting standards

src/tablesleuth/api/routers/gizmosql.py

src/tablesleuth/services/profiling/gizmo_duckdb.py

- Enhance profile_single_column to detect and use registered Iceberg/Delta tables - Use _replace_iceberg_tables to generate appropriate scan calls for table references - Extract scan call from replaced query using regex pattern matching - Simplify _replace_table_ref to use unified word-boundary pattern for all identifier types - Add comprehensive test suite for Iceberg table profiling scenarios - Add test suite for pattern simplification logic - Improve handling of registered table references in profiling queries

src/tablesleuth/services/profiling/gizmo_duckdb.py

- Remove artificial 1B row limit in COUNT(*) result validation - Accept any non-negative count value to support large Iceberg and Delta Lake tables - Add explicit handling for negative counts (invalid, default to 1) - Update test expectations to accept large count values (5B, 10B rows) - Add comprehensive test coverage for counts over 1B, exactly 1B, and negative values - Improve comments to clarify heuristic prevents SUM/AVG misinterpretation, not size limits

- Use a fixed internal name for the Iceberg table in the profile_table function to ensure consistent profiling. - Implement checks to ensure the Iceberg extension is loaded before executing queries in the GizmoDuckDbProfiler. - Refactor Iceberg loading logic into a dedicated method for better maintainability and clarity. - Improve handling of Iceberg tables in profiling queries to enhance overall functionality.

src/tablesleuth/services/profiling/gizmo_duckdb.py

…acking - Introduce new attributes for tracking view paths, Iceberg tables, Delta tables, and their respective statistics. - Simplify checks for existing attributes by removing redundant hasattr calls. - Improve the handling of Iceberg scan statistics and Delta table stats for better profiling accuracy. - Refactor code for clarity and maintainability in the profiling logic.

…ling tests - Update tests to verify the presence of iceberg_scan in SQL queries more accurately. - Filter executed SQL calls to focus on profiling queries, ensuring at least three iceberg_scan queries are present. - Enhance comments for clarity regarding the execution order of INSTALL/LOAD commands in relation to profiling.

… checks - Streamline the assertion for the number of iceberg_scan queries in profiling tests. - Enhance readability by consolidating the assertion statement into a single line. - Maintain clarity in the test's intent regarding the expected number of queries.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

src/tablesleuth/api/routers/gizmosql.py

codecov-commenter · 2026-02-24T03:30:54Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 45.97250% with 550 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/tablesleuth/services/iceberg_manifest_patch.py	0.00%	154 Missing ⚠️
src/tablesleuth/api/routers/gizmosql.py	36.80%	103 Missing ⚠️
src/tablesleuth/api/routers/iceberg.py	44.44%	75 Missing ⚠️
src/tablesleuth/api/routers/delta.py	39.79%	59 Missing ⚠️
src/tablesleuth/api/routers/config.py	46.15%	56 Missing ⚠️
src/tablesleuth/services/profiling/gizmo_duckdb.py	72.05%	38 Missing ⚠️
src/tablesleuth/api/routers/parquet.py	72.61%	23 Missing ⚠️
src/tablesleuth/cli/web.py	43.75%	18 Missing ⚠️
...c/tablesleuth/services/iceberg_metadata_service.py	30.76%	18 Missing ⚠️
src/tablesleuth/api/main.py	87.50%	5 Missing ⚠️
... and 1 more

📢 Thoughts on this report? Let us know!

jamesbconner and others added 21 commits February 22, 2026 13:17

chore(web): Add package-lock.json and fix ci.yml branch triggers

6eb0baa

- Commit web-ui/package-lock.json (required for npm ci in CI/CD) - Remove v0.6.0 from ci.yml branch triggers (version branches merge via PR)

chore(web): Add dev-web-install-npm Makefile target

78a63a3

chore(web): Restore web placeholder after make clean

21f2574

chore(makefile): Replace POSIX-only commands with cross-platform Pyth…

e654245

…on equivalents clean, build-release, and zip now use uv run python instead of rm -rf, cp -r, and find/grep/cut — works on Windows, macOS, and Ubuntu.

chore(gitignore): Ignore auto-generated web-ui/next-env.d.ts

5dc5c5e

feat(web): Rebuild Next.js production bundle with updated dependencies

38d6760

- Regenerate minified HTML bundle for web UI - Update build artifacts with latest Next.js compilation

jamesbconner self-assigned this Feb 23, 2026

cursor bot reviewed Feb 23, 2026

View reviewed changes

tests/unit/profiling/test_gizmo_duckdb_unit.py Show resolved Hide resolved

jamesbconner and others added 2 commits February 23, 2026 17:52

jamesbconner linked an issue Feb 23, 2026 that may be closed by this pull request

Restructure Comparison mechanism #44

Closed

jamesbconner added the enhancement New feature or request label Feb 23, 2026

cursor bot reviewed Feb 23, 2026

View reviewed changes

src/tablesleuth/api/routers/parquet.py Outdated Show resolved Hide resolved

src/tablesleuth/cli/web.py Outdated Show resolved Hide resolved

src/tablesleuth/api/routers/config.py Show resolved Hide resolved

Makefile Outdated Show resolved Hide resolved

jamesbconner and others added 2 commits February 23, 2026 18:16

cursor bot reviewed Feb 23, 2026

View reviewed changes

src/tablesleuth/api/routers/parquet.py Outdated Show resolved Hide resolved

jamesbconner added 4 commits February 23, 2026 18:37

chore(aws-cdk): Update GizmoSQL CLI version to v1.18.4

488ca35

- Upgrade GizmoSQL CLI from v1.12.13 to v1.18.4 in EC2 user data script - Ensures latest GizmoSQL features and bug fixes are available in deployed instances

Merge branch 'v0.6.0' of https://github.com/jamesbconner/TableSleuth …

4dbaa9f

…into v0.6.0

style(api,tests): Fix whitespace formatting in docstrings

8c9e89d

- Remove trailing whitespace from iceberg router docstring - Add blank line after test docstring for consistency - Improve code formatting to match project standards