Skip to content

v0.6.0#69

Merged
jamesbconner merged 41 commits intomainfrom
v0.6.0
Feb 24, 2026
Merged

v0.6.0#69
jamesbconner merged 41 commits intomainfrom
v0.6.0

Conversation

@jamesbconner
Copy link
Copy Markdown
Owner

@jamesbconner jamesbconner commented Feb 23, 2026

Note

High Risk
Large release that adds a new web server/API surface and changes profiling/query execution paths (Iceberg/Delta scanning and snapshot patching), increasing the chance of regressions and new security/performance issues.

Overview
Introduces an optional browser-based Web UI via a new tablesleuth web CLI command that runs a FastAPI app, serves a bundled Next.js static export, and exposes REST endpoints for Parquet/Iceberg/Delta analysis, configuration management, and GizmoSQL query + snapshot/version comparison.

Updates packaging and automation to build/bundle the frontend into the Python wheel (new [web] extras, Hatch force-include, CI/publish workflows adding Node build steps, .gitignore for web artifacts), plus cross-platform Makefile targets for web dev/release builds and Windows path/URI handling fixes for Iceberg metadata loading/scan.

Enhances performance comparison/profiling to support Delta version scans, adds MOR scan-stat breakdown fields to QueryPerformanceMetrics, and introduces an Iceberg metadata patcher to work around DuckDB delete-manifest issues; docs/changelog are updated and dependencies/version are bumped to 0.6.0.

Written by Cursor Bugbot for commit a3fb1cb. This will update automatically on new commits. Configure here.

jamesbconner and others added 21 commits February 22, 2026 13:17
…ontend (v0.6.0)

- Bump version 0.5.3 → 0.6.0
- Add [web] optional deps: fastapi, uvicorn[standard], python-multipart
- New FastAPI app (src/tablesleuth/api/) with 5 routers: parquet, iceberg,
  delta, config, gizmosql — all reusing existing services, no new business logic
- New CLI command: tablesleuth web (--host, --port, --no-browser, --log-level)
  auto-discovered via cli/web.py convention
- Next.js 16 frontend (web-ui/) with static export; pages for all formats,
  Monaco SQL editor, shadcn/ui styling
- Makefile: dev-api, dev-web, build-web, build-release, start-web targets
- CI: add --extra web to uv sync; overhaul build job to run npm build before
  uv build so wheel contains the bundled frontend
- Publish: same frontend-first build order before uv build / PyPI push
- 21 API smoke tests in tests/api/ (gracefully skip without web extras)
- src/tablesleuth/web/index.html placeholder committed so hatchling
  force-include does not fail during uv sync in dev
- Commit web-ui/package-lock.json (required for npm ci in CI/CD)
- Remove v0.6.0 from ci.yml branch triggers (version branches merge via PR)
…on equivalents

clean, build-release, and zip now use uv run python instead of rm -rf,
cp -r, and find/grep/cut — works on Windows, macOS, and Ubuntu.
- Add overrides.minimatch: >=10.2.1 to fix GHSA-3ppc-4f35-3m26 (ReDoS)
  affecting ESLint plugins (dev tooling only, not shipped in build)
- Add autoprefixer as explicit devDependency (was previously implicit
  transitive; became required after minimatch override resolved packages)
…ssets

The /{full_path:path} catch-all was intercepting all requests — including
/_next/static/css and /_next/static/js bundles — before StaticFiles could
serve them. Browsers received index.html where they expected CSS/JS and
discarded it, producing an unstyled, non-interactive page.

Next.js static export with trailingSlash:true generates a dedicated
index.html per route, so StaticFiles(html=True) handles all routing
automatically without a custom fallback.
- iceberg_metadata_service: Convert absolute local paths to file:// URI
  before passing to StaticTable.from_metadata(). On Windows, drive letters
  like 'D:' are parsed as URI schemes by PyIceberg, causing 'Unrecognized
  filesystem type in URI: d'. Path.as_uri() produces file:///D:/path/...
  which PyIceberg handles correctly. S3/GCS/HDFS URIs are passed through.

- iceberg router: Replace dataclasses.asdict() with dataclasses.fields() +
  getattr() in _to_dict(). asdict() deep-copies all fields (including
  native_table) before the loop can skip it, triggering 'cannot pickle
  module object' on the PyIceberg Table object.
…Windows

On Windows, Python's urlparse('file:///D:/path/file.json').path returns
'/D:/path/file.json' (with a leading slash before the drive letter).
PyArrow's LocalFileSystem rejects this path with WinError 123 (invalid
filename syntax).

The patch intercepts parse_location on Windows and strips the spurious
leading slash from drive-letter paths (/D:/path → D:/path). This fix
applies to all PyArrow file I/O operations — metadata file read, manifest
list, individual manifests, and data files — because all go through the
same PyArrowFileIO path.
…aml upload, track web-ui/src/lib/

- Fix `PyArrowFileIO.parse_location` monkey-patch: remove erroneous `.__func__`
  access (staticmethod accessed via class IS the raw function); confirmed working
  by direct test — `file:///D:/path` now resolves to `D:/path` correctly on Windows
- Add `POST /config/pyiceberg/upload` endpoint for uploading .pyiceberg.yaml files
- Add pyiceberg.yaml upload UI to Settings page with file input, status feedback,
  and live preview refresh after upload
- Add `uploadPyiceberg(file)` to API client using multipart FormData
- Fix root .gitignore: anchor `lib/` and `lib64/` with leading `/` so they only
  match repo-root Python venv dirs, not `web-ui/src/lib/`
- Track web-ui/src/lib/{api,types,utils}.ts (were excluded by gitignore bug)
- Update web-ui/tsconfig.json: Next.js 16 auto-updates (jsx: react-jsx, target: ES2017)
…ent JS precision loss

Iceberg snapshot IDs are Java long (int64). JavaScript's float64 can only
represent integers exactly up to 2^53 - 1 (Number.MAX_SAFE_INTEGER). Snapshot
IDs from AWS Glue Catalog routinely exceed this, causing silent rounding:
  real:    3994147258430532637
  JS sees: 3994147258430532608  ← wrong ID sent back to API → 404 Not Found

Fix: in _to_dict(), any integer with abs(value) > 2^53 is serialized as a JSON
string instead of a number. Python's path-param parser and int() conversion
handle these string IDs correctly since Python int is arbitrary precision.

- iceberg.py: add _JS_MAX_SAFE_INT constant; emit large ints as strings in _to_dict()
- iceberg.py: CompareRequest snapshot IDs accept str | int; cast to int() before service call
- types.ts: IcebergSnapshotInfo.snapshot_id / parent_snapshot_id → string
- types.ts: IcebergTableInfo.current_snapshot_id → string | null
- api.ts: snapshotDetails and compare accept string snapshot IDs
- iceberg/page.tsx, snapshot-list.tsx: selectedId and onSelect use string
dataclasses.fields() only returns declared fields, not @Property methods.
IcebergSnapshotInfo.delete_ratio, read_amplification, and has_deletes are
computed properties — they were silently absent from the API response, causing
'Cannot read properties of undefined (reading toFixed)' in the snapshot detail
view. This was masked previously by the int64 precision bug (404 errors before
the component rendered); the fix for that exposed this underlying issue.

Fix: after serializing dataclass fields, also iterate vars(type(obj)) to pick
up any @Property descriptors and include their values in the dict.
Clicking a version in the list now renders a detail panel on the right showing:
- Version number, timestamp, operation, file count, record count, total size
- Commit summary key-value pairs from the Delta log
- Scrollable data files table (path, size, records)
- Delete files table (shown only when present)

Previously setSelectedId was called on click but no detail view existed — the
right side always showed "Select a version to inspect" regardless of selection.
All data comes from the already-loaded versions list (no extra API call needed).
… loader

Backend:
- GET /api/iceberg/catalogs  — reads .pyiceberg.yaml and returns catalog names
- POST /api/iceberg/catalog-tables — loads a catalog and enumerates all tables
  across all namespaces via list_namespaces() + list_tables()

Frontend (table-loader.tsx):
- Default mode is now "Catalog" (was "Metadata File")
- Catalog field is a <select> dropdown populated from /api/iceberg/catalogs on
  mount; falls back to a free-text input if no .pyiceberg.yaml is found
- "List Tables" button calls /api/iceberg/catalog-tables for the selected catalog
  and populates a table <select> dropdown; falls back to free-text input before
  listing or on error
- Changing the catalog resets the table selection and clears errors
…ta sample

- Delta: add POST /delta/schema endpoint returning field names/types/nullability
- Delta version list: show total size and row count per version
- Delta version detail: add schema section
- Iceberg snapshot detail: add commit summary section
- Both pages: restructure right panel with Details/Forensics/Data Sample tabs
- Delta Forensics tab: move "Run Forensic Analysis" button inside the tab
- Iceberg Forensics tab: show schema evolution history (lazy-loaded on tab open)
- Both pages: add Data Sample tab using shared DataSample component
- New shared component: web-ui/src/components/shared/data-sample.tsx
…ta version field)

- IcebergTableLoader now accepts an optional Snapshot ID input in both catalog and
  metadata modes, mirroring the optional version input on the Delta loader
- After loading, the page auto-selects the requested snapshot (or falls back to the
  current snapshot) without requiring the user to scroll through the list
…berg MOR support

- Add ComparisonPanel component and /api/gizmosql/compare endpoint for
  side-by-side query performance comparison between two Iceberg snapshots
  or Delta Lake versions

- Iceberg MOR fix: new iceberg_manifest_patch.py rewrites delete manifest
  avro files (via fastavro, handles any codec) to lowercase file_format
  'PARQUET' → 'parquet', working around DuckDB iceberg_scan() case bug;
  metadata.json and manifest-list.avro are patched in a temp dir with
  posix paths (not file:// URIs) so DuckDB opens them correctly on Windows

- Delta comparison: delta_scan() has no version parameter; resolve active
  Parquet URIs at registration time via DeltaTable.file_uris() and rewrite
  queries to read_parquet([...])

- Fix _clean_file_path() and _read_bytes() to use Path.from_uri() (Python
  3.13+) instead of naive 7-char strip, preventing /C:/... invalid paths
  on Windows

- Fix re.sub replacement strings containing Windows paths (\D, \U, etc.)
  by using lambda callbacks instead of literal replacement strings

- Add _supplement_metrics() to fill zero-valued EXPLAIN ANALYZE metrics
  from file stats captured at registration time; derive rows_returned from
  actual query results for COUNT(*) queries

- Compare tab dropdowns sorted by timestamp descending with human-readable
  labels in both Iceberg and Delta views

- Add fastavro>=1.9.0 to [web] optional dependencies
- Regenerate minified HTML bundle for web UI
- Update build artifacts with latest Next.js compilation
- Upgrade pyiceberg from 0.9.1 to 0.11.0
- Upgrade deltalake from 0.22.0 to 1.4.2
- Upgrade pyarrow from 22.0.0 to 23.0.0
- Upgrade pandas from 2.3.0 to 3.0.1
- Upgrade adbc-driver-flightsql from 1.7.0 to 1.10.0
- Upgrade click from 8.1.0 to 8.3.1
- Upgrade pydantic from 2.11.0 to 2.12.5
- Upgrade rich from 13.0.0 to 14.3.3
- Upgrade sqlalchemy from 2.0.0 to 2.0.46
- Upgrade duckdb from 1.1.0 to 1.4.4
- Upgrade pip from 24.0 to 25.0
- Upgrade uv from 0.5.0 to 0.10.4
- Upgrade fastapi from 0.115.0 to 0.131.0
- Update uv.lock with resolved dependencies and platform-specific markers
…ove code formatting

- Add fastavro dependency to pre-commit and mypy configuration
- Exclude web/ directory from Hatchling VCS-based discovery to prevent duplicate entries
- Implement _iceberg_snapshot_scan_stats() to extract scan metrics from PyIceberg snapshot summaries
- Extend QueryPerformanceMetrics serialization with data/delete file and row counts
- Add snapshot scan statistics to comparison results for both Iceberg snapshots and Delta versions
- Improve code formatting and line wrapping in delta.py and gizmosql.py routers
- Add type assertion in delta.py load_table() for type safety
- Update comparison panel UI to display additional scan statistics
- Refactor type checking in gizmosql.py to use modern union syntax (int | float | bool | str)
@jamesbconner jamesbconner self-assigned this Feb 23, 2026
jamesbconner and others added 2 commits February 23, 2026 17:52
…handling

- Update type ignore comment from `assignment` to `method-assign` for PyArrowFileIO.parse_location staticmethod reassignment
- Fix Windows file URI path handling in _clean_file_path to return normalized path without leading slash
- Update test expectation to match corrected Windows path behavior (file:///C:/path → C:/path)
- Improve test documentation to clarify Path.from_uri() Windows path normalization behavior
…d bundle

- Add try-catch fallback in _clean_file_path() for non-absolute URIs
- Handle both Windows (file:///C:/path) and Unix (file:///path) style URIs
- Strip file:// prefix when Path.from_uri() fails on incompatible platforms
- Rebuild Next.js production bundle with updated dependencies
@jamesbconner jamesbconner linked an issue Feb 23, 2026 that may be closed by this pull request
@jamesbconner jamesbconner added the enhancement New feature or request label Feb 23, 2026
jamesbconner and others added 2 commits February 23, 2026 18:16
…tility

- Extract _resolve_web_dir() function from main.py and web.py into new web_utils module
- Add TOML string escaping in config router to handle special characters in catalog and GizmoSQL credentials
- Optimize Parquet sample reading to use batch iterator instead of loading entire file
- Fix Makefile clean target to properly handle egg-info directories as directories not globs
- Eliminates code duplication and improves maintainability across API and CLI modules
- Rebrand "Table Sleuth" to "TableSleuth" for consistency across all docs
- Add Node.js 20+ and npm to prerequisites for web UI development
- Document new web UI development workflow with Makefile targets (dev-api, dev-web, build-release, start-web)
- Update CHANGELOG with comprehensive v0.6.0 release notes including FastAPI backend, browser-based UI, GizmoSQL snapshot comparison, and Iceberg metadata patching
- Add web UI development section to DEVELOPMENT_SETUP.md with step-by-step instructions
- Update QUICKSTART.md with `tablesleuth web` command and optional web extras installation
- Clarify build targets and add web-specific build instructions
- Update all documentation files to reflect new web UI capabilities and architecture changes
- Reformat config.py with proper spacing and line breaks for readability
- Fix docstring formatting in _escape_toml_string function
- Improve line wrapping for long conditional expressions
- Rebuild Next.js production bundle with updated styling
- Update test file formatting for consistency
- Upgrade GizmoSQL CLI from v1.12.13 to v1.18.4 in EC2 user data script
- Ensures latest GizmoSQL features and bug fixes are available in deployed instances
- Create new serialization.py module with to_dict() function for consistent object-to-dict conversion
- Move _to_dict() implementations from delta.py, iceberg.py, and parquet.py routers to shared utility
- Add support for skipping fields, including @Property values, and safe integer conversion via parameters
- Update all router modules to import and use to_dict() from serialization module
- Add comprehensive unit tests for serialization functionality
- Reduces code duplication across routers and improves maintainability
- Remove trailing whitespace from iceberg router docstring
- Add blank line after test docstring for consistency
- Improve code formatting to match project standards
…sable method

- Remove register_file_table() method and associated file table tracking
- Extract table reference replacement pattern logic into _replace_table_ref() static method
- Consolidate duplicate regex pattern matching across iceberg, delta, and file table replacements
- Simplify _replace_iceberg_tables() to use new _replace_table_ref() method
- Remove file table stats collection from profiling metrics gathering
- Add comprehensive unit tests for table reference replacement with various quoting styles
- Reduces code duplication and improves maintainability of table substitution logic
- Convert single-quoted strings to double-quoted strings for consistency
- Break long function arguments across multiple lines for readability
- Reformat multi-line assertions to improve code clarity
- Align with project's code style standards
…ndling

- Add storage_options parameter to register_delta_table_with_version for cloud backend support
- Pass storage_options through delta table registration in compare_performance endpoint
- Handle empty Parquet files gracefully with StopIteration exception handling
- Improve scalar value conversion with try-except for various numeric types (int, float, Decimal, str)
- Add comprehensive test coverage for empty Parquet files and delta storage options
- Add unit tests for scalar conversion edge cases
…ction

- Add serialization logic to handle non-JSON-safe types (Decimal, bytes, dates, timestamps) in Parquet sample endpoint
- Enhance row count detection in DuckDB profiler to distinguish COUNT aggregates from other functions (SUM, AVG, MAX)
- Add sanity check to prevent misinterpreting large aggregate results as row counts
- Add comprehensive test suite for Parquet serialization with complex data types
- Update scalar conversion tests to reflect improved type handling
- Reformat table creation to single line for improved readability
- Reduce unnecessary line breaks in pa.Table.from_pydict() call
- Maintain consistency with codebase formatting standards
@jamesbconner jamesbconner linked an issue Feb 24, 2026 that may be closed by this pull request
- Enhance profile_single_column to detect and use registered Iceberg/Delta tables
- Use _replace_iceberg_tables to generate appropriate scan calls for table references
- Extract scan call from replaced query using regex pattern matching
- Simplify _replace_table_ref to use unified word-boundary pattern for all identifier types
- Add comprehensive test suite for Iceberg table profiling scenarios
- Add test suite for pattern simplification logic
- Improve handling of registered table references in profiling queries
- Remove artificial 1B row limit in COUNT(*) result validation
- Accept any non-negative count value to support large Iceberg and Delta Lake tables
- Add explicit handling for negative counts (invalid, default to 1)
- Update test expectations to accept large count values (5B, 10B rows)
- Add comprehensive test coverage for counts over 1B, exactly 1B, and negative values
- Improve comments to clarify heuristic prevents SUM/AVG misinterpretation, not size limits
- Use a fixed internal name for the Iceberg table in the profile_table function to ensure consistent profiling.
- Implement checks to ensure the Iceberg extension is loaded before executing queries in the GizmoDuckDbProfiler.
- Refactor Iceberg loading logic into a dedicated method for better maintainability and clarity.
- Improve handling of Iceberg tables in profiling queries to enhance overall functionality.
…acking

- Introduce new attributes for tracking view paths, Iceberg tables, Delta tables, and their respective statistics.
- Simplify checks for existing attributes by removing redundant hasattr calls.
- Improve the handling of Iceberg scan statistics and Delta table stats for better profiling accuracy.
- Refactor code for clarity and maintainability in the profiling logic.
…ling tests

- Update tests to verify the presence of iceberg_scan in SQL queries more accurately.
- Filter executed SQL calls to focus on profiling queries, ensuring at least three iceberg_scan queries are present.
- Enhance comments for clarity regarding the execution order of INSTALL/LOAD commands in relation to profiling.
… checks

- Streamline the assertion for the number of iceberg_scan queries in profiling tests.
- Enhance readability by consolidating the assertion statement into a single line.
- Maintain clarity in the test's intent regarding the expected number of queries.
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

@codecov-commenter
Copy link
Copy Markdown

@jamesbconner jamesbconner merged commit 08dd3eb into main Feb 24, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update GizmoSQL/DuckDB dependency Restructure Comparison mechanism

2 participants