Skip to content

[DOM-75543] feat: dataset upload, preview, and schema endpoints#26

Open
ddl-subir-m wants to merge 19 commits intomainfrom
subir/pr16b-upload-preview
Open

[DOM-75543] feat: dataset upload, preview, and schema endpoints#26
ddl-subir-m wants to merge 19 commits intomainfrom
subir/pr16b-upload-preview

Conversation

@ddl-subir-m
Copy link
Copy Markdown
Collaborator

@ddl-subir-m ddl-subir-m commented Mar 23, 2026

Why

Adds the ability to upload files to Domino datasets and inspect data content (preview rows, column schema). These are the mutation and inspection endpoints that build on the viewing layer in PR #25.

Summary

New endpoints:

  • POST /upload: reads file into memory, uploads to project's automl-extension dataset via v4 chunked API, polls snapshot until active, returns metadata + mount path
  • GET /{id}/preview: paginated data preview (CSV/Parquet) via read_tabular_preview
  • GET /{id}/schema: column names and dtypes via read_tabular_schema
  • POST /preview: preview by file path with pagination support

New storage_resolver methods (consumed by the upload route):

  • upload_file, _upload_chunks — chunked upload with retries
  • ensure_dataset_exists, _resolve_or_create, _create_dataset — dataset provisioning
  • delete_snapshot_files, _get_latest_snapshot_id — snapshot file management
  • _dataset_rw_write_request — dual-host write helper

New dataset_service functions:

  • preview_dataset_response, get_dataset_schema_response, preview_file_response — preview/schema orchestration
  • save_uploaded_file — standalone fallback (no Domino project)
  • build_compat_dataset_preview_payload, coerce_preview_response — compat route support

Stacks on PR #25 (dataset viewing).

File → consumer mapping

All new code is consumed within this PR:

Addition Consumed by
storage_resolver.upload_file routes/datasets.py upload endpoint
storage_resolver.ensure_dataset_exists routes/datasets.py upload endpoint
dataset_service.preview_* functions routes/datasets.py preview/schema endpoints
dataset_service.save_uploaded_file routes/datasets.py upload fallback

Test plan

  • Upload endpoint returns file metadata + dataset_id + snapshot_verified
  • Preview endpoint returns paginated rows for CSV and Parquet
  • Schema endpoint returns column names and dtypes
  • test_dataset_service.py passes
  • test_storage_resolver.py passes (upload/create/delete tests)

@ddl-subir-m ddl-subir-m requested a review from a team March 23, 2026 14:17
@ddl-subir-m ddl-subir-m requested a review from niole March 23, 2026 14:36
@ddl-subir-m ddl-subir-m changed the title feat: dataset upload, preview, and schema endpoints [DOM-75543] feat: dataset upload, preview, and schema endpoints Mar 23, 2026
ddl-subir-m added a commit that referenced this pull request Mar 23, 2026
Remove upload, download, provisioning, and deletion methods from
storage_resolver.py — they will be re-introduced in the PRs that
consume them (PR #26 for upload, PR #22 for download/provisioning).

Keeps: snapshot listing/status, RW ID resolution, list_snapshot_files,
_find_existing, _probe_mount, caching infrastructure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ddl-subir-m added a commit that referenced this pull request Mar 24, 2026
…oad endpoint

Merge conflict between PR #25 (removed download) and PR #26 (added
upload/preview/schema). Resolution: keep all PR #26 endpoints, drop the
download_dataset_file endpoint which has no frontend caller.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ddl-subir-m and others added 14 commits March 23, 2026 23:57
- Add DominoProjectType enum (DFS/GIT/UNKNOWN) with filesystem-based detection
- Add _db_url_remap for cross-project SQLite URL remapping across mount types
- Add tabular_data module: centralized CSV/parquet preview, schema, row counting
  with LRU caching (replaces scattered pd.read_csv/parquet calls)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…leware

- domino_dataset_api.py: typed listing wrapper (v2→v1 fallback) using generated client
- domino_http.py: add params/files/headers/base_url to domino_request, streaming download
- middleware.py: debug request/response logging with header redaction
- project_resolver.py: use generated get_project_by_id endpoint

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lver

- Add normalize_leaderboard_rows/payload to fix TimeSeries fit_time display
- Add resolve_request_project_id() to centralize project context extraction
  from X-Project-Id header, query params, and DOMINO_PROJECT_ID env var

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduces the complete dataset viewing pipeline:
- storage_resolver.py: per-project dataset lifecycle, mount probing,
  snapshot management, file operations, caching
- dataset_manager.py: API-first listing with local fallback,
  cross-project support, snapshot-based file listing
- routes/datasets.py: GET endpoints for list, detail, verify-snapshot,
  and file download
- dataset_service.py: listing orchestration and mount filtering

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove upload, download, provisioning, and deletion methods from
storage_resolver.py — they will be re-introduced in the PRs that
consume them (PR #26 for upload, PR #22 for download/provisioning).

Keeps: snapshot listing/status, RW ID resolution, list_snapshot_files,
_find_existing, _probe_mount, caching infrastructure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…roject_id

The env var is the App's own project, not the target project the user
is working in. Falling back to it silently operates on the wrong project
(root cause of datasets showing empty in cross-project scenarios).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The cache is a process-lifetime dict on a singleton — it never expires
or invalidates. Stale data is worse than a redundant API call.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The v2→v1 fallback was speculative — no evidence that v2 is unreliable
on any target deployment. Ryan's PR uses v2 only and it works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
No frontend caller exists for this endpoint. Dataset preview uses
/svcdatasetpreview which returns JSON rows, not a raw file download.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Speculative field — no frontend code checks it. Remove to keep the
schema minimal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds the remaining dataset endpoints on top of the viewing layer:
- POST /upload: chunked upload to Domino dataset with snapshot verification
- GET /verify-snapshot: poll snapshot status (upload completion check)
- GET /{id}/preview: paginated data preview (CSV/Parquet)
- GET /{id}/schema: column names and dtypes
- POST /preview: preview by file path with pagination

dataset_service.py: upload orchestration, preview pagination,
schema resolution, compat payload builders.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…esolver

Extends storage_resolver with methods consumed by the upload and
snapshot verification endpoints:
- ensure_dataset_exists, _resolve_or_create, _create_dataset
- upload_file, _upload_chunks
- delete_snapshot_files, _get_latest_snapshot_id
- _dataset_rw_write_request, _format_http_error
- get_dataset_info, delete_dataset

Download and provisioning methods (download_file, download_directory,
ensure_project_storage, resolve_project_paths, check_project_storage,
invalidate) will be added in PR #22 where they are consumed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add max_upload_size_mb setting (env: AUTOML_MAX_UPLOAD_SIZE_MB,
default 550) and enforce it in the upload route before processing.
Returns HTTP 413 with a clear message when exceeded.

Validation (filename, extension, size) is now done once before
branching into the Domino vs standalone upload paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add cleanup_dataset_cache() to utils.py — walks the dataset_cache
directory and removes files older than the configured TTL. Empty
directories are pruned afterward.

Configurable via AUTOML_DATASET_CACHE_TTL_HOURS (default: 24).
Called at app startup in main.py (PR #22).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ddl-subir-m ddl-subir-m force-pushed the subir/pr16b-upload-preview branch from 405fd67 to 979880b Compare March 24, 2026 05:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants