[DOM-75543] feat: dataset upload, preview, and schema endpoints by ddl-subir-m · Pull Request #26 · dominodatalab/AutoML_Extension

ddl-subir-m · 2026-03-23T14:17:47Z

Why

Adds the ability to upload files to Domino datasets and inspect data content (preview rows, column schema). These are the mutation and inspection endpoints that build on the viewing layer in PR #25.

Summary

New endpoints:

POST /upload: reads file into memory, uploads to project's automl-extension dataset via v4 chunked API, polls snapshot until active, returns metadata + mount path
GET /{id}/preview: paginated data preview (CSV/Parquet) via read_tabular_preview
GET /{id}/schema: column names and dtypes via read_tabular_schema
POST /preview: preview by file path with pagination support

New storage_resolver methods (consumed by the upload route):

upload_file, _upload_chunks — chunked upload with retries
ensure_dataset_exists, _resolve_or_create, _create_dataset — dataset provisioning
delete_snapshot_files, _get_latest_snapshot_id — snapshot file management
_dataset_rw_write_request — dual-host write helper

New dataset_service functions:

preview_dataset_response, get_dataset_schema_response, preview_file_response — preview/schema orchestration
save_uploaded_file — standalone fallback (no Domino project)
build_compat_dataset_preview_payload, coerce_preview_response — compat route support

Stacks on PR #25 (dataset viewing).

File → consumer mapping

All new code is consumed within this PR:

Addition	Consumed by
`storage_resolver.upload_file`	`routes/datasets.py` upload endpoint
`storage_resolver.ensure_dataset_exists`	`routes/datasets.py` upload endpoint
`dataset_service.preview_*` functions	`routes/datasets.py` preview/schema endpoints
`dataset_service.save_uploaded_file`	`routes/datasets.py` upload fallback

Test plan

Upload endpoint returns file metadata + dataset_id + snapshot_verified
Preview endpoint returns paginated rows for CSV and Parquet
Schema endpoint returns column names and dtypes
test_dataset_service.py passes
test_storage_resolver.py passes (upload/create/delete tests)

Remove upload, download, provisioning, and deletion methods from storage_resolver.py — they will be re-introduced in the PRs that consume them (PR #26 for upload, PR #22 for download/provisioning). Keeps: snapshot listing/status, RW ID resolution, list_snapshot_files, _find_existing, _probe_mount, caching infrastructure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…oad endpoint Merge conflict between PR #25 (removed download) and PR #26 (added upload/preview/schema). Resolution: keep all PR #26 endpoints, drop the download_dataset_file endpoint which has no frontend caller. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add DominoProjectType enum (DFS/GIT/UNKNOWN) with filesystem-based detection - Add _db_url_remap for cross-project SQLite URL remapping across mount types - Add tabular_data module: centralized CSV/parquet preview, schema, row counting with LRU caching (replaces scattered pd.read_csv/parquet calls) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…leware - domino_dataset_api.py: typed listing wrapper (v2→v1 fallback) using generated client - domino_http.py: add params/files/headers/base_url to domino_request, streaming download - middleware.py: debug request/response logging with header redaction - project_resolver.py: use generated get_project_by_id endpoint Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…lver - Add normalize_leaderboard_rows/payload to fix TimeSeries fit_time display - Add resolve_request_project_id() to centralize project context extraction from X-Project-Id header, query params, and DOMINO_PROJECT_ID env var Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Introduces the complete dataset viewing pipeline: - storage_resolver.py: per-project dataset lifecycle, mount probing, snapshot management, file operations, caching - dataset_manager.py: API-first listing with local fallback, cross-project support, snapshot-based file listing - routes/datasets.py: GET endpoints for list, detail, verify-snapshot, and file download - dataset_service.py: listing orchestration and mount filtering Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove upload, download, provisioning, and deletion methods from storage_resolver.py — they will be re-introduced in the PRs that consume them (PR #26 for upload, PR #22 for download/provisioning). Keeps: snapshot listing/status, RW ID resolution, list_snapshot_files, _find_existing, _probe_mount, caching infrastructure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…roject_id The env var is the App's own project, not the target project the user is working in. Falling back to it silently operates on the wrong project (root cause of datasets showing empty in cross-project scenarios). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The cache is a process-lifetime dict on a singleton — it never expires or invalidates. Stale data is worse than a redundant API call. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The v2→v1 fallback was speculative — no evidence that v2 is unreliable on any target deployment. Ryan's PR uses v2 only and it works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

No frontend caller exists for this endpoint. Dataset preview uses /svcdatasetpreview which returns JSON rows, not a raw file download. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Speculative field — no frontend code checks it. Remove to keep the schema minimal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds the remaining dataset endpoints on top of the viewing layer: - POST /upload: chunked upload to Domino dataset with snapshot verification - GET /verify-snapshot: poll snapshot status (upload completion check) - GET /{id}/preview: paginated data preview (CSV/Parquet) - GET /{id}/schema: column names and dtypes - POST /preview: preview by file path with pagination dataset_service.py: upload orchestration, preview pagination, schema resolution, compat payload builders. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…esolver Extends storage_resolver with methods consumed by the upload and snapshot verification endpoints: - ensure_dataset_exists, _resolve_or_create, _create_dataset - upload_file, _upload_chunks - delete_snapshot_files, _get_latest_snapshot_id - _dataset_rw_write_request, _format_http_error - get_dataset_info, delete_dataset Download and provisioning methods (download_file, download_directory, ensure_project_storage, resolve_project_paths, check_project_storage, invalidate) will be added in PR #22 where they are consumed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add max_upload_size_mb setting (env: AUTOML_MAX_UPLOAD_SIZE_MB, default 550) and enforce it in the upload route before processing. Returns HTTP 413 with a clear message when exceeded. Validation (filename, extension, size) is now done once before branching into the Domino vs standalone upload paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add cleanup_dataset_cache() to utils.py — walks the dataset_cache directory and removes files older than the configured TTL. Empty directories are pruned afterward. Configurable via AUTOML_DATASET_CACHE_TTL_HOURS (default: 24). Called at app startup in main.py (PR #22). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…et-view # Conflicts: # automl-service/app/api/utils.py # automl-service/tests/test_api_utils.py

…et-view

…eview

ddl-subir-m requested a review from a team March 23, 2026 14:17

This was referenced Mar 23, 2026

feat: dataset listing API + storage resolver + cross-project file access #17

Closed

feat: API-first dataset listing + cross-project support + snapshot verification #16

Closed

ddl-subir-m requested a review from niole March 23, 2026 14:36

ddl-subir-m changed the title ~~feat: dataset upload, preview, and schema endpoints~~ [DOM-75543] feat: dataset upload, preview, and schema endpoints Mar 23, 2026

This was referenced Mar 23, 2026

[DOM-75515] feat: project type detection, DB URL remap, tabular data helpers #12

Merged

[DOM-75518] feat: dataset listing API client + HTTP layer extensions #24

Open

[DOM-75542] feat: dataset viewing — storage resolver, listing, detail, download #25

Open

ddl-subir-m and others added 14 commits March 23, 2026 23:57

remove in-memory dataset item cache

2a02f0a

The cache is a process-lifetime dict on a singleton — it never expires or invalidates. Stale data is worse than a redundant API call. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

simplify: remove v1 fallback from dataset listing, use v2 only

a5ddc5a

The v2→v1 fallback was speculative — no evidence that v2 is unreliable on any target deployment. Ryan's PR uses v2 only and it works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

remove dead download_dataset_file endpoint

0c07d1e

No frontend caller exists for this endpoint. Dataset preview uses /svcdatasetpreview which returns JSON rows, not a raw file download. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

remove mounted field from DatasetFileResponse

d26d314

Speculative field — no frontend code checks it. Remove to keep the schema minimal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ddl-subir-m force-pushed the subir/pr16b-upload-preview branch from 405fd67 to 979880b Compare March 24, 2026 05:02

niole approved these changes Mar 24, 2026

View reviewed changes

ddl-subir-m mentioned this pull request Mar 24, 2026

[DOM-75568] feat: workers, trainers, predictions, diagnostics & export #28

Open

7 tasks

ddl-subir-m added 3 commits March 24, 2026 10:42

Merge branch 'subir/pr3-leaderboard-api-utils' into subir/pr16a-datas…

5933d1c

…et-view # Conflicts: # automl-service/app/api/utils.py # automl-service/tests/test_api_utils.py

Merge branch 'subir/pr16a-dataset-view' into subir/pr16b-upload-preview

4ae5045

Merge branch 'subir/pr3-leaderboard-api-utils' into subir/pr16a-datas…

fa0bd47

…et-view

ddl-subir-m added 2 commits March 24, 2026 10:46

Merge branch 'subir/pr16a-dataset-view' into subir/pr16b-upload-preview

8e4a853

Merge remote-tracking branch 'origin/main' into subir/pr16b-upload-pr…

70cf5b2

…eview

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOM-75543] feat: dataset upload, preview, and schema endpoints#26

[DOM-75543] feat: dataset upload, preview, and schema endpoints#26
ddl-subir-m wants to merge 19 commits intomainfrom
subir/pr16b-upload-preview

ddl-subir-m commented Mar 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ddl-subir-m commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Summary

File → consumer mapping

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ddl-subir-m commented Mar 23, 2026 •

edited

Loading