[DOM-75543] feat: dataset upload, preview, and schema endpoints#26
Open
ddl-subir-m wants to merge 19 commits intomainfrom
Open
[DOM-75543] feat: dataset upload, preview, and schema endpoints#26ddl-subir-m wants to merge 19 commits intomainfrom
ddl-subir-m wants to merge 19 commits intomainfrom
Conversation
This was referenced Mar 23, 2026
ddl-subir-m
added a commit
that referenced
this pull request
Mar 23, 2026
Remove upload, download, provisioning, and deletion methods from storage_resolver.py — they will be re-introduced in the PRs that consume them (PR #26 for upload, PR #22 for download/provisioning). Keeps: snapshot listing/status, RW ID resolution, list_snapshot_files, _find_existing, _probe_mount, caching infrastructure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add DominoProjectType enum (DFS/GIT/UNKNOWN) with filesystem-based detection - Add _db_url_remap for cross-project SQLite URL remapping across mount types - Add tabular_data module: centralized CSV/parquet preview, schema, row counting with LRU caching (replaces scattered pd.read_csv/parquet calls) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…leware - domino_dataset_api.py: typed listing wrapper (v2→v1 fallback) using generated client - domino_http.py: add params/files/headers/base_url to domino_request, streaming download - middleware.py: debug request/response logging with header redaction - project_resolver.py: use generated get_project_by_id endpoint Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lver - Add normalize_leaderboard_rows/payload to fix TimeSeries fit_time display - Add resolve_request_project_id() to centralize project context extraction from X-Project-Id header, query params, and DOMINO_PROJECT_ID env var Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduces the complete dataset viewing pipeline: - storage_resolver.py: per-project dataset lifecycle, mount probing, snapshot management, file operations, caching - dataset_manager.py: API-first listing with local fallback, cross-project support, snapshot-based file listing - routes/datasets.py: GET endpoints for list, detail, verify-snapshot, and file download - dataset_service.py: listing orchestration and mount filtering Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove upload, download, provisioning, and deletion methods from storage_resolver.py — they will be re-introduced in the PRs that consume them (PR #26 for upload, PR #22 for download/provisioning). Keeps: snapshot listing/status, RW ID resolution, list_snapshot_files, _find_existing, _probe_mount, caching infrastructure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…roject_id The env var is the App's own project, not the target project the user is working in. Falling back to it silently operates on the wrong project (root cause of datasets showing empty in cross-project scenarios). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The cache is a process-lifetime dict on a singleton — it never expires or invalidates. Stale data is worse than a redundant API call. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The v2→v1 fallback was speculative — no evidence that v2 is unreliable on any target deployment. Ryan's PR uses v2 only and it works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
No frontend caller exists for this endpoint. Dataset preview uses /svcdatasetpreview which returns JSON rows, not a raw file download. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Speculative field — no frontend code checks it. Remove to keep the schema minimal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds the remaining dataset endpoints on top of the viewing layer:
- POST /upload: chunked upload to Domino dataset with snapshot verification
- GET /verify-snapshot: poll snapshot status (upload completion check)
- GET /{id}/preview: paginated data preview (CSV/Parquet)
- GET /{id}/schema: column names and dtypes
- POST /preview: preview by file path with pagination
dataset_service.py: upload orchestration, preview pagination,
schema resolution, compat payload builders.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…esolver Extends storage_resolver with methods consumed by the upload and snapshot verification endpoints: - ensure_dataset_exists, _resolve_or_create, _create_dataset - upload_file, _upload_chunks - delete_snapshot_files, _get_latest_snapshot_id - _dataset_rw_write_request, _format_http_error - get_dataset_info, delete_dataset Download and provisioning methods (download_file, download_directory, ensure_project_storage, resolve_project_paths, check_project_storage, invalidate) will be added in PR #22 where they are consumed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add max_upload_size_mb setting (env: AUTOML_MAX_UPLOAD_SIZE_MB, default 550) and enforce it in the upload route before processing. Returns HTTP 413 with a clear message when exceeded. Validation (filename, extension, size) is now done once before branching into the Domino vs standalone upload paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add cleanup_dataset_cache() to utils.py — walks the dataset_cache directory and removes files older than the configured TTL. Empty directories are pruned afterward. Configurable via AUTOML_DATASET_CACHE_TTL_HOURS (default: 24). Called at app startup in main.py (PR #22). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
405fd67 to
979880b
Compare
niole
approved these changes
Mar 24, 2026
7 tasks
…et-view # Conflicts: # automl-service/app/api/utils.py # automl-service/tests/test_api_utils.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Adds the ability to upload files to Domino datasets and inspect data content (preview rows, column schema). These are the mutation and inspection endpoints that build on the viewing layer in PR #25.
Summary
New endpoints:
automl-extensiondataset via v4 chunked API, polls snapshot until active, returns metadata + mount pathread_tabular_previewread_tabular_schemaNew storage_resolver methods (consumed by the upload route):
upload_file,_upload_chunks— chunked upload with retriesensure_dataset_exists,_resolve_or_create,_create_dataset— dataset provisioningdelete_snapshot_files,_get_latest_snapshot_id— snapshot file management_dataset_rw_write_request— dual-host write helperNew dataset_service functions:
preview_dataset_response,get_dataset_schema_response,preview_file_response— preview/schema orchestrationsave_uploaded_file— standalone fallback (no Domino project)build_compat_dataset_preview_payload,coerce_preview_response— compat route supportStacks on PR #25 (dataset viewing).
File → consumer mapping
All new code is consumed within this PR:
storage_resolver.upload_fileroutes/datasets.pyupload endpointstorage_resolver.ensure_dataset_existsroutes/datasets.pyupload endpointdataset_service.preview_*functionsroutes/datasets.pypreview/schema endpointsdataset_service.save_uploaded_fileroutes/datasets.pyupload fallbackTest plan
test_dataset_service.pypassestest_storage_resolver.pypasses (upload/create/delete tests)