Skip to content

feat: API-first dataset listing + cross-project support + snapshot verification#16

Closed
ddl-subir-m wants to merge 30 commits intomainfrom
subir/pr8-dataset-manager
Closed

feat: API-first dataset listing + cross-project support + snapshot verification#16
ddl-subir-m wants to merge 30 commits intomainfrom
subir/pr8-dataset-manager

Conversation

@ddl-subir-m
Copy link
Copy Markdown
Collaborator

@ddl-subir-m ddl-subir-m commented Mar 20, 2026

Why

The dataset manager currently lists datasets by scanning local mount paths (/domino/datasets/). This approach fails for cross-project scenarios where datasets aren't mounted locally, and it can't distinguish between datasets from different projects that happen to have the same name.

To remove the shared dataset dependency, we need datasets listed via the Domino API (project-scoped), files uploaded directly to per-project datasets (not a shared local mount), and cross-project file access for preview and profiling.

This PR rewrites the dataset layer to be API-first: list via Domino API, upload via chunked API, download via streaming API, and only fall back to local filesystem when running in standalone/local mode.

Depends on

Summary

  • dataset_manager.py — API-first listing via list_project_datasets(), snapshot-based file listing for unmounted datasets, cross-project mount safety (only resolves local paths when dataset belongs to App's project)
  • dataset_service.py — replaces pandas with tabular_data helpers, adds project_id and include_files params
  • datasets.py route — project-scoped listing, upload-to-Domino-dataset path, verify-snapshot endpoint, download-dataset-file endpoint, ensure_local_file for preview
  • schemas/dataset.pymounted field on files, dataset_id/snapshot_file_path/snapshot_verified on upload response
  • custom_datasets.py — project-scoped compat routes, svcdataset GET, svcverifysnapshot

Files changed

  • app/core/dataset_manager.py — rewritten: API-first dataset listing
  • app/services/dataset_service.py — modified: tabular_data helpers, project_id param
  • app/api/routes/datasets.py — modified: new endpoints, cross-project upload/download
  • app/api/schemas/dataset.py — modified: new response fields
  • app/api/compat/custom_datasets.py — modified: project-scoped compat routes
  • tests/test_dataset_manager.py — dataset manager tests
  • tests/test_api_datasets.py — dataset API tests
  • tests/test_dataset_service.py — dataset service tests

Test plan

  • test_dataset_manager.py passes
  • test_api_datasets.py passes
  • test_dataset_service.py passes
  • Dataset listing returns API results when project_id provided
  • Cross-project datasets visible but files marked as unmounted
  • Upload to Domino dataset works via chunked API
  • Snapshot verification returns correct status
  • File preview works for cross-project files

ddl-subir-m and others added 7 commits March 20, 2026 09:44
- Add DominoProjectType enum (DFS/GIT/UNKNOWN) with filesystem-based detection
- Add _db_url_remap for cross-project SQLite URL remapping across mount types
- Add tabular_data module: centralized CSV/parquet preview, schema, row counting
  with LRU caching (replaces scattered pd.read_csv/parquet calls)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lver

- Add normalize_leaderboard_rows/payload to fix TimeSeries fit_time display
- Add resolve_request_project_id() to centralize project context extraction
  from X-Project-Id header, query params, and DOMINO_PROJECT_ID env var

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…aming download and debug middleware

- Add params, files, headers, base_url parameters to domino_request()
- Add domino_download() for streaming file downloads from Domino APIs
- Add resolve_domino_nucleus_host() for direct nucleus-frontend access
- Add _get_api_key() helper for X-Domino-Api-Key auth
- Add DebugLoggingMiddleware (opt-in via AUTOML_DEBUG_LOGGING=true)
- Use fresh httpx client per request to avoid proxy idle disconnects
- Add debug_logging setting to config

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…fication

- Rewrite dataset_manager to prefer Domino Dataset RW API over filesystem scan
- Cross-project mount safety: only resolve local paths for same-project datasets
- Upload route: direct to Domino dataset via storage_resolver chunked API
- New verify-snapshot endpoint for polling snapshot status after upload
- New download-dataset-file endpoint
- Preview route: ensure_local_file for cross-project files
- Replace pandas with tabular_data helpers
- Compat routes: project-scoped listing, svcdataset, svcverifysnapshot
@ddl-subir-m ddl-subir-m requested a review from a team March 20, 2026 16:41
- Add domino_dataset_api: Dataset RW v2/v1 listing with pagination and fallback
- Add storage_resolver: auto-create automl-extension dataset per project,
  chunked upload via v4 API, streaming download, snapshot file listing,
  mount path probing across DFS/git layouts
- Add ensure_local_file: downloads from dataset API when file not on local
  mount (enables cross-project file access for profiling and training)
- Add cleanup_dataset_cache and extract_dataset_relative_path utils
ddl-subir-m and others added 19 commits March 20, 2026 14:51
Stop the sidecar token from overwriting the user's forwarded JWT.
When a user token is present (from the Extension-injected Authorization
header), outbound calls to datasetrw, jobs, and model registry now run
as the visiting user instead of the App owner. The sidecar token is
only used as fallback for background tasks and health checks.
Addresses review comment: use the actual header name x-domino-api-key
instead of the incorrect domino-api-key.
Remove use_api_key parameter and _get_api_key() helper. All downloads
now use the standard auth chain which preserves the user's forwarded
token and falls back to sidecar when needed.
- Use generated API client for dataset listing (domino_dataset_api.py)
- Remove use_api_key from domino_download callers
- Remove utils.py additions (ensure_local_file, cleanup_dataset_cache,
  extract_dataset_relative_path) — will be added in PR #16/#22 where used
- Remove test_ensure_local_file.py (moves with the functions)
Switch project_resolver from raw domino_request(/v4/projects) to the
generated get_project_by_id endpoint (/api/projects/v1/projects).
Returns typed ProjectEnvelopeV1 instead of parsing raw JSON dicts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace 4 raw domino_request calls to /api/datasetrw/v1/datasets/{id}/snapshots
with a shared _list_snapshots_typed() helper that uses the generated
get_dataset_snapshots endpoint. Returns typed SnapshotDetailsV1 objects
instead of parsing raw JSON with manual envelope unwrapping.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace raw domino_request call with generated get_dataset endpoint
in dataset_manager.get_dataset(). Returns typed DatasetRwEnvelopeV1
instead of manually unwrapping JSON envelopes. Removes httpx and
domino_request imports (no longer needed).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests now mock _list_snapshots_typed instead of raw domino_request,
matching the refactor to use the generated API client for snapshot
listing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ddl-subir-m and others added 2 commits March 23, 2026 08:52
Tests now mock the new _fetch_dataset_details static method instead of
the removed _api_request, matching the refactor to use the generated
API client.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ddl-subir-m
Copy link
Copy Markdown
Collaborator Author

Re-split for reviewability

Per Niole's feedback, this PR has been re-split so each PR contains functionality alongside its consumers:

Closing this PR in favor of the above. The new PRs stack: #24#25#26.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant