feat: dataset listing API + storage resolver + cross-project file access#17
feat: dataset listing API + storage resolver + cross-project file access#17ddl-subir-m wants to merge 16 commits intomainfrom
Conversation
- Add DominoProjectType enum (DFS/GIT/UNKNOWN) with filesystem-based detection - Add _db_url_remap for cross-project SQLite URL remapping across mount types - Add tabular_data module: centralized CSV/parquet preview, schema, row counting with LRU caching (replaces scattered pd.read_csv/parquet calls) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…aming download and debug middleware - Add params, files, headers, base_url parameters to domino_request() - Add domino_download() for streaming file downloads from Domino APIs - Add resolve_domino_nucleus_host() for direct nucleus-frontend access - Add _get_api_key() helper for X-Domino-Api-Key auth - Add DebugLoggingMiddleware (opt-in via AUTOML_DEBUG_LOGGING=true) - Use fresh httpx client per request to avoid proxy idle disconnects - Add debug_logging setting to config Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…into subir/pr5-dataset-api-storage
- Add domino_dataset_api: Dataset RW v2/v1 listing with pagination and fallback - Add storage_resolver: auto-create automl-extension dataset per project, chunked upload via v4 API, streaming download, snapshot file listing, mount path probing across DFS/git layouts - Add ensure_local_file: downloads from dataset API when file not on local mount (enables cross-project file access for profiling and training) - Add cleanup_dataset_cache and extract_dataset_relative_path utils
niole
left a comment
There was a problem hiding this comment.
If I understand correctly, I think there are some blocking issues with this with respect to what auth method is used to send api requests
| # verify that correct headers are read | ||
| # get_domino_auth_headers should NOT use the forwarded token — | ||
| # it should fall through to sidecar token (which will fail in test) | ||
| # then to API key |
There was a problem hiding this comment.
Requests to the Domino API, which are meant to retrieve data as the viewing user/save things to their project, should use the forwarded auth headers, because then authentication/authorization for that user will apply.
We should not use the proxy token for things that should be authorized for individual users, like saving data to a project. Doesn't that send the requests using the app owner's identity?
There was a problem hiding this comment.
You're right that for user-scoped operations we should ideally use the viewing user's identity. The reason I stopped forwarding the browser token is that the sidecar returns a 500 when it receives it; it only accepts its own internally managed token. I hit this during testing and documented it in main.py (PR #20).
That said, your point stands for calls that go directly to nucleus-frontend bypassing the sidecar, we
could forward the user's browser token. The capture_auth_header middleware already stores it per-request. We'd just need to use it for direct-to-nucleus calls instead of defaulting to the API key.
Would that address your concern? I can update domino_request and domino_download to forward the captured user token when calling nucleus-frontend directly.
There was a problem hiding this comment.
Updated get_domino_auth_headers() to preserve the user's forwarded token for all outbound Domino API calls. Now the sidecar token is only used as a fallback when no user token is present. This change is in PR #14 and cascaded through all downstream PRs.
Stop the sidecar token from overwriting the user's forwarded JWT. When a user token is present (from the Extension-injected Authorization header), outbound calls to datasetrw, jobs, and model registry now run as the visiting user instead of the App owner. The sidecar token is only used as fallback for background tasks and health checks.
…/pr5-dataset-api-storage
Addresses review comment: use the actual header name x-domino-api-key instead of the incorrect domino-api-key.
…/pr5-dataset-api-storage
There was a problem hiding this comment.
Some functionality in this PR isn't used in this PR and should be moved to the right PR or deleted so your changes can be reviewed properly. It would be helpful to me for the PRs to be split into pieces of functionality that work together, like for example the implementation for viewing an already uploaded dataset, a separate one for uploading, another one for the parquet/special file handling by itself, etc...Another PR for implementing caching
|
|
||
| if use_api_key: | ||
| api_key = _get_api_key() | ||
| if api_key: | ||
| auth_headers = {"X-Domino-Api-Key": api_key} | ||
| else: | ||
| auth_headers = await get_domino_auth_headers() | ||
| else: | ||
| auth_headers = await get_domino_auth_headers() |
There was a problem hiding this comment.
I think you mentioned that tokens also work, so maybe we should refactor this to use the get domino auth headers helper
| if use_api_key: | |
| api_key = _get_api_key() | |
| if api_key: | |
| auth_headers = {"X-Domino-Api-Key": api_key} | |
| else: | |
| auth_headers = await get_domino_auth_headers() | |
| else: | |
| auth_headers = await get_domino_auth_headers() | |
| auth_headers = await get_domino_auth_headers() |
| else: | ||
| auth_headers = await get_domino_auth_headers() | ||
|
|
||
| os.makedirs(os.path.dirname(dest_path), exist_ok=True) |
There was a problem hiding this comment.
We should think about what will happen if this file is too big and crashes the pod. If it makes sense, we should write these files to temporary paths so that if the pod restarts, the file will be deleted and the pod can keep running without intervention from the field
Also, do we cleanup these files after we are done using them for whatever they are used for? (haven't checked yet)
There was a problem hiding this comment.
Downloads stream in 8KB chunks and uploads send in 8MB chunks storage_resolver.py), so neither loads the full file into memory - large files won't OOM the pod.
The upload route does hold the uploaded file in memory (await file.read()) for metadata extraction and chunked upload, but it's now capped at 550 MB (configurable via AUTOML_MAX_UPLOAD_SIZE_MB) returns HTTP 413 if exceeded.This matches the limit shown in the UI.
For the local cache (dataset_cache/): these go under temp_path which is ephemeral storage, so a pod restart clears them. For long-running pods, cleanup_dataset_cache() runs at startup and deletes files older than 24 hours (configurable via AUTOML_DATASET_CACHE_TTL_HOURS). Both in PR #26.
| read_tabular_schema, | ||
| read_upload_metadata, | ||
| ) | ||
|
|
There was a problem hiding this comment.
Where is this file used? Can you move it to the PR where it is used so we don't merge anything that's unused
automl-service/app/core/utils.py
Outdated
| "Failed to resolve dataset '%s' in project %s", | ||
| dataset_name, project_id, exc_info=True, | ||
| ) | ||
| return None |
There was a problem hiding this comment.
This file also seems to contain functions that are not used. Can you move them to the PR where they are used?
| while True: | ||
| resp = await domino_request( | ||
| "GET", | ||
| "/api/datasetrw/v2/datasets", |
There was a problem hiding this comment.
Can you rebase with the PR that merges the public API and use those helpers instead? THis will help us maintain this functionality going forward an eliminate the need to manually parse the response, which will help us avoid any bugs there
| } | ||
|
|
||
|
|
||
| class DebugLoggingMiddleware(BaseHTTPMiddleware): |
Remove use_api_key parameter and _get_api_key() helper. All downloads now use the standard auth chain which preserves the user's forwarded token and falls back to sidecar when needed.
…/pr5-dataset-api-storage
- Use generated API client for dataset listing (domino_dataset_api.py) - Remove use_api_key from domino_download callers - Remove utils.py additions (ensure_local_file, cleanup_dataset_cache, extract_dataset_relative_path) — will be added in PR #16/#22 where used - Remove test_ensure_local_file.py (moves with the functions)
- Simplify domino_download to use get_domino_auth_headers, remove use_api_key param and _get_api_key helper - Rewrite domino_dataset_api.py to use generated API client - Remove duplicate endpoint entries in storage_resolver download_file - Update storage resolver test for proxy→nucleus fallback
Switch project_resolver from raw domino_request(/v4/projects) to the generated get_project_by_id endpoint (/api/projects/v1/projects). Returns typed ProjectEnvelopeV1 instead of parsing raw JSON dicts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace 4 raw domino_request calls to /api/datasetrw/v1/datasets/{id}/snapshots
with a shared _list_snapshots_typed() helper that uses the generated
get_dataset_snapshots endpoint. Returns typed SnapshotDetailsV1 objects
instead of parsing raw JSON with manual envelope unwrapping.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests now mock _list_snapshots_typed instead of raw domino_request, matching the refactor to use the generated API client for snapshot listing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Re-split for reviewabilityPer Niole's feedback, this PR has been re-split so each PR contains functionality alongside its consumers:
Closing this PR in favor of the above. The new PRs stack: #24 → #25 → #26. |
Why
The AutoML Extension currently uses the domino-py SDK for dataset operations and relies on a single shared dataset mounted locally. This has several problems:
This PR replaces SDK dataset calls with direct REST API calls and adds a storage resolver that auto-creates a per-project
automl-extensiondataset. Files are uploaded via Domino's v4 chunked API (in-memory, no temp files) and downloaded via streaming to avoid memory pressure on large files.ensure_local_file()is the key cross-project enabler: when a file path doesn't exist locally (because it's in a different project's dataset), it downloads the file from the Domino API into a local cache.Depends on
domino_request()params/files/headers,domino_download()Summary
domino_dataset_api.py— Dataset RW v2/v1 listing API with pagination and version fallbackstorage_resolver.py(~1000 lines) —ensure_dataset_exists(),upload_file()(chunked v4),download_file(),list_snapshot_files(),delete_snapshot_files(),get_rw_snapshot_id(),resolve_project_paths(), mount probingutils.py—ensure_local_file(),extract_dataset_relative_path(),cleanup_dataset_cache()Files changed
app/services/domino_dataset_api.py— new: Domino dataset listing APIapp/services/storage_resolver.py— new: per-project dataset managementapp/core/utils.py— modified: cross-project file utilitiestests/test_domino_dataset_api.py— dataset API teststests/test_storage_resolver.py— storage resolver teststests/test_ensure_local_file.py— ensure_local_file testsTest plan
test_domino_dataset_api.pypassestest_storage_resolver.pypassestest_ensure_local_file.pypassesautoml-extensiondataset doesn't existensure_local_filedownloads and caches cross-project files