[DOM-75542] feat: dataset viewing — storage resolver, listing, detail, download#25
[DOM-75542] feat: dataset viewing — storage resolver, listing, detail, download#25ddl-subir-m wants to merge 13 commits intomainfrom
Conversation
|
|
||
| @app.get("/svcverifysnapshot") | ||
| async def svc_verify_snapshot(dataset_id: str, file_path: str): | ||
| from app.api.routes.datasets import verify_snapshot as _verify_snapshot |
There was a problem hiding this comment.
Why import things inside of the function definition? Is there a circular dependency? If not, I think it's more helpful to import at the top of the file to catch any comp issues when the file is called
There was a problem hiding this comment.
Yes, circular dependency ; these import from app.api.routes.datasets which imports from app.services.dataset_service which registers back through compat routes. Added # avoid circular import comments.
| except FileNotFoundError: | ||
| raise HTTPException(status_code=404, detail=f"File '{file_name}' not found in dataset {dataset_id}") | ||
|
|
||
| if not os.path.isfile(file_path): |
There was a problem hiding this comment.
get_dataset_file_path should not succeed if the path is not a file. Did you need to add this if statement because there is actually a bug with get_dataset_file_path?
There was a problem hiding this comment.
Not a bug in get_dataset_file_path; it can return synthetic paths for cross-project datasets that represent where the file will be mounted in a Job, not where it exists in this App. The download endpoint needs the file to actually exist on disk, so this guard returns a clear 404 instead of a FileResponse crash. Added a comment explaining this.
There was a problem hiding this comment.
This sounds there is a bug with the get_dataset_file_path functionality, where a filepath is returned even though no file was found. I would move the file check into the get_dataset_file_path function
| ) | ||
| # If the v2 listing didn't include the RW snapshot ID, | ||
| # resolve it eagerly via the v1 snapshots endpoint. | ||
| if not rw_sid and ds_id: |
There was a problem hiding this comment.
Why would it not have the right snapshot ID?
There was a problem hiding this comment.
The v2 listing response doesn't always populate readWriteSnapshotId observed missing in some Domino versions and for recently created datasets. The fallback resolves it via the v1 snapshots endpoint which reliably returns the active RW snapshot. Without this, file listing and uploads fail for those datasets.
There was a problem hiding this comment.
I will mention this to the data team.
| val = os.environ.get(env_key) | ||
| if not val: | ||
| continue | ||
| for part in val.replace(",", ":").replace(";", ":").split(":"): |
There was a problem hiding this comment.
download_file fetches both the latest snapshot ID and the RW snapshot ID because the RW snapshot is the mutable head with the most recent uploads. The latest snapshot could be an older read-only snapshot. We use the RW snapshot's v4 raw endpoint, which is the most reliable for cross-project downloads.
| if os.path.isdir(path) and os.access(path, os.W_OK): | ||
| return path | ||
| # Also check env-provided mount roots | ||
| for env_key in ("DOMINO_DATASET_MOUNT_PATH", "DOMINO_MOUNT_PATHS"): |
There was a problem hiding this comment.
TBH, we should probably prefer these, since they are domino provided
There was a problem hiding this comment.
Already the case. The endpoints list is built with rw_snapshot_id (domino-provided) first, so it's always tried before any fallback.
niole
left a comment
There was a problem hiding this comment.
Left a question about the projectstorageresolver, I saw some errors are getting swallowed, I saw some code that falls back to older versions of the datasets apis which I left Qs about,
I don't get why we keep importing things inside of the function in which they are used. Python is JIT compiled, so if there is any issue with dependency resolution we won't know until that line of code gets triggered
|
|
||
|
|
||
| @dataclass | ||
| class ProjectStorageResolver: |
There was a problem hiding this comment.
What is your intention for this class? What would change if we always hit the domino apis to get this information?
There was a problem hiding this comment.
ProjectPaths is a convenience struct so callers (job launcher, training worker) get pre-computed subdirectory paths (models/, uploads/, eda_results/) without each re-deriving them from the mount root.If we always hit Domino APIs, the mount path and subdirectory layout would be the same ; this class just avoids repeating os.path.join(mount, "models") in every caller. It also handles standalone-mode where paths come from settings instead of Domino.
There was a problem hiding this comment.
Could we replace this class with some helper functions?
| # Always list files from the snapshot API — mounted filesystems use | ||
| # read-only snapshots that go stale when files are deleted in Domino. | ||
| if not rw_snapshot_id and dataset_id: | ||
| from app.services.storage_resolver import get_storage_resolver |
There was a problem hiding this comment.
Please import at top of file
There was a problem hiding this comment.
Circular import. dataset_manager.py is in app.core and storage_resolver.py is in app.services which imports from app.core. Moving it top-level would create an import cycle at module load time.
- Remove unnecessary _resolve_project_id wrapper in dataset routes - Move `import asyncio` to top-level in datasets.py - Change HTTP 502 to 500 for internal errors (3 locations) - Downgrade dataset detail fetch log from error to debug - Remove dead _extract_dataset_list wrapper from storage_resolver - Add circular import comments in compat routes - Add comment explaining isfile guard on cross-project paths - Update tests to import extract_dataset_list from domino_dataset_api Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| 1. Get auth headers from request from auth context | ||
| 2. Fallback to the ephemeral token from Domino App/Run sidecar (localhost:8899) | ||
| 1. Forwarded user token from the incoming request (per-request context) | ||
| 2. Ephemeral token from the Domino App/Run sidecar (localhost:8899) |
- Add DominoProjectType enum (DFS/GIT/UNKNOWN) with filesystem-based detection - Add _db_url_remap for cross-project SQLite URL remapping across mount types - Add tabular_data module: centralized CSV/parquet preview, schema, row counting with LRU caching (replaces scattered pd.read_csv/parquet calls) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…leware - domino_dataset_api.py: typed listing wrapper (v2→v1 fallback) using generated client - domino_http.py: add params/files/headers/base_url to domino_request, streaming download - middleware.py: debug request/response logging with header redaction - project_resolver.py: use generated get_project_by_id endpoint Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lver - Add normalize_leaderboard_rows/payload to fix TimeSeries fit_time display - Add resolve_request_project_id() to centralize project context extraction from X-Project-Id header, query params, and DOMINO_PROJECT_ID env var Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduces the complete dataset viewing pipeline: - storage_resolver.py: per-project dataset lifecycle, mount probing, snapshot management, file operations, caching - dataset_manager.py: API-first listing with local fallback, cross-project support, snapshot-based file listing - routes/datasets.py: GET endpoints for list, detail, verify-snapshot, and file download - dataset_service.py: listing orchestration and mount filtering Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove upload, download, provisioning, and deletion methods from storage_resolver.py — they will be re-introduced in the PRs that consume them (PR #26 for upload, PR #22 for download/provisioning). Keeps: snapshot listing/status, RW ID resolution, list_snapshot_files, _find_existing, _probe_mount, caching infrastructure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…roject_id The env var is the App's own project, not the target project the user is working in. Falling back to it silently operates on the wrong project (root cause of datasets showing empty in cross-project scenarios). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The cache is a process-lifetime dict on a singleton — it never expires or invalidates. Stale data is worse than a redundant API call. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The v2→v1 fallback was speculative — no evidence that v2 is unreliable on any target deployment. Ryan's PR uses v2 only and it works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
No frontend caller exists for this endpoint. Dataset preview uses /svcdatasetpreview which returns JSON rows, not a raw file download. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Speculative field — no frontend code checks it. Remove to keep the schema minimal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…l Domino API calls Sidecar token and static API key fallbacks silently escalated to the App-owner identity when no user JWT was forwarded, bypassing Domino's per-user RBAC. Now all outbound Domino API calls require the visiting user's forwarded token and raise MissingUserTokenError if absent. Background job syncs already capture the user token before detaching, so they are unaffected. Job runner workers run in separate containers with their own env and don't import these functions. Addresses Ryan's review comment on PR #25. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…l Domino API calls Sidecar token and static API key fallbacks silently escalated to the App-owner identity when no user JWT was forwarded, bypassing Domino's per-user RBAC. Now all outbound Domino API calls require the visiting user's forwarded token and raise MissingUserTokenError if absent. Background job syncs already capture the user token before detaching, so they are unaffected. Job runner workers run in separate containers with their own env and don't import these functions. Addresses Ryan's review comment on PR #25. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
199624f to
d26d314
Compare
|
|
||
|
|
||
| @dataclass | ||
| class ProjectStorageResolver: |
There was a problem hiding this comment.
Could we replace this class with some helper functions?
| if limit is not None: | ||
| kwargs["limit"] = limit | ||
| result = get_dataset_snapshots.sync(dataset_id, client=client, **kwargs) | ||
| if isinstance(result, PaginatedSnapshotEnvelopeV1): |
There was a problem hiding this comment.
hmm, I can't find it. Where is _find_existing exception handler called in this function?
| ) | ||
| # If the v2 listing didn't include the RW snapshot ID, | ||
| # resolve it eagerly via the v1 snapshots endpoint. | ||
| if not rw_sid and ds_id: |
There was a problem hiding this comment.
I will mention this to the data team.
…et-view # Conflicts: # automl-service/app/api/utils.py # automl-service/tests/test_api_utils.py
Why
Provides the complete pipeline for viewing datasets through the API. Users can list datasets scoped to their project, get dataset details, browse files via snapshot, and download individual files.
Summary
get_rw_snapshot_id,get_latest_snapshot_status,list_snapshot_files), dataset discovery (_find_existing), mount probing, cachingGET /(list),GET /{id}(detail),GET /verify-snapshot,GET /{id}/files/{name}/downloadStacks on PR #24 (listing API) + PR #13 (leaderboard utils / request project ID).
Replaces the storage_resolver portion of old PR #17 + the viewing portion of old PR #16. Storage resolver is trimmed to only the methods consumed by this PR. Upload methods are added in PR #26, download/provisioning in PR #22.
File → consumer mapping
All code in this PR is consumed within the PR:
storage_resolver.pydataset_manager.py(snapshot listing, RW ID),routes/datasets.py(verify-snapshot)dataset_manager.pydataset_service.py,routes/datasets.py(viaget_dataset_managerdependency)dataset_service.pyroutes/datasets.py(list, get_or_404)schemas/dataset.pyroutes/datasets.py,dataset_service.py(response models)Test plan
test_storage_resolver.pypassestest_storage_resolver_extras.pypassestest_dataset_manager.pypassestest_api_datasets.pypasses