[DOM-75542] feat: dataset viewing — storage resolver, listing, detail, download by ddl-subir-m · Pull Request #25 · dominodatalab/AutoML_Extension

ddl-subir-m · 2026-03-23T14:17:30Z

Why

Provides the complete pipeline for viewing datasets through the API. Users can list datasets scoped to their project, get dataset details, browse files via snapshot, and download individual files.

Summary

storage_resolver.py (viewing methods only): snapshot management (get_rw_snapshot_id, get_latest_snapshot_status, list_snapshot_files), dataset discovery (_find_existing), mount probing, caching
dataset_manager.py: API-first listing with local filesystem fallback, cross-project support, snapshot-based file listing, dataset detail with incremental cache merging
routes/datasets.py: GET / (list), GET /{id} (detail), GET /verify-snapshot, GET /{id}/files/{name}/download
dataset_service.py: listing orchestration, mount-path filtering

Stacks on PR #24 (listing API) + PR #13 (leaderboard utils / request project ID).

Replaces the storage_resolver portion of old PR #17 + the viewing portion of old PR #16. Storage resolver is trimmed to only the methods consumed by this PR. Upload methods are added in PR #26, download/provisioning in PR #22.

File → consumer mapping

All code in this PR is consumed within the PR:

File	Consumed by
`storage_resolver.py`	`dataset_manager.py` (snapshot listing, RW ID), `routes/datasets.py` (verify-snapshot)
`dataset_manager.py`	`dataset_service.py`, `routes/datasets.py` (via `get_dataset_manager` dependency)
`dataset_service.py`	`routes/datasets.py` (list, get_or_404)
`schemas/dataset.py`	`routes/datasets.py`, `dataset_service.py` (response models)

Test plan

test_storage_resolver.py passes
test_storage_resolver_extras.py passes
test_dataset_manager.py passes
test_api_datasets.py passes

niole · 2026-03-23T19:53:18Z

automl-service/app/api/compat/custom_datasets.py

+
+    @app.get("/svcverifysnapshot")
+    async def svc_verify_snapshot(dataset_id: str, file_path: str):
+        from app.api.routes.datasets import verify_snapshot as _verify_snapshot


Why import things inside of the function definition? Is there a circular dependency? If not, I think it's more helpful to import at the top of the file to catch any comp issues when the file is called

Yes, circular dependency ; these import from app.api.routes.datasets which imports from app.services.dataset_service which registers back through compat routes. Added # avoid circular import comments.

automl-service/app/api/routes/datasets.py

niole · 2026-03-23T19:56:16Z

automl-service/app/api/routes/datasets.py

+    except FileNotFoundError:
+        raise HTTPException(status_code=404, detail=f"File '{file_name}' not found in dataset {dataset_id}")
+
+    if not os.path.isfile(file_path):


get_dataset_file_path should not succeed if the path is not a file. Did you need to add this if statement because there is actually a bug with get_dataset_file_path?

Not a bug in get_dataset_file_path; it can return synthetic paths for cross-project datasets that represent where the file will be mounted in a Job, not where it exists in this App. The download endpoint needs the file to actually exist on disk, so this guard returns a clear 404 instead of a FileResponse crash. Added a comment explaining this.

This sounds there is a bug with the get_dataset_file_path functionality, where a filepath is returned even though no file was found. I would move the file check into the get_dataset_file_path function

automl-service/app/api/utils.py

automl-service/app/core/dataset_manager.py

niole · 2026-03-23T20:28:03Z

automl-service/app/services/storage_resolver.py

+                )
+                # If the v2 listing didn't include the RW snapshot ID,
+                # resolve it eagerly via the v1 snapshots endpoint.
+                if not rw_sid and ds_id:


Why would it not have the right snapshot ID?

The v2 listing response doesn't always populate readWriteSnapshotId observed missing in some Domino versions and for recently created datasets. The fallback resolves it via the v1 snapshots endpoint which reliably returns the active RW snapshot. Without this, file listing and uploads fail for those datasets.

I will mention this to the data team.

niole · 2026-03-23T20:32:32Z

automl-service/app/services/storage_resolver.py

+            val = os.environ.get(env_key)
+            if not val:
+                continue
+            for part in val.replace(",", ":").replace(";", ":").split(":"):


What is this logic for?

download_file fetches both the latest snapshot ID and the RW snapshot ID because the RW snapshot is the mutable head with the most recent uploads. The latest snapshot could be an older read-only snapshot. We use the RW snapshot's v4 raw endpoint, which is the most reliable for cross-project downloads.

niole · 2026-03-23T20:34:16Z

automl-service/app/services/storage_resolver.py

+            if os.path.isdir(path) and os.access(path, os.W_OK):
+                return path
+        # Also check env-provided mount roots
+        for env_key in ("DOMINO_DATASET_MOUNT_PATH", "DOMINO_MOUNT_PATHS"):


TBH, we should probably prefer these, since they are domino provided

Already the case. The endpoints list is built with rw_snapshot_id (domino-provided) first, so it's always tried before any fallback.

automl-service/app/api/routes/datasets.py

niole

Left a question about the projectstorageresolver, I saw some errors are getting swallowed, I saw some code that falls back to older versions of the datasets apis which I left Qs about,

I don't get why we keep importing things inside of the function in which they are used. Python is JIT compiled, so if there is any issue with dependency resolution we won't know until that line of code gets triggered

niole · 2026-03-23T20:43:31Z

automl-service/app/services/storage_resolver.py

+
+
+@dataclass
+class ProjectStorageResolver:


What is your intention for this class? What would change if we always hit the domino apis to get this information?

ProjectPaths is a convenience struct so callers (job launcher, training worker) get pre-computed subdirectory paths (models/, uploads/, eda_results/) without each re-deriving them from the mount root.If we always hit Domino APIs, the mount path and subdirectory layout would be the same ; this class just avoids repeating os.path.join(mount, "models") in every caller. It also handles standalone-mode where paths come from settings instead of Domino.

Could we replace this class with some helper functions?

niole · 2026-03-23T20:48:13Z

automl-service/app/core/dataset_manager.py

+            # Always list files from the snapshot API — mounted filesystems use
+            # read-only snapshots that go stale when files are deleted in Domino.
+            if not rw_snapshot_id and dataset_id:
+                from app.services.storage_resolver import get_storage_resolver


Please import at top of file

Circular import. dataset_manager.py is in app.core and storage_resolver.py is in app.services which imports from app.core. Moving it top-level would create an import cycle at module load time.

- Remove unnecessary _resolve_project_id wrapper in dataset routes - Move `import asyncio` to top-level in datasets.py - Change HTTP 502 to 500 for internal errors (3 locations) - Downgrade dataset detail fetch log from error to debug - Remove dead _extract_dataset_list wrapper from storage_resolver - Add circular import comments in compat routes - Add comment explaining isfile guard on cross-project paths - Update tests to import extract_dataset_list from domino_dataset_api Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ddl-ryan-connor · 2026-03-24T02:50:28Z

automl-service/app/core/domino_http.py

-    1. Get auth headers from request from auth context
-    2. Fallback to the ephemeral token from Domino App/Run sidecar (localhost:8899)
+    1. Forwarded user token from the incoming request (per-request context)
+    2. Ephemeral token from the Domino App/Run sidecar (localhost:8899)


should never use 2 or 3

…oad endpoint Merge conflict between PR #25 (removed download) and PR #26 (added upload/preview/schema). Resolution: keep all PR #26 endpoints, drop the download_dataset_file endpoint which has no frontend caller. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add DominoProjectType enum (DFS/GIT/UNKNOWN) with filesystem-based detection - Add _db_url_remap for cross-project SQLite URL remapping across mount types - Add tabular_data module: centralized CSV/parquet preview, schema, row counting with LRU caching (replaces scattered pd.read_csv/parquet calls) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…leware - domino_dataset_api.py: typed listing wrapper (v2→v1 fallback) using generated client - domino_http.py: add params/files/headers/base_url to domino_request, streaming download - middleware.py: debug request/response logging with header redaction - project_resolver.py: use generated get_project_by_id endpoint Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…lver - Add normalize_leaderboard_rows/payload to fix TimeSeries fit_time display - Add resolve_request_project_id() to centralize project context extraction from X-Project-Id header, query params, and DOMINO_PROJECT_ID env var Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Introduces the complete dataset viewing pipeline: - storage_resolver.py: per-project dataset lifecycle, mount probing, snapshot management, file operations, caching - dataset_manager.py: API-first listing with local fallback, cross-project support, snapshot-based file listing - routes/datasets.py: GET endpoints for list, detail, verify-snapshot, and file download - dataset_service.py: listing orchestration and mount filtering Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove upload, download, provisioning, and deletion methods from storage_resolver.py — they will be re-introduced in the PRs that consume them (PR #26 for upload, PR #22 for download/provisioning). Keeps: snapshot listing/status, RW ID resolution, list_snapshot_files, _find_existing, _probe_mount, caching infrastructure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…roject_id The env var is the App's own project, not the target project the user is working in. Falling back to it silently operates on the wrong project (root cause of datasets showing empty in cross-project scenarios). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The cache is a process-lifetime dict on a singleton — it never expires or invalidates. Stale data is worse than a redundant API call. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The v2→v1 fallback was speculative — no evidence that v2 is unreliable on any target deployment. Ryan's PR uses v2 only and it works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

No frontend caller exists for this endpoint. Dataset preview uses /svcdatasetpreview which returns JSON rows, not a raw file download. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Speculative field — no frontend code checks it. Remove to keep the schema minimal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…l Domino API calls Sidecar token and static API key fallbacks silently escalated to the App-owner identity when no user JWT was forwarded, bypassing Domino's per-user RBAC. Now all outbound Domino API calls require the visiting user's forwarded token and raise MissingUserTokenError if absent. Background job syncs already capture the user token before detaching, so they are unaffected. Job runner workers run in separate containers with their own env and don't import these functions. Addresses Ryan's review comment on PR #25. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

niole · 2026-03-24T11:49:31Z

automl-service/app/services/storage_resolver.py

+
+
+@dataclass
+class ProjectStorageResolver:


Could we replace this class with some helper functions?

niole · 2026-03-24T11:51:01Z

automl-service/app/services/storage_resolver.py

+        if limit is not None:
+            kwargs["limit"] = limit
+        result = get_dataset_snapshots.sync(dataset_id, client=client, **kwargs)
+        if isinstance(result, PaginatedSnapshotEnvelopeV1):


hmm, I can't find it. Where is _find_existing exception handler called in this function?

niole · 2026-03-24T11:52:17Z

automl-service/app/services/storage_resolver.py

+                )
+                # If the v2 listing didn't include the RW snapshot ID,
+                # resolve it eagerly via the v1 snapshots endpoint.
+                if not rw_sid and ds_id:


I will mention this to the data team.

…et-view # Conflicts: # automl-service/app/api/utils.py # automl-service/tests/test_api_utils.py

…et-view

ddl-subir-m requested a review from a team March 23, 2026 14:17

This was referenced Mar 23, 2026

[DOM-75543] feat: dataset upload, preview, and schema endpoints #26

Open

feat: dataset listing API + storage resolver + cross-project file access #17

Closed

feat: API-first dataset listing + cross-project support + snapshot verification #16

Closed

ddl-subir-m requested a review from niole March 23, 2026 14:36

ddl-subir-m changed the title ~~feat: dataset viewing — storage resolver, listing, detail, download~~ [DOM-75542] feat: dataset viewing — storage resolver, listing, detail, download Mar 23, 2026

niole reviewed Mar 23, 2026

View reviewed changes

ddl-ryan-connor reviewed Mar 24, 2026

View reviewed changes

ddl-subir-m and others added 10 commits March 23, 2026 23:57

remove in-memory dataset item cache

2a02f0a

The cache is a process-lifetime dict on a singleton — it never expires or invalidates. Stale data is worse than a redundant API call. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

simplify: remove v1 fallback from dataset listing, use v2 only

a5ddc5a

The v2→v1 fallback was speculative — no evidence that v2 is unreliable on any target deployment. Ryan's PR uses v2 only and it works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

remove dead download_dataset_file endpoint

0c07d1e

No frontend caller exists for this endpoint. Dataset preview uses /svcdatasetpreview which returns JSON rows, not a raw file download. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

remove mounted field from DatasetFileResponse

d26d314

Speculative field — no frontend code checks it. Remove to keep the schema minimal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ddl-subir-m force-pushed the subir/pr16a-dataset-view branch from 199624f to d26d314 Compare March 24, 2026 05:02

niole approved these changes Mar 24, 2026

View reviewed changes

ddl-subir-m mentioned this pull request Mar 24, 2026

[DOM-75568] feat: workers, trainers, predictions, diagnostics & export #28

Open

7 tasks

ddl-subir-m added 3 commits March 24, 2026 10:42

Merge branch 'subir/pr3-leaderboard-api-utils' into subir/pr16a-datas…

5933d1c

…et-view # Conflicts: # automl-service/app/api/utils.py # automl-service/tests/test_api_utils.py

Merge branch 'subir/pr3-leaderboard-api-utils' into subir/pr16a-datas…

fa0bd47

…et-view

Merge remote-tracking branch 'origin/main' into subir/pr16a-dataset-view

76caa4d

Conversation

ddl-subir-m commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Summary

File → consumer mapping

Test plan

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

niole left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ddl-subir-m commented Mar 23, 2026 •

edited

Loading