Skip to content

[DOM-75542] feat: dataset viewing — storage resolver, listing, detail, download#25

Open
ddl-subir-m wants to merge 13 commits intomainfrom
subir/pr16a-dataset-view
Open

[DOM-75542] feat: dataset viewing — storage resolver, listing, detail, download#25
ddl-subir-m wants to merge 13 commits intomainfrom
subir/pr16a-dataset-view

Conversation

@ddl-subir-m
Copy link
Copy Markdown
Collaborator

@ddl-subir-m ddl-subir-m commented Mar 23, 2026

Why

Provides the complete pipeline for viewing datasets through the API. Users can list datasets scoped to their project, get dataset details, browse files via snapshot, and download individual files.

Summary

  • storage_resolver.py (viewing methods only): snapshot management (get_rw_snapshot_id, get_latest_snapshot_status, list_snapshot_files), dataset discovery (_find_existing), mount probing, caching
  • dataset_manager.py: API-first listing with local filesystem fallback, cross-project support, snapshot-based file listing, dataset detail with incremental cache merging
  • routes/datasets.py: GET / (list), GET /{id} (detail), GET /verify-snapshot, GET /{id}/files/{name}/download
  • dataset_service.py: listing orchestration, mount-path filtering

Stacks on PR #24 (listing API) + PR #13 (leaderboard utils / request project ID).

Replaces the storage_resolver portion of old PR #17 + the viewing portion of old PR #16. Storage resolver is trimmed to only the methods consumed by this PR. Upload methods are added in PR #26, download/provisioning in PR #22.

File → consumer mapping

All code in this PR is consumed within the PR:

File Consumed by
storage_resolver.py dataset_manager.py (snapshot listing, RW ID), routes/datasets.py (verify-snapshot)
dataset_manager.py dataset_service.py, routes/datasets.py (via get_dataset_manager dependency)
dataset_service.py routes/datasets.py (list, get_or_404)
schemas/dataset.py routes/datasets.py, dataset_service.py (response models)

Test plan

  • test_storage_resolver.py passes
  • test_storage_resolver_extras.py passes
  • test_dataset_manager.py passes
  • test_api_datasets.py passes


@app.get("/svcverifysnapshot")
async def svc_verify_snapshot(dataset_id: str, file_path: str):
from app.api.routes.datasets import verify_snapshot as _verify_snapshot
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why import things inside of the function definition? Is there a circular dependency? If not, I think it's more helpful to import at the top of the file to catch any comp issues when the file is called

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, circular dependency ; these import from app.api.routes.datasets which imports from app.services.dataset_service which registers back through compat routes. Added # avoid circular import comments.

except FileNotFoundError:
raise HTTPException(status_code=404, detail=f"File '{file_name}' not found in dataset {dataset_id}")

if not os.path.isfile(file_path):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_dataset_file_path should not succeed if the path is not a file. Did you need to add this if statement because there is actually a bug with get_dataset_file_path?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bug in get_dataset_file_path; it can return synthetic paths for cross-project datasets that represent where the file will be mounted in a Job, not where it exists in this App. The download endpoint needs the file to actually exist on disk, so this guard returns a clear 404 instead of a FileResponse crash. Added a comment explaining this.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds there is a bug with the get_dataset_file_path functionality, where a filepath is returned even though no file was found. I would move the file check into the get_dataset_file_path function

)
# If the v2 listing didn't include the RW snapshot ID,
# resolve it eagerly via the v1 snapshots endpoint.
if not rw_sid and ds_id:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would it not have the right snapshot ID?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The v2 listing response doesn't always populate readWriteSnapshotId observed missing in some Domino versions and for recently created datasets. The fallback resolves it via the v1 snapshots endpoint which reliably returns the active RW snapshot. Without this, file listing and uploads fail for those datasets.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will mention this to the data team.

val = os.environ.get(env_key)
if not val:
continue
for part in val.replace(",", ":").replace(";", ":").split(":"):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this logic for?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

download_file fetches both the latest snapshot ID and the RW snapshot ID because the RW snapshot is the mutable head with the most recent uploads. The latest snapshot could be an older read-only snapshot. We use the RW snapshot's v4 raw endpoint, which is the most reliable for cross-project downloads.

if os.path.isdir(path) and os.access(path, os.W_OK):
return path
# Also check env-provided mount roots
for env_key in ("DOMINO_DATASET_MOUNT_PATH", "DOMINO_MOUNT_PATHS"):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, we should probably prefer these, since they are domino provided

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already the case. The endpoints list is built with rw_snapshot_id (domino-provided) first, so it's always tried before any fallback.

Copy link
Copy Markdown
Collaborator

@niole niole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a question about the projectstorageresolver, I saw some errors are getting swallowed, I saw some code that falls back to older versions of the datasets apis which I left Qs about,

I don't get why we keep importing things inside of the function in which they are used. Python is JIT compiled, so if there is any issue with dependency resolution we won't know until that line of code gets triggered



@dataclass
class ProjectStorageResolver:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your intention for this class? What would change if we always hit the domino apis to get this information?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProjectPaths is a convenience struct so callers (job launcher, training worker) get pre-computed subdirectory paths (models/, uploads/, eda_results/) without each re-deriving them from the mount root.If we always hit Domino APIs, the mount path and subdirectory layout would be the same ; this class just avoids repeating os.path.join(mount, "models") in every caller. It also handles standalone-mode where paths come from settings instead of Domino.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we replace this class with some helper functions?

# Always list files from the snapshot API — mounted filesystems use
# read-only snapshots that go stale when files are deleted in Domino.
if not rw_snapshot_id and dataset_id:
from app.services.storage_resolver import get_storage_resolver
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please import at top of file

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Circular import. dataset_manager.py is in app.core and storage_resolver.py is in app.services which imports from app.core. Moving it top-level would create an import cycle at module load time.

ddl-subir-m added a commit that referenced this pull request Mar 23, 2026
- Remove unnecessary _resolve_project_id wrapper in dataset routes
- Move `import asyncio` to top-level in datasets.py
- Change HTTP 502 to 500 for internal errors (3 locations)
- Downgrade dataset detail fetch log from error to debug
- Remove dead _extract_dataset_list wrapper from storage_resolver
- Add circular import comments in compat routes
- Add comment explaining isfile guard on cross-project paths
- Update tests to import extract_dataset_list from domino_dataset_api

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Get auth headers from request from auth context
2. Fallback to the ephemeral token from Domino App/Run sidecar (localhost:8899)
1. Forwarded user token from the incoming request (per-request context)
2. Ephemeral token from the Domino App/Run sidecar (localhost:8899)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should never use 2 or 3

ddl-subir-m added a commit that referenced this pull request Mar 24, 2026
…oad endpoint

Merge conflict between PR #25 (removed download) and PR #26 (added
upload/preview/schema). Resolution: keep all PR #26 endpoints, drop the
download_dataset_file endpoint which has no frontend caller.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ddl-subir-m and others added 10 commits March 23, 2026 23:57
- Add DominoProjectType enum (DFS/GIT/UNKNOWN) with filesystem-based detection
- Add _db_url_remap for cross-project SQLite URL remapping across mount types
- Add tabular_data module: centralized CSV/parquet preview, schema, row counting
  with LRU caching (replaces scattered pd.read_csv/parquet calls)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…leware

- domino_dataset_api.py: typed listing wrapper (v2→v1 fallback) using generated client
- domino_http.py: add params/files/headers/base_url to domino_request, streaming download
- middleware.py: debug request/response logging with header redaction
- project_resolver.py: use generated get_project_by_id endpoint

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lver

- Add normalize_leaderboard_rows/payload to fix TimeSeries fit_time display
- Add resolve_request_project_id() to centralize project context extraction
  from X-Project-Id header, query params, and DOMINO_PROJECT_ID env var

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduces the complete dataset viewing pipeline:
- storage_resolver.py: per-project dataset lifecycle, mount probing,
  snapshot management, file operations, caching
- dataset_manager.py: API-first listing with local fallback,
  cross-project support, snapshot-based file listing
- routes/datasets.py: GET endpoints for list, detail, verify-snapshot,
  and file download
- dataset_service.py: listing orchestration and mount filtering

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove upload, download, provisioning, and deletion methods from
storage_resolver.py — they will be re-introduced in the PRs that
consume them (PR #26 for upload, PR #22 for download/provisioning).

Keeps: snapshot listing/status, RW ID resolution, list_snapshot_files,
_find_existing, _probe_mount, caching infrastructure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…roject_id

The env var is the App's own project, not the target project the user
is working in. Falling back to it silently operates on the wrong project
(root cause of datasets showing empty in cross-project scenarios).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The cache is a process-lifetime dict on a singleton — it never expires
or invalidates. Stale data is worse than a redundant API call.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The v2→v1 fallback was speculative — no evidence that v2 is unreliable
on any target deployment. Ryan's PR uses v2 only and it works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
No frontend caller exists for this endpoint. Dataset preview uses
/svcdatasetpreview which returns JSON rows, not a raw file download.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Speculative field — no frontend code checks it. Remove to keep the
schema minimal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ddl-subir-m added a commit that referenced this pull request Mar 24, 2026
…l Domino API calls

Sidecar token and static API key fallbacks silently escalated to the
App-owner identity when no user JWT was forwarded, bypassing Domino's
per-user RBAC. Now all outbound Domino API calls require the visiting
user's forwarded token and raise MissingUserTokenError if absent.

Background job syncs already capture the user token before detaching,
so they are unaffected. Job runner workers run in separate containers
with their own env and don't import these functions.

Addresses Ryan's review comment on PR #25.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ddl-subir-m added a commit that referenced this pull request Mar 24, 2026
…l Domino API calls

Sidecar token and static API key fallbacks silently escalated to the
App-owner identity when no user JWT was forwarded, bypassing Domino's
per-user RBAC. Now all outbound Domino API calls require the visiting
user's forwarded token and raise MissingUserTokenError if absent.

Background job syncs already capture the user token before detaching,
so they are unaffected. Job runner workers run in separate containers
with their own env and don't import these functions.

Addresses Ryan's review comment on PR #25.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ddl-subir-m ddl-subir-m force-pushed the subir/pr16a-dataset-view branch from 199624f to d26d314 Compare March 24, 2026 05:02


@dataclass
class ProjectStorageResolver:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we replace this class with some helper functions?

if limit is not None:
kwargs["limit"] = limit
result = get_dataset_snapshots.sync(dataset_id, client=client, **kwargs)
if isinstance(result, PaginatedSnapshotEnvelopeV1):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I can't find it. Where is _find_existing exception handler called in this function?

)
# If the v2 listing didn't include the RW snapshot ID,
# resolve it eagerly via the v1 snapshots endpoint.
if not rw_sid and ds_id:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will mention this to the data team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants