feat: dataset listing API + storage resolver + cross-project file access by ddl-subir-m · Pull Request #17 · dominodatalab/AutoML_Extension

ddl-subir-m · 2026-03-20T16:43:27Z

Why

The AutoML Extension currently uses the domino-py SDK for dataset operations and relies on a single shared dataset mounted locally. This has several problems:

The SDK is a heavy dependency that pulls in unnecessary packages and complicates the Docker build
A shared dataset creates contention between concurrent users and projects
Cross-project file access is impossible — if training data lives in another project, the App can't reach it

This PR replaces SDK dataset calls with direct REST API calls and adds a storage resolver that auto-creates a per-project automl-extension dataset. Files are uploaded via Domino's v4 chunked API (in-memory, no temp files) and downloaded via streaming to avoid memory pressure on large files.

ensure_local_file() is the key cross-project enabler: when a file path doesn't exist locally (because it's in a different project's dataset), it downloads the file from the Domino API into a local cache.

Depends on

PR [DOM-75514] feat: core HTTP layer enhancements + debug middleware #14 (HTTP layer enhancements) — domino_request() params/files/headers, domino_download()
PR [DOM-75515] feat: project type detection, DB URL remap, tabular data helpers #12 (Project type detection) — DFS vs git mount path resolution

Summary

domino_dataset_api.py — Dataset RW v2/v1 listing API with pagination and version fallback
storage_resolver.py (~1000 lines) — ensure_dataset_exists(), upload_file() (chunked v4), download_file(), list_snapshot_files(), delete_snapshot_files(), get_rw_snapshot_id(), resolve_project_paths(), mount probing
utils.py — ensure_local_file(), extract_dataset_relative_path(), cleanup_dataset_cache()

Files changed

app/services/domino_dataset_api.py — new: Domino dataset listing API
app/services/storage_resolver.py — new: per-project dataset management
app/core/utils.py — modified: cross-project file utilities
tests/test_domino_dataset_api.py — dataset API tests
tests/test_storage_resolver.py — storage resolver tests
tests/test_ensure_local_file.py — ensure_local_file tests

Test plan

test_domino_dataset_api.py passes
test_storage_resolver.py passes
test_ensure_local_file.py passes
Dataset auto-creation works when automl-extension dataset doesn't exist
Chunked upload completes successfully for files > 1MB
ensure_local_file downloads and caches cross-project files

- Add DominoProjectType enum (DFS/GIT/UNKNOWN) with filesystem-based detection - Add _db_url_remap for cross-project SQLite URL remapping across mount types - Add tabular_data module: centralized CSV/parquet preview, schema, row counting with LRU caching (replaces scattered pd.read_csv/parquet calls) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…aming download and debug middleware - Add params, files, headers, base_url parameters to domino_request() - Add domino_download() for streaming file downloads from Domino APIs - Add resolve_domino_nucleus_host() for direct nucleus-frontend access - Add _get_api_key() helper for X-Domino-Api-Key auth - Add DebugLoggingMiddleware (opt-in via AUTOML_DEBUG_LOGGING=true) - Use fresh httpx client per request to avoid proxy idle disconnects - Add debug_logging setting to config Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…into subir/pr5-dataset-api-storage

- Add domino_dataset_api: Dataset RW v2/v1 listing with pagination and fallback - Add storage_resolver: auto-create automl-extension dataset per project, chunked upload via v4 API, streaming download, snapshot file listing, mount path probing across DFS/git layouts - Add ensure_local_file: downloads from dataset API when file not on local mount (enables cross-project file access for profiling and training) - Add cleanup_dataset_cache and extract_dataset_relative_path utils

niole

If I understand correctly, I think there are some blocking issues with this with respect to what auth method is used to send api requests

automl-service/app/api/middleware.py

automl-service/app/core/domino_http.py

niole · 2026-03-20T18:02:14Z

automl-service/tests/test_domino_http_auth_context.py

-    # verify that correct headers are read
+    # get_domino_auth_headers should NOT use the forwarded token —
+    # it should fall through to sidecar token (which will fail in test)
+    # then to API key


Requests to the Domino API, which are meant to retrieve data as the viewing user/save things to their project, should use the forwarded auth headers, because then authentication/authorization for that user will apply.

We should not use the proxy token for things that should be authorized for individual users, like saving data to a project. Doesn't that send the requests using the app owner's identity?

You're right that for user-scoped operations we should ideally use the viewing user's identity. The reason I stopped forwarding the browser token is that the sidecar returns a 500 when it receives it; it only accepts its own internally managed token. I hit this during testing and documented it in main.py (PR #20).

That said, your point stands for calls that go directly to nucleus-frontend bypassing the sidecar, we
could forward the user's browser token. The capture_auth_header middleware already stores it per-request. We'd just need to use it for direct-to-nucleus calls instead of defaulting to the API key.

Would that address your concern? I can update domino_request and domino_download to forward the captured user token when calling nucleus-frontend directly.

Updated get_domino_auth_headers() to preserve the user's forwarded token for all outbound Domino API calls. Now the sidecar token is only used as a fallback when no user token is present. This change is in PR #14 and cascaded through all downstream PRs.

automl-service/app/core/domino_http.py

Stop the sidecar token from overwriting the user's forwarded JWT. When a user token is present (from the Extension-injected Authorization header), outbound calls to datasetrw, jobs, and model registry now run as the visiting user instead of the App owner. The sidecar token is only used as fallback for background tasks and health checks.

…/pr5-dataset-api-storage

Addresses review comment: use the actual header name x-domino-api-key instead of the incorrect domino-api-key.

…/pr5-dataset-api-storage

niole

Some functionality in this PR isn't used in this PR and should be moved to the right PR or deleted so your changes can be reviewed properly. It would be helpful to me for the PRs to be split into pieces of functionality that work together, like for example the implementation for viewing an already uploaded dataset, a separate one for uploading, another one for the parquet/special file handling by itself, etc...Another PR for implementing caching

niole · 2026-03-23T11:48:12Z

automl-service/app/core/domino_http.py

+
+    if use_api_key:
+        api_key = _get_api_key()
+        if api_key:
+            auth_headers = {"X-Domino-Api-Key": api_key}
+        else:
+            auth_headers = await get_domino_auth_headers()
+    else:
+        auth_headers = await get_domino_auth_headers()


I think you mentioned that tokens also work, so maybe we should refactor this to use the get domino auth headers helper

Suggested change

if use_api_key:

api_key = _get_api_key()

if api_key:

auth_headers = {"X-Domino-Api-Key": api_key}

else:

auth_headers = await get_domino_auth_headers()

else:

auth_headers = await get_domino_auth_headers()

auth_headers = await get_domino_auth_headers()

niole · 2026-03-23T11:52:14Z

automl-service/app/core/domino_http.py

+    else:
+        auth_headers = await get_domino_auth_headers()
+
+    os.makedirs(os.path.dirname(dest_path), exist_ok=True)


We should think about what will happen if this file is too big and crashes the pod. If it makes sense, we should write these files to temporary paths so that if the pod restarts, the file will be deleted and the pod can keep running without intervention from the field

Also, do we cleanup these files after we are done using them for whatever they are used for? (haven't checked yet)

Downloads stream in 8KB chunks and uploads send in 8MB chunks storage_resolver.py), so neither loads the full file into memory - large files won't OOM the pod.

The upload route does hold the uploaded file in memory (await file.read()) for metadata extraction and chunked upload, but it's now capped at 550 MB (configurable via AUTOML_MAX_UPLOAD_SIZE_MB) returns HTTP 413 if exceeded.This matches the limit shown in the UI.

For the local cache (dataset_cache/): these go under temp_path which is ephemeral storage, so a pod restart clears them. For long-running pods, cleanup_dataset_cache() runs at startup and deletes files older than 24 hours (configurable via AUTOML_DATASET_CACHE_TTL_HOURS). Both in PR #26.

niole · 2026-03-23T12:03:20Z

automl-service/tests/test_tabular_data.py

+    read_tabular_schema,
+    read_upload_metadata,
+)
+


Where is this file used? Can you move it to the PR where it is used so we don't merge anything that's unused

niole · 2026-03-23T12:06:08Z

automl-service/app/core/utils.py

+            "Failed to resolve dataset '%s' in project %s",
+            dataset_name, project_id, exc_info=True,
+        )
+    return None


This file also seems to contain functions that are not used. Can you move them to the PR where they are used?

niole · 2026-03-23T12:09:25Z

automl-service/app/services/domino_dataset_api.py

+    while True:
+        resp = await domino_request(
+            "GET",
+            "/api/datasetrw/v2/datasets",


Can you rebase with the PR that merges the public API and use those helpers instead? THis will help us maintain this functionality going forward an eliminate the need to manually parse the response, which will help us avoid any bugs there

niole · 2026-03-23T12:13:55Z

automl-service/app/api/middleware.py

+    }
+
+
+class DebugLoggingMiddleware(BaseHTTPMiddleware):


Is this middleware used?

Remove use_api_key parameter and _get_api_key() helper. All downloads now use the standard auth chain which preserves the user's forwarded token and falls back to sidecar when needed.

…/pr5-dataset-api-storage

- Use generated API client for dataset listing (domino_dataset_api.py) - Remove use_api_key from domino_download callers - Remove utils.py additions (ensure_local_file, cleanup_dataset_cache, extract_dataset_relative_path) — will be added in PR #16/#22 where used - Remove test_ensure_local_file.py (moves with the functions)

- Simplify domino_download to use get_domino_auth_headers, remove use_api_key param and _get_api_key helper - Rewrite domino_dataset_api.py to use generated API client - Remove duplicate endpoint entries in storage_resolver download_file - Update storage resolver test for proxy→nucleus fallback

Switch project_resolver from raw domino_request(/v4/projects) to the generated get_project_by_id endpoint (/api/projects/v1/projects). Returns typed ProjectEnvelopeV1 instead of parsing raw JSON dicts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace 4 raw domino_request calls to /api/datasetrw/v1/datasets/{id}/snapshots with a shared _list_snapshots_typed() helper that uses the generated get_dataset_snapshots endpoint. Returns typed SnapshotDetailsV1 objects instead of parsing raw JSON with manual envelope unwrapping. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tests now mock _list_snapshots_typed instead of raw domino_request, matching the refactor to use the generated API client for snapshot listing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ddl-subir-m · 2026-03-23T14:17:57Z

Re-split for reviewability

Per Niole's feedback, this PR has been re-split so each PR contains functionality alongside its consumers:

PR [DOM-75518] feat: dataset listing API client + HTTP layer extensions #24 ← replaces this PR (trimmed: listing API + HTTP layer only, no storage_resolver)
PR [DOM-75542] feat: dataset viewing — storage resolver, listing, detail, download #25 ← storage resolver + dataset viewing (list, detail, download)
PR [DOM-75543] feat: dataset upload, preview, and schema endpoints #26 ← dataset upload, preview, schema

Closing this PR in favor of the above. The new PRs stack: #24 → #25 → #26.

ddl-subir-m and others added 4 commits March 20, 2026 09:44

Merge remote-tracking branch 'origin/subir/pr2-project-type-tabular' …

f03be54

…into subir/pr5-dataset-api-storage

ddl-subir-m requested a review from a team March 20, 2026 16:43

ddl-subir-m mentioned this pull request Mar 20, 2026

[DOM-75523] feat: cross-project workers, pre-computed diagnostics, trainer improvements #22

Closed

11 tasks

niole requested changes Mar 20, 2026

View reviewed changes

ddl-subir-m added 4 commits March 20, 2026 14:51

Merge remote-tracking branch 'origin/subir/pr1-http-layer' into subir…

98dae8a

…/pr5-dataset-api-storage

fix: redact x-domino-api-key header in debug middleware

0d89775

Addresses review comment: use the actual header name x-domino-api-key instead of the incorrect domino-api-key.

Merge remote-tracking branch 'origin/subir/pr1-http-layer' into subir…

64e45c5

…/pr5-dataset-api-storage

niole reviewed Mar 23, 2026

View reviewed changes

ddl-subir-m added 3 commits March 23, 2026 08:22

fix: simplify domino_download to use get_domino_auth_headers

f79aa6f

Remove use_api_key parameter and _get_api_key() helper. All downloads now use the standard auth chain which preserves the user's forwarded token and falls back to sidecar when needed.

Merge remote-tracking branch 'origin/subir/pr1-http-layer' into subir…

2ac3a1c

…/pr5-dataset-api-storage

ddl-subir-m and others added 5 commits March 23, 2026 08:45

Merge branch 'subir/pr1-http-layer' into subir/pr5-dataset-api-storage

7e2b71d

fix: use Optional[int] syntax for Python 3.9 compatibility

458016d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

test: update snapshot tests to mock generated client helper

8821ce9

Tests now mock _list_snapshots_typed instead of raw domino_request, matching the refactor to use the generated API client for snapshot listing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

This was referenced Mar 23, 2026

[DOM-75518] feat: dataset listing API client + HTTP layer extensions #24

Open

[DOM-75542] feat: dataset viewing — storage resolver, listing, detail, download #25

Open

ddl-subir-m closed this Mar 23, 2026

		}


		class DebugLoggingMiddleware(BaseHTTPMiddleware):

Conversation

ddl-subir-m commented Mar 20, 2026

Why

Depends on

Summary

Files changed

Test plan

Uh oh!

niole left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

niole left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ddl-subir-m commented Mar 23, 2026

Re-split for reviewability

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

niole left a comment •

edited

Loading