Skip to content

feat: dataset listing API + storage resolver + cross-project file access#17

Closed
ddl-subir-m wants to merge 16 commits intomainfrom
subir/pr5-dataset-api-storage
Closed

feat: dataset listing API + storage resolver + cross-project file access#17
ddl-subir-m wants to merge 16 commits intomainfrom
subir/pr5-dataset-api-storage

Conversation

@ddl-subir-m
Copy link
Copy Markdown
Collaborator

Why

The AutoML Extension currently uses the domino-py SDK for dataset operations and relies on a single shared dataset mounted locally. This has several problems:

  1. The SDK is a heavy dependency that pulls in unnecessary packages and complicates the Docker build
  2. A shared dataset creates contention between concurrent users and projects
  3. Cross-project file access is impossible — if training data lives in another project, the App can't reach it

This PR replaces SDK dataset calls with direct REST API calls and adds a storage resolver that auto-creates a per-project automl-extension dataset. Files are uploaded via Domino's v4 chunked API (in-memory, no temp files) and downloaded via streaming to avoid memory pressure on large files.

ensure_local_file() is the key cross-project enabler: when a file path doesn't exist locally (because it's in a different project's dataset), it downloads the file from the Domino API into a local cache.

Depends on

Summary

  • domino_dataset_api.py — Dataset RW v2/v1 listing API with pagination and version fallback
  • storage_resolver.py (~1000 lines) — ensure_dataset_exists(), upload_file() (chunked v4), download_file(), list_snapshot_files(), delete_snapshot_files(), get_rw_snapshot_id(), resolve_project_paths(), mount probing
  • utils.pyensure_local_file(), extract_dataset_relative_path(), cleanup_dataset_cache()

Files changed

  • app/services/domino_dataset_api.py — new: Domino dataset listing API
  • app/services/storage_resolver.py — new: per-project dataset management
  • app/core/utils.py — modified: cross-project file utilities
  • tests/test_domino_dataset_api.py — dataset API tests
  • tests/test_storage_resolver.py — storage resolver tests
  • tests/test_ensure_local_file.py — ensure_local_file tests

Test plan

  • test_domino_dataset_api.py passes
  • test_storage_resolver.py passes
  • test_ensure_local_file.py passes
  • Dataset auto-creation works when automl-extension dataset doesn't exist
  • Chunked upload completes successfully for files > 1MB
  • ensure_local_file downloads and caches cross-project files

ddl-subir-m and others added 4 commits March 20, 2026 09:44
- Add DominoProjectType enum (DFS/GIT/UNKNOWN) with filesystem-based detection
- Add _db_url_remap for cross-project SQLite URL remapping across mount types
- Add tabular_data module: centralized CSV/parquet preview, schema, row counting
  with LRU caching (replaces scattered pd.read_csv/parquet calls)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…aming download and debug middleware

- Add params, files, headers, base_url parameters to domino_request()
- Add domino_download() for streaming file downloads from Domino APIs
- Add resolve_domino_nucleus_host() for direct nucleus-frontend access
- Add _get_api_key() helper for X-Domino-Api-Key auth
- Add DebugLoggingMiddleware (opt-in via AUTOML_DEBUG_LOGGING=true)
- Use fresh httpx client per request to avoid proxy idle disconnects
- Add debug_logging setting to config

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add domino_dataset_api: Dataset RW v2/v1 listing with pagination and fallback
- Add storage_resolver: auto-create automl-extension dataset per project,
  chunked upload via v4 API, streaming download, snapshot file listing,
  mount path probing across DFS/git layouts
- Add ensure_local_file: downloads from dataset API when file not on local
  mount (enables cross-project file access for profiling and training)
- Add cleanup_dataset_cache and extract_dataset_relative_path utils
Copy link
Copy Markdown
Collaborator

@niole niole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, I think there are some blocking issues with this with respect to what auth method is used to send api requests

# verify that correct headers are read
# get_domino_auth_headers should NOT use the forwarded token —
# it should fall through to sidecar token (which will fail in test)
# then to API key
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requests to the Domino API, which are meant to retrieve data as the viewing user/save things to their project, should use the forwarded auth headers, because then authentication/authorization for that user will apply.

We should not use the proxy token for things that should be authorized for individual users, like saving data to a project. Doesn't that send the requests using the app owner's identity?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that for user-scoped operations we should ideally use the viewing user's identity. The reason I stopped forwarding the browser token is that the sidecar returns a 500 when it receives it; it only accepts its own internally managed token. I hit this during testing and documented it in main.py (PR #20).

That said, your point stands for calls that go directly to nucleus-frontend bypassing the sidecar, we
could forward the user's browser token. The capture_auth_header middleware already stores it per-request. We'd just need to use it for direct-to-nucleus calls instead of defaulting to the API key.

Would that address your concern? I can update domino_request and domino_download to forward the captured user token when calling nucleus-frontend directly.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated get_domino_auth_headers() to preserve the user's forwarded token for all outbound Domino API calls. Now the sidecar token is only used as a fallback when no user token is present. This change is in PR #14 and cascaded through all downstream PRs.

Stop the sidecar token from overwriting the user's forwarded JWT.
When a user token is present (from the Extension-injected Authorization
header), outbound calls to datasetrw, jobs, and model registry now run
as the visiting user instead of the App owner. The sidecar token is
only used as fallback for background tasks and health checks.
Addresses review comment: use the actual header name x-domino-api-key
instead of the incorrect domino-api-key.
Copy link
Copy Markdown
Collaborator

@niole niole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some functionality in this PR isn't used in this PR and should be moved to the right PR or deleted so your changes can be reviewed properly. It would be helpful to me for the PRs to be split into pieces of functionality that work together, like for example the implementation for viewing an already uploaded dataset, a separate one for uploading, another one for the parquet/special file handling by itself, etc...Another PR for implementing caching

Comment on lines +244 to +252

if use_api_key:
api_key = _get_api_key()
if api_key:
auth_headers = {"X-Domino-Api-Key": api_key}
else:
auth_headers = await get_domino_auth_headers()
else:
auth_headers = await get_domino_auth_headers()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mentioned that tokens also work, so maybe we should refactor this to use the get domino auth headers helper

Suggested change
if use_api_key:
api_key = _get_api_key()
if api_key:
auth_headers = {"X-Domino-Api-Key": api_key}
else:
auth_headers = await get_domino_auth_headers()
else:
auth_headers = await get_domino_auth_headers()
auth_headers = await get_domino_auth_headers()

else:
auth_headers = await get_domino_auth_headers()

os.makedirs(os.path.dirname(dest_path), exist_ok=True)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should think about what will happen if this file is too big and crashes the pod. If it makes sense, we should write these files to temporary paths so that if the pod restarts, the file will be deleted and the pod can keep running without intervention from the field

Also, do we cleanup these files after we are done using them for whatever they are used for? (haven't checked yet)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downloads stream in 8KB chunks and uploads send in 8MB chunks storage_resolver.py), so neither loads the full file into memory - large files won't OOM the pod.

The upload route does hold the uploaded file in memory (await file.read()) for metadata extraction and chunked upload, but it's now capped at 550 MB (configurable via AUTOML_MAX_UPLOAD_SIZE_MB) returns HTTP 413 if exceeded.This matches the limit shown in the UI.

For the local cache (dataset_cache/): these go under temp_path which is ephemeral storage, so a pod restart clears them. For long-running pods, cleanup_dataset_cache() runs at startup and deletes files older than 24 hours (configurable via AUTOML_DATASET_CACHE_TTL_HOURS). Both in PR #26.

read_tabular_schema,
read_upload_metadata,
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this file used? Can you move it to the PR where it is used so we don't merge anything that's unused

"Failed to resolve dataset '%s' in project %s",
dataset_name, project_id, exc_info=True,
)
return None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file also seems to contain functions that are not used. Can you move them to the PR where they are used?

while True:
resp = await domino_request(
"GET",
"/api/datasetrw/v2/datasets",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rebase with the PR that merges the public API and use those helpers instead? THis will help us maintain this functionality going forward an eliminate the need to manually parse the response, which will help us avoid any bugs there

}


class DebugLoggingMiddleware(BaseHTTPMiddleware):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this middleware used?

Remove use_api_key parameter and _get_api_key() helper. All downloads
now use the standard auth chain which preserves the user's forwarded
token and falls back to sidecar when needed.
- Use generated API client for dataset listing (domino_dataset_api.py)
- Remove use_api_key from domino_download callers
- Remove utils.py additions (ensure_local_file, cleanup_dataset_cache,
  extract_dataset_relative_path) — will be added in PR #16/#22 where used
- Remove test_ensure_local_file.py (moves with the functions)
ddl-subir-m added a commit that referenced this pull request Mar 23, 2026
- Simplify domino_download to use get_domino_auth_headers, remove
  use_api_key param and _get_api_key helper
- Rewrite domino_dataset_api.py to use generated API client
- Remove duplicate endpoint entries in storage_resolver download_file
- Update storage resolver test for proxy→nucleus fallback
ddl-subir-m and others added 5 commits March 23, 2026 08:45
Switch project_resolver from raw domino_request(/v4/projects) to the
generated get_project_by_id endpoint (/api/projects/v1/projects).
Returns typed ProjectEnvelopeV1 instead of parsing raw JSON dicts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace 4 raw domino_request calls to /api/datasetrw/v1/datasets/{id}/snapshots
with a shared _list_snapshots_typed() helper that uses the generated
get_dataset_snapshots endpoint. Returns typed SnapshotDetailsV1 objects
instead of parsing raw JSON with manual envelope unwrapping.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests now mock _list_snapshots_typed instead of raw domino_request,
matching the refactor to use the generated API client for snapshot
listing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ddl-subir-m
Copy link
Copy Markdown
Collaborator Author

Re-split for reviewability

Per Niole's feedback, this PR has been re-split so each PR contains functionality alongside its consumers:

Closing this PR in favor of the above. The new PRs stack: #24#25#26.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants