DOM-75238 List files in project dataset, start domino job with the right API args#27
DOM-75238 List files in project dataset, start domino job with the right API args#27ddl-ryan-connor wants to merge 8 commits intomainfrom
Conversation
- add file path as required arg to training script - disallow in-process execution of training script for now since that should be explicitly added with its own ticket if that is a required feature
| @@ -1,6 +1,7 @@ | |||
| """Custom compatibility dataset routes.""" | |||
There was a problem hiding this comment.
i put all this in the PR description, but i'll also call it out here in this comment.
this PR makes changes that:
- allow listing datasets in the project when you click from sidebar mount point
- start the job in the project with the dataset file path passed as it is available in the domino job
further work needed to:
- remove domino job dependency on the sqlite database that lives in the app's dataset. im really uncomfortable with this design, it should wait until we have a "deploy with database" option for apps so we are not using the extremely hacky "put a file-based db into the domino-shared NFS volume that happens to be mounted everywhere"
- deal with "uploaded" data by the user. we can either 1) remove this functionality for the first release, or 2) enforce that the file is created via API in a new dataset in the target project (and so the user must select a target project, it cannot be the app's project)
- deal with "preview" for dataset files -- the
/svcdatasetpreviewapi endpoint
There was a problem hiding this comment.
working with subir on the last 3 points
| "Accept": "application/json", | ||
| } | ||
| if self.settings.effective_api_key: | ||
| headers["X-Domino-Api-Key"] = self.settings.effective_api_key |
There was a problem hiding this comment.
we should not be using api key anywhere
|
|
||
| return response.json() if response.text else {} | ||
|
|
||
| async def list_project_datasets(self, project_id: str) -> list[DatasetResponse]: |
There was a problem hiding this comment.
this is mostly claude-coded, i have not cleaned it up (code or comments). but it "works"
| class DominoJobLauncher: | ||
| """Wrapper around Domino V4 Jobs REST API.""" | ||
|
|
||
| _RUNNER_BASE = "/home/ubuntu/AutoML_Extension/automl-service/app/workers" |
There was a problem hiding this comment.
this is where we installed the stuff in the compute environment (dockerfile)
| async def start_training_job( | ||
| self, | ||
| job_id: str, | ||
| file_path: str, |
There was a problem hiding this comment.
file path required. the file at that path should be openable by a Domino Job in the job's project. should not rely on a "database" within the job until we have a "deploy app with database" option -- which we dont have yet
| """List available datasets in API response shape. | ||
|
|
||
| When *project_id* is provided, datasets are fetched from the Domino API for | ||
| that project. Otherwise the legacy local-mount scan is used as a fallback. |
There was a problem hiding this comment.
i don't really know whether to leave this local-mount scan. maybe we can leave it, but should consider what to do here in the follow-up work about making the jobs work with user-uploaded data.
|
|
||
| if header_project_id: | ||
| from app.services.project_resolver import resolve_project | ||
| project_id = request.query_params.get("projectId") if request else None |
There was a problem hiding this comment.
project context should only come from query params. we're only allowing domino jobs to execute in other projects, not the app's project, so there's no reason to get it anywhere except query param.
| if job.data_source == "domino_dataset" and job.dataset_id: | ||
| from app.services.dataset_service import get_dataset_manager | ||
| dataset_manager = get_dataset_manager() | ||
| dataset_path = await dataset_manager.get_dataset_path(job.dataset_id) |
There was a problem hiding this comment.
this resolves the path that the dataset file is mounted at when it is mounted to a domino job.
it is passed as an argument to the "start job" run command, and the target script must use that arg.
…move env var project fallback Three changes from PR #27 review incorporated: 1. resolve_execution_target() now rejects local (in-process) training with a 400 — it's not supported in the Domino App context. 2. validate_job_create_request() now requires file_path when data_source is 'domino_dataset' (previously only required for 'upload'). 3. get_project_context() no longer falls back to DOMINO_PROJECT_ID env var — that's the App's own project, not the target project. Reads X-Project-Id header and projectId query param only. 4. Remove environment_id param from start_training_job and start_eda_job callers (environment now read from env vars in job launcher constructor). 5. Dockerfile: fix git clone URL from niole's fork to dominodatalab org. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adopt PR #27's approach: resolve the full dataset file path at job creation time in job_service.py and pass it as --file-path CLI arg to the training runner. The worker uses the path directly instead of instantiating DominoDatasetManager at runtime. For domino_dataset source, constructs {datasetPath}/{file_path} using the dataset's mount location from the Domino API. Keeps _resolve_training_data_path() as fallback for legacy/local jobs that don't receive --file-path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| const projectId = getProjectIdFromUrl() | ||
| if (projectId) { | ||
| this.defaultHeaders['X-Project-Id'] = projectId | ||
| this.defaultParams['projectId'] = projectId |
There was a problem hiding this comment.
This may break the app. A lot of functionality depends on this headers code. It would probably be best to fix this design issue in its own PR. I ran it locally in dev mode and it seems like job listing is broken.
| @@ -1,6 +1,7 @@ | |||
| """Custom compatibility dataset routes.""" | |||
There was a problem hiding this comment.
working with subir on the last 3 points
| Object.entries(mergedParams).forEach(([key, value]) => { | ||
| if (value !== undefined) { | ||
| searchParams.append(key, String(value)) | ||
| } |
There was a problem hiding this comment.
Since this changes where the datasets come from, can you change the empty state data viewer UI message too, that explains "Mount a dataset to this app to get started". It should say "No datasets found in project" probably. That would clue the user in that they would need to mount datasets in their own project to get started from there
Summary
Ticket: https://dominodatalab.atlassian.net/browse/DOM-75238
Fix listing datasets and files in the datasets, and form the right job API calls referencing the dataset file as mounted in a Domino Job and using the auto ML compute env.
Relies on domino monorepo PR that puts environment id / environment revision id in run container as env var (needed by app for API call https://github.com/cerebrotech/domino/pull/47259
Notes:
Testing
I have a video of this working deployed. I'll post it in slack.
Further Work
/svcdatasetpreviewapi endpoint. it returns 404 for dataset files.