[DOM-75523] feat: cross-project workers, pre-computed diagnostics, trainer improvements by ddl-subir-m · Pull Request #22 · dominodatalab/AutoML_Extension

ddl-subir-m · 2026-03-20T17:44:41Z

Why

Cross-project workers: When a Domino training or EDA job runs in a different project than the App, the worker can't find its database, training data, or model output paths. Workers now accept --database-url and --job-config as CLI args, resolve training data via a mount fallback chain (local mount → dataset download), and write outputs to the per-project dataset via storage_resolver.

Pre-computed diagnostics: Model diagnostics (feature importance, learning curves, confusion matrix) were computed on-demand when users viewed the diagnostics tab. For cross-project models this required downloading model files first. Now diagnostics are computed during training and stored in diagnostics_data — the predictions endpoint serves from stored data first, falling back to live computation only when needed.

TimeSeries validation: Training could fail mid-way on bad input (duplicate timestamps, prediction_length > series length, invalid covariates). Adding upfront validation catches these before starting the expensive AutoGluon fit.

Profiling performance: CPU-bound profiling blocked the event loop. Wrapping in asyncio.to_thread() keeps the API responsive during long-running profiles.

Depends on

All previous PRs ([DOM-75514] feat: core HTTP layer enhancements + debug middleware #14, [DOM-75515] feat: project type detection, DB URL remap, tabular data helpers #12, [DOM-75516] feat: leaderboard normalization utils + request project ID resolver #13, [DOM-75517] feat: EDA results DB model + diagnostics_data + CRUD enhancements #15, feat: dataset listing API + storage resolver + cross-project file access #17, feat: API-first dataset listing + cross-project support + snapshot verification #16, [DOM-75519] feat: migrate job launcher from v4 to v1/beta Domino Jobs API #19, [DOM-75520] fix: enforce owner-based RBAC on all job endpoints #20)

Summary

Workers

domino_training_runner.py — --database-url, --job-config CLI args, DB URL remap
domino_eda_runner.py — --database-url, --project-id, async, DB-backed EDA store, ensure_local_file
training_worker.py — _resolve_training_data_path with mount fallback, compute_all_diagnostics, feature column filtering, project-scoped model/temp paths

Trainers

timeseries.py — prediction_length validation, timestamp dedup, covariate validation, frequency detection (QS, YS), leaderboard normalization
autogluon_runner.py — feature_columns, models_path, temp_path params
data_profiler.py / ts_profiler.py — optimized column loading, sampling, ACF/decomposition fixes

Diagnostics/Export/Predictions

model_diagnostics.py — compute_all_diagnostics(), leaderboard normalization
model_export.py — in-memory generate_deployment_files()
prediction_service.py — _format_forecast_predictions(), serve from stored diagnostics
export.py route — export_deployment_zip, _ensure_local_model
predictions.py route — serve diagnostics from DB first
profiling.py route — asyncio.to_thread(), ensure_local_file, owner-scoped, async EDA store

Other

cleanup_service.py — dataset file deletion via storage_resolver, owner filtering
job_queue.py — fail stuck Domino jobs (>1hr) on startup
health.py — /storage endpoint
registry.py — minor
domino_model_api.py — removed dead _get_ephemeral_token
domino_registry.py — temp_path param
experiment_tracker.py — temp_dir for artifacts
READMEs updated

Files changed

25 source files, 9 test files, 2 READMEs

Test plan

- Add DominoProjectType enum (DFS/GIT/UNKNOWN) with filesystem-based detection - Add _db_url_remap for cross-project SQLite URL remapping across mount types - Add tabular_data module: centralized CSV/parquet preview, schema, row counting with LRU caching (replaces scattered pd.read_csv/parquet calls) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…lver - Add normalize_leaderboard_rows/payload to fix TimeSeries fit_time display - Add resolve_request_project_id() to centralize project context extraction from X-Project-Id header, query params, and DOMINO_PROJECT_ID env var Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…aming download and debug middleware - Add params, files, headers, base_url parameters to domino_request() - Add domino_download() for streaming file downloads from Domino APIs - Add resolve_domino_nucleus_host() for direct nucleus-frontend access - Add _get_api_key() helper for X-Domino-Api-Key auth - Add DebugLoggingMiddleware (opt-in via AUTOML_DEBUG_LOGGING=true) - Use fresh httpx client per request to avoid proxy idle disconnects - Add debug_logging setting to config Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ancements - Add EDAResult model for DB-backed EDA storage (replaces file-based store) - Add diagnostics_data JSON column to Job model - Add summary_only query optimization with load_only() for job listing - Add owner filtering on get_registered_models and get_jobs_for_cleanup - Add EDA CRUD operations (create/get/update/write result/error/delete stale) - Rewrite eda_job_store from file-backed to DB-backed (async) - Add WebSocket broadcast helpers for job/log updates in CRUD layer Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…into subir/pr8-dataset-manager

… into subir/pr8-dataset-manager

…fication - Rewrite dataset_manager to prefer Domino Dataset RW API over filesystem scan - Cross-project mount safety: only resolve local paths for same-project datasets - Upload route: direct to Domino dataset via storage_resolver chunked API - New verify-snapshot endpoint for polling snapshot status after upload - New download-dataset-file endpoint - Preview route: ensure_local_file for cross-project files - Replace pandas with tabular_data helpers - Compat routes: project-scoped listing, svcdataset, svcverifysnapshot

…into subir/pr5-dataset-api-storage

- Add domino_dataset_api: Dataset RW v2/v1 listing with pagination and fallback - Add storage_resolver: auto-create automl-extension dataset per project, chunked upload via v4 API, streaming download, snapshot file listing, mount path probing across DFS/git layouts - Add ensure_local_file: downloads from dataset API when file not on local mount (enables cross-project file access for profiling and training) - Add cleanup_dataset_cache and extract_dataset_relative_path utils

…into subir/pr6-job-launcher

- Switch job launch from POST /v4/jobs/start to POST /api/jobs/v1/jobs - Switch job status from GET /v4/jobs/{id} to GET /api/jobs/beta/jobs/{id} - Remove _resolve_hardware_tier_id (v1 API accepts tier name directly) - Add _job_api_request with direct-host-first fallback - Add _remap_db_url_for_target for cross-project database paths - Pass database_url and job_config as CLI args to workers

…r/pr7-rbac-owner

Security fix: all job endpoints now enforce ownership. - Add _enforce_job_owner() — returns 404 to non-owners (avoids leaking job existence) - Invert get_request_owner() priority: prefer domino-username header over sidecar (sidecar returns App owner, not viewing user) - resolve_job_list_filters() ignores client-supplied owner (always server-side) - Add needs_request flag in compat patterns for Request forwarding - Owner filtering on cleanup, registered models - Add JobListItemResponse lightweight schema for list views - Add clear_viewing_user() to prevent cross-request context leakage - Background Domino sync throttling, zombie local job detection - Add leaderboard_utils for normalize_leaderboard_payload Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Stop the sidecar token from overwriting the user's forwarded JWT. When a user token is present (from the Extension-injected Authorization header), outbound calls to datasetrw, jobs, and model registry now run as the visiting user instead of the App owner. The sidecar token is only used as fallback for background tasks and health checks.

…/pr5-dataset-api-storage

…/pr6-job-launcher

…/pr7-rbac-owner

…nto subir/pr8-dataset-manager

Addresses review comment: use the actual header name x-domino-api-key instead of the incorrect domino-api-key.

…/pr5-dataset-api-storage

…/pr6-job-launcher

…/pr7-rbac-owner

…nto subir/pr8-dataset-manager

Remove use_api_key parameter and _get_api_key() helper. All downloads now use the standard auth chain which preserves the user's forwarded token and falls back to sidecar when needed.

Introduces the complete dataset viewing pipeline: - storage_resolver.py: per-project dataset lifecycle, mount probing, snapshot management, file operations, caching - dataset_manager.py: API-first listing with local fallback, cross-project support, snapshot-based file listing - routes/datasets.py: GET endpoints for list, detail, verify-snapshot, and file download - dataset_service.py: listing orchestration and mount filtering Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove upload, download, provisioning, and deletion methods from storage_resolver.py — they will be re-introduced in the PRs that consume them (PR #26 for upload, PR #22 for download/provisioning). Keeps: snapshot listing/status, RW ID resolution, list_snapshot_files, _find_existing, _probe_mount, caching infrastructure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…roject_id The env var is the App's own project, not the target project the user is working in. Falling back to it silently operates on the wrong project (root cause of datasets showing empty in cross-project scenarios). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The cache is a process-lifetime dict on a singleton — it never expires or invalidates. Stale data is worse than a redundant API call. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The v2→v1 fallback was speculative — no evidence that v2 is unreliable on any target deployment. Ryan's PR uses v2 only and it works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

No frontend caller exists for this endpoint. Dataset preview uses /svcdatasetpreview which returns JSON rows, not a raw file download. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Speculative field — no frontend code checks it. Remove to keep the schema minimal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds the remaining dataset endpoints on top of the viewing layer: - POST /upload: chunked upload to Domino dataset with snapshot verification - GET /verify-snapshot: poll snapshot status (upload completion check) - GET /{id}/preview: paginated data preview (CSV/Parquet) - GET /{id}/schema: column names and dtypes - POST /preview: preview by file path with pagination dataset_service.py: upload orchestration, preview pagination, schema resolution, compat payload builders. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…esolver Extends storage_resolver with methods consumed by the upload and snapshot verification endpoints: - ensure_dataset_exists, _resolve_or_create, _create_dataset - upload_file, _upload_chunks - delete_snapshot_files, _get_latest_snapshot_id - _dataset_rw_write_request, _format_http_error - get_dataset_info, delete_dataset Download and provisioning methods (download_file, download_directory, ensure_project_storage, resolve_project_paths, check_project_storage, invalidate) will be added in PR #22 where they are consumed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add max_upload_size_mb setting (env: AUTOML_MAX_UPLOAD_SIZE_MB, default 550) and enforce it in the upload route before processing. Returns HTTP 413 with a clear message when exceeded. Validation (filename, extension, size) is now done once before branching into the Domino vs standalone upload paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add cleanup_dataset_cache() to utils.py — walks the dataset_cache directory and removes files older than the configured TTL. Empty directories are pruned afterward. Configurable via AUTOML_DATASET_CACHE_TTL_HOURS (default: 24). Called at app startup in main.py (PR #22). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add DominoProjectType enum (DFS/GIT/UNKNOWN) with filesystem-based detection - Add _db_url_remap for cross-project SQLite URL remapping across mount types - Add tabular_data module: centralized CSV/parquet preview, schema, row counting with LRU caching (replaces scattered pd.read_csv/parquet calls) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Switch job launch from POST /v4/jobs/start to POST /api/jobs/v1/jobs - Switch job status from GET /v4/jobs/{id} to GET /api/jobs/beta/jobs/{id} - Remove _resolve_hardware_tier_id (v1 API accepts tier name directly) - Add _job_api_request with direct-host-first fallback - Add _remap_db_url_for_target for cross-project database paths - Pass database_url and job_config as CLI args to workers

DOMINO_ENVIRONMENT_ID and DOMINO_ENVIRONMENT_REVISION_ID are set on the App container and identify the compute environment with the right dependencies. Using env vars eliminates per-caller plumbing and ensures child jobs always match the App's environment. Removes environment_id param from _job_start, start_training_job, and start_eda_job. Adds environmentRevisionId to job payload. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Resolve the training data path at job creation time and pass it as --file-path to the Domino Job command. The worker uses the path directly instead of needing dataset API access at runtime. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ancements - Add EDAResult model for DB-backed EDA storage (replaces file-based store) - Add diagnostics_data JSON column to Job model - Add summary_only query optimization with load_only() for job listing - Add owner filtering on get_registered_models and get_jobs_for_cleanup - Add EDA CRUD operations (create/get/update/write result/error/delete stale) - Rewrite eda_job_store from file-backed to DB-backed (async) - Add WebSocket broadcast helpers for job/log updates in CRUD layer Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Security fix: all job endpoints now enforce ownership. - Add _enforce_job_owner() — returns 404 to non-owners (avoids leaking job existence) - Invert get_request_owner() priority: prefer domino-username header over sidecar (sidecar returns App owner, not viewing user) - resolve_job_list_filters() ignores client-supplied owner (always server-side) - Add needs_request flag in compat patterns for Request forwarding - Owner filtering on cleanup, registered models - Add JobListItemResponse lightweight schema for list views - Add clear_viewing_user() to prevent cross-request context leakage - Background Domino sync throttling, zombie local job detection - Add leaderboard_utils for normalize_leaderboard_payload Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Query params are the canonical approach going forward. The X-Project-Id header is kept as a fallback for legacy clients only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…et-view # Conflicts: # automl-service/app/api/utils.py # automl-service/tests/test_api_utils.py

…ners

The frontend sends both header and query param from the same source. No scenario where header is present but query param isn't. Query param only — simpler. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…et-view

…ners

ddl-subir-m and others added 14 commits March 20, 2026 09:44

Merge remote-tracking branch 'origin/subir/pr2-project-type-tabular' …

849fb92

…into subir/pr8-dataset-manager

Merge remote-tracking branch 'origin/subir/pr3-leaderboard-api-utils'…

6594bad

… into subir/pr8-dataset-manager

include PR 5 dependencies

ae456cc

Merge remote-tracking branch 'origin/subir/pr2-project-type-tabular' …

f03be54

…into subir/pr5-dataset-api-storage

Merge remote-tracking branch 'origin/subir/pr2-project-type-tabular' …

43118d1

…into subir/pr6-job-launcher

Merge remote-tracking branch 'origin/subir/pr4-db-eda-crud' into subi…

f6a930d

…r/pr7-rbac-owner

ddl-subir-m requested review from a team and niole March 20, 2026 17:44

ddl-subir-m added 12 commits March 20, 2026 14:51

Merge remote-tracking branch 'origin/subir/pr1-http-layer' into subir…

98dae8a

…/pr5-dataset-api-storage

Merge remote-tracking branch 'origin/subir/pr1-http-layer' into subir…

fb1f017

…/pr6-job-launcher

Merge remote-tracking branch 'origin/subir/pr1-http-layer' into subir…

b81265e

…/pr7-rbac-owner

Merge remote-tracking branch 'origin/subir/pr5-dataset-api-storage' i…

eda20a4

…nto subir/pr8-dataset-manager

fix: redact x-domino-api-key header in debug middleware

0d89775

Addresses review comment: use the actual header name x-domino-api-key instead of the incorrect domino-api-key.

Merge remote-tracking branch 'origin/subir/pr1-http-layer' into subir…

64e45c5

…/pr5-dataset-api-storage

Merge remote-tracking branch 'origin/subir/pr1-http-layer' into subir…

7b4baf5

…/pr6-job-launcher

Merge remote-tracking branch 'origin/subir/pr1-http-layer' into subir…

ace7d5c

…/pr7-rbac-owner

Merge remote-tracking branch 'origin/subir/pr5-dataset-api-storage' i…

f1e94aa

…nto subir/pr8-dataset-manager

fix: update auth middleware comment to reflect user token forwarding

6bfccc2

fix: pass user auth token to background sync tasks

f21a605

ddl-subir-m mentioned this pull request Mar 20, 2026

Niole.dom 75352.disable in memory jobs #18

Merged

fix: simplify domino_download to use get_domino_auth_headers

f79aa6f

Remove use_api_key parameter and _get_api_key() helper. All downloads now use the standard auth chain which preserves the user's forwarded token and falls back to sidecar when needed.

ddl-subir-m and others added 29 commits March 23, 2026 23:57

remove in-memory dataset item cache

2a02f0a

The cache is a process-lifetime dict on a singleton — it never expires or invalidates. Stale data is worse than a redundant API call. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

simplify: remove v1 fallback from dataset listing, use v2 only

a5ddc5a

The v2→v1 fallback was speculative — no evidence that v2 is unreliable on any target deployment. Ryan's PR uses v2 only and it works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

remove dead download_dataset_file endpoint

0c07d1e

No frontend caller exists for this endpoint. Dataset preview uses /svcdatasetpreview which returns JSON rows, not a raw file download. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

remove mounted field from DatasetFileResponse

d26d314

Speculative field — no frontend code checks it. Remove to keep the schema minimal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: update auth middleware comment to reflect user token forwarding

4253988

fix: pass user auth token to background sync tasks

f80ca44

merge: cascade auth fix from rebased pr16b-upload-preview

adcc5f6

merge: cascade auth fix from rebased pr7-rbac-owner

8b161f2

fix: check query params before header in resolve_request_project_id

f9edab6

Query params are the canonical approach going forward. The X-Project-Id header is kept as a fallback for legacy clients only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'subir/pr3-leaderboard-api-utils' into subir/pr16a-datas…

5933d1c

…et-view # Conflicts: # automl-service/app/api/utils.py # automl-service/tests/test_api_utils.py

Merge branch 'subir/pr16a-dataset-view' into subir/pr16b-upload-preview

4ae5045

Merge branch 'subir/pr16b-upload-preview' into subir/pr9-workers-trai…

a4cc12c

…ners

Merge branch 'subir/pr3-leaderboard-api-utils' into subir/pr16a-datas…

fa0bd47

…et-view

Merge branch 'subir/pr16a-dataset-view' into subir/pr16b-upload-preview

8e4a853

Merge branch 'subir/pr16b-upload-preview' into subir/pr9-workers-trai…

1aa2b82

…ners

ddl-subir-m closed this Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOM-75523] feat: cross-project workers, pre-computed diagnostics, trainer improvements#22

[DOM-75523] feat: cross-project workers, pre-computed diagnostics, trainer improvements#22
ddl-subir-m wants to merge 145 commits intomainfrom
subir/pr9-workers-trainers

ddl-subir-m commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ddl-subir-m commented Mar 20, 2026

Why

Depends on

Summary

Workers

Trainers

Diagnostics/Export/Predictions

Other

Files changed

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant