[DOM-75568] feat: workers, trainers, predictions, diagnostics & export by ddl-subir-m · Pull Request #28 · dominodatalab/AutoML_Extension

ddl-subir-m · 2026-03-24T14:54:50Z

Summary

Training worker with data-path resolution, config deserialization, cancellation checks, and step-by-step progress tracking
Job queue manager with concurrency limiting, job recovery on restart, and graceful shutdown
Prediction service with pre-computed diagnostics fallback and time series support
Model export with Docker zip packaging and build command generation
Leaderboard normalization for time series timing keys
Cross-project DB URL remap and dataset file resolution via ensure_local_file
11 new test modules covering worker, prediction, export, cross-project, and queue logic

Dependencies

Depends on PRs #24, #25, #26, #19, #20 (pr1-4, pr17, pr16a, pr16b, pr6, pr7)

Test plan

Training worker launches and tracks progress through all steps
Job queue respects concurrency limits and recovers interrupted jobs on restart
Predictions return pre-computed diagnostics when available, fall back to live computation
Model export generates valid Docker zip with correct build command
Cross-project jobs correctly remap database URLs
ensure_local_file downloads from dataset API when file not mounted locally
All 11 new test modules pass

…aunch diagnostics - Replace filesystem-only dataset listing with Domino Dataset RW v2 API (GET /api/datasetrw/v2/datasets?projectIdsToInclude=...) when a project ID is available via X-Project-Id header or DOMINO_PROJECT_ID env var - Cross-reference API results with mounted filesystem paths for file discovery (preview/training still uses local mounts) - Fall back to legacy filesystem scan when API is unavailable - Add params support to domino_request() helper - Thread project ID through dataset routes and compat endpoints - Add diagnostic logging to job launch path to trace project ID flow - Add scripts/diagnose_api_routing.py for testing proxy vs direct host

…nce at construction The ApiClient singleton was reading ?projectId= from the URL once at module load time. React Router's Navigate redirect strips query params before the header could be reliably captured. Now the project ID is: 1. Cached eagerly at module evaluation time 2. Resolved dynamically on every API request 3. Synced from React Router search params via ProjectIdSync component

…osis

…query params Domino's app proxy strips query parameters before serving the App, so ?projectId=TARGET_ID never reaches the frontend JS. Hash fragments (#projectId=TARGET_ID) are client-side only and survive proxy stripping. The frontend now checks both ?projectId= and #projectId= at module load time, in the per-request header injection, and in the React Router sync.

Domino loads Apps in an iframe with a clean internal URL — the user's projectId (query param or hash) is on the parent frame, not the iframe. Since both are same-origin, read window.parent.location as fallback.

…sdk-with-api

…austion - Detect zombie local jobs (RUNNING in DB but no active asyncio task) on every job-list request and auto-mark them FAILED - Wrap Domino job sync in try/except so a flaky API call cannot crash the job-list endpoint - Mark Domino jobs stuck in RUNNING >1hr as FAILED on startup

Test scripts to discover and validate Domino Dataset RW v2 API capabilities: - test_dataset_api.py: list, create, verify datasets and probe mount paths - test_dataset_grant.py: test cross-project dataset sharing/grant workflows - test_dataset_upload.py: probe snapshot, direct, presigned, and mount-write upload methods ProjectStorageResolver service for auto-creating per-project automl-extension datasets with in-memory caching and mount path probing.

- Use v1 endpoint for dataset creation (v2 POST returns 404) - Use "name" field instead of "datasetName" for v1 create payload - Unwrap v2 list response nested {"dataset": {...}} wrappers - Add grant API call (POST v1 grants with DatasetRwEditor role) - Use v1 for get-by-id (v2 returns 404) - Update test script with correct endpoints and payload shapes

- Implement upload_file() in storage_resolver using Domino's v4 chunked upload API (same workflow as python-domino SDK) - Support files of any size via automatic chunking (default 8MB) - Per-chunk retry with exponential backoff (up to 10 attempts) - Auto-cancel upload session on failure - Add files and headers support to domino_request() - Update test_dataset_upload.py with working v4 upload probe - Fix v2 response unwrapping and use v1 for get-by-id in test script

…-local - Storage resolver wired into uploads, training, and health endpoints - Prediction service probes dataset mounts to find models - EDA job store resolves base_dir from dataset mount when available - Orphan cleanup scans dataset mount paths in addition to app-local dirs - Add dataset file download endpoint and chunked upload support

Route all temp file creation (leaderboard, feature importance, MLflow model staging, model exports) through project dataset mounts so artifacts land on persistent storage. Also adds mount timing test script that confirmed dataset mounts require an app restart to appear.

Dataset mounts are resolved at boot time, so creating the dataset from inside a running job means the mount is never available. Move dataset creation to before job launch so the Domino Job boots with the mount already present. Also fix EDA async profiling to pass project_id from X-Project-Id header and pre-create the dataset.

The upload endpoint previously called resolve_project_paths() which raises 503 if the dataset mount is not yet available. On first-time upload the mount won't exist until the app restarts. Now uses check_project_storage() with graceful fallback to settings.uploads_path, and pre-creates the dataset for future Jobs/restarts.

…l mount When a target project_id is present, read the uploaded file into memory, extract metadata from the buffer, and upload to the automl-extension dataset via the v4 chunked API. This removes the dependency on having the dataset mount available in the app container, making cross-project uploads reliable. Falls back to local disk for standalone/no-project mode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When the app runs outside the target project, dataset mount paths don't exist locally. This adds transparent download-on-demand: ensure_local_file() checks if a dataset mount path exists, and if not, downloads the file via the Domino Dataset RW API to a local cache before profiling or training. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add /export/deployment/zip that streams a zip directly from dataset mount without intermediate copies. Simplify the UI to a single download button (no output dir input needed). Make output_dir optional on the existing build endpoint. Remove unused diagnostic and test scripts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add pytest suites covering: - domino_download() streaming, domino_request() retries, auth headers - StorageResolver: download_file(), _find_existing(), snapshots, grants, cache - ensure_local_file(): local/remap/download branches, caching, fallbacks - /export/deployment/zip: zip helpers and API endpoint integration 53 tests, all passing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Standalone script that tests the full dataset lifecycle against a live Domino environment: auth, create, grant, upload, snapshots, list files, download (probes endpoints), and cleanup. Confirms file download via API is not supported (404 on all endpoints) — files only accessible via mounts. Run: python scripts/test_domino_api_live.py [--keep] [--project-id ID] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Systematically probes 60+ endpoint patterns across v1/v2/v4 and alternative paths. All return 404 — Domino Dataset RW API has no file read/download capability. Files are only accessible via mount paths in Domino Jobs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… API Domino Dataset RW API has no file read/download endpoints (confirmed by probing 60+ URL patterns). Files uploaded to the dataset are only accessible via mount paths in Domino Jobs, not in the app. Fix: during upload, also write the file bytes to the local dataset_cache directory using the same cache key scheme that ensure_local_file() uses. When profiling or local training calls ensure_local_file(), it finds the cached copy immediately. Also make ensure_local_file() gracefully handle download failures instead of propagating the exception, returning the original path for a clearer downstream error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Delete cached dataset files older than 24 hours during app startup to prevent indefinite accumulation on disk. Profiling and training happen within minutes of upload, so 24h gives ample margin. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ore jobs After uploading a file to a Domino Dataset via the chunked API, poll the snapshot endpoint to confirm the file is committed before allowing the user to proceed to profiling or training. This prevents Domino Jobs from failing because the file hasn't materialized in the mount yet. Backend: inline polling (~15s) during upload, plus a GET /verify-snapshot endpoint for frontend fallback polling. Frontend: useSnapshotVerification hook gates setDataSource/setSelectedFilePath behind verification, with "Proceed Anyway" escape hatch on timeout. Fixes: route ordering (verify-snapshot above /{dataset_id} catch-all), setTimeout chain instead of setInterval to prevent overlapping async polls, snapshot_file_path for consistent path matching, reduced inline backoff to avoid proxy timeouts, and Proceed Anyway escape hatch in EDA page. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…nds automl.db When a training/EDA job runs in the target project, DOMINO_PROJECT_NAME points to that project, causing config.py to derive the wrong DB path. Pass the app's database_url as a --database-url CLI arg to the runners, which set it as an env var before any app imports. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Covers arg parsing, env var injection, command building, and config preservation to verify the cross-project DATABASE_URL fix (ae7f6e2). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…l DB read When launching a Domino training job, serialize the complete job config to JSON and pass it as a CLI arg so the runner can skip the cross-project DB read on startup. Falls back to DB read when --job-config is absent (local execution path). DB is still used for writes (progress, status, results, logs, cancellation checks). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Child jobs receive a sqlite database URL from the app project, but the target project may mount the shared dataset at a different path depending on whether it's DFS or git-based. Add project type detection via filesystem probes and remap the database URL to the correct mount path before setting DATABASE_URL in the environment. Also adds /mnt/imported/data/ to _MOUNT_ROOTS in utils.py so remap_shared_path() works for git-based target projects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…unch flows Tests the full API request path with mocked Domino APIs to verify: - Training job launch command includes --database-url and --job-config - Async EDA launch command includes --database-url - Runner DB URL remap resolves across git and DFS mount points - remap_shared_path covers /mnt/imported/data/ mount root Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add AUTOML_DEBUG_LOGGING (backend) and VITE_DEBUG_LOGGING (frontend) env vars for verbose request/response logging during Domino app debugging. Both default to false for production. Backend: new DebugLoggingMiddleware logs method, URL, headers, body, and timing for every request when enabled. Frontend: new debug logger utility wraps API client calls with grouped console output including timing. Update both READMEs to reflect current codebase structure: new compat/ directory, core services, serving layer, resolvers, scripts, and all missing env vars. Remove dead test_dataset_grant.py script.

…t logic Replace file-listing-based snapshot verification with snapshot status check. The Domino Dataset RW API has no file-listing endpoint, so all list_files() calls returned 404. Now checks if the latest snapshot status is "active" which correctly indicates the upload commit completed. Remove _grant_project_access() — the grants API takes user IDs not project IDs, and cross-project grants aren't needed since local training uses the cached file and Domino Jobs run in the target project.

- Training worker with data-path resolution, config deserialization, cancellation checks, and step-by-step progress tracking - Job queue manager with concurrency limiting, recovery, and graceful shutdown - Prediction service with pre-computed diagnostics and time series support - Model export with Docker zip packaging - Leaderboard normalization for time series timing keys - Cross-project DB URL remap and dataset file resolution - 11 new test modules Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- EDA Analysis page with tabular/time series mode toggle and manual trigger - Job detail page with tab navigation (results/leaderboard/diagnostics) - Dataset hooks and data source selector for Domino datasets - Live job updates replacing simulated progress bars - Interactive leaderboard with normalized timing keys - Export dialog with Docker build command display - Time series config panel for column selection - Debug logger utility Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Table.tsx, DataOverview.tsx, DataTable.tsx, InteractiveCharts.tsx (replaced by inline SVG charts) - TimeSeriesForecastPanel.tsx (consolidated into EDA flow) - useJobProgress.ts (replaced by useJobLiveUpdates) - pathDefaults.ts, eda/index.ts barrel export (no longer needed) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…sdk-with-api # Conflicts: # automl-service/app/api/routes/health.py # automl-service/app/api/routes/jobs.py # automl-service/app/main.py

…ed client Per Niole's review: - Remove _remap_db_url_for_target (eliminating imported data pattern) - Remove multi-base-URL fallback, use DOMINO_API_HOST directly - Replace raw domino_request() calls with generated public API client for job start (start_job) and status (get_job_details) - Keep domino_request only for v4 stop (no public API alternative) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

# Conflicts: # automl-service/app/api/routes/jobs.py # automl-service/app/main.py

# Conflicts: # automl-service/app/api/routes/health.py # automl-service/app/core/domino_job_launcher.py

…h-api # Conflicts: # automl-service/app/api/routes/jobs.py # automl-service/app/core/domino_job_launcher.py

…ainers # Conflicts: # automl-service/app/api/utils.py # automl-service/tests/test_api_utils.py

The comment claimed the sidecar token is used as fallback, but the code never falls back to the sidecar. Clarify that missing user tokens cause MissingUserTokenError. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…h-api # Conflicts: # automl-service/app/api/utils.py

Resolve merge conflicts with Niole's PR #30 (job auth via authorized actions). Adopt get_viewing_user_name(), require_job_list(), and the dual-track get_job_or_404 (local owner check / Domino API check) while keeping our unique additions: zombie local-job detection, summary_only query optimization, build_job_list_item_response, and cleanup owner scoping. Remove get_request_owner / _enforce_job_owner in favor of the PR #30 patterns. Add from __future__ import annotations to job_service.py for Python 3.9 compat with | union syntax. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…h-api

Job service functions (get_job_response, cancel_job, delete_job, etc.) accept (db, job_id) only. The compat route registrations incorrectly had needs_request=True, causing patterns.py to pass request=request which triggered TypeError on /svcjobget, /svcjobcancel, etc.

Jobs with execution_target="domino_job" but no domino_job_id (pending submission or failed before launch) were raising a hard 500. Fall back to owner-based auth check instead.

Domino API returns 'Github' (capital G) but the generated enum only accepts 'github'. Add _missing_ hook for case-insensitive lookup with fallback to UNKNOWN for unrecognized values.

The generated ProjectEnvelopeV1 parser fails on enum values like 'Github' due to stale .pyc caches ignoring the _missing_ hook fix. Since we only need name and ownerUsername, parse the raw JSON directly to avoid the generated model deserialization entirely.

onTerminal callback was an inline arrow function in the useEffect dependency array, causing the effect to re-run on every render. Use a ref for the callback instead so the WebSocket only reconnects when enabled or wsUrl actually change.

The Domino project API wraps data in {"project": {...}}. Read name/ownerUsername from the nested project object, not the top level.

job.started_at is timezone-naive from the DB but utc_now() is timezone-aware. Add tzinfo to started_at before subtraction.

The WebSocket only sent the initial job state, then waited for pings. Domino jobs that complete externally never pushed status updates. Now a background task polls every 5s, syncs Domino job state, and pushes updates when status changes. Stops on terminal status or client disconnect.

The WebSocket handler has no HTTP middleware context, so auth headers and user context are unavailable. Replace get_job_progress (which requires auth via get_job_or_404) with direct crud.get_job reads and inline status serialization.

ddl-subir-m and others added 30 commits March 10, 2026 12:54

debug: dump all request headers and frontend URL for project ID diagn…

b6017fc

…osis

fix: read projectId from parent iframe URL (Domino loads Apps in iframe)

a782a3f

Domino loads Apps in an iframe with a clean internal URL — the user's projectId (query param or hash) is on the parent frame, not the iframe. Since both are same-origin, read window.parent.location as fallback.

Merge remote-tracking branch 'origin/main' into subir/replace-domino-…

eef0cdc

…sdk-with-api

test: add unit tests for --database-url CLI passthrough to child jobs

0f22da1

Covers arg parsing, env var injection, command building, and config preservation to verify the cross-project DATABASE_URL fix (ae7f6e2). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ddl-subir-m and others added 3 commits March 24, 2026 09:53

ddl-subir-m requested review from a team and niole March 24, 2026 14:54

ddl-subir-m changed the title ~~feat: workers, trainers, predictions, diagnostics & export~~ [DOM-75568] feat: workers, trainers, predictions, diagnostics & export Mar 24, 2026

ddl-subir-m and others added 24 commits March 24, 2026 11:26

Merge remote-tracking branch 'origin/main' into subir/replace-domino-…

9ac9109

…sdk-with-api # Conflicts: # automl-service/app/api/routes/health.py # automl-service/app/api/routes/jobs.py # automl-service/app/main.py

Merge remote-tracking branch 'origin/main' into subir/pr6-job-launcher

dcdd9b9

Merge branch 'subir/pr6-job-launcher' into subir/pr7-rbac-owner

005712e

# Conflicts: # automl-service/app/api/routes/jobs.py # automl-service/app/main.py

Merge branch 'subir/pr7-rbac-owner' into subir/pr9b-workers-trainers

a068322

# Conflicts: # automl-service/app/api/routes/health.py # automl-service/app/core/domino_job_launcher.py

Merge branch 'subir/pr9b-workers-trainers' into subir/pr10b-frontend

d590820

Merge branch 'subir/pr10b-frontend' into subir/replace-domino-sdk-wit…

b0c010e

…h-api # Conflicts: # automl-service/app/api/routes/jobs.py # automl-service/app/core/domino_job_launcher.py

Merge remote-tracking branch 'origin/main' into subir/pr9b-workers-tr…

dd01880

…ainers # Conflicts: # automl-service/app/api/utils.py # automl-service/tests/test_api_utils.py

Merge branch 'subir/pr9b-workers-trainers' into subir/pr10b-frontend

417e5ab

Merge branch 'subir/pr10b-frontend' into subir/replace-domino-sdk-wit…

d975034

…h-api # Conflicts: # automl-service/app/api/utils.py

Merge origin/main (PR #30) into pr9b — adopt auth-actions RBAC

73be381

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'subir/pr9b-workers-trainers' into subir/pr10b-frontend

bd8c405

Merge branch 'subir/pr10b-frontend' into subir/replace-domino-sdk-wit…

74c0ca4

…h-api

Fix 500 on domino jobs without domino_job_id

545bdb8

Jobs with execution_target="domino_job" but no domino_job_id (pending submission or failed before launch) were raising a hard 500. Fall back to owner-based auth check instead.

Fix case-insensitive GitServiceProviderV1 enum parsing

5b31c43

Domino API returns 'Github' (capital G) but the generated enum only accepts 'github'. Add _missing_ hook for case-insensitive lookup with fallback to UNKNOWN for unrecognized values.

Fix project resolver reading wrong JSON path

74eaef0

The Domino project API wraps data in {"project": {...}}. Read name/ownerUsername from the nested project object, not the top level.

Fix startup crash: naive/aware datetime comparison in job queue

f049c13

job.started_at is timezone-naive from the DB but utc_now() is timezone-aware. Add tzinfo to started_at before subtraction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOM-75568] feat: workers, trainers, predictions, diagnostics & export#28

[DOM-75568] feat: workers, trainers, predictions, diagnostics & export#28
ddl-subir-m wants to merge 250 commits intomainfrom
subir/pr9b-workers-trainers

ddl-subir-m commented Mar 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ddl-subir-m commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dependencies

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ddl-subir-m commented Mar 24, 2026 •

edited

Loading