[DOM-75568] feat: workers, trainers, predictions, diagnostics & export#28
Open
ddl-subir-m wants to merge 250 commits intomainfrom
Open
[DOM-75568] feat: workers, trainers, predictions, diagnostics & export#28ddl-subir-m wants to merge 250 commits intomainfrom
ddl-subir-m wants to merge 250 commits intomainfrom
Conversation
…aunch diagnostics - Replace filesystem-only dataset listing with Domino Dataset RW v2 API (GET /api/datasetrw/v2/datasets?projectIdsToInclude=...) when a project ID is available via X-Project-Id header or DOMINO_PROJECT_ID env var - Cross-reference API results with mounted filesystem paths for file discovery (preview/training still uses local mounts) - Fall back to legacy filesystem scan when API is unavailable - Add params support to domino_request() helper - Thread project ID through dataset routes and compat endpoints - Add diagnostic logging to job launch path to trace project ID flow - Add scripts/diagnose_api_routing.py for testing proxy vs direct host
…nce at construction The ApiClient singleton was reading ?projectId= from the URL once at module load time. React Router's Navigate redirect strips query params before the header could be reliably captured. Now the project ID is: 1. Cached eagerly at module evaluation time 2. Resolved dynamically on every API request 3. Synced from React Router search params via ProjectIdSync component
…query params Domino's app proxy strips query parameters before serving the App, so ?projectId=TARGET_ID never reaches the frontend JS. Hash fragments (#projectId=TARGET_ID) are client-side only and survive proxy stripping. The frontend now checks both ?projectId= and #projectId= at module load time, in the per-request header injection, and in the React Router sync.
Domino loads Apps in an iframe with a clean internal URL — the user's projectId (query param or hash) is on the parent frame, not the iframe. Since both are same-origin, read window.parent.location as fallback.
…austion - Detect zombie local jobs (RUNNING in DB but no active asyncio task) on every job-list request and auto-mark them FAILED - Wrap Domino job sync in try/except so a flaky API call cannot crash the job-list endpoint - Mark Domino jobs stuck in RUNNING >1hr as FAILED on startup
Test scripts to discover and validate Domino Dataset RW v2 API capabilities: - test_dataset_api.py: list, create, verify datasets and probe mount paths - test_dataset_grant.py: test cross-project dataset sharing/grant workflows - test_dataset_upload.py: probe snapshot, direct, presigned, and mount-write upload methods ProjectStorageResolver service for auto-creating per-project automl-extension datasets with in-memory caching and mount path probing.
- Use v1 endpoint for dataset creation (v2 POST returns 404)
- Use "name" field instead of "datasetName" for v1 create payload
- Unwrap v2 list response nested {"dataset": {...}} wrappers
- Add grant API call (POST v1 grants with DatasetRwEditor role)
- Use v1 for get-by-id (v2 returns 404)
- Update test script with correct endpoints and payload shapes
- Implement upload_file() in storage_resolver using Domino's v4 chunked upload API (same workflow as python-domino SDK) - Support files of any size via automatic chunking (default 8MB) - Per-chunk retry with exponential backoff (up to 10 attempts) - Auto-cancel upload session on failure - Add files and headers support to domino_request() - Update test_dataset_upload.py with working v4 upload probe - Fix v2 response unwrapping and use v1 for get-by-id in test script
…-local - Storage resolver wired into uploads, training, and health endpoints - Prediction service probes dataset mounts to find models - EDA job store resolves base_dir from dataset mount when available - Orphan cleanup scans dataset mount paths in addition to app-local dirs - Add dataset file download endpoint and chunked upload support
Route all temp file creation (leaderboard, feature importance, MLflow model staging, model exports) through project dataset mounts so artifacts land on persistent storage. Also adds mount timing test script that confirmed dataset mounts require an app restart to appear.
Dataset mounts are resolved at boot time, so creating the dataset from inside a running job means the mount is never available. Move dataset creation to before job launch so the Domino Job boots with the mount already present. Also fix EDA async profiling to pass project_id from X-Project-Id header and pre-create the dataset.
The upload endpoint previously called resolve_project_paths() which raises 503 if the dataset mount is not yet available. On first-time upload the mount won't exist until the app restarts. Now uses check_project_storage() with graceful fallback to settings.uploads_path, and pre-creates the dataset for future Jobs/restarts.
…l mount When a target project_id is present, read the uploaded file into memory, extract metadata from the buffer, and upload to the automl-extension dataset via the v4 chunked API. This removes the dependency on having the dataset mount available in the app container, making cross-project uploads reliable. Falls back to local disk for standalone/no-project mode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the app runs outside the target project, dataset mount paths don't exist locally. This adds transparent download-on-demand: ensure_local_file() checks if a dataset mount path exists, and if not, downloads the file via the Domino Dataset RW API to a local cache before profiling or training. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add /export/deployment/zip that streams a zip directly from dataset mount without intermediate copies. Simplify the UI to a single download button (no output dir input needed). Make output_dir optional on the existing build endpoint. Remove unused diagnostic and test scripts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add pytest suites covering: - domino_download() streaming, domino_request() retries, auth headers - StorageResolver: download_file(), _find_existing(), snapshots, grants, cache - ensure_local_file(): local/remap/download branches, caching, fallbacks - /export/deployment/zip: zip helpers and API endpoint integration 53 tests, all passing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Standalone script that tests the full dataset lifecycle against a live Domino environment: auth, create, grant, upload, snapshots, list files, download (probes endpoints), and cleanup. Confirms file download via API is not supported (404 on all endpoints) — files only accessible via mounts. Run: python scripts/test_domino_api_live.py [--keep] [--project-id ID] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Systematically probes 60+ endpoint patterns across v1/v2/v4 and alternative paths. All return 404 — Domino Dataset RW API has no file read/download capability. Files are only accessible via mount paths in Domino Jobs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… API Domino Dataset RW API has no file read/download endpoints (confirmed by probing 60+ URL patterns). Files uploaded to the dataset are only accessible via mount paths in Domino Jobs, not in the app. Fix: during upload, also write the file bytes to the local dataset_cache directory using the same cache key scheme that ensure_local_file() uses. When profiling or local training calls ensure_local_file(), it finds the cached copy immediately. Also make ensure_local_file() gracefully handle download failures instead of propagating the exception, returning the original path for a clearer downstream error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Delete cached dataset files older than 24 hours during app startup to prevent indefinite accumulation on disk. Profiling and training happen within minutes of upload, so 24h gives ample margin. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ore jobs
After uploading a file to a Domino Dataset via the chunked API, poll the
snapshot endpoint to confirm the file is committed before allowing the user
to proceed to profiling or training. This prevents Domino Jobs from failing
because the file hasn't materialized in the mount yet.
Backend: inline polling (~15s) during upload, plus a GET /verify-snapshot
endpoint for frontend fallback polling. Frontend: useSnapshotVerification
hook gates setDataSource/setSelectedFilePath behind verification, with
"Proceed Anyway" escape hatch on timeout.
Fixes: route ordering (verify-snapshot above /{dataset_id} catch-all),
setTimeout chain instead of setInterval to prevent overlapping async polls,
snapshot_file_path for consistent path matching, reduced inline backoff to
avoid proxy timeouts, and Proceed Anyway escape hatch in EDA page.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nds automl.db When a training/EDA job runs in the target project, DOMINO_PROJECT_NAME points to that project, causing config.py to derive the wrong DB path. Pass the app's database_url as a --database-url CLI arg to the runners, which set it as an env var before any app imports. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Covers arg parsing, env var injection, command building, and config preservation to verify the cross-project DATABASE_URL fix (ae7f6e2). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…l DB read When launching a Domino training job, serialize the complete job config to JSON and pass it as a CLI arg so the runner can skip the cross-project DB read on startup. Falls back to DB read when --job-config is absent (local execution path). DB is still used for writes (progress, status, results, logs, cancellation checks). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Child jobs receive a sqlite database URL from the app project, but the target project may mount the shared dataset at a different path depending on whether it's DFS or git-based. Add project type detection via filesystem probes and remap the database URL to the correct mount path before setting DATABASE_URL in the environment. Also adds /mnt/imported/data/ to _MOUNT_ROOTS in utils.py so remap_shared_path() works for git-based target projects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…unch flows Tests the full API request path with mocked Domino APIs to verify: - Training job launch command includes --database-url and --job-config - Async EDA launch command includes --database-url - Runner DB URL remap resolves across git and DFS mount points - remap_shared_path covers /mnt/imported/data/ mount root Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add AUTOML_DEBUG_LOGGING (backend) and VITE_DEBUG_LOGGING (frontend) env vars for verbose request/response logging during Domino app debugging. Both default to false for production. Backend: new DebugLoggingMiddleware logs method, URL, headers, body, and timing for every request when enabled. Frontend: new debug logger utility wraps API client calls with grouped console output including timing. Update both READMEs to reflect current codebase structure: new compat/ directory, core services, serving layer, resolvers, scripts, and all missing env vars. Remove dead test_dataset_grant.py script.
…t logic Replace file-listing-based snapshot verification with snapshot status check. The Domino Dataset RW API has no file-listing endpoint, so all list_files() calls returned 404. Now checks if the latest snapshot status is "active" which correctly indicates the upload commit completed. Remove _grant_project_access() — the grants API takes user IDs not project IDs, and cross-project grants aren't needed since local training uses the cached file and Domino Jobs run in the target project.
- Training worker with data-path resolution, config deserialization, cancellation checks, and step-by-step progress tracking - Job queue manager with concurrency limiting, recovery, and graceful shutdown - Prediction service with pre-computed diagnostics and time series support - Model export with Docker zip packaging - Leaderboard normalization for time series timing keys - Cross-project DB URL remap and dataset file resolution - 11 new test modules Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- EDA Analysis page with tabular/time series mode toggle and manual trigger - Job detail page with tab navigation (results/leaderboard/diagnostics) - Dataset hooks and data source selector for Domino datasets - Live job updates replacing simulated progress bars - Interactive leaderboard with normalized timing keys - Export dialog with Docker build command display - Time series config panel for column selection - Debug logger utility Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Table.tsx, DataOverview.tsx, DataTable.tsx, InteractiveCharts.tsx (replaced by inline SVG charts) - TimeSeriesForecastPanel.tsx (consolidated into EDA flow) - useJobProgress.ts (replaced by useJobLiveUpdates) - pathDefaults.ts, eda/index.ts barrel export (no longer needed) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sdk-with-api # Conflicts: # automl-service/app/api/routes/health.py # automl-service/app/api/routes/jobs.py # automl-service/app/main.py
…ed client Per Niole's review: - Remove _remap_db_url_for_target (eliminating imported data pattern) - Remove multi-base-URL fallback, use DOMINO_API_HOST directly - Replace raw domino_request() calls with generated public API client for job start (start_job) and status (get_job_details) - Keep domino_request only for v4 stop (no public API alternative) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Conflicts: # automl-service/app/api/routes/jobs.py # automl-service/app/main.py
# Conflicts: # automl-service/app/api/routes/health.py # automl-service/app/core/domino_job_launcher.py
…h-api # Conflicts: # automl-service/app/api/routes/jobs.py # automl-service/app/core/domino_job_launcher.py
…ainers # Conflicts: # automl-service/app/api/utils.py # automl-service/tests/test_api_utils.py
The comment claimed the sidecar token is used as fallback, but the code never falls back to the sidecar. Clarify that missing user tokens cause MissingUserTokenError. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…h-api # Conflicts: # automl-service/app/api/utils.py
Resolve merge conflicts with Niole's PR #30 (job auth via authorized actions). Adopt get_viewing_user_name(), require_job_list(), and the dual-track get_job_or_404 (local owner check / Domino API check) while keeping our unique additions: zombie local-job detection, summary_only query optimization, build_job_list_item_response, and cleanup owner scoping. Remove get_request_owner / _enforce_job_owner in favor of the PR #30 patterns. Add from __future__ import annotations to job_service.py for Python 3.9 compat with | union syntax. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Job service functions (get_job_response, cancel_job, delete_job, etc.) accept (db, job_id) only. The compat route registrations incorrectly had needs_request=True, causing patterns.py to pass request=request which triggered TypeError on /svcjobget, /svcjobcancel, etc.
Jobs with execution_target="domino_job" but no domino_job_id (pending submission or failed before launch) were raising a hard 500. Fall back to owner-based auth check instead.
Domino API returns 'Github' (capital G) but the generated enum only accepts 'github'. Add _missing_ hook for case-insensitive lookup with fallback to UNKNOWN for unrecognized values.
The generated ProjectEnvelopeV1 parser fails on enum values like 'Github' due to stale .pyc caches ignoring the _missing_ hook fix. Since we only need name and ownerUsername, parse the raw JSON directly to avoid the generated model deserialization entirely.
onTerminal callback was an inline arrow function in the useEffect dependency array, causing the effect to re-run on every render. Use a ref for the callback instead so the WebSocket only reconnects when enabled or wsUrl actually change.
The Domino project API wraps data in {"project": {...}}. Read
name/ownerUsername from the nested project object, not the top level.
job.started_at is timezone-naive from the DB but utc_now() is timezone-aware. Add tzinfo to started_at before subtraction.
The WebSocket only sent the initial job state, then waited for pings. Domino jobs that complete externally never pushed status updates. Now a background task polls every 5s, syncs Domino job state, and pushes updates when status changes. Stops on terminal status or client disconnect.
The WebSocket handler has no HTTP middleware context, so auth headers and user context are unavailable. Replace get_job_progress (which requires auth via get_job_or_404) with direct crud.get_job reads and inline status serialization.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ensure_local_fileDependencies
Depends on PRs #24, #25, #26, #19, #20 (pr1-4, pr17, pr16a, pr16b, pr6, pr7)
Test plan
ensure_local_filedownloads from dataset API when file not mounted locally