[DOM-75523] feat: cross-project workers, pre-computed diagnostics, trainer improvements#22
Closed
ddl-subir-m wants to merge 145 commits intomainfrom
Closed
[DOM-75523] feat: cross-project workers, pre-computed diagnostics, trainer improvements#22ddl-subir-m wants to merge 145 commits intomainfrom
ddl-subir-m wants to merge 145 commits intomainfrom
Conversation
- Add DominoProjectType enum (DFS/GIT/UNKNOWN) with filesystem-based detection - Add _db_url_remap for cross-project SQLite URL remapping across mount types - Add tabular_data module: centralized CSV/parquet preview, schema, row counting with LRU caching (replaces scattered pd.read_csv/parquet calls) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lver - Add normalize_leaderboard_rows/payload to fix TimeSeries fit_time display - Add resolve_request_project_id() to centralize project context extraction from X-Project-Id header, query params, and DOMINO_PROJECT_ID env var Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…aming download and debug middleware - Add params, files, headers, base_url parameters to domino_request() - Add domino_download() for streaming file downloads from Domino APIs - Add resolve_domino_nucleus_host() for direct nucleus-frontend access - Add _get_api_key() helper for X-Domino-Api-Key auth - Add DebugLoggingMiddleware (opt-in via AUTOML_DEBUG_LOGGING=true) - Use fresh httpx client per request to avoid proxy idle disconnects - Add debug_logging setting to config Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ancements - Add EDAResult model for DB-backed EDA storage (replaces file-based store) - Add diagnostics_data JSON column to Job model - Add summary_only query optimization with load_only() for job listing - Add owner filtering on get_registered_models and get_jobs_for_cleanup - Add EDA CRUD operations (create/get/update/write result/error/delete stale) - Rewrite eda_job_store from file-backed to DB-backed (async) - Add WebSocket broadcast helpers for job/log updates in CRUD layer Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…into subir/pr8-dataset-manager
… into subir/pr8-dataset-manager
…fication - Rewrite dataset_manager to prefer Domino Dataset RW API over filesystem scan - Cross-project mount safety: only resolve local paths for same-project datasets - Upload route: direct to Domino dataset via storage_resolver chunked API - New verify-snapshot endpoint for polling snapshot status after upload - New download-dataset-file endpoint - Preview route: ensure_local_file for cross-project files - Replace pandas with tabular_data helpers - Compat routes: project-scoped listing, svcdataset, svcverifysnapshot
…into subir/pr5-dataset-api-storage
- Add domino_dataset_api: Dataset RW v2/v1 listing with pagination and fallback - Add storage_resolver: auto-create automl-extension dataset per project, chunked upload via v4 API, streaming download, snapshot file listing, mount path probing across DFS/git layouts - Add ensure_local_file: downloads from dataset API when file not on local mount (enables cross-project file access for profiling and training) - Add cleanup_dataset_cache and extract_dataset_relative_path utils
…into subir/pr6-job-launcher
- Switch job launch from POST /v4/jobs/start to POST /api/jobs/v1/jobs
- Switch job status from GET /v4/jobs/{id} to GET /api/jobs/beta/jobs/{id}
- Remove _resolve_hardware_tier_id (v1 API accepts tier name directly)
- Add _job_api_request with direct-host-first fallback
- Add _remap_db_url_for_target for cross-project database paths
- Pass database_url and job_config as CLI args to workers
Security fix: all job endpoints now enforce ownership. - Add _enforce_job_owner() — returns 404 to non-owners (avoids leaking job existence) - Invert get_request_owner() priority: prefer domino-username header over sidecar (sidecar returns App owner, not viewing user) - resolve_job_list_filters() ignores client-supplied owner (always server-side) - Add needs_request flag in compat patterns for Request forwarding - Owner filtering on cleanup, registered models - Add JobListItemResponse lightweight schema for list views - Add clear_viewing_user() to prevent cross-request context leakage - Background Domino sync throttling, zombie local job detection - Add leaderboard_utils for normalize_leaderboard_payload Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stop the sidecar token from overwriting the user's forwarded JWT. When a user token is present (from the Extension-injected Authorization header), outbound calls to datasetrw, jobs, and model registry now run as the visiting user instead of the App owner. The sidecar token is only used as fallback for background tasks and health checks.
…/pr5-dataset-api-storage
…/pr6-job-launcher
…nto subir/pr8-dataset-manager
Addresses review comment: use the actual header name x-domino-api-key instead of the incorrect domino-api-key.
…/pr5-dataset-api-storage
…/pr6-job-launcher
…nto subir/pr8-dataset-manager
Remove use_api_key parameter and _get_api_key() helper. All downloads now use the standard auth chain which preserves the user's forwarded token and falls back to sidecar when needed.
Introduces the complete dataset viewing pipeline: - storage_resolver.py: per-project dataset lifecycle, mount probing, snapshot management, file operations, caching - dataset_manager.py: API-first listing with local fallback, cross-project support, snapshot-based file listing - routes/datasets.py: GET endpoints for list, detail, verify-snapshot, and file download - dataset_service.py: listing orchestration and mount filtering Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove upload, download, provisioning, and deletion methods from storage_resolver.py — they will be re-introduced in the PRs that consume them (PR #26 for upload, PR #22 for download/provisioning). Keeps: snapshot listing/status, RW ID resolution, list_snapshot_files, _find_existing, _probe_mount, caching infrastructure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…roject_id The env var is the App's own project, not the target project the user is working in. Falling back to it silently operates on the wrong project (root cause of datasets showing empty in cross-project scenarios). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The cache is a process-lifetime dict on a singleton — it never expires or invalidates. Stale data is worse than a redundant API call. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The v2→v1 fallback was speculative — no evidence that v2 is unreliable on any target deployment. Ryan's PR uses v2 only and it works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
No frontend caller exists for this endpoint. Dataset preview uses /svcdatasetpreview which returns JSON rows, not a raw file download. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Speculative field — no frontend code checks it. Remove to keep the schema minimal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds the remaining dataset endpoints on top of the viewing layer:
- POST /upload: chunked upload to Domino dataset with snapshot verification
- GET /verify-snapshot: poll snapshot status (upload completion check)
- GET /{id}/preview: paginated data preview (CSV/Parquet)
- GET /{id}/schema: column names and dtypes
- POST /preview: preview by file path with pagination
dataset_service.py: upload orchestration, preview pagination,
schema resolution, compat payload builders.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…esolver Extends storage_resolver with methods consumed by the upload and snapshot verification endpoints: - ensure_dataset_exists, _resolve_or_create, _create_dataset - upload_file, _upload_chunks - delete_snapshot_files, _get_latest_snapshot_id - _dataset_rw_write_request, _format_http_error - get_dataset_info, delete_dataset Download and provisioning methods (download_file, download_directory, ensure_project_storage, resolve_project_paths, check_project_storage, invalidate) will be added in PR #22 where they are consumed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add max_upload_size_mb setting (env: AUTOML_MAX_UPLOAD_SIZE_MB, default 550) and enforce it in the upload route before processing. Returns HTTP 413 with a clear message when exceeded. Validation (filename, extension, size) is now done once before branching into the Domino vs standalone upload paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add cleanup_dataset_cache() to utils.py — walks the dataset_cache directory and removes files older than the configured TTL. Empty directories are pruned afterward. Configurable via AUTOML_DATASET_CACHE_TTL_HOURS (default: 24). Called at app startup in main.py (PR #22). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add DominoProjectType enum (DFS/GIT/UNKNOWN) with filesystem-based detection - Add _db_url_remap for cross-project SQLite URL remapping across mount types - Add tabular_data module: centralized CSV/parquet preview, schema, row counting with LRU caching (replaces scattered pd.read_csv/parquet calls) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Switch job launch from POST /v4/jobs/start to POST /api/jobs/v1/jobs
- Switch job status from GET /v4/jobs/{id} to GET /api/jobs/beta/jobs/{id}
- Remove _resolve_hardware_tier_id (v1 API accepts tier name directly)
- Add _job_api_request with direct-host-first fallback
- Add _remap_db_url_for_target for cross-project database paths
- Pass database_url and job_config as CLI args to workers
DOMINO_ENVIRONMENT_ID and DOMINO_ENVIRONMENT_REVISION_ID are set on the App container and identify the compute environment with the right dependencies. Using env vars eliminates per-caller plumbing and ensures child jobs always match the App's environment. Removes environment_id param from _job_start, start_training_job, and start_eda_job. Adds environmentRevisionId to job payload. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve the training data path at job creation time and pass it as --file-path to the Domino Job command. The worker uses the path directly instead of needing dataset API access at runtime. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ancements - Add EDAResult model for DB-backed EDA storage (replaces file-based store) - Add diagnostics_data JSON column to Job model - Add summary_only query optimization with load_only() for job listing - Add owner filtering on get_registered_models and get_jobs_for_cleanup - Add EDA CRUD operations (create/get/update/write result/error/delete stale) - Rewrite eda_job_store from file-backed to DB-backed (async) - Add WebSocket broadcast helpers for job/log updates in CRUD layer Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Security fix: all job endpoints now enforce ownership. - Add _enforce_job_owner() — returns 404 to non-owners (avoids leaking job existence) - Invert get_request_owner() priority: prefer domino-username header over sidecar (sidecar returns App owner, not viewing user) - resolve_job_list_filters() ignores client-supplied owner (always server-side) - Add needs_request flag in compat patterns for Request forwarding - Owner filtering on cleanup, registered models - Add JobListItemResponse lightweight schema for list views - Add clear_viewing_user() to prevent cross-request context leakage - Background Domino sync throttling, zombie local job detection - Add leaderboard_utils for normalize_leaderboard_payload Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Query params are the canonical approach going forward. The X-Project-Id header is kept as a fallback for legacy clients only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…et-view # Conflicts: # automl-service/app/api/utils.py # automl-service/tests/test_api_utils.py
The frontend sends both header and query param from the same source. No scenario where header is present but query param isn't. Query param only — simpler. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Cross-project workers: When a Domino training or EDA job runs in a different project than the App, the worker can't find its database, training data, or model output paths. Workers now accept
--database-urland--job-configas CLI args, resolve training data via a mount fallback chain (local mount → dataset download), and write outputs to the per-project dataset via storage_resolver.Pre-computed diagnostics: Model diagnostics (feature importance, learning curves, confusion matrix) were computed on-demand when users viewed the diagnostics tab. For cross-project models this required downloading model files first. Now diagnostics are computed during training and stored in
diagnostics_data— the predictions endpoint serves from stored data first, falling back to live computation only when needed.TimeSeries validation: Training could fail mid-way on bad input (duplicate timestamps, prediction_length > series length, invalid covariates). Adding upfront validation catches these before starting the expensive AutoGluon fit.
Profiling performance: CPU-bound profiling blocked the event loop. Wrapping in
asyncio.to_thread()keeps the API responsive during long-running profiles.Depends on
Summary
Workers
domino_training_runner.py—--database-url,--job-configCLI args, DB URL remapdomino_eda_runner.py—--database-url,--project-id, async, DB-backed EDA store,ensure_local_filetraining_worker.py—_resolve_training_data_pathwith mount fallback,compute_all_diagnostics, feature column filtering, project-scoped model/temp pathsTrainers
timeseries.py— prediction_length validation, timestamp dedup, covariate validation, frequency detection (QS, YS), leaderboard normalizationautogluon_runner.py—feature_columns,models_path,temp_pathparamsdata_profiler.py/ts_profiler.py— optimized column loading, sampling, ACF/decomposition fixesDiagnostics/Export/Predictions
model_diagnostics.py—compute_all_diagnostics(), leaderboard normalizationmodel_export.py— in-memorygenerate_deployment_files()prediction_service.py—_format_forecast_predictions(), serve from stored diagnosticsexport.pyroute —export_deployment_zip,_ensure_local_modelpredictions.pyroute — serve diagnostics from DB firstprofiling.pyroute —asyncio.to_thread(),ensure_local_file, owner-scoped, async EDA storeOther
cleanup_service.py— dataset file deletion via storage_resolver, owner filteringjob_queue.py— fail stuck Domino jobs (>1hr) on startuphealth.py—/storageendpointregistry.py— minordomino_model_api.py— removed dead_get_ephemeral_tokendomino_registry.py—temp_pathparamexperiment_tracker.py—temp_dirfor artifactsFiles changed
25 source files, 9 test files, 2 READMEs
Test plan
test_training_worker.pypassestest_model_diagnostics.pypassestest_export_zip.pypassestest_prediction_service.pypassestest_model_export.pypassestest_job_config_passthrough.pypassestest_api_predictions.pypassestest_job_service_extras.pypasses