Skip to content

[DOM-75523] feat: cross-project workers, pre-computed diagnostics, trainer improvements#22

Closed
ddl-subir-m wants to merge 145 commits intomainfrom
subir/pr9-workers-trainers
Closed

[DOM-75523] feat: cross-project workers, pre-computed diagnostics, trainer improvements#22
ddl-subir-m wants to merge 145 commits intomainfrom
subir/pr9-workers-trainers

Conversation

@ddl-subir-m
Copy link
Copy Markdown
Collaborator

Why

Cross-project workers: When a Domino training or EDA job runs in a different project than the App, the worker can't find its database, training data, or model output paths. Workers now accept --database-url and --job-config as CLI args, resolve training data via a mount fallback chain (local mount → dataset download), and write outputs to the per-project dataset via storage_resolver.

Pre-computed diagnostics: Model diagnostics (feature importance, learning curves, confusion matrix) were computed on-demand when users viewed the diagnostics tab. For cross-project models this required downloading model files first. Now diagnostics are computed during training and stored in diagnostics_data — the predictions endpoint serves from stored data first, falling back to live computation only when needed.

TimeSeries validation: Training could fail mid-way on bad input (duplicate timestamps, prediction_length > series length, invalid covariates). Adding upfront validation catches these before starting the expensive AutoGluon fit.

Profiling performance: CPU-bound profiling blocked the event loop. Wrapping in asyncio.to_thread() keeps the API responsive during long-running profiles.

Depends on

Summary

Workers

  • domino_training_runner.py--database-url, --job-config CLI args, DB URL remap
  • domino_eda_runner.py--database-url, --project-id, async, DB-backed EDA store, ensure_local_file
  • training_worker.py_resolve_training_data_path with mount fallback, compute_all_diagnostics, feature column filtering, project-scoped model/temp paths

Trainers

  • timeseries.py — prediction_length validation, timestamp dedup, covariate validation, frequency detection (QS, YS), leaderboard normalization
  • autogluon_runner.pyfeature_columns, models_path, temp_path params
  • data_profiler.py / ts_profiler.py — optimized column loading, sampling, ACF/decomposition fixes

Diagnostics/Export/Predictions

  • model_diagnostics.pycompute_all_diagnostics(), leaderboard normalization
  • model_export.py — in-memory generate_deployment_files()
  • prediction_service.py_format_forecast_predictions(), serve from stored diagnostics
  • export.py route — export_deployment_zip, _ensure_local_model
  • predictions.py route — serve diagnostics from DB first
  • profiling.py route — asyncio.to_thread(), ensure_local_file, owner-scoped, async EDA store

Other

  • cleanup_service.py — dataset file deletion via storage_resolver, owner filtering
  • job_queue.py — fail stuck Domino jobs (>1hr) on startup
  • health.py/storage endpoint
  • registry.py — minor
  • domino_model_api.py — removed dead _get_ephemeral_token
  • domino_registry.pytemp_path param
  • experiment_tracker.pytemp_dir for artifacts
  • READMEs updated

Files changed

25 source files, 9 test files, 2 READMEs

Test plan

  • test_training_worker.py passes
  • test_model_diagnostics.py passes
  • test_export_zip.py passes
  • test_prediction_service.py passes
  • test_model_export.py passes
  • test_job_config_passthrough.py passes
  • test_api_predictions.py passes
  • test_job_service_extras.py passes
  • TimeSeries training validates prediction_length before starting
  • Diagnostics available immediately after training completes
  • Cross-project EDA profiling works via ensure_local_file

ddl-subir-m and others added 14 commits March 20, 2026 09:44
- Add DominoProjectType enum (DFS/GIT/UNKNOWN) with filesystem-based detection
- Add _db_url_remap for cross-project SQLite URL remapping across mount types
- Add tabular_data module: centralized CSV/parquet preview, schema, row counting
  with LRU caching (replaces scattered pd.read_csv/parquet calls)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lver

- Add normalize_leaderboard_rows/payload to fix TimeSeries fit_time display
- Add resolve_request_project_id() to centralize project context extraction
  from X-Project-Id header, query params, and DOMINO_PROJECT_ID env var

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…aming download and debug middleware

- Add params, files, headers, base_url parameters to domino_request()
- Add domino_download() for streaming file downloads from Domino APIs
- Add resolve_domino_nucleus_host() for direct nucleus-frontend access
- Add _get_api_key() helper for X-Domino-Api-Key auth
- Add DebugLoggingMiddleware (opt-in via AUTOML_DEBUG_LOGGING=true)
- Use fresh httpx client per request to avoid proxy idle disconnects
- Add debug_logging setting to config

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ancements

- Add EDAResult model for DB-backed EDA storage (replaces file-based store)
- Add diagnostics_data JSON column to Job model
- Add summary_only query optimization with load_only() for job listing
- Add owner filtering on get_registered_models and get_jobs_for_cleanup
- Add EDA CRUD operations (create/get/update/write result/error/delete stale)
- Rewrite eda_job_store from file-backed to DB-backed (async)
- Add WebSocket broadcast helpers for job/log updates in CRUD layer

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…fication

- Rewrite dataset_manager to prefer Domino Dataset RW API over filesystem scan
- Cross-project mount safety: only resolve local paths for same-project datasets
- Upload route: direct to Domino dataset via storage_resolver chunked API
- New verify-snapshot endpoint for polling snapshot status after upload
- New download-dataset-file endpoint
- Preview route: ensure_local_file for cross-project files
- Replace pandas with tabular_data helpers
- Compat routes: project-scoped listing, svcdataset, svcverifysnapshot
- Add domino_dataset_api: Dataset RW v2/v1 listing with pagination and fallback
- Add storage_resolver: auto-create automl-extension dataset per project,
  chunked upload via v4 API, streaming download, snapshot file listing,
  mount path probing across DFS/git layouts
- Add ensure_local_file: downloads from dataset API when file not on local
  mount (enables cross-project file access for profiling and training)
- Add cleanup_dataset_cache and extract_dataset_relative_path utils
- Switch job launch from POST /v4/jobs/start to POST /api/jobs/v1/jobs
- Switch job status from GET /v4/jobs/{id} to GET /api/jobs/beta/jobs/{id}
- Remove _resolve_hardware_tier_id (v1 API accepts tier name directly)
- Add _job_api_request with direct-host-first fallback
- Add _remap_db_url_for_target for cross-project database paths
- Pass database_url and job_config as CLI args to workers
Security fix: all job endpoints now enforce ownership.

- Add _enforce_job_owner() — returns 404 to non-owners (avoids leaking job existence)
- Invert get_request_owner() priority: prefer domino-username header over sidecar
  (sidecar returns App owner, not viewing user)
- resolve_job_list_filters() ignores client-supplied owner (always server-side)
- Add needs_request flag in compat patterns for Request forwarding
- Owner filtering on cleanup, registered models
- Add JobListItemResponse lightweight schema for list views
- Add clear_viewing_user() to prevent cross-request context leakage
- Background Domino sync throttling, zombie local job detection
- Add leaderboard_utils for normalize_leaderboard_payload

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ddl-subir-m ddl-subir-m requested review from a team and niole March 20, 2026 17:44
Stop the sidecar token from overwriting the user's forwarded JWT.
When a user token is present (from the Extension-injected Authorization
header), outbound calls to datasetrw, jobs, and model registry now run
as the visiting user instead of the App owner. The sidecar token is
only used as fallback for background tasks and health checks.
Addresses review comment: use the actual header name x-domino-api-key
instead of the incorrect domino-api-key.
Remove use_api_key parameter and _get_api_key() helper. All downloads
now use the standard auth chain which preserves the user's forwarded
token and falls back to sidecar when needed.
ddl-subir-m and others added 29 commits March 23, 2026 23:57
Introduces the complete dataset viewing pipeline:
- storage_resolver.py: per-project dataset lifecycle, mount probing,
  snapshot management, file operations, caching
- dataset_manager.py: API-first listing with local fallback,
  cross-project support, snapshot-based file listing
- routes/datasets.py: GET endpoints for list, detail, verify-snapshot,
  and file download
- dataset_service.py: listing orchestration and mount filtering

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove upload, download, provisioning, and deletion methods from
storage_resolver.py — they will be re-introduced in the PRs that
consume them (PR #26 for upload, PR #22 for download/provisioning).

Keeps: snapshot listing/status, RW ID resolution, list_snapshot_files,
_find_existing, _probe_mount, caching infrastructure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…roject_id

The env var is the App's own project, not the target project the user
is working in. Falling back to it silently operates on the wrong project
(root cause of datasets showing empty in cross-project scenarios).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The cache is a process-lifetime dict on a singleton — it never expires
or invalidates. Stale data is worse than a redundant API call.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The v2→v1 fallback was speculative — no evidence that v2 is unreliable
on any target deployment. Ryan's PR uses v2 only and it works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
No frontend caller exists for this endpoint. Dataset preview uses
/svcdatasetpreview which returns JSON rows, not a raw file download.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Speculative field — no frontend code checks it. Remove to keep the
schema minimal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds the remaining dataset endpoints on top of the viewing layer:
- POST /upload: chunked upload to Domino dataset with snapshot verification
- GET /verify-snapshot: poll snapshot status (upload completion check)
- GET /{id}/preview: paginated data preview (CSV/Parquet)
- GET /{id}/schema: column names and dtypes
- POST /preview: preview by file path with pagination

dataset_service.py: upload orchestration, preview pagination,
schema resolution, compat payload builders.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…esolver

Extends storage_resolver with methods consumed by the upload and
snapshot verification endpoints:
- ensure_dataset_exists, _resolve_or_create, _create_dataset
- upload_file, _upload_chunks
- delete_snapshot_files, _get_latest_snapshot_id
- _dataset_rw_write_request, _format_http_error
- get_dataset_info, delete_dataset

Download and provisioning methods (download_file, download_directory,
ensure_project_storage, resolve_project_paths, check_project_storage,
invalidate) will be added in PR #22 where they are consumed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add max_upload_size_mb setting (env: AUTOML_MAX_UPLOAD_SIZE_MB,
default 550) and enforce it in the upload route before processing.
Returns HTTP 413 with a clear message when exceeded.

Validation (filename, extension, size) is now done once before
branching into the Domino vs standalone upload paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add cleanup_dataset_cache() to utils.py — walks the dataset_cache
directory and removes files older than the configured TTL. Empty
directories are pruned afterward.

Configurable via AUTOML_DATASET_CACHE_TTL_HOURS (default: 24).
Called at app startup in main.py (PR #22).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add DominoProjectType enum (DFS/GIT/UNKNOWN) with filesystem-based detection
- Add _db_url_remap for cross-project SQLite URL remapping across mount types
- Add tabular_data module: centralized CSV/parquet preview, schema, row counting
  with LRU caching (replaces scattered pd.read_csv/parquet calls)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Switch job launch from POST /v4/jobs/start to POST /api/jobs/v1/jobs
- Switch job status from GET /v4/jobs/{id} to GET /api/jobs/beta/jobs/{id}
- Remove _resolve_hardware_tier_id (v1 API accepts tier name directly)
- Add _job_api_request with direct-host-first fallback
- Add _remap_db_url_for_target for cross-project database paths
- Pass database_url and job_config as CLI args to workers
DOMINO_ENVIRONMENT_ID and DOMINO_ENVIRONMENT_REVISION_ID are set on
the App container and identify the compute environment with the right
dependencies. Using env vars eliminates per-caller plumbing and ensures
child jobs always match the App's environment.

Removes environment_id param from _job_start, start_training_job, and
start_eda_job. Adds environmentRevisionId to job payload.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve the training data path at job creation time and pass it as
--file-path to the Domino Job command. The worker uses the path
directly instead of needing dataset API access at runtime.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ancements

- Add EDAResult model for DB-backed EDA storage (replaces file-based store)
- Add diagnostics_data JSON column to Job model
- Add summary_only query optimization with load_only() for job listing
- Add owner filtering on get_registered_models and get_jobs_for_cleanup
- Add EDA CRUD operations (create/get/update/write result/error/delete stale)
- Rewrite eda_job_store from file-backed to DB-backed (async)
- Add WebSocket broadcast helpers for job/log updates in CRUD layer

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Security fix: all job endpoints now enforce ownership.

- Add _enforce_job_owner() — returns 404 to non-owners (avoids leaking job existence)
- Invert get_request_owner() priority: prefer domino-username header over sidecar
  (sidecar returns App owner, not viewing user)
- resolve_job_list_filters() ignores client-supplied owner (always server-side)
- Add needs_request flag in compat patterns for Request forwarding
- Owner filtering on cleanup, registered models
- Add JobListItemResponse lightweight schema for list views
- Add clear_viewing_user() to prevent cross-request context leakage
- Background Domino sync throttling, zombie local job detection
- Add leaderboard_utils for normalize_leaderboard_payload

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Query params are the canonical approach going forward. The X-Project-Id
header is kept as a fallback for legacy clients only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…et-view

# Conflicts:
#	automl-service/app/api/utils.py
#	automl-service/tests/test_api_utils.py
The frontend sends both header and query param from the same source.
No scenario where header is present but query param isn't. Query param
only — simpler.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant