Skip to content

[DOM-75515] feat: project type detection, DB URL remap, tabular data helpers#12

Merged
ddl-subir-m merged 2 commits intomainfrom
subir/pr2-project-type-tabular
Mar 24, 2026
Merged

[DOM-75515] feat: project type detection, DB URL remap, tabular data helpers#12
ddl-subir-m merged 2 commits intomainfrom
subir/pr2-project-type-tabular

Conversation

@ddl-subir-m
Copy link
Copy Markdown
Collaborator

@ddl-subir-m ddl-subir-m commented Mar 20, 2026

Why

The AutoML Extension currently assumes it runs in the same project as the data it accesses. To support cross-project training jobs (where a Domino Job runs in a different project than the App), we need to:

  1. Detect the project type (DFS vs git-based) because dataset mount paths differ between them (/domino/datasets/ vs /mnt/data/). Without this, the training worker can't find its SQLite database or training data when launched cross-project.

  2. Remap database URLs so that a SQLite path written by the App gets translated to the equivalent read path in the target project's mount layout.

  3. Centralize tabular file I/O because pd.read_csv() and pd.read_parquet() calls are scattered across dataset_manager, dataset_service, and profiling code. This causes slow cold starts (pandas imported eagerly in multiple places) and duplicated error handling. The new tabular_data module consolidates these with LRU caching keyed by file mtime.

Summary

  • Adds DominoProjectType enum (DFS/GIT/UNKNOWN) with filesystem-based detection
  • Adds _db_url_remap.py to remap SQLite URLs across DFS and git-based mount points
  • Adds tabular_data.py with read_tabular_preview(), read_tabular_schema(), count_csv_rows(), get_tabular_metadata(), and estimate_tabular_memory_mb() — all with LRU caching

File → consumer mapping

File First consumed by
domino_project_type.py PR #25 (dataset_manager.py uses detect_project_type for mount path resolution)
tabular_data.py PR #25 (dataset_manager.py calls read_tabular_preview, read_tabular_schema) and PR #26 (dataset_service.py calls get_tabular_metadata, read_upload_metadata)
_db_url_remap.py PR #22 (training_worker.py calls remap_database_url for cross-project job launches)

These are shared utility modules grouped here by theme (cross-project infrastructure). Each is fully tested in this PR and consumed in downstream PRs.

Test plan

  • test_domino_project_type.py passes
  • test_db_url_remap.py passes
  • test_database_url_passthrough.py passes
  • test_tabular_data.py passes

- Add DominoProjectType enum (DFS/GIT/UNKNOWN) with filesystem-based detection
- Add _db_url_remap for cross-project SQLite URL remapping across mount types
- Add tabular_data module: centralized CSV/parquet preview, schema, row counting
  with LRU caching (replaces scattered pd.read_csv/parquet calls)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ddl-subir-m ddl-subir-m merged commit 4ca91a1 into main Mar 24, 2026
@ddl-subir-m ddl-subir-m deleted the subir/pr2-project-type-tabular branch March 24, 2026 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants