Skip to content

issue #297 fixes — bundle template, compat, cli, demo, runtime#298

Merged
ravi-databricks merged 9 commits into
databrickslabs:issue_297from
park-peter:issue_297
Jun 4, 2026
Merged

issue #297 fixes — bundle template, compat, cli, demo, runtime#298
ravi-databricks merged 9 commits into
databrickslabs:issue_297from
park-peter:issue_297

Conversation

@park-peter
Copy link
Copy Markdown

Closes #297. Eight independent fixes; each section below is a separate commit.

What changed

1. DAB template — variables.yml validation warnings

Every variable in templates/dab/template/{{.bundle_name}}/resources/variables.yml.tmpl declared type: string or type: bool. The bundle CLI's variable schema only accepts type: complex, so every databricks bundle validate against a freshly-scaffolded bundle emitted 18 Warning: invalid value "string" for enum field. Valid values are [complex]. Dropped the type: line from all 18 declarations.

Also dropped the sync.exclude: block from databricks.yml.tmpl — the patterns .DS_Store, .vscode, .venv/ do not match any files in a freshly-scaffolded bundle, producing three additional "Pattern X does not match any files" warnings.

bundle-init --quickstart + bundle validate now returns Validation OK! with zero warnings.

2. compat/dlt_meta — runtime import deferral + pyspark/delta-spark bump

dataflow_pipeline.py and pipeline_writers.py do from pyspark import pipelines as dp, which is only available in pyspark >=4.1.0. setup.py DEV_REQUIREMENTS and the README pinned pyspark==3.5.5, which lacks pyspark.pipelines. The previous blanket try: ... except ImportError: pass in compat/dlt_meta/__init__.py then silently dropped every runtime re-export (DataflowPipeline, PipelineReaders, AppendFlowWriter, DLTSinkWriter, BronzeDataflowSpec, SilverDataflowSpec, OnboardDataflowspec), surfacing as cannot import name 'DataflowPipeline' from 'dlt_meta' with no hint at the pyspark version mismatch.

Bumped pyspark==3.5.5pyspark>=4.1.0 and delta-spark==3.0.0delta-spark>=4.0.0 (the old delta pin caps pyspark <3.6.0 and is incompatible with the bump). Split the compat re-exports: cli / install / config bind unconditionally; runtime symbols are wrapped in a narrow try/except ImportError that swaps in stubs raising a clear ImportError("requires pyspark>=4.1.0 ...") when the import fails. Expanded the sys.modules mock in tests/test_compat.py to cover pyspark, pyspark.sql.session, pyspark.sql.window, delta, and delta.tables so the file no longer requires a real install.

3. CLI — malformed onboarding_file_path job parameter

SDPMeta._get_onboarding_named_parameters built the named parameter as f"{cmd.uc_volume_path}/sdp_meta_conf/{cmd.onboarding_file_path}". By that point cmd.onboarding_file_path is the local absolute path (overwritten by update_ws_onboarding_paths to point at the rewritten onboarding.json), so the concatenation produced /Volumes/<cat>/<schema>/<vol>/sdp_meta_conf//Users/.../onboarding.json, which the onboarding job then failed to open. The launchers under demo/launch_*.py construct their own named_parameters dict and bypass this path, which is why no demo caught it. Use os.path.basename(cmd.onboarding_file_path) in both the UC and DBFS branches.

4. Demo — legacy import dlt in snapshot runner

demo/SDP_META_INTERACTIVE_DEMO.py builds the snapshot runner notebook as a string. The codebase migrated from import dlt to from pyspark import pipelines as dp (issue #274), but the inline snapshot_runner_content string still carried import dlt. The symbol was never referenced in the runner body — just stale. Dropped.

5. Demo (DAB template runner) — placeholder seed strip

demo/launch_dab_template_demo.py calls _strip_example_onboarding_entry to remove the template's seeded data_flow_id: "100" row after bundle-init. The strip was gated on scenario.name in _DELTA_SCENARIO_NAMES only. For the kafka and eventhub scenarios, the seed carries <your-kafka-host>:9092 / <your-eventhub-namespace> placeholders that the launcher's STAGE 5 sanity checks reject, so every --scenario kafka / --scenario eventhub run failed at STAGE 5 with flow data_flow_id='100' field source_details.kafka.bootstrap.servers is still the placeholder. Added _STRIP_EXAMPLE_SCENARIO_NAMES = _DELTA_SCENARIO_NAMES | {"kafka", "eventhub"} and gated the strip on the broader set. cloudfiles / cloudfiles_combined keep the seed — their placeholder-free source_path_dev validates fine.

6. Demo (DAB template runner) — duplicate sanity-check printing

stage_validate printed _sdp_meta_sanity_checks errors itself and then called bundle_validate, which prints the same list again under a different header. Compounded with #5 — every failing kafka/eventhub run dumped the same block twice. Dropped the launcher's local copy; bundle_validate owns the output.

7. Runtime — read_silver where-clause shadow

DataflowPipeline.read_silver had for where_clause in where_clause:, shadowing the outer list with the last clause string. Nothing downstream reads where_clause post-loop today, but the same logic was already implemented correctly in the private __apply_where_clause helper. Delegated read_silver's where-clause handling to __apply_where_clause.

8. Runtime — unused helper methods + tests

dataflow_pipeline.py defined _build_table_name, _get_source_table_info, _get_target_table_name, _create_dataframe_reader, _read_from_source, _apply_transformations — six methods whose only call sites were within the same orphan chain. read_silver / read_bronze / _get_target_table_info all kept their inline implementations. Deleted the six methods and the eight test_dataflow_pipeline.py tests that exercised them.

…exclude

variables.yml.tmpl declared type: string / type: bool on every variable, but
`complex` is the only valid value per the bundle CLI's enum schema, so every
`databricks bundle validate` against a freshly-scaffolded bundle emitted 18
warnings. The sync.exclude block in databricks.yml.tmpl listed three patterns
that no scaffolded bundle ships (.DS_Store, .vscode, .venv), producing three
more 'Pattern X does not match any files' warnings.

Result of `bundle-init --quickstart` + `bundle validate` is now "Validation
OK!" with zero warnings, matching the behavior of the official default-python
template.
@ravi-databricks ravi-databricks self-assigned this May 28, 2026
@ravi-databricks ravi-databricks changed the base branch from feature/sdp-meta to issue_297 May 28, 2026 23:56
@ravi-databricks ravi-databricks added this to the v0.0.11 milestone May 28, 2026
@park-peter
Copy link
Copy Markdown
Author

park-peter commented May 29, 2026

@ravi-databricks
cli.py it's a bug fix. The onboarding_file_path job parameter was being built from the full local path instead of just the filename, so databricks labs sdp-meta onboard produced an unopenable /Volumes/.../sdp_meta_conf//Users/.../onboarding.json. The demos never caught it because the launchers build that parameter themselves. Switched to os.path.basename(...).

dataflow_pipeline.py two things here. First, the inline where-clause loop in read_silver now delegates to the existing __apply_where_clause helper instead of carrying its own copy (the inline copy reused its loop variable name, shadowing the outer list). Same behavior, one implementation. This is just a clean-up. Second part, I removed six private methods (_build_table_name, _get_source_table_info, _get_target_table_name, _create_dataframe_reader, _read_from_source, _apply_transformations) plus their tests. They had no caller in any production path, the only references were calls among themselves. The real read/write paths kept their own inline logic and never touched this cluster. No behavioral change. But let me know if this is an implementation you were meaning to continue to build, I can take the removal out of PR.

…pipelines

dataflow_pipeline.py and pipeline_writers.py do `from pyspark import pipelines`,
introduced in pyspark 4.1.0. setup.py DEV_REQUIREMENTS and the README pinned
pyspark==3.5.5, which lacks pyspark.pipelines, so `from dlt_meta import
DataflowPipeline` failed silently under the previous blanket
`try: ... except ImportError: pass` in compat/dlt_meta/__init__.py and surfaced
as the unhelpful `cannot import name 'DataflowPipeline' from 'dlt_meta'`.

Split the compat re-exports: pyspark-free symbols (cli surface, install,
config) bind unconditionally; symbols whose modules transitively import pyspark
are wrapped in a narrow try/except that swaps in stubs raising a clear
ImportError naming pyspark>=4.1.0 as the requirement.

Bumped pyspark==3.5.5 to pyspark>=4.1.0 and delta-spark==3.0.0 to
delta-spark>=4.0.0 in setup.py DEV_REQUIREMENTS and the README install line.
delta-spark 3.x caps pyspark<3.6.0 so the old delta pin is incompatible with
the pyspark bump.

Extended the pyspark mock in tests/test_compat.py to cover pyspark,
pyspark.sql.session, pyspark.sql.window, delta, and delta.tables so the test
file no longer requires real pyspark/delta installs.
_get_onboarding_named_parameters concatenated the full local path
cmd.onboarding_file_path into the UC volume / DBFS path, producing
'/Volumes/<cat>/<schema>/<vol>/sdp_meta_conf//Users/.../onboarding.json'.
The onboarding job then failed to open that path. The launcher demos
(launch_*.py) constructed the named_parameters dict themselves and bypassed
this code, which masked the bug.

Use os.path.basename(cmd.onboarding_file_path) so the parameter is
'<volume>/sdp_meta_conf/onboarding.json' in both the UC and DBFS branches.
Updated the existing test_get_onboarding_named_parameters assertion to match.
The codebase migrated from `import dlt` to `from pyspark import pipelines as dp`
(issue databrickslabs#274 / commit cfd66fa); the inline snapshot_runner_content string in
SDP_META_INTERACTIVE_DEMO.py was missed. The `dlt` symbol was never used in
the runner body.
… scenarios

_strip_example_onboarding_entry only fired for the `delta` scenario. The
seeded `data_flow_id: "100"` row carries unedited `<your-kafka-host>:9092` /
`<your-eventhub-namespace>` placeholders for the kafka and eventhub scenarios,
which STAGE 5's bundle-validate sanity checks reject; every kafka/eventhub
launcher run failed at STAGE 5 as a result.

Added _STRIP_EXAMPLE_SCENARIO_NAMES = _DELTA_SCENARIO_NAMES | {"kafka",
"eventhub"} and gated the strip on that set. cloudfiles / cloudfiles_combined
keep the seed (its placeholder-free source path validates fine).
stage_validate ran _sdp_meta_sanity_checks and printed every error itself
before calling bundle_validate, which then printed the same errors again
under a different header. Removed the launcher's local copy and let
bundle_validate own the output. Also dropped the now-unused
_sdp_meta_sanity_checks import.
read_silver had an inline where-clause loop `for where_clause in where_clause:`
that shadowed the outer list with the last clause string. Today nothing
reads it post-loop so no functional break, but the shadow is bug-prone and
the private __apply_where_clause helper already implements the same logic
correctly with a clause iterator. Delegated read_silver's where-clause
handling to __apply_where_clause.
_build_table_name, _get_source_table_info, _get_target_table_name,
_create_dataframe_reader, _read_from_source, _apply_transformations were
added but never called from any production code path. Each one's only
caller was another method in the same orphan chain, and read_silver /
read_bronze / _get_target_table_info etc. continued to implement the same
logic inline. Removed the methods and the eight tests that exercised them.
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 29, 2026

CLA assistant check
All committers have signed the CLA.

@ravi-databricks ravi-databricks merged commit 0a22537 into databrickslabs:issue_297 Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants