issue #297 fixes — bundle template, compat, cli, demo, runtime#298
Conversation
…exclude variables.yml.tmpl declared type: string / type: bool on every variable, but `complex` is the only valid value per the bundle CLI's enum schema, so every `databricks bundle validate` against a freshly-scaffolded bundle emitted 18 warnings. The sync.exclude block in databricks.yml.tmpl listed three patterns that no scaffolded bundle ships (.DS_Store, .vscode, .venv), producing three more 'Pattern X does not match any files' warnings. Result of `bundle-init --quickstart` + `bundle validate` is now "Validation OK!" with zero warnings, matching the behavior of the official default-python template.
|
@ravi-databricks dataflow_pipeline.py two things here. First, the inline where-clause loop in |
…pipelines dataflow_pipeline.py and pipeline_writers.py do `from pyspark import pipelines`, introduced in pyspark 4.1.0. setup.py DEV_REQUIREMENTS and the README pinned pyspark==3.5.5, which lacks pyspark.pipelines, so `from dlt_meta import DataflowPipeline` failed silently under the previous blanket `try: ... except ImportError: pass` in compat/dlt_meta/__init__.py and surfaced as the unhelpful `cannot import name 'DataflowPipeline' from 'dlt_meta'`. Split the compat re-exports: pyspark-free symbols (cli surface, install, config) bind unconditionally; symbols whose modules transitively import pyspark are wrapped in a narrow try/except that swaps in stubs raising a clear ImportError naming pyspark>=4.1.0 as the requirement. Bumped pyspark==3.5.5 to pyspark>=4.1.0 and delta-spark==3.0.0 to delta-spark>=4.0.0 in setup.py DEV_REQUIREMENTS and the README install line. delta-spark 3.x caps pyspark<3.6.0 so the old delta pin is incompatible with the pyspark bump. Extended the pyspark mock in tests/test_compat.py to cover pyspark, pyspark.sql.session, pyspark.sql.window, delta, and delta.tables so the test file no longer requires real pyspark/delta installs.
_get_onboarding_named_parameters concatenated the full local path cmd.onboarding_file_path into the UC volume / DBFS path, producing '/Volumes/<cat>/<schema>/<vol>/sdp_meta_conf//Users/.../onboarding.json'. The onboarding job then failed to open that path. The launcher demos (launch_*.py) constructed the named_parameters dict themselves and bypassed this code, which masked the bug. Use os.path.basename(cmd.onboarding_file_path) so the parameter is '<volume>/sdp_meta_conf/onboarding.json' in both the UC and DBFS branches. Updated the existing test_get_onboarding_named_parameters assertion to match.
The codebase migrated from `import dlt` to `from pyspark import pipelines as dp` (issue databrickslabs#274 / commit cfd66fa); the inline snapshot_runner_content string in SDP_META_INTERACTIVE_DEMO.py was missed. The `dlt` symbol was never used in the runner body.
… scenarios
_strip_example_onboarding_entry only fired for the `delta` scenario. The
seeded `data_flow_id: "100"` row carries unedited `<your-kafka-host>:9092` /
`<your-eventhub-namespace>` placeholders for the kafka and eventhub scenarios,
which STAGE 5's bundle-validate sanity checks reject; every kafka/eventhub
launcher run failed at STAGE 5 as a result.
Added _STRIP_EXAMPLE_SCENARIO_NAMES = _DELTA_SCENARIO_NAMES | {"kafka",
"eventhub"} and gated the strip on that set. cloudfiles / cloudfiles_combined
keep the seed (its placeholder-free source path validates fine).
stage_validate ran _sdp_meta_sanity_checks and printed every error itself before calling bundle_validate, which then printed the same errors again under a different header. Removed the launcher's local copy and let bundle_validate own the output. Also dropped the now-unused _sdp_meta_sanity_checks import.
read_silver had an inline where-clause loop `for where_clause in where_clause:` that shadowed the outer list with the last clause string. Today nothing reads it post-loop so no functional break, but the shadow is bug-prone and the private __apply_where_clause helper already implements the same logic correctly with a clause iterator. Delegated read_silver's where-clause handling to __apply_where_clause.
_build_table_name, _get_source_table_info, _get_target_table_name, _create_dataframe_reader, _read_from_source, _apply_transformations were added but never called from any production code path. Each one's only caller was another method in the same orphan chain, and read_silver / read_bronze / _get_target_table_info etc. continued to implement the same logic inline. Removed the methods and the eight tests that exercised them.
Closes #297. Eight independent fixes; each section below is a separate commit.
What changed
1. DAB template —
variables.ymlvalidation warningsEvery variable in
templates/dab/template/{{.bundle_name}}/resources/variables.yml.tmpldeclaredtype: stringortype: bool. The bundle CLI's variable schema only acceptstype: complex, so everydatabricks bundle validateagainst a freshly-scaffolded bundle emitted 18Warning: invalid value "string" for enum field. Valid values are [complex]. Dropped thetype:line from all 18 declarations.Also dropped the
sync.exclude:block fromdatabricks.yml.tmpl— the patterns.DS_Store,.vscode,.venv/do not match any files in a freshly-scaffolded bundle, producing three additional "Pattern X does not match any files" warnings.bundle-init --quickstart+bundle validatenow returnsValidation OK!with zero warnings.2.
compat/dlt_meta— runtime import deferral + pyspark/delta-spark bumpdataflow_pipeline.pyandpipeline_writers.pydofrom pyspark import pipelines as dp, which is only available in pyspark>=4.1.0.setup.pyDEV_REQUIREMENTSand the README pinnedpyspark==3.5.5, which lackspyspark.pipelines. The previous blankettry: ... except ImportError: passincompat/dlt_meta/__init__.pythen silently dropped every runtime re-export (DataflowPipeline,PipelineReaders,AppendFlowWriter,DLTSinkWriter,BronzeDataflowSpec,SilverDataflowSpec,OnboardDataflowspec), surfacing ascannot import name 'DataflowPipeline' from 'dlt_meta'with no hint at the pyspark version mismatch.Bumped
pyspark==3.5.5→pyspark>=4.1.0anddelta-spark==3.0.0→delta-spark>=4.0.0(the old delta pin caps pyspark<3.6.0and is incompatible with the bump). Split the compat re-exports: cli / install / config bind unconditionally; runtime symbols are wrapped in a narrowtry/except ImportErrorthat swaps in stubs raising a clearImportError("requires pyspark>=4.1.0 ...")when the import fails. Expanded thesys.modulesmock intests/test_compat.pyto coverpyspark,pyspark.sql.session,pyspark.sql.window,delta, anddelta.tablesso the file no longer requires a real install.3. CLI — malformed
onboarding_file_pathjob parameterSDPMeta._get_onboarding_named_parametersbuilt the named parameter asf"{cmd.uc_volume_path}/sdp_meta_conf/{cmd.onboarding_file_path}". By that pointcmd.onboarding_file_pathis the local absolute path (overwritten byupdate_ws_onboarding_pathsto point at the rewrittenonboarding.json), so the concatenation produced/Volumes/<cat>/<schema>/<vol>/sdp_meta_conf//Users/.../onboarding.json, which the onboarding job then failed to open. The launchers underdemo/launch_*.pyconstruct their ownnamed_parametersdict and bypass this path, which is why no demo caught it. Useos.path.basename(cmd.onboarding_file_path)in both the UC and DBFS branches.4. Demo — legacy
import dltin snapshot runnerdemo/SDP_META_INTERACTIVE_DEMO.pybuilds the snapshot runner notebook as a string. The codebase migrated fromimport dlttofrom pyspark import pipelines as dp(issue #274), but the inlinesnapshot_runner_contentstring still carriedimport dlt. The symbol was never referenced in the runner body — just stale. Dropped.5. Demo (DAB template runner) — placeholder seed strip
demo/launch_dab_template_demo.pycalls_strip_example_onboarding_entryto remove the template's seededdata_flow_id: "100"row afterbundle-init. The strip was gated onscenario.name in _DELTA_SCENARIO_NAMESonly. For the kafka and eventhub scenarios, the seed carries<your-kafka-host>:9092/<your-eventhub-namespace>placeholders that the launcher's STAGE 5 sanity checks reject, so every--scenario kafka/--scenario eventhubrun failed at STAGE 5 withflow data_flow_id='100' field source_details.kafka.bootstrap.servers is still the placeholder. Added_STRIP_EXAMPLE_SCENARIO_NAMES = _DELTA_SCENARIO_NAMES | {"kafka", "eventhub"}and gated the strip on the broader set.cloudfiles/cloudfiles_combinedkeep the seed — their placeholder-freesource_path_devvalidates fine.6. Demo (DAB template runner) — duplicate sanity-check printing
stage_validateprinted_sdp_meta_sanity_checkserrors itself and then calledbundle_validate, which prints the same list again under a different header. Compounded with #5 — every failing kafka/eventhub run dumped the same block twice. Dropped the launcher's local copy;bundle_validateowns the output.7. Runtime —
read_silverwhere-clause shadowDataflowPipeline.read_silverhadfor where_clause in where_clause:, shadowing the outer list with the last clause string. Nothing downstream readswhere_clausepost-loop today, but the same logic was already implemented correctly in the private__apply_where_clausehelper. Delegatedread_silver's where-clause handling to__apply_where_clause.8. Runtime — unused helper methods + tests
dataflow_pipeline.pydefined_build_table_name,_get_source_table_info,_get_target_table_name,_create_dataframe_reader,_read_from_source,_apply_transformations— six methods whose only call sites were within the same orphan chain.read_silver/read_bronze/_get_target_table_infoall kept their inline implementations. Deleted the six methods and the eighttest_dataflow_pipeline.pytests that exercised them.