Skip to content

feat(reconcile): materialize intermediate DataFrames during reconcile#2535

Open
BesikiML wants to merge 5 commits into
mainfrom
2257-feature-tighten-writes-to-delta-during-reconcile
Open

feat(reconcile): materialize intermediate DataFrames during reconcile#2535
BesikiML wants to merge 5 commits into
mainfrom
2257-feature-tighten-writes-to-delta-during-reconcile

Conversation

@BesikiML

@BesikiML BesikiML commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add intermediate write_and_read_df_with_volumes checkpoints in reconcile computation paths that were still missing materialization (marked with TODOs in [FEATURE]: Tighten writes to delta during reconcile #2257).
  • Materialize threshold mismatch rows, aggregate per-rule outputs, and sampled mismatch capture DataFrames before downstream count() / sampling work.

Motivation

Large Spark plans in reconcile can retain long lineages and cause backpressure. The reconcile runtime already materializes some intermediate results (e.g. after source/target join); this PR fills the remaining gaps called out in #2257.

Changes

Phase Location Change
Threshold comparison reconciliation._compute_threshold_comparison Materialize filtered mismatched_df before count()
Aggregate per-rule compare.reconcile_agg_data_per_rule Thread persistence and materialize mismatch, missing_in_src, missing_in_tgt
Mismatch capture (sampling) compare.capture_mismatch_data_and_columns Require persistence; wired from Reconciliation._get_mismatch_data

Fixed issue: #2257

Note: These are temporary intermediate writes via ReconIntermediatePersist (UC volume / parquet), not changes to final ReconCapture MAIN/METRICS/DETAILS output tables. Temp data is still cleaned up in trigger_recon / trigger_recon_aggregates.

Test plan

  • CI integration reconcile tests (test_compare.py, test_aggregates_reconcile.py, test_execute.py) — covers aggregate per-rule, mismatch capture, and threshold paths via existing e2e coverage
  • Manual smoke on Databricks (optional): run reconcile job and confirm intermediate volume writes under metadata_config.volume

Add write_and_read_df_with_volumes checkpoints for threshold comparison,
aggregate per-rule outputs, and mismatch capture to improve backpressure.
Includes unit tests verifying persistence is invoked in each path.

Fixes #2257
@BesikiML BesikiML requested a review from a team as a code owner June 26, 2026 03:24
@BesikiML BesikiML linked an issue Jun 26, 2026 that may be closed by this pull request
1 task
@BesikiML BesikiML self-assigned this Jun 26, 2026
@BesikiML BesikiML requested a review from m-abulazm June 26, 2026 03:26
@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 14.28571% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.18%. Comparing base (7ebd945) to head (cce627c).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...rc/databricks/labs/lakebridge/reconcile/compare.py 20.00% 4 Missing ⚠️
...bricks/labs/lakebridge/reconcile/reconciliation.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2535      +/-   ##
==========================================
+ Coverage   69.10%   69.18%   +0.07%     
==========================================
  Files         105      105              
  Lines        9482     9503      +21     
  Branches     1050     1052       +2     
==========================================
+ Hits         6553     6575      +22     
+ Misses       2735     2731       -4     
- Partials      194      197       +3     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@BesikiML BesikiML requested a review from bishwajit-db June 26, 2026 03:26
BesikiML added 2 commits June 25, 2026 23:35
Move imports to module scope, mock target.read_data via constructor,
and satisfy mypy/pylint for the threshold comparison test.
Remove the private-method threshold test instead of suppressing
protected-access; aggregate and capture tests already verify the
intermediate persist wiring pattern.
@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown

✅ 169/169 passed, 9 flaky, 2 skipped, 52m4s total

Flaky tests:

  • 🤪 test_installs_and_runs_local_bladebridge (13.157s)
  • 🤪 test_installs_and_runs_pypi_bladebridge (23.547s)
  • 🤪 test_recon_for_report_type_is_data (22.771s)
  • 🤪 test_transpiles_informatica_to_sparksql_non_interactive[True] (24.53s)
  • 🤪 test_transpiles_informatica_to_sparksql_non_interactive[False] (23.735s)
  • 🤪 test_transpiles_informatica_to_sparksql (25.919s)
  • 🤪 test_transpile_teradata_sql_non_interactive[False] (6.199s)
  • 🤪 test_transpile_teradata_sql (8.872s)
  • 🤪 test_transpile_teradata_sql_non_interactive[True] (26.578s)

Running from acceptance #4946

source: DataFrame,
target: DataFrame,
key_columns: list[str],
persistence: AbstractReconIntermediatePersist | None = None,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

persistence is never None so no needs to handle that or allow that in the first place

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,106 @@
import tempfile

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont really need to unit test this. this is a performance optimization addition that testing it goes beyond simple unit tests. we have e2e tests for that.

can you also attempt a test run and post a screenshot of the result

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

BesikiML added 2 commits June 26, 2026 08:53
Make persistence a required parameter in capture_mismatch_data_and_columns
since callers always provide it. Remove the dedicated unit test file in
favor of existing integration/e2e coverage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Tighten writes to delta during reconcile

2 participants