Skip to content

#346 Fix Databricks spatial join: collect() instead of count()#349

Merged
jathavaan merged 2 commits into
mainfrom
fix/346-broadcast-notebook-disable-aqe
May 26, 2026
Merged

#346 Fix Databricks spatial join: collect() instead of count()#349
jathavaan merged 2 commits into
mainfrom
fix/346-broadcast-notebook-disable-aqe

Conversation

@jathavaan
Copy link
Copy Markdown
Collaborator

This pull request introduces improvements to the benchmark runner, configuration, and Databricks job execution to better handle out-of-memory (OOM) scenarios and ensure accurate experiment tracking. The main changes include adding the ability to skip problematic experiments, updating cluster and job settings to avoid Spark query plan rewrites, and fixing result collection in Databricks scripts.

Benchmark skipping and experiment tracking:

  • benchmark_runner.py, benchmarks.yml: Added a skip flag to experiments in benchmarks.yml and logic in benchmark_runner.py to detect and record skipped experiments, preventing OOM-prone benchmarks from running and ensuring metadata is saved with a FAILED stop reason. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]

Databricks cluster and Spark settings:

Databricks result collection fix:

  • src/presentation/databricks/national_scale_spatial_join_broadcast.py, src/presentation/databricks/national_scale_spatial_join_partitioned.py: Changed from using result.count() to collecting results and using len(collected) to ensure accurate cardinality measurement and avoid possible Spark execution issues. [1] [2]

…er-level AQE disable, skip OOM experiments

- Replace result.count() with result.collect() in both notebooks to prevent
  Databricks ResultCacheManager from replanning the query as
  BroadcastNestedLoopJoin (brute-force cross product that exceeds
  spark.driver.maxResultSize)
- Add spark.sql.adaptive.enabled=false and
  spark.sql.autoBroadcastJoinThreshold=-1 to cluster-level spark_conf
- Add skip: true support in benchmark_runner.py — records stop_reason=failed
  with 0 iterations without provisioning a cluster
- Mark partitioned joins at medium/large and PostGIS large as skip: true
  (executor OOM from RangeJoin spatial index exceeding Standard_D4s_v3 memory)
@jathavaan jathavaan self-assigned this May 26, 2026
Copilot AI review requested due to automatic review settings May 26, 2026 08:04
@jathavaan jathavaan enabled auto-merge May 26, 2026 08:04
@jathavaan jathavaan disabled auto-merge May 26, 2026 08:04
@jathavaan jathavaan merged commit 766fa60 into main May 26, 2026
2 checks passed
@jathavaan jathavaan deleted the fix/346-broadcast-notebook-disable-aqe branch May 26, 2026 08:04
@jathavaan jathavaan review requested due to automatic review settings May 26, 2026 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant