#346 Fix Databricks spatial join: collect() instead of count() by jathavaan · Pull Request #349 · kartAI/doppa

jathavaan · 2026-05-26T08:04:14Z

This pull request introduces improvements to the benchmark runner, configuration, and Databricks job execution to better handle out-of-memory (OOM) scenarios and ensure accurate experiment tracking. The main changes include adding the ability to skip problematic experiments, updating cluster and job settings to avoid Spark query plan rewrites, and fixing result collection in Databricks scripts.

Benchmark skipping and experiment tracking:

benchmark_runner.py, benchmarks.yml: Added a skip flag to experiments in benchmarks.yml and logic in benchmark_runner.py to detect and record skipped experiments, preventing OOM-prone benchmarks from running and ensuring metadata is saved with a FAILED stop reason. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]

Databricks cluster and Spark settings:

src/infra/infrastructure/services/databricks_service.py: Disabled Spark Adaptive Query Execution (AQE) and automatic broadcast join threshold at the cluster level to prevent runtime query plan rewrites that could cause instability in spatial join jobs.

Databricks result collection fix:

src/presentation/databricks/national_scale_spatial_join_broadcast.py, src/presentation/databricks/national_scale_spatial_join_partitioned.py: Changed from using result.count() to collecting results and using len(collected) to ensure accurate cardinality measurement and avoid possible Spark execution issues. [1] [2]

…er-level AQE disable, skip OOM experiments - Replace result.count() with result.collect() in both notebooks to prevent Databricks ResultCacheManager from replanning the query as BroadcastNestedLoopJoin (brute-force cross product that exceeds spark.driver.maxResultSize) - Add spark.sql.adaptive.enabled=false and spark.sql.autoBroadcastJoinThreshold=-1 to cluster-level spark_conf - Add skip: true support in benchmark_runner.py — records stop_reason=failed with 0 iterations without provisioning a cluster - Mark partitioned joins at medium/large and PostGIS large as skip: true (executor OOM from RangeJoin spatial index exceeding Standard_D4s_v3 memory)

jathavaan self-assigned this May 26, 2026

Copilot AI review requested due to automatic review settings May 26, 2026 08:04

Merge branch 'main' into fix/346-broadcast-notebook-disable-aqe

eab7b67

jathavaan enabled auto-merge May 26, 2026 08:04

Copilot started reviewing on behalf of jathavaan May 26, 2026 08:04 View session

jathavaan disabled auto-merge May 26, 2026 08:04

jathavaan merged commit 766fa60 into main May 26, 2026
2 checks passed

jathavaan deleted the fix/346-broadcast-notebook-disable-aqe branch May 26, 2026 08:04

jathavaan review requested due to automatic review settings May 26, 2026 08:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#346 Fix Databricks spatial join: collect() instead of count()#349

#346 Fix Databricks spatial join: collect() instead of count()#349
jathavaan merged 2 commits into
mainfrom
fix/346-broadcast-notebook-disable-aqe

jathavaan commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jathavaan commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant