You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Benchmark cost is essentially num_workers × cluster_window (per-node-hour rate is fixed at ~$0.3165; ACI, Postgres, blob ops, and intra-region egress are negligible by comparison). Almost all spend lives in RQ2-large Sedona, so the highest-leverage cost cut is trimming worker counts on that tier rather than touching dataset tiers.
Decision:
Drop the 12-node point for broadcast and partitioned at the large tier; keep {2, 4, 8, 16}. This removes the second-most-expensive worker level while preserving the scaling-curve shape (steep early gains across 2→4→8, diminishing-returns/COST anchor at 16; 12 only interpolates between 8 and 16).
Keep RQ1 at small + large (single-node, near-zero cost; large is the informative tier) — no tier change.
Keep all RQ2 dataset tiers for now; if further savings are needed, thin RQ2 medium node breadth rather than dropping a tier.
The national-scale-join @monitor branch has no wall-clock ceiling (use_sequential_stopping=False skips the timeout), so a pathological iteration converts directly into billed node-hours. Add a finite ceiling to cap worst-case cost.
This also captures cleanup now that the default strategy has been removed entirely.
Developer tasks
Remove national-scale-spatial-join-databricks-broadcast-12-nodes-large and ...-partitioned-12-nodes-large from benchmarks.yml
Rebatch the orphaned large 4-node peers (currently broadcast-12 pairs with broadcast-4, partitioned-12 with partitioned-4); re-point related_script_ids so the 4-node pair still launches together and main.py::_assert_related_ids_resolvable passes (e.g. pair broadcast-4-large ↔ partitioned-4-large, vCPU sum 40 ≤ 200)
Remove the now-unused 12-node entrypoints, benchmark_runner.py dispatch arms, docker-compose.yml services, and matrix rows in pull-request-tests.yml / push-containers-to-acr.yml (only if 12-node is unused at every tier)
Confirm no dangling default references remain after its removal: grep benchmarks.yml, benchmark_runner.py, entrypoints/__init__.py, app_config.py, docker-compose.yml, both workflow matrices, and delete national_scale_spatial_join_default.py
Add a finite wall-clock ceiling for the national_scale_spatial_join_* branch in monitor so cost/time can't run away (the 5.4 h default-16 case)
Reconcile thesis ↔ code: update RQ2 worker levels in Table 4.1.1 (lists 2, 4, 8, 12, 16) and Table 4.2.1; fix the join iteration count (p.44 says seven, code uses NATIONAL_SCALE_SPATIAL_JOIN = 5 + 1 warmup); remove "Sedona + GeoParquet" from the RQ1 single-machine rows in Table 4.2.1 (no Sedona single-machine entrypoints exist)
Task description
Benchmark cost is essentially
num_workers × cluster_window(per-node-hour rate is fixed at ~$0.3165; ACI, Postgres, blob ops, and intra-region egress are negligible by comparison). Almost all spend lives in RQ2-large Sedona, so the highest-leverage cost cut is trimming worker counts on that tier rather than touching dataset tiers.Decision:
broadcastandpartitionedat thelargetier; keep{2, 4, 8, 16}. This removes the second-most-expensive worker level while preserving the scaling-curve shape (steep early gains across 2→4→8, diminishing-returns/COST anchor at 16; 12 only interpolates between 8 and 16).mediumnode breadth rather than dropping a tier.@monitorbranch has no wall-clock ceiling (use_sequential_stopping=Falseskips the timeout), so a pathological iteration converts directly into billed node-hours. Add a finite ceiling to cap worst-case cost.This also captures cleanup now that the
defaultstrategy has been removed entirely.Developer tasks
national-scale-spatial-join-databricks-broadcast-12-nodes-largeand...-partitioned-12-nodes-largefrombenchmarks.ymllarge4-node peers (currentlybroadcast-12pairs withbroadcast-4,partitioned-12withpartitioned-4); re-pointrelated_script_idsso the 4-node pair still launches together andmain.py::_assert_related_ids_resolvablepasses (e.g. pairbroadcast-4-large↔partitioned-4-large, vCPU sum 40 ≤ 200)benchmark_runner.pydispatch arms,docker-compose.ymlservices, and matrix rows inpull-request-tests.yml/push-containers-to-acr.yml(only if 12-node is unused at every tier)defaultreferences remain after its removal: grepbenchmarks.yml,benchmark_runner.py,entrypoints/__init__.py,app_config.py,docker-compose.yml, both workflow matrices, and deletenational_scale_spatial_join_default.pynational_scale_spatial_join_*branch inmonitorso cost/time can't run away (the 5.4 hdefault-16case)NATIONAL_SCALE_SPATIAL_JOIN = 5+ 1 warmup); remove "Sedona + GeoParquet" from the RQ1 single-machine rows in Table 4.2.1 (no Sedona single-machine entrypoints exist)