Skip to content

Reduce databricks runs #341

@jathavaan

Description

@jathavaan

Task description

Benchmark cost is essentially num_workers × cluster_window (per-node-hour rate is fixed at ~$0.3165; ACI, Postgres, blob ops, and intra-region egress are negligible by comparison). Almost all spend lives in RQ2-large Sedona, so the highest-leverage cost cut is trimming worker counts on that tier rather than touching dataset tiers.

Decision:

  • Drop the 12-node point for broadcast and partitioned at the large tier; keep {2, 4, 8, 16}. This removes the second-most-expensive worker level while preserving the scaling-curve shape (steep early gains across 2→4→8, diminishing-returns/COST anchor at 16; 12 only interpolates between 8 and 16).
  • Keep RQ1 at small + large (single-node, near-zero cost; large is the informative tier) — no tier change.
  • Keep all RQ2 dataset tiers for now; if further savings are needed, thin RQ2 medium node breadth rather than dropping a tier.
  • The national-scale-join @monitor branch has no wall-clock ceiling (use_sequential_stopping=False skips the timeout), so a pathological iteration converts directly into billed node-hours. Add a finite ceiling to cap worst-case cost.

This also captures cleanup now that the default strategy has been removed entirely.

Developer tasks

  • Remove national-scale-spatial-join-databricks-broadcast-12-nodes-large and ...-partitioned-12-nodes-large from benchmarks.yml
  • Rebatch the orphaned large 4-node peers (currently broadcast-12 pairs with broadcast-4, partitioned-12 with partitioned-4); re-point related_script_ids so the 4-node pair still launches together and main.py::_assert_related_ids_resolvable passes (e.g. pair broadcast-4-largepartitioned-4-large, vCPU sum 40 ≤ 200)
  • Remove the now-unused 12-node entrypoints, benchmark_runner.py dispatch arms, docker-compose.yml services, and matrix rows in pull-request-tests.yml / push-containers-to-acr.yml (only if 12-node is unused at every tier)
  • Confirm no dangling default references remain after its removal: grep benchmarks.yml, benchmark_runner.py, entrypoints/__init__.py, app_config.py, docker-compose.yml, both workflow matrices, and delete national_scale_spatial_join_default.py
  • Add a finite wall-clock ceiling for the national_scale_spatial_join_* branch in monitor so cost/time can't run away (the 5.4 h default-16 case)
  • Reconcile thesis ↔ code: update RQ2 worker levels in Table 4.1.1 (lists 2, 4, 8, 12, 16) and Table 4.2.1; fix the join iteration count (p.44 says seven, code uses NATIONAL_SCALE_SPATIAL_JOIN = 5 + 1 warmup); remove "Sedona + GeoParquet" from the RQ1 single-machine rows in Table 4.2.1 (no Sedona single-machine entrypoints exist)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions