Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 0 additions & 6 deletions .github/workflows/pull-request-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -193,15 +193,9 @@ jobs:
- service: national-scale-spatial-join-databricks-partitioned-8-nodes
display_name: Sedona National Scale Spatial Join - Partitioned - 8 Nodes

- service: national-scale-spatial-join-databricks-broadcast-12-nodes
display_name: Sedona National Scale Spatial Join - Broadcast - 12 Nodes

- service: national-scale-spatial-join-databricks-broadcast-16-nodes
display_name: Sedona National Scale Spatial Join - Broadcast - 16 Nodes

- service: national-scale-spatial-join-databricks-partitioned-12-nodes
display_name: Sedona National Scale Spatial Join - Partitioned - 12 Nodes

- service: national-scale-spatial-join-databricks-partitioned-16-nodes
display_name: Sedona National Scale Spatial Join - Partitioned - 16 Nodes

Expand Down
8 changes: 0 additions & 8 deletions .github/workflows/push-containers-to-acr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -129,18 +129,10 @@ jobs:
image: national-scale-spatial-join-databricks-partitioned-8-nodes
display_name: Sedona National Scale Spatial Join - Partitioned - 8 Nodes

- service: national-scale-spatial-join-databricks-broadcast-12-nodes
image: national-scale-spatial-join-databricks-broadcast-12-nodes
display_name: Sedona National Scale Spatial Join - Broadcast - 12 Nodes

- service: national-scale-spatial-join-databricks-broadcast-16-nodes
image: national-scale-spatial-join-databricks-broadcast-16-nodes
display_name: Sedona National Scale Spatial Join - Broadcast - 16 Nodes

- service: national-scale-spatial-join-databricks-partitioned-12-nodes
image: national-scale-spatial-join-databricks-partitioned-12-nodes
display_name: Sedona National Scale Spatial Join - Partitioned - 12 Nodes

- service: national-scale-spatial-join-databricks-partitioned-16-nodes
image: national-scale-spatial-join-databricks-partitioned-16-nodes
display_name: Sedona National Scale Spatial Join - Partitioned - 16 Nodes
Expand Down
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Python, DuckDB (spatial), PostGIS on Azure Database for PostgreSQL, Apache Sedon
- Secrets live in `.env` (gitignored) and are forwarded to ACI as `--secure-environment-variables` from `main.py`.
- Benchmark batching: members of a batch list each other bidirectionally in `related_script_ids`. The orchestrator dedupes via `completed_experiments`, so each batch runs once as one parallel `ThreadPoolExecutor` fan-out. A batch must satisfy four constraints simultaneously: (a) same query type, (b) same `dataset_size`, (c) at most one PostGIS member (shared Azure Postgres server), (d) Databricks cluster vCPU sum ≤ 200, computed as `(workers + 1) × 4` per Sedona member on `Standard_D4s_v3`. See `README.md#pairing-and-randomization` for the full batch listing. *Why: peers must execute under the same wall-clock window for fair comparison, without contending on shared infrastructure or breaching regional quota.*
- Adding a benchmark requires three edits in lockstep: file in `src/presentation/entrypoints/`, `case` arm in `benchmark_runner.py`, and an entry in `benchmarks.yml`. Missing any one silently breaks dispatch or orchestration.
- Stopping rule on `@monitor`: high-frequency single-machine queries use sequential stopping (bootstrapped CI on the mean elapsed time, floors at `BENCHMARK_MIN_ITERATIONS` and `BENCHMARK_MIN_TIMED_WINDOW_SECONDS`, ceiling at the `BenchmarkIteration` value, per-iteration hard timeout at `BENCHMARK_MAX_ITERATION_SECONDS`, cumulative hard timeout at `BENCHMARK_MAX_TIMED_WINDOW_SECONDS`). Long-running low-variance benchmarks (`national_scale_spatial_join_*` for Databricks, DuckDB, and PostGIS) opt out via `use_sequential_stopping=False` and run a small fixed iteration count; they also override `warmup_iterations=1` since one warmup is enough on long-running queries and additional warmups dominate the wall-clock budget. Transient per-iteration failures do not abort the run: failed iterations are recorded as `status="failed"` samples and the loop continues until `BENCHMARK_MAX_CONSECUTIVE_FAILURES` in a row triggers `stop_reason="failed"`; a run that finishes normally but had any failed iterations is tagged `stop_reason="partial"`. *Why: bootstrap CI on <10 samples is uninformative, and the per-iteration cost of those benchmarks (cluster time, shared Postgres) outweighs precision gains. Spark/Azure flakiness is intermittent rather than deterministic — losing 2 of 3 timed iters to one transient executor loss erases data we could have kept.* See `README.md#stopping-rule` for the full rule.
- Stopping rule on `@monitor`: high-frequency single-machine queries use sequential stopping (bootstrapped CI on the mean elapsed time, floors at `BENCHMARK_MIN_ITERATIONS` and `BENCHMARK_MIN_TIMED_WINDOW_SECONDS`, ceiling at the `BenchmarkIteration` value, per-iteration hard timeout at `BENCHMARK_MAX_ITERATION_SECONDS`, cumulative hard timeout at `BENCHMARK_MAX_TIMED_WINDOW_SECONDS`). Long-running low-variance benchmarks (`national_scale_spatial_join_*` for Databricks, DuckDB, and PostGIS) opt out via `use_sequential_stopping=False` and run a small fixed iteration count with a cumulative wall-clock ceiling at `BENCHMARK_MAX_FIXED_WINDOW_SECONDS`; they also override `warmup_iterations=1` since one warmup is enough on long-running queries and additional warmups dominate the wall-clock budget. Transient per-iteration failures do not abort the run: failed iterations are recorded as `status="failed"` samples and the loop continues until `BENCHMARK_MAX_CONSECUTIVE_FAILURES` in a row triggers `stop_reason="failed"`; a run that finishes normally but had any failed iterations is tagged `stop_reason="partial"`. *Why: bootstrap CI on <10 samples is uninformative, and the per-iteration cost of those benchmarks (cluster time, shared Postgres) outweighs precision gains. Spark/Azure flakiness is intermittent rather than deterministic — losing 2 of 3 timed iters to one transient executor loss erases data we could have kept.* See `README.md#stopping-rule` for the full rule.

## Commands

Expand Down
8 changes: 0 additions & 8 deletions benchmark_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,10 @@
national_scale_spatial_join_databricks_broadcast_2_nodes,
national_scale_spatial_join_databricks_broadcast_4_nodes,
national_scale_spatial_join_databricks_broadcast_8_nodes,
national_scale_spatial_join_databricks_broadcast_12_nodes,
national_scale_spatial_join_databricks_broadcast_16_nodes,
national_scale_spatial_join_databricks_partitioned_2_nodes,
national_scale_spatial_join_databricks_partitioned_4_nodes,
national_scale_spatial_join_databricks_partitioned_8_nodes,
national_scale_spatial_join_databricks_partitioned_12_nodes,
national_scale_spatial_join_databricks_partitioned_16_nodes,
)

Expand Down Expand Up @@ -84,9 +82,6 @@ def benchmark_runner() -> None:
case "national-scale-spatial-join-databricks-broadcast-8-nodes":
national_scale_spatial_join_databricks_broadcast_8_nodes()
return
case "national-scale-spatial-join-databricks-broadcast-12-nodes":
national_scale_spatial_join_databricks_broadcast_12_nodes()
return
case "national-scale-spatial-join-databricks-broadcast-16-nodes":
national_scale_spatial_join_databricks_broadcast_16_nodes()
return
Expand All @@ -99,9 +94,6 @@ def benchmark_runner() -> None:
case "national-scale-spatial-join-databricks-partitioned-8-nodes":
national_scale_spatial_join_databricks_partitioned_8_nodes()
return
case "national-scale-spatial-join-databricks-partitioned-12-nodes":
national_scale_spatial_join_databricks_partitioned_12_nodes()
return
case "national-scale-spatial-join-databricks-partitioned-16-nodes":
national_scale_spatial_join_databricks_partitioned_16_nodes()
return
Expand Down
38 changes: 11 additions & 27 deletions benchmarks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -143,16 +143,10 @@ experiments:
- bbox-filtering-duckdb-large

# ==================================================================
# RQ2 — National-scale spatial join (29 experiments)
# RQ2 — National-scale spatial join (25 experiments)
# Single-node: DuckDB + PostGIS at small/medium/large (paired).
# Sedona: broadcast + partitioned at variable worker counts × 3
# sizes, unpaired (each provisions its own cluster). The `default`
# strategy is retained at large tier only (2/8/16 nodes) as a
# within-Sedona baseline; small and medium default cells and
# large-tier 4-/12-node default cells are pruned (issue #309).
# The 2-node row is omitted at small for broadcast/partitioned
# (weakly differentiated from default); the freed cells fund the
# 12-/16-node extension of the scaling curve at large.
# Sedona: broadcast + partitioned at {2, 4, 8, 16} workers × 3
# sizes, unpaired (each provisions its own cluster).
# ==================================================================

# ---- single-node ----
Expand Down Expand Up @@ -227,6 +221,7 @@ experiments:
memory_gb: 16
dataset_size: large
related_script_ids:
- national-scale-spatial-join-databricks-broadcast-4-nodes-large
- national-scale-spatial-join-databricks-broadcast-16-nodes-large

- id: national-scale-spatial-join-databricks-broadcast-4-nodes-small
Expand All @@ -251,7 +246,8 @@ experiments:
memory_gb: 16
dataset_size: large
related_script_ids:
- national-scale-spatial-join-databricks-broadcast-12-nodes-large
- national-scale-spatial-join-databricks-broadcast-2-nodes-large
- national-scale-spatial-join-databricks-broadcast-16-nodes-large

- id: national-scale-spatial-join-databricks-broadcast-8-nodes-small
image: doppaacr.azurecr.io/national-scale-spatial-join-databricks-broadcast-8-nodes:latest
Expand Down Expand Up @@ -281,21 +277,14 @@ experiments:
related_script_ids:
- national-scale-spatial-join-databricks-partitioned-8-nodes-large

- id: national-scale-spatial-join-databricks-broadcast-12-nodes-large
image: doppaacr.azurecr.io/national-scale-spatial-join-databricks-broadcast-12-nodes:latest
cpu: 4
memory_gb: 16
dataset_size: large
related_script_ids:
- national-scale-spatial-join-databricks-broadcast-4-nodes-large

- id: national-scale-spatial-join-databricks-broadcast-16-nodes-large
image: doppaacr.azurecr.io/national-scale-spatial-join-databricks-broadcast-16-nodes:latest
cpu: 4
memory_gb: 16
dataset_size: large
related_script_ids:
- national-scale-spatial-join-databricks-broadcast-2-nodes-large
- national-scale-spatial-join-databricks-broadcast-4-nodes-large

# ---- Sedona partitioned strategy ----
- id: national-scale-spatial-join-databricks-partitioned-2-nodes-medium
Expand All @@ -312,6 +301,7 @@ experiments:
memory_gb: 16
dataset_size: large
related_script_ids:
- national-scale-spatial-join-databricks-partitioned-4-nodes-large
- national-scale-spatial-join-databricks-partitioned-16-nodes-large

- id: national-scale-spatial-join-databricks-partitioned-4-nodes-small
Expand All @@ -336,7 +326,8 @@ experiments:
memory_gb: 16
dataset_size: large
related_script_ids:
- national-scale-spatial-join-databricks-partitioned-12-nodes-large
- national-scale-spatial-join-databricks-partitioned-2-nodes-large
- national-scale-spatial-join-databricks-partitioned-16-nodes-large

- id: national-scale-spatial-join-databricks-partitioned-8-nodes-small
image: doppaacr.azurecr.io/national-scale-spatial-join-databricks-partitioned-8-nodes:latest
Expand Down Expand Up @@ -366,19 +357,12 @@ experiments:
related_script_ids:
- national-scale-spatial-join-databricks-broadcast-8-nodes-large

- id: national-scale-spatial-join-databricks-partitioned-12-nodes-large
image: doppaacr.azurecr.io/national-scale-spatial-join-databricks-partitioned-12-nodes:latest
cpu: 4
memory_gb: 16
dataset_size: large
related_script_ids:
- national-scale-spatial-join-databricks-partitioned-4-nodes-large

- id: national-scale-spatial-join-databricks-partitioned-16-nodes-large
image: doppaacr.azurecr.io/national-scale-spatial-join-databricks-partitioned-16-nodes:latest
cpu: 4
memory_gb: 16
dataset_size: large
related_script_ids:
- national-scale-spatial-join-databricks-partitioned-2-nodes-large
- national-scale-spatial-join-databricks-partitioned-4-nodes-large

18 changes: 0 additions & 18 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -182,15 +182,6 @@ services:
image: national-scale-spatial-join-databricks-partitioned-8-nodes:latest
command: python benchmark_runner.py --script-id national-scale-spatial-join-databricks-partitioned-8-nodes --benchmark-run 1 --run-id ABCDEF

national-scale-spatial-join-databricks-broadcast-12-nodes:
env_file:
- .env
build:
context: .
dockerfile: .docker/Query.Dockerfile
image: national-scale-spatial-join-databricks-broadcast-12-nodes:latest
command: python benchmark_runner.py --script-id national-scale-spatial-join-databricks-broadcast-12-nodes --benchmark-run 1 --run-id ABCDEF

national-scale-spatial-join-databricks-broadcast-16-nodes:
env_file:
- .env
Expand All @@ -200,15 +191,6 @@ services:
image: national-scale-spatial-join-databricks-broadcast-16-nodes:latest
command: python benchmark_runner.py --script-id national-scale-spatial-join-databricks-broadcast-16-nodes --benchmark-run 1 --run-id ABCDEF

national-scale-spatial-join-databricks-partitioned-12-nodes:
env_file:
- .env
build:
context: .
dockerfile: .docker/Query.Dockerfile
image: national-scale-spatial-join-databricks-partitioned-12-nodes:latest
command: python benchmark_runner.py --script-id national-scale-spatial-join-databricks-partitioned-12-nodes --benchmark-run 1 --run-id ABCDEF

national-scale-spatial-join-databricks-partitioned-16-nodes:
env_file:
- .env
Expand Down
15 changes: 14 additions & 1 deletion src/application/common/monitor.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,8 @@ def monitor(
(soft ceiling: kept open until the 60-second floor is met to keep the cost-metric
window valid), ``Config.BENCHMARK_MAX_ITERATION_SECONDS`` (per-iteration hard timeout),
and ``Config.BENCHMARK_MAX_TIMED_WINDOW_SECONDS`` (cumulative hard timeout).
Set to False for Databricks national-scale runs, which use a fixed iteration count.
Set to False for Databricks national-scale runs, which use a fixed iteration count
with a cumulative wall-clock ceiling of ``Config.BENCHMARK_MAX_FIXED_WINDOW_SECONDS``.
Default is True.
:param warmup_iterations: Override the number of warmup iterations. ``None`` falls back
to ``Config.BENCHMARK_WARMUP_ITERATIONS``. Long-running benchmarks (national-scale
Expand Down Expand Up @@ -294,6 +295,18 @@ def wrapper(*args, **kwargs):
break

if not use_sequential_stopping:
window_seconds = (
datetime.datetime.now(datetime.UTC) - timed_loop_start
).total_seconds()
if window_seconds >= Config.BENCHMARK_MAX_FIXED_WINDOW_SECONDS:
stop_reason = StopReason.TIMEOUT
logger.warning(
f"Fixed-iteration window for '{query_id}' reached "
f"BENCHMARK_MAX_FIXED_WINDOW_SECONDS="
f"{Config.BENCHMARK_MAX_FIXED_WINDOW_SECONDS}s after "
f"{iteration} iterations; stopping."
)
break
if iteration >= ceiling:
stop_reason = StopReason.FIXED
break
Expand Down
3 changes: 3 additions & 0 deletions src/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,9 @@ class Config:
BENCHMARK_CI_CONFIDENCE: float = 0.95
BENCHMARK_MAX_CONSECUTIVE_FAILURES: int = 3

# Fixed-iteration cumulative wall-clock ceiling (75 min)
BENCHMARK_MAX_FIXED_WINDOW_SECONDS: int = 75 * 60

INGESTION_DELAY_SECONDS: int = 600

# DATABRICKS
Expand Down
2 changes: 0 additions & 2 deletions src/presentation/entrypoints/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,8 @@
from .national_scale_spatial_join_databricks_broadcast_2_nodes import national_scale_spatial_join_databricks_broadcast_2_nodes
from .national_scale_spatial_join_databricks_broadcast_4_nodes import national_scale_spatial_join_databricks_broadcast_4_nodes
from .national_scale_spatial_join_databricks_broadcast_8_nodes import national_scale_spatial_join_databricks_broadcast_8_nodes
from .national_scale_spatial_join_databricks_broadcast_12_nodes import national_scale_spatial_join_databricks_broadcast_12_nodes
from .national_scale_spatial_join_databricks_broadcast_16_nodes import national_scale_spatial_join_databricks_broadcast_16_nodes
from .national_scale_spatial_join_databricks_partitioned_2_nodes import national_scale_spatial_join_databricks_partitioned_2_nodes
from .national_scale_spatial_join_databricks_partitioned_4_nodes import national_scale_spatial_join_databricks_partitioned_4_nodes
from .national_scale_spatial_join_databricks_partitioned_8_nodes import national_scale_spatial_join_databricks_partitioned_8_nodes
from .national_scale_spatial_join_databricks_partitioned_12_nodes import national_scale_spatial_join_databricks_partitioned_12_nodes
from .national_scale_spatial_join_databricks_partitioned_16_nodes import national_scale_spatial_join_databricks_partitioned_16_nodes

This file was deleted.

This file was deleted.

Loading