Fix cost model gaps: missing Databricks blob cost, driver node, cross-region egress

## Summary

Deep audit of the cost analytics pipeline against the thesis cost model (Equation 4.1) and RQ1/RQ2 research questions revealed several gaps where costs are underestimated — most significantly for Databricks/Sedona benchmarks.

Net effect: **Databricks cost is systematically underestimated** relative to DuckDB and PostGIS, biasing the RQ2 crossover analysis.

## Bugs

### 1. Databricks reads from Blob Storage but doesn't pay for it

`_databricks_benchmark_runner.py:64-65`:
```python
CostConfiguration(include_aci=True, include_databricks=True, num_workers=num_workers)
# include_blob_storage=False ← default
```

Sedona executors read the **same GeoParquet files** from the **same ADLS Gen2 account** that DuckDB reads. Those reads incur blob read transactions ($0.006/10K ops). DuckDB pays this via `include_blob_storage=True`; Databricks does not.

**Fix**: Add `include_blob_storage=True` to Databricks `CostConfiguration`.

### 2. Databricks cost omits driver node

`azure_cost_service.py:108-111`:
```python
dbu_cost = usage.num_workers * pricing.dbu_per_node_per_hour * ...
vm_cost = usage.num_workers * pricing.vm_cost_per_node_per_hour * ...
```

Cluster has `num_workers + 1` nodes (driver is also `Standard_D4s_v3`). CLAUDE.md already uses `(workers + 1) × 4` for vCPU quota calculations.

Impact by cluster size:
- 2-worker: **33%** underestimate
- 4-worker: **20%**
- 8-worker: **11%**
- 16-worker: **6%**

This asymmetry biases scaling analysis — small clusters appear relatively cheaper than they are.

**Fix**: Use `num_workers + 1` (or expose a `total_nodes` parameter that includes the driver).

### 3. Cross-region egress for Databricks blob reads is unaccounted

Pricing service confirms Blob Storage is in **Norway East**, Databricks workspace is in **Sweden Central**. When Databricks executors read GeoParquet, that is inter-region egress (~$0.02/GB for intra-Europe cross-region transfer). Current code sets all egress to $0.00 assuming intra-region.

For national-scale spatial join on large tier, executors read GBs of building data — this cost is real.

**Fix**: Add cross-region egress rate to blob storage pricing when accessed from Databricks (or document the decision to exclude it with justification).

### 4. `get_blob_storage_usage()` hardcodes `DatasetSize.SMALL`

`azure_metric_service.py:119`:
```python
dataset_size=DatasetSize.SMALL,  # ← hardcoded
```

Blob count and storage size are always measured from the small-tier path, even for medium/large DuckDB benchmarks. This means `read_transactions`, `storage_bytes`, and therefore `operations_cost` and `storage_cost` are wrong for non-small tiers.

**Fix**: Pass actual `dataset_size` from the DI container.

## Documentation gaps (Section 5.6.5 "What is not measured")

- [ ] Blob writes for monitoring data (parquet samples, cost analytics) omitted for all technologies — shared overhead, cancels in comparisons
- [ ] ACI cost is idle-wait for PostGIS/Databricks vs actual query compute for DuckDB/Shapefile
- [ ] Cluster provisioning/teardown time excluded from Databricks cost window
- [ ] Blob operation count is approximate — one read per blob, but DuckDB may skip files via bbox pushdown or issue multiple range requests
- [ ] Stale cost constants in `src/config.py` (`AZURE_ACI_VCPU_PRICE_PER_SECOND = 0.0002`) differ from `AzurePricingService` values — unused but confusing

## Files to change

- `src/presentation/entrypoints/_databricks_benchmark_runner.py` — add `include_blob_storage=True`
- `src/infra/infrastructure/services/azure_cost_service.py` — `num_workers + 1` for driver
- `src/infra/infrastructure/services/azure_pricing_service.py` — cross-region egress rate
- `src/infra/infrastructure/services/azure_metric_service.py` — pass actual dataset_size
- `src/config.py` — remove or mark stale cost constants

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cost model gaps: missing Databricks blob cost, driver node, cross-region egress #342

Summary

Bugs

1. Databricks reads from Blob Storage but doesn't pay for it

2. Databricks cost omits driver node

3. Cross-region egress for Databricks blob reads is unaccounted

4. `get_blob_storage_usage()` hardcodes `DatasetSize.SMALL`

Documentation gaps (Section 5.6.5 "What is not measured")

Files to change

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Fix cost model gaps: missing Databricks blob cost, driver node, cross-region egress #342

Description

Summary

Bugs

1. Databricks reads from Blob Storage but doesn't pay for it

2. Databricks cost omits driver node

3. Cross-region egress for Databricks blob reads is unaccounted

4. get_blob_storage_usage() hardcodes DatasetSize.SMALL

Documentation gaps (Section 5.6.5 "What is not measured")

Files to change

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

4. `get_blob_storage_usage()` hardcodes `DatasetSize.SMALL`