Summary
Deep audit of the cost analytics pipeline against the thesis cost model (Equation 4.1) and RQ1/RQ2 research questions revealed several gaps where costs are underestimated — most significantly for Databricks/Sedona benchmarks.
Net effect: Databricks cost is systematically underestimated relative to DuckDB and PostGIS, biasing the RQ2 crossover analysis.
Bugs
1. Databricks reads from Blob Storage but doesn't pay for it
_databricks_benchmark_runner.py:64-65:
CostConfiguration(include_aci=True, include_databricks=True, num_workers=num_workers)
# include_blob_storage=False ← default
Sedona executors read the same GeoParquet files from the same ADLS Gen2 account that DuckDB reads. Those reads incur blob read transactions ($0.006/10K ops). DuckDB pays this via include_blob_storage=True; Databricks does not.
Fix: Add include_blob_storage=True to Databricks CostConfiguration.
2. Databricks cost omits driver node
azure_cost_service.py:108-111:
dbu_cost = usage.num_workers * pricing.dbu_per_node_per_hour * ...
vm_cost = usage.num_workers * pricing.vm_cost_per_node_per_hour * ...
Cluster has num_workers + 1 nodes (driver is also Standard_D4s_v3). CLAUDE.md already uses (workers + 1) × 4 for vCPU quota calculations.
Impact by cluster size:
- 2-worker: 33% underestimate
- 4-worker: 20%
- 8-worker: 11%
- 16-worker: 6%
This asymmetry biases scaling analysis — small clusters appear relatively cheaper than they are.
Fix: Use num_workers + 1 (or expose a total_nodes parameter that includes the driver).
3. Cross-region egress for Databricks blob reads is unaccounted
Pricing service confirms Blob Storage is in Norway East, Databricks workspace is in Sweden Central. When Databricks executors read GeoParquet, that is inter-region egress (~$0.02/GB for intra-Europe cross-region transfer). Current code sets all egress to $0.00 assuming intra-region.
For national-scale spatial join on large tier, executors read GBs of building data — this cost is real.
Fix: Add cross-region egress rate to blob storage pricing when accessed from Databricks (or document the decision to exclude it with justification).
4. get_blob_storage_usage() hardcodes DatasetSize.SMALL
azure_metric_service.py:119:
dataset_size=DatasetSize.SMALL, # ← hardcoded
Blob count and storage size are always measured from the small-tier path, even for medium/large DuckDB benchmarks. This means read_transactions, storage_bytes, and therefore operations_cost and storage_cost are wrong for non-small tiers.
Fix: Pass actual dataset_size from the DI container.
Documentation gaps (Section 5.6.5 "What is not measured")
Files to change
src/presentation/entrypoints/_databricks_benchmark_runner.py — add include_blob_storage=True
src/infra/infrastructure/services/azure_cost_service.py — num_workers + 1 for driver
src/infra/infrastructure/services/azure_pricing_service.py — cross-region egress rate
src/infra/infrastructure/services/azure_metric_service.py — pass actual dataset_size
src/config.py — remove or mark stale cost constants
Summary
Deep audit of the cost analytics pipeline against the thesis cost model (Equation 4.1) and RQ1/RQ2 research questions revealed several gaps where costs are underestimated — most significantly for Databricks/Sedona benchmarks.
Net effect: Databricks cost is systematically underestimated relative to DuckDB and PostGIS, biasing the RQ2 crossover analysis.
Bugs
1. Databricks reads from Blob Storage but doesn't pay for it
_databricks_benchmark_runner.py:64-65:Sedona executors read the same GeoParquet files from the same ADLS Gen2 account that DuckDB reads. Those reads incur blob read transactions ($0.006/10K ops). DuckDB pays this via
include_blob_storage=True; Databricks does not.Fix: Add
include_blob_storage=Trueto DatabricksCostConfiguration.2. Databricks cost omits driver node
azure_cost_service.py:108-111:Cluster has
num_workers + 1nodes (driver is alsoStandard_D4s_v3). CLAUDE.md already uses(workers + 1) × 4for vCPU quota calculations.Impact by cluster size:
This asymmetry biases scaling analysis — small clusters appear relatively cheaper than they are.
Fix: Use
num_workers + 1(or expose atotal_nodesparameter that includes the driver).3. Cross-region egress for Databricks blob reads is unaccounted
Pricing service confirms Blob Storage is in Norway East, Databricks workspace is in Sweden Central. When Databricks executors read GeoParquet, that is inter-region egress (~$0.02/GB for intra-Europe cross-region transfer). Current code sets all egress to $0.00 assuming intra-region.
For national-scale spatial join on large tier, executors read GBs of building data — this cost is real.
Fix: Add cross-region egress rate to blob storage pricing when accessed from Databricks (or document the decision to exclude it with justification).
4.
get_blob_storage_usage()hardcodesDatasetSize.SMALLazure_metric_service.py:119:Blob count and storage size are always measured from the small-tier path, even for medium/large DuckDB benchmarks. This means
read_transactions,storage_bytes, and thereforeoperations_costandstorage_costare wrong for non-small tiers.Fix: Pass actual
dataset_sizefrom the DI container.Documentation gaps (Section 5.6.5 "What is not measured")
src/config.py(AZURE_ACI_VCPU_PRICE_PER_SECOND = 0.0002) differ fromAzurePricingServicevalues — unused but confusingFiles to change
src/presentation/entrypoints/_databricks_benchmark_runner.py— addinclude_blob_storage=Truesrc/infra/infrastructure/services/azure_cost_service.py—num_workers + 1for driversrc/infra/infrastructure/services/azure_pricing_service.py— cross-region egress ratesrc/infra/infrastructure/services/azure_metric_service.py— pass actual dataset_sizesrc/config.py— remove or mark stale cost constants