Skip to content

Fix cost model gaps: missing Databricks blob cost, driver node, cross-region egress #342

@jathavaan

Description

@jathavaan

Summary

Deep audit of the cost analytics pipeline against the thesis cost model (Equation 4.1) and RQ1/RQ2 research questions revealed several gaps where costs are underestimated — most significantly for Databricks/Sedona benchmarks.

Net effect: Databricks cost is systematically underestimated relative to DuckDB and PostGIS, biasing the RQ2 crossover analysis.

Bugs

1. Databricks reads from Blob Storage but doesn't pay for it

_databricks_benchmark_runner.py:64-65:

CostConfiguration(include_aci=True, include_databricks=True, num_workers=num_workers)
# include_blob_storage=False ← default

Sedona executors read the same GeoParquet files from the same ADLS Gen2 account that DuckDB reads. Those reads incur blob read transactions ($0.006/10K ops). DuckDB pays this via include_blob_storage=True; Databricks does not.

Fix: Add include_blob_storage=True to Databricks CostConfiguration.

2. Databricks cost omits driver node

azure_cost_service.py:108-111:

dbu_cost = usage.num_workers * pricing.dbu_per_node_per_hour * ...
vm_cost = usage.num_workers * pricing.vm_cost_per_node_per_hour * ...

Cluster has num_workers + 1 nodes (driver is also Standard_D4s_v3). CLAUDE.md already uses (workers + 1) × 4 for vCPU quota calculations.

Impact by cluster size:

  • 2-worker: 33% underestimate
  • 4-worker: 20%
  • 8-worker: 11%
  • 16-worker: 6%

This asymmetry biases scaling analysis — small clusters appear relatively cheaper than they are.

Fix: Use num_workers + 1 (or expose a total_nodes parameter that includes the driver).

3. Cross-region egress for Databricks blob reads is unaccounted

Pricing service confirms Blob Storage is in Norway East, Databricks workspace is in Sweden Central. When Databricks executors read GeoParquet, that is inter-region egress (~$0.02/GB for intra-Europe cross-region transfer). Current code sets all egress to $0.00 assuming intra-region.

For national-scale spatial join on large tier, executors read GBs of building data — this cost is real.

Fix: Add cross-region egress rate to blob storage pricing when accessed from Databricks (or document the decision to exclude it with justification).

4. get_blob_storage_usage() hardcodes DatasetSize.SMALL

azure_metric_service.py:119:

dataset_size=DatasetSize.SMALL,  # ← hardcoded

Blob count and storage size are always measured from the small-tier path, even for medium/large DuckDB benchmarks. This means read_transactions, storage_bytes, and therefore operations_cost and storage_cost are wrong for non-small tiers.

Fix: Pass actual dataset_size from the DI container.

Documentation gaps (Section 5.6.5 "What is not measured")

  • Blob writes for monitoring data (parquet samples, cost analytics) omitted for all technologies — shared overhead, cancels in comparisons
  • ACI cost is idle-wait for PostGIS/Databricks vs actual query compute for DuckDB/Shapefile
  • Cluster provisioning/teardown time excluded from Databricks cost window
  • Blob operation count is approximate — one read per blob, but DuckDB may skip files via bbox pushdown or issue multiple range requests
  • Stale cost constants in src/config.py (AZURE_ACI_VCPU_PRICE_PER_SECOND = 0.0002) differ from AzurePricingService values — unused but confusing

Files to change

  • src/presentation/entrypoints/_databricks_benchmark_runner.py — add include_blob_storage=True
  • src/infra/infrastructure/services/azure_cost_service.pynum_workers + 1 for driver
  • src/infra/infrastructure/services/azure_pricing_service.py — cross-region egress rate
  • src/infra/infrastructure/services/azure_metric_service.py — pass actual dataset_size
  • src/config.py — remove or mark stale cost constants

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions