Skip to content

#342 Fix cost model gaps: driver node, blob cost, cross-region egress#343

Merged
jathavaan merged 3 commits into
mainfrom
feature/342-fix-cost-model-gaps-missing-databricks-blob-cost-driver-node-cross-region-egress
May 24, 2026
Merged

#342 Fix cost model gaps: driver node, blob cost, cross-region egress#343
jathavaan merged 3 commits into
mainfrom
feature/342-fix-cost-model-gaps-missing-databricks-blob-cost-driver-node-cross-region-egress

Conversation

@jathavaan
Copy link
Copy Markdown
Collaborator

Summary

  • Driver node: Databricks cost now uses num_workers + 1 (driver is same VM type). Previously underestimated by 33% on 2-worker clusters.
  • Blob storage cost for Databricks: Sedona reads the same GeoParquet from ADLS Gen2 as DuckDB but wasn't charged. Added include_blob_storage=True.
  • Cross-region egress: Blob Storage is in Norway East, Databricks in Sweden Central. Added $0.02/GB intra-Europe cross-region rate.
  • Dataset size fix: get_blob_storage_usage() hardcoded DatasetSize.SMALL — now resolves actual size from DI.
  • Cleanup: Removed 10 stale zero-valued cost constants from Config (superseded by AzurePricingService). Also resolves Cost model: only ACI compute is priced; storage / ops / DB / egress all zero #302.
  • Driver memory: Reduced from 14g to 9g (Standard_D4s_v3 max is 10069 MB).

Test plan

  • Ran benchmark_runner.py for Databricks broadcast 4-node small tier
  • Verified blob_cost.parquet now generated for Databricks (previously absent)
  • Verified databricks_cost.parquet reflects 5 nodes (4 workers + 1 driver)
  • Verified blob network_cost > 0 (cross-region egress applied)
  • Verified blob operations_cost > 0 (correct dataset size used for blob counting)

Closes #342
Closes #302

jathavaan added 3 commits May 24, 2026 11:56
Cluster has num_workers + 1 nodes (driver is same Standard_D4s_v3).
Previously only billed workers, underestimating by 33% on 2-worker
clusters down to 6% on 16-worker clusters.
Databricks reads GeoParquet from Blob Storage (Norway East) but was not
charged for it. Additionally, since Databricks is in Sweden Central,
those reads are inter-region egress at $0.02/GB.

- Add include_blob_storage=True and is_cross_region_blob=True to
  Databricks CostConfiguration
- Add cross_region_egress_per_gb to BlobStoragePricing
- Thread dataset_size from DI through cost pipeline so blob operation
  counts reflect actual tier (was hardcoded to SMALL)
- Update interfaces and implementations with new parameters
- Remove 10 zero-valued cost constants from Config that were superseded
  by AzurePricingService (resolves #302)
- Reduce DATABRICKS_DRIVER_MEMORY from 14g to 9g (Standard_D4s_v3 max
  is 10069 MB)
- Reduce DATABRICKS_DRIVER_MAX_RESULT_SIZE from 8g to 4g accordingly
Copilot AI review requested due to automatic review settings May 24, 2026 09:57
@jathavaan jathavaan enabled auto-merge May 24, 2026 09:58
@jathavaan jathavaan self-assigned this May 24, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@jathavaan jathavaan merged commit 760eaef into main May 24, 2026
25 of 26 checks passed
@jathavaan jathavaan deleted the feature/342-fix-cost-model-gaps-missing-databricks-blob-cost-driver-node-cross-region-egress branch May 24, 2026 10:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants