Conversation
Cluster has num_workers + 1 nodes (driver is same Standard_D4s_v3). Previously only billed workers, underestimating by 33% on 2-worker clusters down to 6% on 16-worker clusters.
Databricks reads GeoParquet from Blob Storage (Norway East) but was not charged for it. Additionally, since Databricks is in Sweden Central, those reads are inter-region egress at $0.02/GB. - Add include_blob_storage=True and is_cross_region_blob=True to Databricks CostConfiguration - Add cross_region_egress_per_gb to BlobStoragePricing - Thread dataset_size from DI through cost pipeline so blob operation counts reflect actual tier (was hardcoded to SMALL) - Update interfaces and implementations with new parameters
- Remove 10 zero-valued cost constants from Config that were superseded by AzurePricingService (resolves #302) - Reduce DATABRICKS_DRIVER_MEMORY from 14g to 9g (Standard_D4s_v3 max is 10069 MB) - Reduce DATABRICKS_DRIVER_MAX_RESULT_SIZE from 8g to 4g accordingly
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
num_workers + 1(driver is same VM type). Previously underestimated by 33% on 2-worker clusters.include_blob_storage=True.get_blob_storage_usage()hardcodedDatasetSize.SMALL— now resolves actual size from DI.Config(superseded byAzurePricingService). Also resolves Cost model: only ACI compute is priced; storage / ops / DB / egress all zero #302.Test plan
benchmark_runner.pyfor Databricks broadcast 4-node small tierblob_cost.parquetnow generated for Databricks (previously absent)databricks_cost.parquetreflects 5 nodes (4 workers + 1 driver)network_cost > 0(cross-region egress applied)operations_cost > 0(correct dataset size used for blob counting)Closes #342
Closes #302