Add multi-database profiling support for SQL Server#2536
Conversation
Adds single_db and multi_db mssql profiler variants, auto-selected at execute time by probing SERVERPROPERTY('EngineEdition')
When a database is configured for mssql, the profiler scopes to just that database (single_db) on any edition; leaving it blank profiles all databases (multi_db on on-prem / Managed Instance, or the connected database on Azure SQL Database). The configure-time database prompt becomes optional for mssql (blank = all databases); legacy_synapse, which shares the configurator, still requires it.
881335e to
21177f5
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2536 +/- ##
==========================================
+ Coverage 69.18% 69.27% +0.08%
==========================================
Files 105 106 +1
Lines 9503 9548 +45
Branches 1052 1060 +8
==========================================
+ Hits 6575 6614 +39
- Misses 2731 2734 +3
- Partials 197 200 +3 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
❌ 168/169 passed, 1 flaky, 1 failed, 2 skipped, 1h11m7s total ❌ test_recon_for_report_type_is_data: pyspark.errors.exceptions.connect.AnalysisException: [PATH_NOT_FOUND] Path does not exist: dbfs:/tmp/42b38c5f19834f07b916a736d4e3b073. SQLSTATE: 42K03 (50.604s)Flaky tests:
Running from acceptance #4944 |
d8575e6 to
a13b6b7
Compare
dgomez04
left a comment
There was a problem hiding this comment.
Let's align on the engine-variant vs. database-scope model (See variants.py).
| "database": self.prompts.question("Enter the database name"), | ||
| # mssql: blank profiles every accessible database (on-prem / Managed Instance); a name scopes | ||
| # to that one database. legacy_synapse (shares this configurator) needs the dedicated-pool name. | ||
| "database": ( |
There was a problem hiding this comment.
This skips a common middle case: customers want to scope the migration to a subset of databases on a shared instance. A comma-separated allowlist here would be a natural fit.
We can either address the changes in this pull request or open an issue and follow-up.
There was a problem hiding this comment.
Stepping back on the architecture before we land anything. I don't think single_db vs multi_db is the right axis.
Comparing the two pipeline configurations, they are identical except for 7 of the ~13 extracts. And those that do only differ in database scope.
What EngineEdition actually tells us is the engine variant - each variant differs in system-table surface as we had discussed. That's the thing that should select a query pool / pipeline configuration.
The curent == 5 ? single_db : multi_db check both ignores this axis and mis-fires on it.
Synapse (6), serverless (9), Fabric (12), and SQL Edge (11) all fall into the else and would run the on-prem-style sys.databases + USE + three-part-INFORMATION_SCHEMA SQL, which isn't valid there.
The shape I'd propose:
**EngineEditionmaps to engine variant, which selects a query pool.** Pick the dialect-appropriate set of extracts, with a safe default for unknown editions instead of assuming on-prem multi database.- Database scope is a user input (
allor comma-separated list), independent of the enigne. - Engine capability decides how a multi-database scope executes: cross-database dynamic SQL on on-prem/MI; per-connection looping (or "scope the connected DB only") on Azure SQL Database, which can't do cross-DB from one connection.
Note. Scoping a single query pool by all/subset needs a way to inject WHERE [name] IN (...) into the SQL, which the type: sql pipeline can't do today, so this likely comes with adding parameter substitution to the pipeline.
| return None | ||
|
|
||
| if AUTO in spec: | ||
| if variant != AUTO: |
There was a problem hiding this comment.
Every normal mssql run reaches here with variant=None, so None != AUTO is true and we emit the warning on each run. The user never passed a variant, so "ignoring" one is confusing.
Suggest guarding the warning: if variant and variant != AUTO: so it only fires when an explicit choice is discarded.
| with DatabaseManager("mssql", connect_config) as db_manager: | ||
| # SERVERPROPERTY returns sql_variant, which pyodbc cannot fetch (ODBC type -16); CAST to int. | ||
| result = db_manager.fetch("SELECT CAST(SERVERPROPERTY('EngineEdition') AS INT) AS engine_edition") | ||
| engine_edition = int(result.rows[0][0]) |
There was a problem hiding this comment.
Optional Improvement: int(result.rows[0][0]) will IndexError if the probe ever returns no rows. Low risk for SERVERPROPERTY, but a guarded failure ("could not determine SQL Server edition; defaulting to ...") reads better than a raw IndexError.
| """ | ||
| cred_manager = create_credential_manager("mssql", EnvGetter(), creds_path=cred_file_path) | ||
| connect_config = cred_manager.get_credentials("mssql") | ||
| if connect_config.get("database"): |
There was a problem hiding this comment.
Blank database on Azure SQL Database (EngineEdition == 5) returns single_db, after which MSSQLConnector connects with database=None (the login's default, often master). Is profiling the default/master database the intended result, or should blank+Azure-SQL-DB require a database name?
Changes
What does this PR do?
The mssql profiler previously assessed only a single database per run. This PR adds automatic detection between single-database and multi-database profiling based on the SQL Server edition and the configured database, and introduces a reusable variant-resolution mechanism to support it.
mssql auto-detection
single_db).EngineEdition = 5) falls back tosingle_db; on-prem SQL Server and Azure SQL Managed Instance profile every accessible database (multi_db).Multi-database extracts (
mssql/multi_db/, new)database_namecolumn.USEfor database-scoped DMVs (db_sizes,table_sizes) that can't use three-part naming.DATABASE_NAMEcolumn. The original config moves tomssql/single_db/.Linked issues
Resolves #2483
Functionality
databricks labs lakebridge execute-database-profiler --source-tech mssqlTests