Skip to content

FinOps Roadmap: Cost Optimization Recommendations Engine #564

@erikdarlingdata

Description

@erikdarlingdata

Overview

The FinOps tab already collects solid foundational data — utilization scoring, idle databases, index cleanup, storage analysis, server properties, and resource consumption. The next phase is turning that raw data into actionable cost-saving recommendations that help DBAs and IT managers justify budget decisions.

This issue tracks the full roadmap from per-server optimizations through estate-wide consolidation analysis.


Phase 1: Per-Server Optimizations (SHIPPED)

  • Enterprise feature usage audit — query sys.dm_db_persisted_sku_features per database to identify servers paying for Enterprise where no database uses Enterprise-only features
  • CPU right-sizing score — compare licensed core count against sustained CPU utilization trends to flag servers where cores are significantly over-provisioned
  • Memory right-sizing score — compare allocated max server memory against actual buffer pool + stolen memory usage to flag reclaimable RAM
  • Unused index cost quantification — extend existing index cleanup to show wasted storage (GB), write I/O overhead, and estimated backup time impact for unused/duplicate indexes
  • Compression savings estimator — flag large uncompressed heaps/indexes with estimated space savings (page vs. row compression candidates)
  • Dormant database detection (enhanced) — extend existing idle database detection with estimated cost impact (backup storage, AG traffic, DBCC overhead, license implications)
  • Dev/test workload detection — flag servers with production-edition licenses but development usage patterns (low connection counts, sporadic activity, database naming conventions)

Phase 2: Edition & Licensing (SHIPPED)

  • Enterprise → Standard downgrade report — per-server report showing which Enterprise features are actually in use, which databases use them, and what would need to change to move to Standard
  • License cost impact estimates — list pricing differential so recommendations show estimated dollar savings

Phase 3: Hardware & Maintenance Optimization (SHIPPED)

  • Maintenance window efficiency — flag maintenance jobs with excessive duration relative to their historical average
  • VM right-sizing recommendations — based on sustained CPU/memory trends, prescribe specific target reductions (e.g., "reduce from 32 to 16 cores based on P95 CPU of 22%")
  • Storage tier optimization — flag databases with consistently low IO latency (<5ms reads) that don't need premium storage
  • Reserved capacity candidates — servers with consistent, predictable utilization patterns (low coefficient of variation) that would benefit from reserved pricing over pay-as-you-go

Phase 4: High Impact Analysis (SHIPPED)

  • High Impact Queries tab — 80/20 analysis across CPU, duration, reads, writes, memory, and executions. PERCENT_RANK scoring with composite impact score.

Phase 5: Test Infrastructure (SHIPPED)

  • HighImpactScorer extraction — scoring logic in testable static methods
  • FinOps test scenarios — 6 DuckDB-seeded scenarios for deterministic validation of recommendation engine output
  • FinOps xunit tests — 16 scenario-based tests validating recommendations and scoring
  • HealthCalculator + PercentRank tests — 57 unit tests with adversarial inputs, found and fixed PercentRank >1.0 bug

Phase 6: UI Polish (SHIPPED)

  • Column filters on all FinOps grids — Dashboard (10 grids) and Lite Optimization (5 grids)
  • Query preview with full-text tooltips — truncated in row, full text on hover
  • Per-server monthly cost — budget field on server config, proportional cost attribution
  • Health scores and license warnings — composite 0-100 score, Standard Edition boundary detection
  • Storage Growth query fix — fallback to oldest snapshot when <7d history
  • Empty grid scrollbar fix — disabled horizontal scroll on all FinOps DataGrids

Future (post-2.3)

These items need more design work, new collectors, or cross-server architecture changes.

  • Consolidation candidate scoring — identify groups of underutilized servers whose combined workloads could fit on fewer instances. Must account for version compatibility, edition requirements, and platform (cloud vs on-prem). Cannot consolidate across major version downgrades or edition downgrades where features are in use.
  • Workload complementarity analysis — find servers with complementary peak usage windows via hourly CPU time-series correlation. Natural pairs for consolidation. Same version/edition/platform constraints apply.
  • Postgres migration compatibility score — feature dependency scan, T-SQL compatibility analysis. Requires: stored procedure/function text scanning.

Design Notes

  • All recommendations are data-driven from DMVs and system catalog views we already collect
  • Recommendations include confidence levels (high/medium/low) based on observation window length
  • Where possible, savings are quantified as proportional share of the per-server monthly budget set by the user
  • Both Lite (DuckDB) and Dashboard (SQL Server) surface recommendations
  • Dollar estimates use the per-server monthly cost model — user tags each server with its budget, we show proportional attribution

Related

  • Existing FinOps tab infrastructure (utilization scoring, idle DBs, index cleanup, storage analysis)
  • Server properties collector (daily, 365-day retention)
  • Database size stats collector (hourly, 90-day retention)
  • Running jobs collector (per schedule, with avg/p95 duration tracking)
  • High Impact Queries tab (query_stats-based 80/20 analysis)
  • FinOps test scenarios in TestDataSeeder
  • 73 xunit tests (FinOpsTests + HealthCalculatorTests)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions