Skip to content

BUG (P1): Hardware diagnostics report host physical_memory_kb in containers — direct misconfiguration risk #330

@nanoDBA

Description

@nanoDBA

Summary

The hardware context diagnostic block introduced in v2.3 logs sys.dm_os_sys_info.physical_memory_kb as the system's available memory. In containerized deployments (AKS, Docker, Azure SQL on Kubernetes), this value reports host machine RAM, not the container's memory limit.

On a 512 GB host running a 32 GB SQL Server pod, sp_StatUpdate logs 512 GB as available memory. A DBA using this output to configure max server memory via the standard 90%-of-RAM rule would set a 460 GB ceiling on a 32 GB pod — a direct OOMKill trigger.

This is not a cosmetic display issue. It produces a specific, plausible misconfiguration with measurable production consequences in any containerized SQL Server deployment.


Impact

  • Direct max server memory misconfiguration. The 90%-of-RAM rule is muscle memory for DBAs. Applying it to host RAM figures in container contexts produces a ceiling far above the container limit.
  • OOMKill in Kubernetes. When SQL Server's working set exceeds the container memory limit, the pod is OOMKilled — abrupt shutdown with no SQL Server-level warning.
  • Henry Schein context. AWS deployments on EC2 with SQL Server in containers or on VMs with constrained memory allocations are affected. The diagnostic is intended to help, but produces dangerous guidance without the container annotation.
  • HIGH_UPTIME false positives in AKS. LOW_UPTIME detection fires on every routine pod restart (AKS recycles pods daily or more frequently for rolling updates) — the current OS uptime gate via ms_ticks does not distinguish SQL Server restarts from node restarts.

Root Cause

-- CURRENT (v2.3) — hardware context block
SELECT
    @physical_memory_kb = physical_memory_kb,  -- ← host RAM in containers
    @cpu_count = cpu_count                      -- ← may overcount schedulers (SQL 2017 and earlier)
FROM sys.dm_os_sys_info;

-- Logged to RAISERROR output as-is — no container context annotation

The correct companion value for containerized deployments is sys.dm_os_process_memory.physical_memory_in_use_kb, which reports the SQL Server process RSS (resident set size) — the closest available pure-T-SQL proxy for actual container memory consumption.

There is no supported pure-T-SQL path to read the cgroup memory.limit_in_bytes from SQL Server 2019/2022, so the process RSS is the best available signal. The fix is to log both values with a divergence annotation when they differ significantly.


Fix

1. Add process RSS as a companion metric

DECLARE @process_memory_kb BIGINT;

SELECT @process_memory_kb = physical_memory_in_use_kb
FROM sys.dm_os_process_memory;

No version gate required — sys.dm_os_process_memory is available in all supported SQL Server versions.

2. Log both values with divergence annotation

-- Determine if running in container context (large divergence = container signal)
DECLARE @memory_divergence_pct FLOAT = NULL;

IF @physical_memory_kb > 0 AND @process_memory_kb IS NOT NULL
BEGIN
    SET @memory_divergence_pct = 
        100.0 * ABS(@physical_memory_kb - @process_memory_kb) / @physical_memory_kb;
END

-- In RAISERROR output:
RAISERROR(
    N'  Memory: host=%I64d KB | process_rss=%I64d KB%s',
    0, 1,
    @physical_memory_kb,
    @process_memory_kb,
    CASE WHEN @memory_divergence_pct > 20
         THEN N' [CONTAINER SIGNAL: process RSS << host RAM; use process_rss for max server memory sizing]'
         ELSE N''
    END
) WITH NOWAIT;

3. Fix cpu_count overcounting (companion fix, v2.4 scope)

cpu_count overcounts schedulers in SQL Server 2017 and earlier on Linux, Windows containers, and Azure VM vCore-throttled deployments. SQL Server 2019 CU8+ and 2022 add cgroup-aware CPU quota detection. Supplement with COUNT(*) from sys.dm_os_schedulers WHERE status = 'VISIBLE ONLINE' as the affinity-effective count:

SELECT @visible_online_schedulers = COUNT(*)
FROM sys.dm_os_schedulers
WHERE status = N'VISIBLE ONLINE';

Log both: cpu_count (system-reported) and visible_online_schedulers (affinity-effective). When they diverge, emit an advisory note.

4. Fix LOW_UPTIME false positives in containerized environments

Current uptime detection uses sys.dm_os_sys_info.ms_ticks (node uptime), not SQL Server start time. Use sqlserver_start_time from sys.dm_os_sys_info to measure SQL Server uptime independent of OS/node uptime:

SELECT @sql_uptime_hours = DATEDIFF(HOUR, sqlserver_start_time, GETUTCDATE())
FROM sys.dm_os_sys_info;

Emit LOW_UPTIME only when SQL Server uptime is low, not when the OS was recently rebooted.


Version Gate

None required for the memory RSS fix. The cpu_count scheduler supplement has no version gate (DMV available in all versions). The LOW_UPTIME fix has no version gate.


Testing Requirements

  • Validate physical_memory_in_use_kb is logged alongside physical_memory_kb
  • Validate divergence annotation fires when values differ by >20%
  • Validate visible_online_schedulers logged alongside cpu_count
  • Validate LOW_UPTIME uses sqlserver_start_time not ms_ticks
  • Manual test: verify output on a development container instance (Docker Desktop + SQL Server image)

References

  • Internal analysis: Hardware context DMV accuracy in containerized/virtualized deployments
  • Claim-236: physical_memory_kb reports host RAM (❌ Bug Fix — escalated from Glenn Berry claim-189)
  • Claim-237: cpu_count overcounting is version-dependent (⚠️ Enhancement)
  • Claim-238: No pure-T-SQL path to cgroup memory.limit_in_bytes; process RSS is best proxy
  • Claim-239: LOW_UPTIME false positive in AKS via ms_ticks
  • Claim-241: Code sketch — two zero-version-gate additions
  • Related: rev-010 — Glenn Berry: hardware context initial review (the prior ⚠️ Concerns assessment now escalated)

Metadata

Metadata

Assignees

No one assigned

    Labels

    azureAzure SQL Database / Managed Instance specific behaviorbugSomething isn't workingcontainersContainer/Docker/Kubernetes/AKS deployment behaviorp1-criticalBug or defect that causes incorrect behavior, data loss, or silent failure

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions