BUG (P1): Hardware diagnostics report host physical_memory_kb in containers — direct misconfiguration risk

---

## Summary

The hardware context diagnostic block introduced in v2.3 logs `sys.dm_os_sys_info.physical_memory_kb` as the system's available memory. In containerized deployments (AKS, Docker, Azure SQL on Kubernetes), this value reports **host machine RAM**, not the container's memory limit.

On a 512 GB host running a 32 GB SQL Server pod, sp_StatUpdate logs 512 GB as available memory. A DBA using this output to configure `max server memory` via the standard 90%-of-RAM rule would set a 460 GB ceiling on a 32 GB pod — a **direct OOMKill trigger**.

This is not a cosmetic display issue. It produces a specific, plausible misconfiguration with measurable production consequences in any containerized SQL Server deployment.

---

## Impact

- **Direct max server memory misconfiguration.** The 90%-of-RAM rule is muscle memory for DBAs. Applying it to host RAM figures in container contexts produces a ceiling far above the container limit.
- **OOMKill in Kubernetes.** When SQL Server's working set exceeds the container memory limit, the pod is OOMKilled — abrupt shutdown with no SQL Server-level warning.
- **Henry Schein context.** AWS deployments on EC2 with SQL Server in containers or on VMs with constrained memory allocations are affected. The diagnostic is intended to help, but produces dangerous guidance without the container annotation.
- **HIGH_UPTIME false positives in AKS.** LOW_UPTIME detection fires on every routine pod restart (AKS recycles pods daily or more frequently for rolling updates) — the current OS uptime gate via `ms_ticks` does not distinguish SQL Server restarts from node restarts.

---

## Root Cause

```sql
-- CURRENT (v2.3) — hardware context block
SELECT
    @physical_memory_kb = physical_memory_kb,  -- ← host RAM in containers
    @cpu_count = cpu_count                      -- ← may overcount schedulers (SQL 2017 and earlier)
FROM sys.dm_os_sys_info;

-- Logged to RAISERROR output as-is — no container context annotation
```

The correct companion value for containerized deployments is `sys.dm_os_process_memory.physical_memory_in_use_kb`, which reports the SQL Server process RSS (resident set size) — the closest available pure-T-SQL proxy for actual container memory consumption.

There is no supported pure-T-SQL path to read the cgroup `memory.limit_in_bytes` from SQL Server 2019/2022, so the process RSS is the best available signal. The fix is to log **both** values with a divergence annotation when they differ significantly.

---

## Fix

### 1. Add process RSS as a companion metric

```sql
DECLARE @process_memory_kb BIGINT;

SELECT @process_memory_kb = physical_memory_in_use_kb
FROM sys.dm_os_process_memory;
```

No version gate required — `sys.dm_os_process_memory` is available in all supported SQL Server versions.

### 2. Log both values with divergence annotation

```sql
-- Determine if running in container context (large divergence = container signal)
DECLARE @memory_divergence_pct FLOAT = NULL;

IF @physical_memory_kb > 0 AND @process_memory_kb IS NOT NULL
BEGIN
    SET @memory_divergence_pct = 
        100.0 * ABS(@physical_memory_kb - @process_memory_kb) / @physical_memory_kb;
END

-- In RAISERROR output:
RAISERROR(
    N'  Memory: host=%I64d KB | process_rss=%I64d KB%s',
    0, 1,
    @physical_memory_kb,
    @process_memory_kb,
    CASE WHEN @memory_divergence_pct > 20
         THEN N' [CONTAINER SIGNAL: process RSS << host RAM; use process_rss for max server memory sizing]'
         ELSE N''
    END
) WITH NOWAIT;
```

### 3. Fix cpu_count overcounting (companion fix, v2.4 scope)

`cpu_count` overcounts schedulers in SQL Server 2017 and earlier on Linux, Windows containers, and Azure VM vCore-throttled deployments. SQL Server 2019 CU8+ and 2022 add cgroup-aware CPU quota detection. Supplement with `COUNT(*)` from `sys.dm_os_schedulers WHERE status = 'VISIBLE ONLINE'` as the affinity-effective count:

```sql
SELECT @visible_online_schedulers = COUNT(*)
FROM sys.dm_os_schedulers
WHERE status = N'VISIBLE ONLINE';
```

Log both: `cpu_count` (system-reported) and `visible_online_schedulers` (affinity-effective). When they diverge, emit an advisory note.

### 4. Fix LOW_UPTIME false positives in containerized environments

Current uptime detection uses `sys.dm_os_sys_info.ms_ticks` (node uptime), not SQL Server start time. Use `sqlserver_start_time` from `sys.dm_os_sys_info` to measure SQL Server uptime independent of OS/node uptime:

```sql
SELECT @sql_uptime_hours = DATEDIFF(HOUR, sqlserver_start_time, GETUTCDATE())
FROM sys.dm_os_sys_info;
```

Emit LOW_UPTIME only when SQL Server uptime is low, not when the OS was recently rebooted.

---

## Version Gate

None required for the memory RSS fix. The `cpu_count` scheduler supplement has no version gate (DMV available in all versions). The LOW_UPTIME fix has no version gate.

---

## Testing Requirements

- Validate `physical_memory_in_use_kb` is logged alongside `physical_memory_kb`
- Validate divergence annotation fires when values differ by >20%
- Validate `visible_online_schedulers` logged alongside `cpu_count`
- Validate LOW_UPTIME uses `sqlserver_start_time` not `ms_ticks`
- Manual test: verify output on a development container instance (Docker Desktop + SQL Server image)

---

## References

- Internal analysis: Hardware context DMV accuracy in containerized/virtualized deployments
- Claim-236: physical_memory_kb reports host RAM (❌ Bug Fix — escalated from Glenn Berry claim-189)
- Claim-237: cpu_count overcounting is version-dependent (⚠️ Enhancement)
- Claim-238: No pure-T-SQL path to cgroup memory.limit_in_bytes; process RSS is best proxy
- Claim-239: LOW_UPTIME false positive in AKS via ms_ticks
- Claim-241: Code sketch — two zero-version-gate additions
- Related: `rev-010` — Glenn Berry: hardware context initial review (the prior ⚠️ Concerns assessment now escalated)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG (P1): Hardware diagnostics report host physical_memory_kb in containers — direct misconfiguration risk #330

Summary

Impact

Root Cause

Fix

1. Add process RSS as a companion metric

2. Log both values with divergence annotation

3. Fix cpu_count overcounting (companion fix, v2.4 scope)

4. Fix LOW_UPTIME false positives in containerized environments

Version Gate

Testing Requirements

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

BUG (P1): Hardware diagnostics report host physical_memory_kb in containers — direct misconfiguration risk #330

Description

Summary

Impact

Root Cause

Fix

1. Add process RSS as a companion metric

2. Log both values with divergence annotation

3. Fix cpu_count overcounting (companion fix, v2.4 scope)

4. Fix LOW_UPTIME false positives in containerized environments

Version Gate

Testing Requirements

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions