Skip to content

BUG (P2): LastStatCompletedAt heartbeat write failure marks succeeded stat as FAILED #334

@nanoDBA

Description

@nanoDBA

Summary

When the LastStatCompletedAt heartbeat write to CommandLog fails (e.g., table locked, permissions issue, transient deadlock), the exception propagates into the outer TRY/CATCH block, which marks the stat update as FAILED and increments @consecutive_failures. This happens even when the UPDATE STATISTICS itself completed successfully.

The result: a healthy run on a busy system can falsely trip @MaxConsecutiveFailures and abort, due entirely to heartbeat write failures — not stat update failures.


Example Failure Scenario

UPDATE STATISTICS [dbo].[OrderLines] ([IX_OrderDate])  → SUCCESS (5.2s)
CommandLog heartbeat write                              → DEADLOCK (victim)
Outer TRY/CATCH catches the deadlock
Marks [IX_OrderDate] as FAILED in accounting
@consecutive_failures increments to 3
@MaxConsecutiveFailures = 3 → procedure aborts

All statistics were successfully updated. The abort is a false positive.


Root Cause

The heartbeat write is inside the same TRY block as the stat update, without its own exception handler. Any failure in the heartbeat write causes the entire stat's TRY block to jump to the CATCH, which cannot distinguish "stat failed" from "stat succeeded, heartbeat failed."

-- CURRENT (simplified)
BEGIN TRY
    EXECUTE sp_executesql @current_command;     -- UPDATE STATISTICS
    -- heartbeat write
    UPDATE dbo.CommandLog
    SET LastStatCompletedAt = GETDATE()
    WHERE ...;                                  -- if this throws, falls to CATCH
END TRY
BEGIN CATCH
    SET @stat_succeeded = 0;                    -- incorrectly marks as failed
    SET @consecutive_failures += 1;
END CATCH

Fix

Wrap the heartbeat write in a nested TRY/CATCH. Heartbeat failure is logged as a warning but does not affect the stat's success/failure accounting.

-- PROPOSED FIX
BEGIN TRY
    EXECUTE sp_executesql @current_command;     -- UPDATE STATISTICS

    -- Heartbeat write in its own scope
    BEGIN TRY
        UPDATE dbo.CommandLog
        SET LastStatCompletedAt = GETDATE()
        WHERE ...;
    END TRY
    BEGIN CATCH
        -- Log the heartbeat failure but do NOT rethrow
        RAISERROR(
            '[sp_StatUpdate] WARNING: Heartbeat write failed for %s.%s — stat update succeeded. Error: %s',
            0, 1, @table_name, @stat_name, ERROR_MESSAGE()
        ) WITH NOWAIT;
        -- @stat_succeeded remains 1; @consecutive_failures is NOT incremented
    END CATCH

    SET @stat_succeeded = 1;

END TRY
BEGIN CATCH
    SET @stat_succeeded = 0;                    -- only genuine stat failures reach here
    SET @consecutive_failures += 1;
END CATCH

Edge Cases

  • If the heartbeat write failure is due to a permissions problem, the warning will repeat for every stat. This is intentional — it makes the misconfiguration visible without aborting the run.
  • If the CommandLog table is dropped mid-run, this will emit repeated warnings. Acceptable behavior.
  • The nested TRY/CATCH pattern is supported on SQL Server 2005+ and does not require a version gate.

Testing Requirements

  • Simulate heartbeat write failure (revoke INSERT on CommandLog mid-run) — confirm stat is marked SUCCESS, consecutive_failures not incremented, warning emitted
  • Simulate genuine stat failure (UPDATE STATISTICS on inaccessible table) — confirm stat is marked FAILED, consecutive_failures incremented
  • Confirm @MaxConsecutiveFailures abort does NOT trigger on heartbeat-only failures

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcode-qualityRefactoring, dead code removal, maintainability, or internal correctnessp2-importantSemantic gap, missing warning, or undocumented behavior that affects correctness

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions