Skip to content

Incident: HTTP 5xx spike with cart memory-retention behavior (Grubify Container App) #110

@gderossilive

Description

@gderossilive

Incident Report: HTTP 5xx spike with cart memory-retention behavior

  • Incident ID: 4bcfd5d3-3796-4fc9-8c2c-bc4415cef000
  • Service: Azure Container Apps — ca-grubify-api (rg: rg-grubify-app)
  • Subscription: 06dbbc7b-2363-4dd4-9803-95d07f1a8d3e
  • FQDN: ca-grubify-api.politecliff-89094031.swedencentral.azurecontainerapps.io
  • Active revision: ca-grubify-api--0000002 (100% traffic)

Summary

A Sev2 alert fired for elevated HTTP 5xx on the Grubify backend Container App. Incident-window telemetry confirmed a burst of 5xx responses and logs showed repeated per-request cache growth in the cart path (10MB increments). Service is currently reachable with successful synthetic checks, but recurrence risk remains until code-level containment is implemented.

Impact

  • Elevated backend HTTP 5xx during the incident window.
  • Cart operations experienced increased failure risk while the burst occurred.

Timeline (UTC)

  • ~10:20: 5xx requests reached 37/min.
  • ~10:21: 5xx requests reached 61/min.
  • 10:22:28: Azure Monitor Sev2 alert fired.
  • ~10:28: Container startup observed in logs; recovery underway.
  • ~10:35: Synthetic checks on key endpoints returned HTTP 200.

Evidence

Console logs (active revision)

2026-05-07T10:28:32Z cache: Added request data. Total entries: 1
2026-05-07T10:28:32Z size: 10MB
2026-05-07T10:29:36Z cache: Added request data. Total entries: 2
2026-05-07T10:29:36Z size: 20MB
2026-05-07T10:29:36Z cache: Added request data. Total entries: 3
2026-05-07T10:29:36Z size: 30MB
2026-05-07T10:32:45Z cache: Added request data. Total entries: 4
2026-05-07T10:32:45Z size: 40MB

Traffic and Response Time

  • Evidence chart artifact: /api/files/tmp/ThreadFiles/a3bd7360-77aa-4402-a14c-eba85094b03e/grubify-5xx-incident-2026-05-07-evidence.png
  • Synthetic checks (~10:35 UTC):
    • GET /weatherforecast200
    • GET /api/restaurants200
    • GET /api/fooditems200
    • POST /api/cart/demo-user/items200

Metrics snapshot (Azure Monitor)

  • Requests (5xx, 1m bins): 10:20=37, 10:21=61, 10:24=1
  • ResponseTime avg: peak 79 ms at 10:29
  • RestartCount: 0 across sampled window
  • MemoryPercentage: one control-plane call returned scope error; memory corroborated with WorkingSetBytes
  • WorkingSetBytes: peak 128,634,880 bytes (~128.6 MB) at 10:20
  • UsageNanoCores: peak 245,688,944 (~245.7 millicores) at 10:29

Root Cause

Likely application-level memory-retention behavior in the cart path (POST /api/cart/{userId}/items) causing transient resource pressure and backend 5xx under burst traffic, evidenced by cumulative 10MB growth log entries.

Remediation

  • Code: Remove unbounded per-request retained allocations in cart handling.
  • Defensive: Add endpoint rate/resource guards for cart POST traffic.
  • Platform: Keep protective backend scaling/memory settings while code fix is rolled out.
  • Observability: Add targeted alerting on cache-growth signatures and exception spikes.

Action Items

# Action Priority
1 Patch cart-path memory retention bug High
2 Add repeated cart-POST memory regression test High
3 Review memory autoscale guardrails Medium
4 Add alert for cart cache-growth signature Medium

References

  • Container App: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-grubify-app/providers/Microsoft.App/containerapps/ca-grubify-api
  • Log Analytics Workspace ID: bd41ac04-55df-4ef8-b157-4aebd5cd76d5
  • App Insights: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-grubify-sre/providers/Microsoft.Insights/components/appi-sre-grubify

This issue was created by sre-agent-cff6qws2yy4ku--163d1e9d
Tracked by the SRE agent here

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions