Skip to content

Incident: HTTP 5xx due to OutOfMemory in Cart API (Grubify Container App) #101

@gderossilive

Description

@gderossilive

Incident Report: HTTP 5xx due to OutOfMemory in Cart API

  • Incident ID: c319f08a-bd84-4c4b-99c0-b63cff7bf000
  • Service: Azure Container Apps — ca-grubify-api (rg: rg-grubify-app)
  • Subscription: 06dbbc7b-2363-4dd4-9803-95d07f1a8d3e
  • FQDN: ca-grubify-api.politecliff-89094031.swedencentral.azurecontainerapps.io
  • Active revision: ca-grubify-api--0000001 (100% traffic)

Summary

A Sev2 Azure Monitor metric alert fired for Grubify API 5xx responses. During the alert window, platform metrics showed a sharp burst of 5xx requests, and revision logs showed repeated unhandled System.OutOfMemoryException in CartController.AddItemToCart (/app/Controllers/CartController.cs:line 30). Immediate stabilization actions were executed (revision restart, then scale-up), and endpoint verification showed recovery.

Impact

  • User-facing API failures (HTTP 5xx) during the spike window.
  • Failed cart item operations on POST /api/cart/{userId}/items for affected requests.
  • Short-lived service instability while containers recycled after memory pressure.

Timeline (UTC)

  • ~08:53: 5xx spike starts; alert condition breached (Requests{statusCodeCategory=5xx} > threshold).
  • 08:55:51: Azure Monitor alert fired (alert-http-5xx-grubify, Sev2).
  • ~08:56-08:58: Repeated OutOfMemoryException seen in active revision logs; readiness failures observed around recycle events.
  • ~09:01: Revision restart executed for active revision.
  • ~09:04-09:06: Defensive scale change applied; new revision ca-grubify-api--0000001 became active with increased resources/replicas.
  • ~09:06-09:07: Verification checks show cart endpoint returning HTTP 200 (20/20 successful test calls).

Evidence

Console logs (active revision)

fail: Microsoft.AspNetCore.Server.Kestrel[13]
Connection id "...", Request id "...": An unhandled exception was thrown by the application.
System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at GrubifyApi.Controllers.CartController.AddItemToCart(String userId, AddCartItemRequest request) in /app/Controllers/CartController.cs:line 30

Additional observed lines:

readiness probe failed: connection refused
Analytics cache: Added request data. Total entries: 1/2/3/4
Cache size: 10MB/20MB/30MB/40MB

Traffic and Response Time

Evidence chart generated from investigation window (Requests, 5xx Requests, and ResponseTime with alert marker):

  • Local artifact generated by investigation: grubify-incident-2026-05-03-evidence.png

Metrics snapshot (Azure Monitor)

  • Requests (5xx, 1m bins):
    • 08:53 = 31
    • 08:54 = 80 (peak)
    • 08:55 = 16
    • 08:58 = 2
    • 09:00+ = 0
  • ResponseTime avg (ms): 08:36 = 47 ms, 08:59 = 46 ms, most spike-window points low/sparse due failures.
  • CPU utilization: low-to-moderate; peak observed ~2.75% during traffic burst.
  • Memory utilization: low percentage at platform level (~4.5% peak) but app still throws OOM in request path (managed/runtime allocation issue).
  • Alert context metric value at fire: 31 for 5xx criterion.

Root Cause

Application-level memory exhaustion in the cart API request path. Repeated unhandled System.OutOfMemoryException occurred in CartController.AddItemToCart (/app/Controllers/CartController.cs:line 30), causing request failures (5xx) and downstream readiness instability during recycle periods.

Remediation

  • Code: Refactor cart analytics/data retention in AddItemToCart to prevent unbounded memory growth (bounded cache/eviction or external store).
  • Defensive: Add payload size limits and endpoint throttling for POST /api/cart/{userId}/items.
  • Platform: Immediate mitigation executed:
    • Restarted active revision.
    • Scaled app to higher baseline capacity (new revision ca-grubify-api--0000001, resources increased to 1 vCPU / 2Gi, replicas increased).
  • Observability: Add alerting for OutOfMemoryException signature in container logs and endpoint-specific 5xx for cart route.

Action Items

# Action Priority
1 Patch CartController.AddItemToCart to remove unbounded memory retention and add bounded strategy High
2 Add regression/load test for sustained cart POST traffic and memory growth behavior High
3 Keep temporary higher baseline scaling until code fix is deployed and validated Medium
4 Add log-based OOM alert + cart endpoint 5xx alert dimensioning Medium
5 Validate rollback criteria and document safe fallback scaling profile Medium

References

  • Container App: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-grubify-app/providers/Microsoft.App/containerApps/ca-grubify-api
  • Alert Rule: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-grubify-sre/providers/Microsoft.Insights/metricAlerts/alert-http-5xx-grubify
  • Alert Resource: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourcegroups/rg-grubify-app/providers/microsoft.app/containerapps/ca-grubify-api/providers/Microsoft.AlertsManagement/alerts/c319f08a-bd84-4c4b-99c0-b63cff7bf000
  • Log Analytics Workspace ID (GUID): bd41ac04-55df-4ef8-b157-4aebd5cd76d5
  • Log Analytics Workspace ARM ID: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-grubify-app/providers/Microsoft.OperationalInsights/workspaces/cae-grubify-logs
  • App Insights: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-grubify-lab/providers/microsoft.insights/components/appi-cff6qws2yy4ku

This issue was created by sre-agent-cff6qws2yy4ku--163d1e9d
Tracked by the SRE agent here

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions