Skip to content

Incident: HTTP 5xx due to OutOfMemoryException in Cart API (Grubify Container App) #103

@gderossilive

Description

@gderossilive

Incident Report: HTTP 5xx due to OutOfMemoryException in Cart API

  • Incident ID: 1802ec48-6466-4df5-bee5-b0345a7df000
  • Service: Azure Container Apps — ca-grubify-api (rg: rg-grubify-app)
  • Subscription: 06dbbc7b-2363-4dd4-9803-95d07f1a8d3e
  • FQDN: ca-grubify-api.politecliff-89094031.swedencentral.azurecontainerapps.io
  • Active revision: ca-grubify-api--0000001 (100% traffic)

Summary

Azure Monitor fired alert-http-5xx-grubify at 2026-05-03T10:28:53Z for sustained 5xx responses on ca-grubify-api. Runtime logs show repeated unhandled System.OutOfMemoryException in CartController.AddItemToCart during the same window. Immediate mitigation was applied by restarting the active revision; live endpoint probes then returned HTTP 200.

Impact

  • User-facing API failures (HTTP 5xx) during the incident burst around the alert window.
  • Cart-related operations were at elevated risk of failure while OOM exceptions were being thrown.

Timeline (UTC)

  • ~10:22–10:27: Traffic ramp observed (43, 82, 81, 65, 80, 62 req/min).
  • ~10:26–10:31: Repeated System.OutOfMemoryException and unhandled request failures in app logs.
  • 10:28:53: Sev2 Azure Monitor alert fired (alert-http-5xx-grubify).
  • ~10:31–10:32: Active revision restart executed (ca-grubify-api--0000001).
  • ~10:32–10:33: Endpoint validation succeeded (/weatherforecast, /api/restaurants, /api/fooditems, /api/cart/demo-user/items all HTTP 200).

Evidence

Console logs (active revision)

fail: Microsoft.AspNetCore.Server.Kestrel[13]
Connection id "0HNL8UCA95GM9", Request id "0HNL8UCA95GM9:00000002": An unhandled exception was thrown by the application.
System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at GrubifyApi.Controllers.CartController.AddItemToCart(String userId, AddCartItemRequest request) in /app/Controllers/CartController.cs:line 30

Additional repeated signals in the same window:

  • readiness probe failed: connection refused
  • Stopping container grubify-api
  • Application is shutting down...

Traffic and Response Time

An investigation chart was generated for the incident window showing request burst followed by drop at alert time and post-mitigation stability.

Key plotted points (UTC, req/min):

  • 10:22=43, 10:23=82, 10:24=81, 10:25=65, 10:26=80, 10:27=62, 10:28=0, 10:29=0, 10:30=0

Metrics snapshot (Azure Monitor)

  • Requests (1m bins): 43, 82, 81, 65, 80, 62 (10:22–10:27), then 0 at 10:28+
  • ResponseTime avg: elevated during incident window (alert fired on 5xx rule context)
  • RestartCount: not reliably returned via CLI in this run; runtime logs confirm restart activity around mitigation
  • MemoryPercentage: ~2% average in sampled post-window points
  • UsageNanoCores: low/near-idle in sampled post-window points

Endpoint validation after mitigation:

  • GET /weatherforecast → 200
  • GET /api/restaurants → 200
  • GET /api/fooditems → 200
  • POST /api/cart/demo-user/items → 200

Root Cause

Application-level memory exhaustion in the cart code path: repeated System.OutOfMemoryException in GrubifyApi.Controllers.CartController.AddItemToCart (/app/Controllers/CartController.cs:line 30) caused unhandled exceptions and 5xx errors under active traffic.

Remediation

  • Code: Remove unbounded per-request memory retention in CartController.AddItemToCart; implement bounded cache or persistent storage.
  • Defensive: Add request throttling/rate limits specifically on cart write endpoint and enforce payload constraints.
  • Platform: Keep higher baseline capacity for resilience (current app configured at 2Gi / min replicas 2, max 4) and tune autoscale rules for burst traffic.
  • Observability: Add explicit OOM/log-based alerting and endpoint-specific SLO monitors for cart operations.

Action Items

# Action Priority
1 Patch CartController to eliminate unbounded memory growth and deploy fixed image High
2 Add regression/load test that repeatedly posts to cart endpoint and asserts stable memory High
3 Add autoscale rule and guardrails for cart traffic bursts Medium
4 Add log-based alert for OutOfMemoryException with incident correlation to 5xx alert Medium
5 Confirm alert auto-resolves and document closeout steps in runbook Low

References

  • Container App: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-grubify-app/providers/Microsoft.App/containerApps/ca-grubify-api
  • Log Analytics Workspace ID: bd41ac04-55df-4ef8-b157-4aebd5cd76d5
  • Log Analytics Workspace ARM ID: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-grubify-lab/providers/Microsoft.OperationalInsights/workspaces/cae-grubify-logs
  • App Insights: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-grubify-lab/providers/Microsoft.Insights/components/appi-cff6qws2yy4ku
  • Alert ARM ID: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourcegroups/rg-grubify-app/providers/microsoft.app/containerapps/ca-grubify-api/providers/Microsoft.AlertsManagement/alerts/1802ec48-6466-4df5-bee5-b0345a7df000

This issue was created by sre-agent-cff6qws2yy4ku--163d1e9d
Tracked by the SRE agent here

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions