Skip to content

Incident: HTTP 5xx due to OutOfMemory in Cart API (Grubify Container App) #104

@gderossilive

Description

@gderossilive

Incident Report: HTTP 5xx due to OutOfMemory in Cart API

  • Incident ID: 18c4993e-7340-496c-aa1c-ddd97954f000
  • Service: Azure Container Apps — ca-grubify-api (rg: rg-grubify-app)
  • Subscription: 06dbbc7b-2363-4dd4-9803-95d07f1a8d3e
  • FQDN: ca-grubify-api.politecliff-89094031.swedencentral.azurecontainerapps.io
  • Active revision: ca-grubify-api--0000001 (100% traffic)

Summary

A Sev2 Azure Monitor metric alert fired for backend HTTP 5xx at 2026-05-03T10:28:33Z. Console logs show repeated unhandled System.OutOfMemoryException in CartController.AddItemToCart(...) and subsequent readiness probe failures (connection refused). Immediate mitigation was applied by restarting the active revision, and endpoint validation succeeded afterward.

Impact

  • API requests on the cart path experienced elevated failures (HTTP 5xx) during the spike window.
  • User cart operations were intermittently unavailable while replicas churned and readiness checks failed.

Timeline (UTC)

  • ~10:22: Request surge started (43/min), increasing to 82/min by 10:23.
  • ~10:26: Repeated unhandled System.OutOfMemoryException logged from CartController.cs:line 30.
  • 10:28:33: Azure Monitor alert alert-http-5xx-grubify-api fired (Sev2).
  • ~10:32: Readiness probe failures (connection refused) and container stop/start churn observed.
  • ~10:33: Active revision ca-grubify-api--0000001 restarted as immediate mitigation.
  • ~10:34: Post-mitigation endpoint checks returned HTTP 200 across core API routes.

Evidence

Console logs (active revision)

fail: Microsoft.AspNetCore.Server.Kestrel[13]
Connection id "...", Request id "...": An unhandled exception was thrown by the application.
System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at GrubifyApi.Controllers.CartController.AddItemToCart(String userId, AddCartItemRequest request) in /app/Controllers/CartController.cs:line 30
...
readiness probe failed: connection refused
Stopping container grubify-api

Traffic and Response Time

Evidence chart generated from incident window telemetry (requests/cpu/memory with alert and mitigation markers):

Post-mitigation synthetic checks (all HTTP 200):

  • GET /weatherforecast
  • GET /api/restaurants
  • GET /api/fooditems
  • POST /api/cart/demo-user/items

Metrics snapshot (Azure Monitor)

  • Requests (1m bins, total request signal): 10:22=43, 10:23=82, 10:24=81, 10:25=65, 10:26=80, 10:27=62, then 0 by 10:28 onward.
  • ResponseTime avg: Alert indicates backend 5xx threshold breach; request-path latency degradation accompanied OOM exception burst.
  • RestartCount: Replica stop/start churn observed in logs near 10:32 during instability.
  • MemoryPercentage: ~2% in sampled platform metric (known limitation in this lab; does not always reflect process-level OOM in app path).
  • UsageNanoCores: Near-zero/low during burst, consistent with memory-failure dominant incident rather than CPU saturation.

Root Cause

The incident is consistent with a code-level memory leak pattern in the cart endpoint path: CartController.AddItemToCart(...) allocates and retains large buffers, resulting in repeated OutOfMemoryException and request failures under burst traffic. This aligns with known Grubify lab failure behavior and the observed stack trace (CartController.cs:line 30).

Remediation

  • Code: Replace unbounded per-request retained allocations in CartController.AddItemToCart with bounded cache/metadata-only tracking and TTL/size limits.
  • Defensive: Add cart endpoint rate limiting and payload/operation guards to reduce burst amplification.
  • Platform: Keep conservative floor replicas and memory sizing while code fix rolls out; validate revision health before traffic stabilization.
  • Observability: Add explicit alerting on System.OutOfMemoryException log signature and cart-route error rate, and capture per-endpoint error SLOs.

Action Items

# Action Priority
1 Patch cart controller to remove unbounded retained allocations and ship hotfix revision High
2 Add load/regression test for repeated POST /api/cart/{userId}/items to detect leak behavior pre-release Medium
3 Introduce endpoint rate limiting and validate autoscale/concurrency settings for cart path Medium
4 Add OOM-specific log alert and route-level 5xx dashboard/alerts Low

References

  • Container App: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-grubify-app/providers/Microsoft.App/containerApps/ca-grubify-api
  • Alert Instance: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourcegroups/rg-grubify-app/providers/microsoft.app/containerapps/ca-grubify-api/providers/Microsoft.AlertsManagement/alerts/18c4993e-7340-496c-aa1c-ddd97954f000
  • Alert Rule: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-grubify-app/providers/Microsoft.Insights/metricAlerts/alert-http-5xx-grubify-api
  • Log Analytics Workspace ID: bd41ac04-55df-4ef8-b157-4aebd5cd76d5
  • Log Analytics Workspace Resource: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourcegroups/rg-grubify-app/providers/microsoft.operationalinsights/workspaces/cae-grubify-logs
  • App Insights: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourcegroups/rg-grubify-lab/providers/microsoft.insights/components/appi-cff6qws2yy4ku

This issue was created by sre-agent-cff6qws2yy4ku--163d1e9d
Tracked by the SRE agent here

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions