Incident Report: HTTP 5xx due to OutOfMemory in Cart API
- Incident ID:
18c4993e-7340-496c-aa1c-ddd97954f000
- Service: Azure Container Apps —
ca-grubify-api (rg: rg-grubify-app)
- Subscription:
06dbbc7b-2363-4dd4-9803-95d07f1a8d3e
- FQDN:
ca-grubify-api.politecliff-89094031.swedencentral.azurecontainerapps.io
- Active revision:
ca-grubify-api--0000001 (100% traffic)
Summary
A Sev2 Azure Monitor metric alert fired for backend HTTP 5xx at 2026-05-03T10:28:33Z. Console logs show repeated unhandled System.OutOfMemoryException in CartController.AddItemToCart(...) and subsequent readiness probe failures (connection refused). Immediate mitigation was applied by restarting the active revision, and endpoint validation succeeded afterward.
Impact
- API requests on the cart path experienced elevated failures (HTTP 5xx) during the spike window.
- User cart operations were intermittently unavailable while replicas churned and readiness checks failed.
Timeline (UTC)
- ~10:22: Request surge started (43/min), increasing to 82/min by 10:23.
- ~10:26: Repeated unhandled
System.OutOfMemoryException logged from CartController.cs:line 30.
- 10:28:33: Azure Monitor alert
alert-http-5xx-grubify-api fired (Sev2).
- ~10:32: Readiness probe failures (
connection refused) and container stop/start churn observed.
- ~10:33: Active revision
ca-grubify-api--0000001 restarted as immediate mitigation.
- ~10:34: Post-mitigation endpoint checks returned HTTP 200 across core API routes.
Evidence
Console logs (active revision)
fail: Microsoft.AspNetCore.Server.Kestrel[13]
Connection id "...", Request id "...": An unhandled exception was thrown by the application.
System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
at GrubifyApi.Controllers.CartController.AddItemToCart(String userId, AddCartItemRequest request) in /app/Controllers/CartController.cs:line 30
...
readiness probe failed: connection refused
Stopping container grubify-api
Traffic and Response Time
Evidence chart generated from incident window telemetry (requests/cpu/memory with alert and mitigation markers):
Post-mitigation synthetic checks (all HTTP 200):
GET /weatherforecast
GET /api/restaurants
GET /api/fooditems
POST /api/cart/demo-user/items
Metrics snapshot (Azure Monitor)
- Requests (1m bins, total request signal): 10:22=43, 10:23=82, 10:24=81, 10:25=65, 10:26=80, 10:27=62, then 0 by 10:28 onward.
- ResponseTime avg: Alert indicates backend 5xx threshold breach; request-path latency degradation accompanied OOM exception burst.
- RestartCount: Replica stop/start churn observed in logs near 10:32 during instability.
- MemoryPercentage: ~2% in sampled platform metric (known limitation in this lab; does not always reflect process-level OOM in app path).
- UsageNanoCores: Near-zero/low during burst, consistent with memory-failure dominant incident rather than CPU saturation.
Root Cause
The incident is consistent with a code-level memory leak pattern in the cart endpoint path: CartController.AddItemToCart(...) allocates and retains large buffers, resulting in repeated OutOfMemoryException and request failures under burst traffic. This aligns with known Grubify lab failure behavior and the observed stack trace (CartController.cs:line 30).
Remediation
- Code: Replace unbounded per-request retained allocations in
CartController.AddItemToCart with bounded cache/metadata-only tracking and TTL/size limits.
- Defensive: Add cart endpoint rate limiting and payload/operation guards to reduce burst amplification.
- Platform: Keep conservative floor replicas and memory sizing while code fix rolls out; validate revision health before traffic stabilization.
- Observability: Add explicit alerting on
System.OutOfMemoryException log signature and cart-route error rate, and capture per-endpoint error SLOs.
Action Items
| # |
Action |
Priority |
| 1 |
Patch cart controller to remove unbounded retained allocations and ship hotfix revision |
High |
| 2 |
Add load/regression test for repeated POST /api/cart/{userId}/items to detect leak behavior pre-release |
Medium |
| 3 |
Introduce endpoint rate limiting and validate autoscale/concurrency settings for cart path |
Medium |
| 4 |
Add OOM-specific log alert and route-level 5xx dashboard/alerts |
Low |
References
- Container App:
/subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-grubify-app/providers/Microsoft.App/containerApps/ca-grubify-api
- Alert Instance:
/subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourcegroups/rg-grubify-app/providers/microsoft.app/containerapps/ca-grubify-api/providers/Microsoft.AlertsManagement/alerts/18c4993e-7340-496c-aa1c-ddd97954f000
- Alert Rule:
/subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-grubify-app/providers/Microsoft.Insights/metricAlerts/alert-http-5xx-grubify-api
- Log Analytics Workspace ID:
bd41ac04-55df-4ef8-b157-4aebd5cd76d5
- Log Analytics Workspace Resource:
/subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourcegroups/rg-grubify-app/providers/microsoft.operationalinsights/workspaces/cae-grubify-logs
- App Insights:
/subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourcegroups/rg-grubify-lab/providers/microsoft.insights/components/appi-cff6qws2yy4ku
This issue was created by sre-agent-cff6qws2yy4ku--163d1e9d
Tracked by the SRE agent here
Incident Report: HTTP 5xx due to OutOfMemory in Cart API
18c4993e-7340-496c-aa1c-ddd97954f000ca-grubify-api(rg:rg-grubify-app)06dbbc7b-2363-4dd4-9803-95d07f1a8d3eca-grubify-api.politecliff-89094031.swedencentral.azurecontainerapps.ioca-grubify-api--0000001(100% traffic)Summary
A Sev2 Azure Monitor metric alert fired for backend HTTP 5xx at 2026-05-03T10:28:33Z. Console logs show repeated unhandled
System.OutOfMemoryExceptioninCartController.AddItemToCart(...)and subsequent readiness probe failures (connection refused). Immediate mitigation was applied by restarting the active revision, and endpoint validation succeeded afterward.Impact
Timeline (UTC)
System.OutOfMemoryExceptionlogged fromCartController.cs:line 30.alert-http-5xx-grubify-apifired (Sev2).connection refused) and container stop/start churn observed.ca-grubify-api--0000001restarted as immediate mitigation.Evidence
Console logs (active revision)
Traffic and Response Time
Evidence chart generated from incident window telemetry (requests/cpu/memory with alert and mitigation markers):
Post-mitigation synthetic checks (all HTTP 200):
GET /weatherforecastGET /api/restaurantsGET /api/fooditemsPOST /api/cart/demo-user/itemsMetrics snapshot (Azure Monitor)
Root Cause
The incident is consistent with a code-level memory leak pattern in the cart endpoint path:
CartController.AddItemToCart(...)allocates and retains large buffers, resulting in repeatedOutOfMemoryExceptionand request failures under burst traffic. This aligns with known Grubify lab failure behavior and the observed stack trace (CartController.cs:line 30).Remediation
CartController.AddItemToCartwith bounded cache/metadata-only tracking and TTL/size limits.System.OutOfMemoryExceptionlog signature and cart-route error rate, and capture per-endpoint error SLOs.Action Items
POST /api/cart/{userId}/itemsto detect leak behavior pre-releaseReferences
/subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-grubify-app/providers/Microsoft.App/containerApps/ca-grubify-api/subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourcegroups/rg-grubify-app/providers/microsoft.app/containerapps/ca-grubify-api/providers/Microsoft.AlertsManagement/alerts/18c4993e-7340-496c-aa1c-ddd97954f000/subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-grubify-app/providers/Microsoft.Insights/metricAlerts/alert-http-5xx-grubify-apibd41ac04-55df-4ef8-b157-4aebd5cd76d5/subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourcegroups/rg-grubify-app/providers/microsoft.operationalinsights/workspaces/cae-grubify-logs/subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourcegroups/rg-grubify-lab/providers/microsoft.insights/components/appi-cff6qws2yy4kuThis issue was created by sre-agent-cff6qws2yy4ku--163d1e9d
Tracked by the SRE agent here