Skip to content

Detect Lambda OOM crashes and return codes.ResourceExhausted#859

Open
c1-dev-bot[bot] wants to merge 2 commits into
mainfrom
fix/lambda-oom-detection
Open

Detect Lambda OOM crashes and return codes.ResourceExhausted#859
c1-dev-bot[bot] wants to merge 2 commits into
mainfrom
fix/lambda-oom-detection

Conversation

@c1-dev-bot
Copy link
Copy Markdown
Contributor

@c1-dev-bot c1-dev-bot Bot commented May 22, 2026

Summary

  • Adds OOM detection for Lambda function errors by checking for Runtime.ExitError with signal: killed and comparing Memory Size vs Max Memory Used from the Lambda REPORT line
  • Returns codes.ResourceExhausted for OOM crashes (enabling proper retry behavior and clearer diagnostics)
  • Stops filtering Runtime.ExitError from extractMeaningfulLogLines() — this line carries crash context that was previously discarded, causing the uninformative "function returned error: Unhandled; status code: 200" message
  • Refactors inline error classification into classifyLambdaError() per the existing code comment at line 69 requesting this when a third case was added

Context

This addresses a cross-connector Lambda crash pattern where FunctionError=Unhandled errors provide no diagnostic information because extractMeaningfulLogLines() filters out the key crash indicators (Runtime.ExitError, REPORT line with memory stats). The issue has been observed in multiple connectors (baton-workday, baton-google-workspace, baton-okta).

Test plan

  • Added TestIsLambdaOOM with 8 test cases covering: empty log, normal execution, OOM via signal killed, OOM via memory match, timeout (not OOM), signal killed without ExitError, ExitError without signal killed, memory fields on separate lines
  • Added TestClassifyLambdaError with 6 test cases covering: timeout via payload, timeout via context deadline, OOM via signal killed, OOM via memory limit, generic error with logs, generic error without logs
  • Updated TestExtractMeaningfulLogLines with new case verifying Runtime.ExitError is preserved in output
  • CI must pass (cannot run tests locally — environment has Go 1.24 but SDK requires Go 1.25.2)

Fixes: CXH-1553


Automated PR Notice

This PR was automatically created by c1-dev-bot as a potential implementation.

This code requires:

  • Human review of the implementation approach
  • Manual testing to verify correctness
  • Approval from the appropriate team before merging

The Lambda error handling previously only detected timeouts, leaving
OOM crashes as uninformative "function returned error: Unhandled"
messages. This adds OOM detection by checking for Runtime.ExitError
with signal:killed and comparing Memory Size vs Max Memory Used from
the Lambda REPORT line. OOM errors now return codes.ResourceExhausted
for proper retry behavior and clearer diagnostics.

Also stops filtering Runtime.ExitError from log output — it carries
crash context that was previously discarded.

Refactors inline error classification into classifyLambdaError() per
the existing code comment requesting this when a third case was added.
@c1-dev-bot c1-dev-bot Bot requested a review from a team May 22, 2026 21:50
@linear-code
Copy link
Copy Markdown

linear-code Bot commented May 22, 2026

CXH-1553

@c1-dev-bot
Copy link
Copy Markdown
Contributor Author

c1-dev-bot Bot commented May 22, 2026

Checking lint failure - reviewing code for potential issues.

Comment on lines +134 to +145
wantIsGRPC: true,
},
{
name: "timeout via context deadline exceeded in logs",
functionError: "Unhandled",
statusCode: 200,
payload: []byte(`{}`),
rawLog: `{"level":"error","error":"context deadline exceeded","msg":"sync failed"}`,
wantCode: codes.DeadlineExceeded,
wantSubstring: "function timed out",
wantIsGRPC: true,
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Bug: This test case will fail at runtime due to two compounding issues:

  1. The rawLog starts with {, so extractMeaningfulLogLines filters it out as a JSON log line (client.go:189), making filteredLogs empty.
  2. Even if it weren't filtered, the code at client.go:222 checks for escaped quotes (`\"error\":\"context deadline exceeded\"` — literal \ + " in a raw string), but this rawLog contains unescaped JSON quotes.

The function falls through to the generic error path returning "lambda_transport: function returned error: Unhandled; status code: 200", which fails both the substring ("function timed out") and gRPC code (DeadlineExceeded) assertions. The rawLog needs to match the actual production Lambda log format that triggers the escaped-quote check.

Comment thread pkg/lambda/grpc/client.go
return status.Errorf(codes.DeadlineExceeded, "lambda_transport: function timed out: %s; logSummary: %s", functionError, filteredLogs)
}
if strings.Contains(filteredLogs, `\"error\":\"context deadline exceeded\"`) {
return status.Errorf(codes.DeadlineExceeded, "lambda_transport: function timed out: %s; logSummary: %s", functionError, filteredLogs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: This pre-existing check searches filteredLogs for escaped-quote patterns (`\"error\":\"context deadline exceeded\"`), but extractMeaningfulLogLines already strips all JSON lines starting with { (line 189). If in production this pattern only appears in JSON-formatted log lines, this check would be dead code. Consider checking rawLog instead of filteredLogs, and matching the unescaped form "error":"context deadline exceeded".

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 22, 2026

General PR Review: Detect Lambda OOM crashes and return codes.ResourceExhausted

Blocking Issues: 0 | Suggestions: 0 | Threads Resolved: 0
Review mode: incremental since b32788a4
View review run

Review Summary

The new commits fix the context deadline exceeded check to search rawLog with an unescaped substring match instead of searching filteredLogs with escaped-quote patterns. This directly addresses both prior findings: the correctness issue (test "timeout via context deadline exceeded in logs" would fail because JSON logs were stripped from filteredLogs) and the suggestion (dead-code risk from checking filtered logs). The remaining changes are cosmetic reformatting of long test strings into multi-line concatenation. No new issues found.

Security Issues

None found.

Correctness Issues

None found.

Suggestions

None.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking issues found — see review comments.

Check rawLog instead of filteredLogs for context deadline exceeded,
since JSON log lines (where this error appears) get filtered out by
extractMeaningfulLogLines. Also simplifies the match to plain string
instead of escaped-quote pattern. Break long test strings to stay
under 200-char line limit.
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocking issues found.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants