Skip to content

fix: classify Cloudflare AI errors with error codes and retry guidance#63

Open
whoabuddy wants to merge 1 commit intomainfrom
fix/classify-ai-errors
Open

fix: classify Cloudflare AI errors with error codes and retry guidance#63
whoabuddy wants to merge 1 commit intomainfrom
fix/classify-ai-errors

Conversation

@whoabuddy
Copy link
Contributor

Summary

  • Replace generic "Chat completion failed" 500 with classified errors that distinguish timeout, rate limit, model not found, and internal errors
  • Include error_code, retryable, and retry_after_seconds in error responses so agent consumers know whether to retry

Context

Production log analysis (Feb 28 - Mar 2) identified that all Cloudflare AI failures returned a generic 500 with no error classification, making it impossible for AI agent consumers to distinguish transient vs permanent failures.

Closes #61

Changes

  • chat.ts: Added classifyCloudflareAIError() helper that inspects error message/name to categorize into TIMEOUT (504), RATE_LIMIT (429), MODEL_NOT_FOUND (404), or INTERNAL_ERROR (502). Catch block now uses classifier and passes error_code, retryable, and optional retry_after_seconds via the existing extra parameter on errorResponse().

Test plan

  • Verify TypeScript compiles cleanly (npm run check)
  • Trigger timeout error — confirm 504 with error_code: "TIMEOUT", retryable: true
  • Trigger rate limit — confirm 429 with error_code: "RATE_LIMIT", retryable: true, retry_after_seconds: 30
  • Trigger unknown error — confirm 502 with error_code: "INTERNAL_ERROR", retryable: false

🤖 Generated with Claude Code

…try guidance

Replace generic "Chat completion failed" 500 error with classified error
responses that give AI agent consumers actionable information:

- TIMEOUT (504): AbortError, "Request timed out", or error code 3046
  retryable: true, retry_after_seconds: 30
- RATE_LIMIT (429): "Rate limit exceeded" or 429 in message
  retryable: true, retry_after_seconds: 60
- MODEL_NOT_FOUND (404): "Model not found" or 404 in message
  retryable: false
- INTERNAL_ERROR (502): all other upstream failures
  retryable: false

Each error response now includes error_code and retryable fields (and
optionally retry_after_seconds) via the existing errorResponse() extra
parameter. Log entries also include error_code and status for observability.

Resolves: #61

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 3, 2026 00:05
@cloudflare-workers-and-pages
Copy link

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Updated (UTC)
✅ Deployment successful!
View logs
x402-api-staging 1b3edb4 Mar 03 2026, 12:05 AM

@cloudflare-workers-and-pages
Copy link

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Updated (UTC)
✅ Deployment successful!
View logs
x402-api-production 1b3edb4 Mar 03 2026, 12:05 AM

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a production issue (issue #61) where all Cloudflare AI chat errors returned a generic 500 "Chat completion failed", making it impossible for AI agent consumers to distinguish transient from permanent failures. It adds a classifyCloudflareAIError() helper that maps error messages/names to four categories (timeout, rate limit, model not found, internal error) with corresponding HTTP status codes, error codes, and retry guidance. The classified error fields are included in both the structured log output and the JSON error response.

Changes:

  • Added CloudflareAIErrorClassification interface and classifyCloudflareAIError() function that uses message/name inspection to categorize errors
  • Catch block now logs the classified error code and status, and returns a classified error response with error_code, retryable, and optional retry_after_seconds fields
  • OpenAPI schema updated to document the four new error responses (404, 429, 502, 504)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +28 to +61
if (
errorName === "AbortError" ||
errorMessage.includes("Request timed out") ||
errorMessage.includes("3046")
) {
return {
message: "Request timed out",
status: 504,
error_code: "TIMEOUT",
retryable: true,
retry_after_seconds: 30,
};
}

// Rate limit: explicit message or 429 code in message
if (errorMessage.includes("Rate limit exceeded") || errorMessage.includes("429")) {
return {
message: "Rate limit exceeded",
status: 429,
error_code: "RATE_LIMIT",
retryable: true,
retry_after_seconds: 60,
};
}

// Model not found: explicit message or 404 code in message
if (errorMessage.includes("Model not found") || errorMessage.includes("404")) {
return {
message: "Model not found",
status: 404,
error_code: "MODEL_NOT_FOUND",
retryable: false,
};
}
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All string matching in classifyCloudflareAIError is case-sensitive (e.g., errorMessage.includes("Rate limit exceeded"), errorMessage.includes("Model not found"), errorMessage.includes("Request timed out")). Cloudflare AI could return the same messages with different casing (e.g., "rate limit exceeded", "model not found", "request timed out"). This would fall through to the INTERNAL_ERROR default, resulting in incorrect classification. Using .toLowerCase() on errorMessage before the comparisons (or toUpperCase()), or using case-insensitive .includes() alternatives, would make this more robust.

Copilot uses AI. Check for mistakes.
if (classified.retry_after_seconds !== undefined) {
extra.retry_after_seconds = classified.retry_after_seconds;
}

Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new Cloudflare AI error handler puts retry_after_seconds in the JSON body, but does not set the standard HTTP Retry-After response header. The OpenRouter chat handler already establishes this pattern at src/endpoints/inference/openrouter/chat.ts:242: c.header("Retry-After", "5"). HTTP clients and proxies use the Retry-After header to decide when to retry — omitting it means consumers relying on the header (instead of parsing the JSON body) will not get the correct backoff. The Retry-After header should be set when classified.retryable is true, using the value from classified.retry_after_seconds.

Suggested change
if (classified.retryable && classified.retry_after_seconds !== undefined) {
c.header("Retry-After", String(classified.retry_after_seconds));
}

Copilot uses AI. Check for mistakes.
Comment on lines +43 to +61
if (errorMessage.includes("Rate limit exceeded") || errorMessage.includes("429")) {
return {
message: "Rate limit exceeded",
status: 429,
error_code: "RATE_LIMIT",
retryable: true,
retry_after_seconds: 60,
};
}

// Model not found: explicit message or 404 code in message
if (errorMessage.includes("Model not found") || errorMessage.includes("404")) {
return {
message: "Model not found",
status: 404,
error_code: "MODEL_NOT_FOUND",
retryable: false,
};
}
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matching bare numeric strings "429" and "404" anywhere in the error message is brittle and can cause false positives. For example, an error message containing a model path, URL, or response body that happens to include the substring "404" or "429" (e.g., a route path like /v1/chat/404-handler, a model with a version number, or a debug string quoting the upstream response) would be misclassified. Consider requiring the digit strings to appear as standalone tokens, at least surrounded by non-digit characters (e.g., using a regex like /\b429\b/ or /status:?\s*429/i), or checking for more specific Cloudflare AI error message patterns instead.

Copilot uses AI. Check for mistakes.
Comment on lines +36 to +69
error_code: "TIMEOUT",
retryable: true,
retry_after_seconds: 30,
};
}

// Rate limit: explicit message or 429 code in message
if (errorMessage.includes("Rate limit exceeded") || errorMessage.includes("429")) {
return {
message: "Rate limit exceeded",
status: 429,
error_code: "RATE_LIMIT",
retryable: true,
retry_after_seconds: 60,
};
}

// Model not found: explicit message or 404 code in message
if (errorMessage.includes("Model not found") || errorMessage.includes("404")) {
return {
message: "Model not found",
status: 404,
error_code: "MODEL_NOT_FOUND",
retryable: false,
};
}

// Default: internal error from upstream Cloudflare AI
return {
message: "Chat completion failed",
status: 502,
error_code: "INTERNAL_ERROR",
retryable: false,
};
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a significant discrepancy between the error codes defined in the code and those specified in the linked issue #61. Issue #61 explicitly specifies prefixed codes: AI_TIMEOUT, AI_RATE_LIMITED, AI_MODEL_NOT_FOUND, and AI_INTERNAL_ERROR. The implementation uses unprefixed codes: TIMEOUT, RATE_LIMIT, MODEL_NOT_FOUND, and INTERNAL_ERROR. If AI agent consumers have already been built or documented against the issue spec (which is the source of truth for acceptance criteria), this is a breaking contract mismatch. If the unprefixed names are intentional, the issue spec and PR description should be reconciled.

Copilot uses AI. Check for mistakes.
Comment on lines +64 to +69
return {
message: "Chat completion failed",
status: 502,
error_code: "INTERNAL_ERROR",
retryable: false,
};
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue #61 specifies that AI_INTERNAL_ERROR should be retryable: true (internal errors from an upstream service are generally transient and should be retried). The implementation sets retryable: false for INTERNAL_ERROR. This conflicts with the acceptance criteria in the issue. Additionally, the issue specifies HTTP 500 for unknown/internal errors, but the implementation uses 502 — while 502 ("Bad Gateway") is arguably a better semantic choice for upstream AI failures, it deviates from the issue spec without acknowledgement. Please reconcile these with the issue requirements or update the issue spec intentionally.

Copilot uses AI. Check for mistakes.
Copy link

@arc0btc arc0btc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APPROVED

Good fix — the production log analysis driving this is exactly the right way to justify error classification work. Replacing a generic 500 with semantically correct 404/429/502/504 directly helps agent consumers (including Arc) implement intelligent retry logic.

What's solid:

  • classifyCloudflareAIError() interface is clean and well-typed via CloudflareAIErrorClassification
  • TIMEOUT classification is specific and reliable: AbortError name + "Request timed out" message + Cloudflare error code 3046 — three distinct signals that reduce false positives
  • Conditional retry_after_seconds (only included when defined) is the right pattern — avoids polluting responses where the field is irrelevant
  • OpenAPI schema update correctly documents all four new status codes with their error_code and retryable values
  • Additive response changes: error_code, retryable, retry_after_seconds don't break existing consumers reading error/message

Two things to watch:

  1. Broad string matching on "429" and "404"errorMessage.includes("429") will match any error message containing those digit sequences (e.g., "timeout after 404ms", "found 4290 items"). The TIMEOUT classification avoids this by using specific error names/codes. Worth a follow-up to tighten these to /\b429\b/ regex or checking for Cloudflare-specific patterns like "status: 429".

  2. Hardcoded retry windowsretry_after_seconds: 60 for RATE_LIMIT is a reasonable default, but if Cloudflare's error includes an actual retry window (e.g., in the error message or a propagated header), that information is currently lost. Potential enhancement: parse the actual retry window from the error object if Cloudflare surfaces it.

Neither blocks this merge — both are follow-up improvements. The fix correctly addresses the immediate production issue (opaque 500s → classified errors with retry guidance). Staging deployment already confirmed ✅.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Classify Cloudflare AI errors with error codes and retry guidance

3 participants