connectorerrors: ProvisionFailureDetail primitive for terminal Grant/Revoke#828
connectorerrors: ProvisionFailureDetail primitive for terminal Grant/Revoke#828arreyder wants to merge 2 commits into
Conversation
Connectors today return plain Go errors for upstream-permanent failures (target user SUSPENDED, group governed by a rule, license cap reached). Those resolve to gRPC code=Unknown by the time a remote caller sees them, and Unknown classifies as retryable — so callers loop on a failure that will never succeed without external state change. This change adds a baton-sdk primitive for that signal: - new proto ProvisionFailureDetail (reason + admin-facing detail) - pkg/connectorerrors.NewTerminalError builds a gRPC status with the right code (PermissionDenied / FailedPrecondition) and attaches the detail via status.WithDetails - pkg/connectorerrors.FailureDetail extracts the detail across fmt.Errorf %w and errors.Join wrap chains, via status.FromError The reason→code table picks codes the existing isClientErrorCode logic already treats as non-retryable, so callers that only look at status.Code classify these errors correctly without any changes; callers that want structured logging or admin-facing detail can read the proto. Refs OPS-1551. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
General PR Review: connectorerrors: ProvisionFailureDetail primitive for terminal Grant/RevokeBlocking Issues: 0 | Suggestions: 0 | Threads Resolved: 0 Review SummaryThe new commits refine documentation across the proto enum comments, generated pb files, and Go source: reason boundaries between REASON_LICENSE_EXHAUSTED (capacity) and REASON_OPERATION_NOT_PERMITTED (policy) are now explicit, and the ResourceProvisionerV2 interface gets a docstring pointing to the terminal-error contract. Two code changes in terminal.go: NewTerminalError now passes the format string verbatim when no args are supplied, fixing a real bug where percent characters in upstream messages would be misinterpreted by fmt.Sprintf; and the WithDetails error branch switches from a silent fallback to a panic, which is appropriate for this genuinely unreachable path. No new issues found. Security IssuesNone found. Correctness IssuesNone found. SuggestionsNone. |
- Trim package doc; drop historical narrative - Add cross-reference on ResourceProvisionerV2 so the recommended interface points at the helper too - Fix IsTerminal docstring (FailureDetail allocates via FromError) - Panic on the unreachable WithDetails error path so a future field change can't silently strip the detail - Skip fmt.Sprintf when len(args)==0 so a literal upstream message containing % isn't reinterpreted - Document the Reason-is-authoritative contract on NewTerminalError / codeForReason; note forward-compat on FailureDetail - Sharpen LICENSE_EXHAUSTED (capacity) vs OPERATION_NOT_PERMITTED (policy) boundary in proto comments Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Connectors today return plain Go errors for upstream-permanent failures (target user SUSPENDED, group governed by a rule, license cap reached). Those resolve to gRPC
code=Unknownby the time a remote caller sees them, and Unknown classifies as retryable — so callers loop on a failure that will never succeed without external state change. We hit this in production with baton-okta returning E0000038 / E0000001 errors that caused ~20-minute action delays per grant.This adds the baton-sdk primitive callers and connector authors agreed was the right place for that signal (see internal review thread).
ProvisionFailureDetailwith a reason enum (USER_NOT_PROVISIONABLE,TARGET_MANAGED_EXTERNALLY,LICENSE_EXHAUSTED,VALIDATION_FAILURE,OPERATION_NOT_PERMITTED) and a free-formdetailstring for logs / admin context.pkg/connectorerrors.NewTerminalError(reason, format, args...)builds a gRPC status with the right code (PermissionDeniedforOPERATION_NOT_PERMITTED,FailedPreconditionfor everything else) and attaches the detail viastatus.WithDetails.pkg/connectorerrors.FailureDetail(err)extracts the detail acrossfmt.Errorf%wanderrors.Joinwrap chains (viastatus.FromError, which useserrors.As).ResourceProvisionerpointing connector authors at the helper.The reason→code table picks codes the existing
isClientErrorCodelogic in c1 already treats as non-retryable, so callers that only look atstatus.Code(err)classify these errors correctly without any changes; callers that want structured logging or admin-facing detail can read the proto viaFailureDetail.Transient failures (5xx, network blips, rate limits) should continue to be returned as ordinary errors — those are retryable, and rate limits have their own annotation (
RateLimitDescription).Refs OPS-1551.
Follow-ups (separate PRs)
REASON_USER_NOT_PROVISIONABLE, E0000001 (groupMembers) →REASON_TARGET_MANAGED_EXTERNALLY.FailureDetailinpkg/temporal/activity/connector-actionsfor log enrichment (reason tag interminal_marker_reason).Test plan
make protogenregenerates Go code from new protogo test ./pkg/connectorerrors/...passesmake lintclean