Skip to content

connectorerrors: ProvisionFailureDetail primitive for terminal Grant/Revoke#828

Open
arreyder wants to merge 2 commits into
mainfrom
chris/ops-1551-provision-failure-detail
Open

connectorerrors: ProvisionFailureDetail primitive for terminal Grant/Revoke#828
arreyder wants to merge 2 commits into
mainfrom
chris/ops-1551-provision-failure-detail

Conversation

@arreyder
Copy link
Copy Markdown
Contributor

@arreyder arreyder commented May 14, 2026

Summary

Connectors today return plain Go errors for upstream-permanent failures (target user SUSPENDED, group governed by a rule, license cap reached). Those resolve to gRPC code=Unknown by the time a remote caller sees them, and Unknown classifies as retryable — so callers loop on a failure that will never succeed without external state change. We hit this in production with baton-okta returning E0000038 / E0000001 errors that caused ~20-minute action delays per grant.

This adds the baton-sdk primitive callers and connector authors agreed was the right place for that signal (see internal review thread).

  • New proto ProvisionFailureDetail with a reason enum (USER_NOT_PROVISIONABLE, TARGET_MANAGED_EXTERNALLY, LICENSE_EXHAUSTED, VALIDATION_FAILURE, OPERATION_NOT_PERMITTED) and a free-form detail string for logs / admin context.
  • pkg/connectorerrors.NewTerminalError(reason, format, args...) builds a gRPC status with the right code (PermissionDenied for OPERATION_NOT_PERMITTED, FailedPrecondition for everything else) and attaches the detail via status.WithDetails.
  • pkg/connectorerrors.FailureDetail(err) extracts the detail across fmt.Errorf %w and errors.Join wrap chains (via status.FromError, which uses errors.As).
  • Doc on ResourceProvisioner pointing connector authors at the helper.

The reason→code table picks codes the existing isClientErrorCode logic in c1 already treats as non-retryable, so callers that only look at status.Code(err) classify these errors correctly without any changes; callers that want structured logging or admin-facing detail can read the proto via FailureDetail.

Transient failures (5xx, network blips, rate limits) should continue to be returned as ordinary errors — those are retryable, and rate limits have their own annotation (RateLimitDescription).

Refs OPS-1551.

Follow-ups (separate PRs)

  • baton-okta: map E0000038 → REASON_USER_NOT_PROVISIONABLE, E0000001 (groupMembers) → REASON_TARGET_MANAGED_EXTERNALLY.
  • c1: read FailureDetail in pkg/temporal/activity/connector-actions for log enrichment (reason tag in terminal_marker_reason).

Test plan

  • make protogen regenerates Go code from new proto
  • go test ./pkg/connectorerrors/... passes
  • make lint clean
  • Reviewer to confirm reason taxonomy covers the upstream signals we know about

Connectors today return plain Go errors for upstream-permanent failures
(target user SUSPENDED, group governed by a rule, license cap reached).
Those resolve to gRPC code=Unknown by the time a remote caller sees them,
and Unknown classifies as retryable — so callers loop on a failure that
will never succeed without external state change.

This change adds a baton-sdk primitive for that signal:

- new proto ProvisionFailureDetail (reason + admin-facing detail)
- pkg/connectorerrors.NewTerminalError builds a gRPC status with the
  right code (PermissionDenied / FailedPrecondition) and attaches the
  detail via status.WithDetails
- pkg/connectorerrors.FailureDetail extracts the detail across
  fmt.Errorf %w and errors.Join wrap chains, via status.FromError

The reason→code table picks codes the existing isClientErrorCode logic
already treats as non-retryable, so callers that only look at status.Code
classify these errors correctly without any changes; callers that want
structured logging or admin-facing detail can read the proto.

Refs OPS-1551.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@arreyder arreyder requested a review from a team May 14, 2026 22:31
@linear-code
Copy link
Copy Markdown

linear-code Bot commented May 14, 2026

OPS-1551

@github-actions
Copy link
Copy Markdown
Contributor

ghost commented May 14, 2026

General PR Review: connectorerrors: ProvisionFailureDetail primitive for terminal Grant/Revoke

Blocking Issues: 0 | Suggestions: 0 | Threads Resolved: 0
Review mode: incremental since cf1cbdd
View review run

Review Summary

The new commits refine documentation across the proto enum comments, generated pb files, and Go source: reason boundaries between REASON_LICENSE_EXHAUSTED (capacity) and REASON_OPERATION_NOT_PERMITTED (policy) are now explicit, and the ResourceProvisionerV2 interface gets a docstring pointing to the terminal-error contract. Two code changes in terminal.go: NewTerminalError now passes the format string verbatim when no args are supplied, fixing a real bug where percent characters in upstream messages would be misinterpreted by fmt.Sprintf; and the WithDetails error branch switches from a silent fallback to a panic, which is appropriate for this genuinely unreachable path. No new issues found.

Security Issues

None found.

Correctness Issues

None found.

Suggestions

None.

Copy link
Copy Markdown
Contributor

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocking issues found.

- Trim package doc; drop historical narrative
- Add cross-reference on ResourceProvisionerV2 so the recommended
  interface points at the helper too
- Fix IsTerminal docstring (FailureDetail allocates via FromError)
- Panic on the unreachable WithDetails error path so a future field
  change can't silently strip the detail
- Skip fmt.Sprintf when len(args)==0 so a literal upstream message
  containing % isn't reinterpreted
- Document the Reason-is-authoritative contract on NewTerminalError /
  codeForReason; note forward-compat on FailureDetail
- Sharpen LICENSE_EXHAUSTED (capacity) vs OPERATION_NOT_PERMITTED
  (policy) boundary in proto comments

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocking issues found.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant