Skip to content

fix: add adaptive retry for AWS SSO Admin API rate limits#110

Open
jacobaguon-blip wants to merge 1 commit into
mainfrom
fix/aws-sso-admin-rate-limit-retry
Open

fix: add adaptive retry for AWS SSO Admin API rate limits#110
jacobaguon-blip wants to merge 1 commit into
mainfrom
fix/aws-sso-admin-rate-limit-retry

Conversation

@jacobaguon-blip
Copy link
Copy Markdown

Summary

  • Adds RetryMaxAttempts: 10 and RetryMode: aws.RetryModeAdaptive to all three config return paths in getCallingConfig()
  • Fixes rate limit failures (429 ThrottlingException) on ListAccountAssignments for large-scale AWS organizations
  • The adaptive retry mode uses a token-bucket rate limiter for automatic client-side throttling, preventing cascading 429s

Context

Qualtrics (Pylon #9651) has 306 AWS accounts × 195 permission sets = ~59,670 ListAccountAssignments API calls per sync. The AWS SSO Admin API rate limit (~10 TPS) combined with the default SDK retry budget (3 attempts, standard backoff) causes ~15% of syncs to fail with ThrottlingException: Rate exceeded.

The fix increases the retry budget to 10 attempts and switches to adaptive mode, which adds client-side rate limiting via a token bucket. This is the AWS SDK's recommended approach for high-volume API consumers.

Test plan

  • Squire validation: all existing tests pass with the change
  • go test ./pkg/connector/... passes locally
  • Monitor Qualtrics sync success rate after deploy (expect improvement from ~85% to ~99%)
  • Verify no regression for smaller AWS orgs (adaptive mode is a no-op when not throttled)

🤖 Generated with Claude Code

Customers with large AWS organizations (300+ accounts, 190+ permission
sets) generate ~60k ListAccountAssignments API calls per sync, which
exceeds the AWS SSO Admin API rate limit with the default 3-retry budget.

This adds RetryMaxAttempts=10 with RetryModeAdaptive to all three config
return paths in getCallingConfig(), enabling the AWS SDK's token-bucket
rate limiter for automatic client-side throttling.

Fixes: Pylon #9651 (Qualtrics)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jacobaguon-blip jacobaguon-blip requested a review from a team April 7, 2026 20:11
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 7, 2026

Claude finished @jacobaguon-blip's task in 56s —— View job


PR Review: Adaptive Retry for AWS SSO Admin API Rate Limits

  • Gather context and read changed files
  • Analyze changes against origin/main
  • Post review feedback

Overall: Looks good. The change is well-scoped and correctly applies retry config to all three return paths in getCallingConfig(). A few observations:

Positive

  • All paths covered: Retry config is added to the non-assume-role path (line 111-112), single-hop path (line 141-142), and two-hop path (line 179-180).
  • No mutation of shared state: The non-assume-role path correctly uses o.baseConfig.Copy() (line 110) before modifying, avoiding side effects on the shared baseConfig. The other two paths create fresh awsSdk.Config{} structs, so no issue there.
  • Adaptive mode is appropriate: RetryModeAdaptive with its token-bucket rate limiter is the AWS-recommended approach for high-volume API consumers, and is a no-op when not being throttled.

Minor Observations

  1. Magic number 10 repeated 3 times — Consider extracting to a package-level constant for clarity and single-point-of-change:

    const ssoAdminRetryMaxAttempts = 10

    This is a nit, not a blocker. Fix this →

  2. Intermediate STS clients don't get retry config — The STS clients created at lines 124 and 148 use o.baseConfig directly (without adaptive retry). This means the AssumeRole calls themselves use the SDK default retry behavior. This is likely fine since STS calls are low-volume compared to SSO Admin API calls, but worth being aware of if STS throttling ever becomes an issue.

  3. Scope is broader than SSO Admin — The retry config is applied to getCallingConfig() which returns configs used by all AWS service clients (IAM, Orgs, SSO Admin, Identity Store, CloudTrail), not just SSO Admin. This is probably beneficial (all services get better retry behavior), but the PR title/description focuses only on SSO Admin. Consider updating the description to note this applies connector-wide.

Verdict

The change is correct, minimal, and well-motivated by the production data (306 accounts × 195 permission sets). No blockers — approve with optional nits above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant