feat: Allow aws-vault to safely be run in parallel#291
feat: Allow aws-vault to safely be run in parallel#291timvisher-dd wants to merge 5 commits intoByteNess:mainfrom
Conversation
b4be5e4 to
96a6c03
Compare
|
Heads up I've seen some issues since running with this where in rare cases aws-vault will error with something along the lines of 'can't update keychain item it already exists'. I'm going to drop this back to draft and let it bake a bit more. |
|
Hey @timvisher-dd , no worries let me know when it's ready for a review. |
96a6c03 to
72c41a4
Compare
|
I haven't been able to reproduce any issues yet. I think it may have been a failure in my small build script that creates a throwaway keychain to start from scratch. |
No worries, we can mark this as Draft while you investigate, if you want to look more into it. |
|
Ah hah! Found it. It's a legit bug where if more than one aws-vault instance is trying to refresh creds for a single profile they fight over setting the keychain item. I can figure that out. Dropping this back to Draft. |
18e63b7 to
be4e082
Compare
|
I think I've found another area where parallelism needs to be coordinated: The keychain Get/Set paths. I've never noticed this before because I use the |
Could you talk more about your use case here and what is the problem you're trying to solve? Am I reading this correct you're getting locking issues on the keychain backend ro something else? |
Sure. I run
When you run it that way without this changeset, several pathological things happen:
I had 'solved' all of this before by wrapping aws-vault with a bash script that made it so that only one Does that make sense? |
be4e082 to
6eee885
Compare
Thanks for expanding on your use case. I'm sure I'm still missing some context, but wouldn't you be able to inject credentials once and run Terraform in that session. You can run multiple TF runs in there if needed. What's the reason here you're fetching credentials for every Terraform run (usually you'd do role assumption from the provider)? I wonder if you might be holding it wrong here 🤔 Also if we were to merge this I'd suggest to put it behind a flag to avoid breaking workflow for existing users. |
Perhaps but I don't think so unless I'm missing something about what you're saying. :)
Creds/Sessions are only good for a single account/role. I have access to and make regular use of hundreds of accounts across dozens of SSO directories. Furthermore a single terraform run can make use of any number of AWS Accounts via Profiles that allow it to retrieve creds via a Even once I have valid creds
Most of my root modules are designed to be operated by heterogenous groups and so don't directly bake in a particular profile/role. Instead operators and CI are expected to inject creds into the environment via In some cases though, what you're describing is exactly what's happening anyway. I have something like which uses We have also at times used although TBH I find this to a be somewhat pathological although it's a decent hack to solve the problem of heterogenous operators of multi-provider root modules. Does that make sense? I'm definitely open to not being aware of 'the right way' to do this but I'm not aware of any other options. Beyond whether ↑ is right or wrong, though, I'd argue that
I'm open to adding a flag but to repeat myself here:
Any existing user who has tried to run Any workflows that this could break (because it still operates absolutely fine and with identical performance when running just one) would strictly be spacebar workflows, IMO. |
|
Appreciate additional examples and providing more context how you use it, and I think I have a better understanding now of your use case 😄 . Also I didn't want to imply there is "right or wrong", but perhaps more of using a nicer ergonomic when using Terraform native role assumption, rather than injecting credentials for every account/ENV you need to access. e.g.: https://developer.hashicorp.com/terraform/tutorials/aws/aws-assumerole I realise that's not always possible or desired based on your architecture, but might get you around the current limitations of Coming back to those, I'd like to reiterate I'm not against working to improve that behaviour and adding support for that could be beneficial, but what I was thinking is more having a flag to enable new parallel safe behaviour and not use as default, at least yet. Does that makes sense? Am I right to assume you're trying mostly to address 3-5 here as other might be AWS limitation or how we might throttle requests? Thanks again and I think this is great context trying to figure out this one. |
I wasn't trying to imply that you were being belligerent or anything! Glad we're both doing our best to be constructive here. :) To your point above: Yes. Definitely aware of that pattern. It doesn't work for us for a number of reasons, at least in part because it spreads around account IDs in a lot of duplicate locations and also is subject to the issue of heterogenous operators who need to run the same root modules with their own roles. Again you can use terraform variables for that but I think that's even stranger than just using
Totally makes sense. If that's a hard requirement for getting this sort of capability merged to
Actually in my case I'm almost entirely focussed on 1 and 2. In general I we use our But what's really broken is 1 and 2. 1 literally can't authenticate in many cases because the AWS API 5xx when I open even 2 SSO flows in parallel. And 2 is more of an inconvenience in that if you run it repeatedly you do eventually succeed but why not just respect the retry-after headers. 3-5 is still worth fixing, IMO, and the behavior is definitely broken in parallel, but it doesn't affect me at all because my
👍 |
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <codex@openai.com> # Conflicts: # vault/ssorolecredentialsprovider.go
|
Just in case this makes it more palatable this branch seems to work as expected. https://github.com/ByteNess/aws-vault/compare/main...timvisher-dd:aws-vault:sso-browser-lock-behind-option?expand=1 You get the old behavior by default but you pass |
Thanks, I'd be happy to review your changes behind this flag 👍 And it should mostly address your issues from the list 1-2, right? |
Yes this still fully addresses my needs (1-5) with a bit of added config on my end (totally acceptable). :) |
6eee885 to
a5670db
Compare
OK switched to the changeset with the options. :) |
|
Thanks, I'll have a look |
|
Apologies about a delay here and I wanted to let you know this is still on my radar and I'll try to get back to this shortly! |
I'm running it locally so getting plenty of verification that it works as expected at least in my config. :) |
mbevc1
left a comment
There was a problem hiding this comment.
Hi, apologies again for the late review. I've finally had some time to look at this and have few comments on the PR. Major things I'd suggest looking at are: locking per SSO URLs, use defer when locking and masking errors on unlock. Otherwise looks good, thanks again for submitting a PR!
| To configure the default flag values of `aws-vault` and its subcommands: | ||
| * `AWS_VAULT_BACKEND`: Secret backend to use (see the flag `--backend`) | ||
| * `AWS_VAULT_BIOMETRICS`: Use biometric authentication using TouchID, if supported (see the flag `--biometrics`) | ||
| * `AWS_VAULT_PARALLEL_SAFE`: Enable cross-process locking for keychain and cached credentials (see the flag `--parallel-safe`) |
There was a problem hiding this comment.
Could you please expand documentation of this feature -there's no explanation of when to use --parallel-safe, what it does, what the trade-offs are (serialized keychain ops), or what backends it applies to. Given the flag is opt-in, users who need it won't know what it does. Could you please add a short section on it?
| if max < min { | ||
| max = min | ||
| } | ||
| r := rand.New(rand.NewSource(time.Now().UnixNano())) |
There was a problem hiding this comment.
Minor suggestion here; This creates a new rand.Source seeded with the current nanosecond on every retry. In practice this is fine since retries are seconds apart, but it's wasteful and better pattern is a package-level rand.Rand initialized once. In Go 1.20+ the global rand is automatically seeded, so this could just be rand.Float64().
| defaultSSOLockLogEvery = 15 * time.Second | ||
| defaultSSOLockWarnAfter = 5 * time.Second | ||
| // 0 means retry indefinitely (caller is expected to use context cancellation). | ||
| ssoMaxAttempts = 0 |
There was a problem hiding this comment.
The context passed down from Terraform's credential_process may not have a deadline. If AWS keeps returning 429s indefinitely, this loops forever. Let's either set a reasonable default max (e.g. 10), or document the expectation that callers must pass a timeout context. This could be the case for GetRoleCredentials() where context is passed down to the function.
There was a problem hiding this comment.
Thanks for flagging this. I agree that unbounded retries need a safety net.
I went a slightly different direction than a hard attempt cap though. A fixed max (e.g. 10) doesn't map well to variable Retry-After durations — 10 attempts with 30s retry-afters is very different from 10 attempts with 1s retry-afters. Instead, I'm planning to derive the timeout from deviceCreds.ExpiresIn (returned by StartDeviceAuthorization), which is the natural bound for how long any of this should take. Once the device code expires, retrying is pointless anyway.
Re: Terraform's credential_process — worth noting that Terraform doesn't call us via Go code. It subshells to credential_process (which can be any binary), and hardcodes a 60s timeout on that subprocess. So the context concern doesn't apply in the way you described, but the 60s wall clock is a real constraint. That said, the 429 retries here are on GetRoleCredentials, which happens after the browser flow completes, so in practice these retries are short-lived and well within Terraform's timeout. Am I misunderstanding anything about what you were saying here?
I'll also add periodic retry logging with stats (429 count, max Retry-After seen) so users have visibility into what's happening.
There was a problem hiding this comment.
Correction on my earlier reply — I said:
Instead, I'm planning to derive the timeout from
deviceCreds.ExpiresIn(returned byStartDeviceAuthorization), which is the natural bound for how long any of this should take.
After implementing this I realized deviceCreds.ExpiresIn doesn't apply here. There are two separate retry loops:
-
The OIDC device polling loop (
CreateTokeninnewOIDCToken) — this is wheredeviceCreds.ExpiresInmatters, but it's already naturally bounded by AWS: when the device code expires, AWS returns an error that terminates the loop. No change needed. -
The
GetRoleCredentialsretry loop — this is the one you flagged withssoMaxAttempts = 0. It runs after the OIDC token is already acquired (the device code is consumed at that point). The OIDC access token is valid for ~8 hours, sodeviceCreds.ExpiresInwould be the wrong bound here.
What I implemented instead: a 5-minute wall-clock deadline on the GetRoleCredentials retry loop. If we're still getting 429s after 5 minutes, we give up with a descriptive error including retry stats (429 count, max Retry-After seen). The sleep duration is also capped to the remaining time before the deadline so it can't overshoot.
This is a more conservative approach than your suggestion of a hard attempt cap (e.g. 10), but achieves the same goal — preventing infinite loops while being tolerant of variable Retry-After durations.
| return nil, false, ctx.Err() | ||
| } | ||
|
|
||
| locked, err := p.ssoTokenLock.TryLock() |
There was a problem hiding this comment.
The manual unlock-on-every-error-path pattern is fragile. Every new code path added in the future must remember to call Unlock(). A panic (e.g. from a nil pointer in newOIDCTokenFn) will leave the lock file held until process death.
The standard Go idiom is defer:
golocked, err := p.ssoTokenLock.TryLock()
if locked {
defer p.ssoTokenLock.Unlock()
// ... rest of logic
}There was a problem hiding this comment.
Agreed that defer is the right approach here. One nuance though — the straightforward pattern you suggested:
locked, err := p.ssoTokenLock.TryLock()
if locked {
defer p.ssoTokenLock.Unlock()
// ... rest of logic
}doesn't quite work because TryLock is called inside a for loop. defer runs at function return, not at the end of a loop iteration, so the lock would be held across all subsequent iterations (including the sleep-and-retry path where we explicitly don't want to hold it).
The plan is to extract the locked body into a helper function so defer runs at the right scope — when the helper returns, which is the same point the manual Unlock() currently fires. Lock scope stays identical, but we get panic safety and the fragility concern you raised goes away.
I'll also use errors.Join in the deferred unlock to properly surface both the operation error and the unlock error if both fail (addresses your other comment about swallowed errors).
Does that match up with your understanding and still address the spirit of your concerns?
There was a problem hiding this comment.
Sure, as long as we're using defer pattern 'globaly' to make sure we always unlock
| return nil, ctx.Err() | ||
| } | ||
|
|
||
| locked, err := p.sessionLock.TryLock() |
There was a problem hiding this comment.
Same here - the manual unlock-on-every-error-path pattern is fragile. Every new code path added in the future must remember to call Unlock(). A panic (e.g. from a nil pointer in newOIDCTokenFn) will leave the lock file held until process death.
The standard Go idiom is defer:
golocked, err := p.ssoTokenLock.TryLock()
if locked {
defer p.ssoTokenLock.Unlock()
// ... rest of logic
}| if err == nil && cached { | ||
| unlockErr := p.sessionLock.Unlock() | ||
| if unlockErr != nil { | ||
| return nil, unlockErr |
There was a problem hiding this comment.
If both the operation and the unlock fail, the unlock error is returned and the original (more useful) error is silently dropped. The original error should be wrapped or joined: fmt.Errorf("unlock: %w (original: %v)", unlockErr, err). This could be a debugging nightmare in production.
The withLock in locked_keyring.go has the same pattern but correctly returns unlockErr only after function succeeds, so it's fine there.
| if err != nil || token != nil { | ||
| unlockErr := p.ssoTokenLock.Unlock() | ||
| if unlockErr != nil { | ||
| return nil, false, unlockErr |
There was a problem hiding this comment.
If both the operation and the unlock fail, the unlock error is returned and the original (more useful) error is silently dropped. The original error should be wrapped or joined: fmt.Errorf("unlock: %w (original: %v)", unlockErr, err). This could be a debugging nightmare in production.
The withLock in locked_keyring.go has the same pattern but correctly returns unlockErr only after function succeeds, so it's fine there.
| }, | ||
| ) | ||
|
|
||
| ctx := context.Background() |
There was a problem hiding this comment.
Just a note here of a limitation; the keyring.Keyring interface doesn't pass a context (it's not context-aware), so this is a constraint of the interface. But the consequence is that if Terraform sends SIGKILL after 60s, the process dies, but any process waiting to acquire the keychain lock with context.Background() will wait indefinitely.
This means keychain waiters cannot be cancelled by context deadline or cancellation. Combined with the internal sync.Mutex (k.mu.Lock() is also uncancellable), this is a latent hang risk. Not easy to fix given the interface, but could at perhaps document it. What do you think?
| @@ -0,0 +1,17 @@ | |||
| package vault | |||
|
|
|||
| const defaultSSOLockFilename = "aws-vault.sso.lock" | |||
There was a problem hiding this comment.
This always creates aws-vault.sso.lock - a single shared lock regardless of StartURL. you mentioned running 646 profiles across 9 SSO directories. With a global lock, all concurrent OIDC flows queue behind a single mutex even when they're for completely independent SSO endpoints. This turns what could be 9 parallel flows (one per directory) into full serialization.
Suggested improvement: Key the lock by StartURL, the same way session and keychain locks use hashedLockFilename. Perhaps change NewDefaultSSOTokenLock() to NewDefaultSSOTokenLock(startURL string) and add p.StartURL into it via ensureSSODependencies.
There was a problem hiding this comment.
Great suggestion. I agree — keying the SSO lock by StartURL is the right move.
To clarify the current behavior: the global lock only serializes the initial OIDC token acquisition (the browser flow + cache write). Once the token is cached, all processes for that StartURL find it in the cache and proceed immediately — the per-role GetRoleCredentials calls already run fully in parallel. So even with the global lock, the serialization window is just the browser auth, not all credential retrieval.
That said, with 9 SSO directories the global lock forces those 9 browser flows to run one after another when they could safely run in parallel. I ran a spike test to verify this: I fired 8 profiles simultaneously (one per distinct StartURL) with no SSO lock at all. All 8 completed successfully with zero 5xx or rate-limit errors from AWS. So parallel browser flows across different directories are safe.
I'll change NewDefaultSSOTokenLock() to accept startURL and hash it for the lock filename, the same pattern the session cache and keychain locks already use. This means independent SSO directories will authenticate in parallel while profiles sharing the same StartURL still serialize to a single browser tab. Should be a nice performance win for multi-directory setups at the cost of a slightly more complex lock key.
LMK if that doesn't sound right to you in any way. :)
There was a problem hiding this comment.
Should be okay as long as we're grouping by SSO directory URLs and getting some concurrency benefits. Thanks
|
@mbevc1 I think I've addressed all your feedback. I'll keep running it with these latest changes and I'm happy to clean up the commits before merging but for now LMK if you catch anything else. :) |
|
Thanks @timvisher-dd, I'll review those in next few days and get back to you |
Closes 99designs/aws-vault#1275 (abandoned)
I run aws-vault as a credentials_provider for SSO backed AWS profiles. This is often in parallel contexts of dozens to hundreds of invocations at once.
In that context when credentials needed to be refreshed, aws-vault would open an unconstrained amount of browser tabs in parallel, usually triggering HTTP 500 responses on the AWS side and failing to acquire creds.
To mitigate this I developed a small wrapper around aws-vault that would use a Bash dir lock (Wooledge BashFAQ 45) when there was a possibility that the credentials would need to be refreshed. This worked but it was also quite slow as it would lock the entire aws-vault execution rather than just the step that might launch a browser tab. The dir locking strategy was also sensitive to being killed by process managers like terraform and so had to do things like die voluntarily after magic amounts of seconds to avoid being SIGKILLed and leaving a stale lock around.
This changeset introduces a cross-platform process lock that is tolerant of all forms of process death (even SIGKILL), and wires it into the SSO/OIDC browser flow so only one process refreshes at a time. This keeps parallel invocations fast while avoiding the browser storm that triggers HTTP 500s.
While stress testing I also found that the session cache could race across processes, leading to "item already exists" errors from the keychain. This branch adds a session cache refresh lock so only one process writes back to the same cache entry at once, while others wait and re-check.
Because the parallelism is now safe and fast, I also hit SSO OIDC rate limits (429). This change adds Retry-After support with jittered backoff so the SSO token exchange is resilient under heavy load.
In a stress test across 646 AWS profiles in 9 SSO Directories I'm able to retrieve creds successfully in ~36 seconds on my box. This fails on upstream HEAD because the browser storm overwhelms the IAM/SSO APIs and the keychain cache update races.
Performance implications
This change serializes keychain operations (for the macOS keychain backend only) so only one aws-vault process is touching the keychain at a time. This avoids concurrent unlock prompts and other keychain contention at the cost of making highly parallel workloads wait their turn for keychain access. In practice this has been acceptable for my workloads.
On my machine, successfully retrieving creds for 642 profiles completes in under 60 seconds, which is good enough that I did not do a formal before/after benchmark. If maintainers want more detailed measurements I can provide them.
Testing
Unit tests have been added for the SSO/OIDC lock, session cache lock, and OIDC retry logic. I also used some local integration test scripts that clear out the creds and run
aws-vault export --format=jsonacross different sets of my profiles and assert that it succeeds. Finally I've converted my local tooling to use this fork of aws-vault and have been exercising it there without issue.Colofon
I did not write any of this code. Codex did. That said I have read through it in some detail and it looks reasonable to me.
Co-authored-by: Codex noreply@openai.com