feat: Allow aws-vault to safely be run in parallel by timvisher-dd · Pull Request #291 · ByteNess/aws-vault

timvisher-dd · 2026-02-07T19:20:13Z

Closes 99designs/aws-vault#1275 (abandoned)

I run aws-vault as a credentials_provider for SSO backed AWS profiles. This is often in parallel contexts of dozens to hundreds of invocations at once.

In that context when credentials needed to be refreshed, aws-vault would open an unconstrained amount of browser tabs in parallel, usually triggering HTTP 500 responses on the AWS side and failing to acquire creds.

To mitigate this I developed a small wrapper around aws-vault that would use a Bash dir lock (Wooledge BashFAQ 45) when there was a possibility that the credentials would need to be refreshed. This worked but it was also quite slow as it would lock the entire aws-vault execution rather than just the step that might launch a browser tab. The dir locking strategy was also sensitive to being killed by process managers like terraform and so had to do things like die voluntarily after magic amounts of seconds to avoid being SIGKILLed and leaving a stale lock around.

This changeset introduces a cross-platform process lock that is tolerant of all forms of process death (even SIGKILL), and wires it into the SSO/OIDC browser flow so only one process refreshes at a time. This keeps parallel invocations fast while avoiding the browser storm that triggers HTTP 500s.

While stress testing I also found that the session cache could race across processes, leading to "item already exists" errors from the keychain. This branch adds a session cache refresh lock so only one process writes back to the same cache entry at once, while others wait and re-check.

Because the parallelism is now safe and fast, I also hit SSO OIDC rate limits (429). This change adds Retry-After support with jittered backoff so the SSO token exchange is resilient under heavy load.

In a stress test across 646 AWS profiles in 9 SSO Directories I'm able to retrieve creds successfully in ~36 seconds on my box. This fails on upstream HEAD because the browser storm overwhelms the IAM/SSO APIs and the keychain cache update races.

Performance implications

This change serializes keychain operations (for the macOS keychain backend only) so only one aws-vault process is touching the keychain at a time. This avoids concurrent unlock prompts and other keychain contention at the cost of making highly parallel workloads wait their turn for keychain access. In practice this has been acceptable for my workloads.

On my machine, successfully retrieving creds for 642 profiles completes in under 60 seconds, which is good enough that I did not do a formal before/after benchmark. If maintainers want more detailed measurements I can provide them.

Testing

Unit tests have been added for the SSO/OIDC lock, session cache lock, and OIDC retry logic. I also used some local integration test scripts that clear out the creds and run aws-vault export --format=json across different sets of my profiles and assert that it succeeds. Finally I've converted my local tooling to use this fork of aws-vault and have been exercising it there without issue.

Colofon

I did not write any of this code. Codex did. That said I have read through it in some detail and it looks reasonable to me.

Co-authored-by: Codex noreply@openai.com

timvisher-dd · 2026-02-09T14:41:23Z

Heads up I've seen some issues since running with this where in rare cases aws-vault will error with something along the lines of 'can't update keychain item it already exists'. I'm going to drop this back to draft and let it bake a bit more.

mbevc1 · 2026-02-09T15:28:09Z

Hey @timvisher-dd , no worries let me know when it's ready for a review.

timvisher-dd · 2026-02-10T17:16:44Z

I haven't been able to reproduce any issues yet. I think it may have been a failure in my small build script that creates a throwaway keychain to start from scratch.

mbevc1 · 2026-02-10T17:35:47Z

I haven't been able to reproduce any issues yet. I think it may have been a failure in my small build script that creates a throwaway keychain to start from scratch.

No worries, we can mark this as Draft while you investigate, if you want to look more into it.

timvisher-dd · 2026-02-10T18:48:40Z

Ah hah! Found it. It's a legit bug where if more than one aws-vault instance is trying to refresh creds for a single profile they fight over setting the keychain item. I can figure that out. Dropping this back to Draft.

timvisher-dd · 2026-02-11T13:14:57Z

I think I've found another area where parallelism needs to be coordinated: The keychain Get/Set paths. I've never noticed this before because I use the login keychain which is always unlocked when I'm logged in so my test environment (which I think is closer to aws-vault recommendations) needs to unlock the keychain before it can be used and that's causing a storm of alerts asking to use the keychain.

mbevc1 · 2026-02-11T14:31:08Z

I think I've found another area where parallelism needs to be coordinated: The keychain Get/Set paths. I've never noticed this before because I use the login keychain which is always unlocked when I'm logged in so my test environment (which I think is closer to aws-vault recommendations) needs to unlock the keychain before it can be used and that's causing a storm of alerts asking to use the keychain.

Could you talk more about your use case here and what is the problem you're trying to solve? Am I reading this correct you're getting locking issues on the keychain backend ro something else?

timvisher-dd · 2026-02-11T16:39:03Z

I think I've found another area where parallelism needs to be coordinated: The keychain Get/Set paths. I've never noticed this before because I use the login keychain which is always unlocked when I'm logged in so my test environment (which I think is closer to aws-vault recommendations) needs to unlock the keychain before it can be used and that's causing a storm of alerts asking to use the keychain.

Could you talk more about your use case here and what is the problem you're trying to solve? Am I reading this correct you're getting locking issues on the keychain backend ro something else?

Sure.

I run parallel terraform plan ::: <root modules> (essentially).

aws-vault is involved because it's the credentials_provider for the AWS profiles in play and I'm using the keychain backend (normally configured to login but right now configured to aws-vault-dev).

When you run it that way without this changeset, several pathological things happen:

Many SSO flows are triggered at once as all of the aws-vault instances attempt to refresh their creds. This mainly results in none of them succeeding because AWS SSO 500s due to the request storm
Once you have that locked down, you'll then find that you can make requests to get OIDC creds fast enough that you trigger back-offs which aws-vault wasn't handling before.
Once you have that locked down, if you happen to call aws-vault for the exact same profile in parallel processes, the keychain would error as they each tried to update the keychain item.
Once you have that locked down, if the keychain locked for any reason (say your device locked) then all the aws-vaults would attempt to unlock it at the same time which would lead to many prompts to unlock the keychain as each instance tried to unlock it.
If you're able to coordinate on the keychain unsealing, then you also end up needing to coordinate on the item gets/sets so that the processes cooperate with each other.

I had 'solved' all of this before by wrapping aws-vault with a bash script that made it so that only one aws-vault process could be running at any given point. This was effective but slow and potentially error prone since terraform SIGKILLs its credential process after an unconfigurable 60s.

Does that make sense?

mbevc1 · 2026-02-15T21:17:20Z

I think I've found another area where parallelism needs to be coordinated: The keychain Get/Set paths. I've never noticed this before because I use the login keychain which is always unlocked when I'm logged in so my test environment (which I think is closer to aws-vault recommendations) needs to unlock the keychain before it can be used and that's causing a storm of alerts asking to use the keychain.

Could you talk more about your use case here and what is the problem you're trying to solve? Am I reading this correct you're getting locking issues on the keychain backend ro something else?

Sure.

I run parallel terraform plan ::: <root modules> (essentially).

aws-vault is involved because it's the credentials_provider for the AWS profiles in play and I'm using the keychain backend (normally configured to login but right now configured to aws-vault-dev).

When you run it that way without this changeset, several pathological things happen:
1. Many SSO flows are triggered at once as all of the aws-vault instances attempt to refresh their creds. This mainly results in none of them succeeding because AWS SSO 500s due to the request storm

2. Once you have that locked down, you'll then find that you can make requests to get OIDC creds fast enough that you trigger back-offs which aws-vault wasn't handling before.

3. Once you have that locked down, if you happen to call `aws-vault` for the exact same profile in parallel processes, the keychain would error as they each tried to update the keychain item.

4. Once you have that locked down, if the keychain locked for any reason (say your device locked) then all the aws-vaults would attempt to unlock it at the same time which would lead to many prompts to unlock the keychain as each instance tried to unlock it.

5. If you're able to coordinate on the keychain unsealing, then you also end up needing to coordinate on the item gets/sets so that the processes cooperate with each other.
I had 'solved' all of this before by wrapping aws-vault with a bash script that made it so that only one aws-vault process could be running at any given point. This was effective but slow and potentially error prone since terraform SIGKILLs its credential process after an unconfigurable 60s.

Does that make sense?

Thanks for expanding on your use case. I'm sure I'm still missing some context, but wouldn't you be able to inject credentials once and run Terraform in that session. You can run multiple TF runs in there if needed. What's the reason here you're fetching credentials for every Terraform run (usually you'd do role assumption from the provider)? I wonder if you might be holding it wrong here 🤔

Also if we were to merge this I'd suggest to put it behind a flag to avoid breaking workflow for existing users.

timvisher-dd · 2026-02-16T18:12:04Z

Thanks for expanding on your use case. I'm sure I'm still missing some context, but wouldn't you be able to inject credentials once and run Terraform in that session. You can run multiple TF runs in there if needed. What's the reason here you're fetching credentials for every Terraform run (usually you'd do role assumption from the provider)? I wonder if you might be holding it wrong here 🤔

Perhaps but I don't think so unless I'm missing something about what you're saying. :)

… inject credentials once and run Terraform in that session

Creds/Sessions are only good for a single account/role. I have access to and make regular use of hundreds of accounts across dozens of SSO directories. Furthermore a single terraform run can make use of any number of AWS Accounts via Profiles that allow it to retrieve creds via a credential_process like aws-vault.

Even once I have valid creds aws-vault can still trigger bad UX with itself if I need the same session's creds from more than one terraform root module (I have thousands) that both target the same account/role because as I pointed out in some of the later changes they can all try to unseal the keychain/item at once.

What's the reason here you're fetching credentials for every Terraform run (usually you'd do role assumption from the provider)?

Most of my root modules are designed to be operated by heterogenous groups and so don't directly bake in a particular profile/role. Instead operators and CI are expected to inject creds into the environment via aws-vault exec with the role they have access to and run the root module.

In some cases though, what you're describing is exactly what's happening anyway. I have something like

provider "aws" {
  profile = "exec-foo"
  …
}

which uses ~/.aws/config like

[profile exec-foo]
credential_process = aws-vault export --format=json foo
…

We have also at times used

variable "role" {
  type = string
  default = "foo"
}

provider "aws" {
  profile = "exec-foo-${var.role}"
}

although TBH I find this to a be somewhat pathological although it's a decent hack to solve the problem of heterogenous operators of multi-provider root modules.

Does that make sense? I'm definitely open to not being aware of 'the right way' to do this but I'm not aware of any other options.

Beyond whether ↑ is right or wrong, though, I'd argue that aws-vault, as it stands, cannot be safely used in parallel on a single machine. It's non-functional (in the case of SSO flows) and/or extremely bad UX (in the case of a sealed keychain/item). Even if we don't take the step of making it safe to execute in parallel we should take a different step of making it natively refuse to operate in parallel with something like a process lock and a hard error exit. I think that would be a real shame because it's absolutely core to how my company operates as many AWS accounts as we do so successfully.

Also if we were to merge this I'd suggest to put it behind a flag to avoid breaking workflow for existing users.

I'm open to adding a flag but to repeat myself here:

I'd argue that aws-vault, as it stands, cannot be safely used in parallel on a single machine. It's either non-functional (in the case of SSO flows) or extremely bad UX (in the case of a sealed keychain/item).

Any existing user who has tried to run aws-vault in parallel I'm sure has either learned to tolerate it's bad behavior or hacked around it like I did. I could see a flag that made parallel execution a hard error rather than allowing more than one to be executing but I don't think that it should be the default.

Any workflows that this could break (because it still operates absolutely fine and with identical performance when running just one) would strictly be spacebar workflows, IMO.

mbevc1 · 2026-02-16T22:33:59Z

Appreciate additional examples and providing more context how you use it, and I think I have a better understanding now of your use case 😄 . Also I didn't want to imply there is "right or wrong", but perhaps more of using a nicer ergonomic when using Terraform native role assumption, rather than injecting credentials for every account/ENV you need to access. e.g.:

provider "aws" {
  region  = "eu-west-2"
  profile = "source"

  assume_role {
    role_arn = "<ROLE_ARN>"
  }
}

https://developer.hashicorp.com/terraform/tutorials/aws/aws-assumerole

I realise that's not always possible or desired based on your architecture, but might get you around the current limitations of aws-vault and parallel runs.

Coming back to those, I'd like to reiterate I'm not against working to improve that behaviour and adding support for that could be beneficial, but what I was thinking is more having a flag to enable new parallel safe behaviour and not use as default, at least yet. Does that makes sense?

Am I right to assume you're trying mostly to address 3-5 here as other might be AWS limitation or how we might throttle requests?

1. Many SSO flows are triggered at once as all of the aws-vault instances attempt to refresh their creds. This mainly results in none of them succeeding because AWS SSO 500s due to the request storm

2. Once you have that locked down, you'll then find that you can make requests to get OIDC creds fast enough that you trigger back-offs which aws-vault wasn't handling before.

3. Once you have that locked down, if you happen to call `aws-vault` for the exact same profile in parallel processes, the keychain would error as they each tried to update the keychain item.

4. Once you have that locked down, if the keychain locked for any reason (say your device locked) then all the aws-vaults would attempt to unlock it at the same time which would lead to many prompts to unlock the keychain as each instance tried to unlock it.

5. If you're able to coordinate on the keychain unsealing, then you also end up needing to coordinate on the item gets/sets so that the processes cooperate with each other.

Thanks again and I think this is great context trying to figure out this one.

timvisher-dd · 2026-02-17T14:25:37Z

Appreciate additional examples and providing more context how you use it, and I think I have a better understanding now of your use case 😄 . Also I didn't want to imply there is "right or wrong", but perhaps more of using a nicer ergonomic when using Terraform native role assumption, rather than injecting credentials for every account/ENV you need to access. e.g.:
provider "aws" {
  region  = "eu-west-2"
  profile = "source"

  assume_role {
    role_arn = "<ROLE_ARN>"
  }
}
https://developer.hashicorp.com/terraform/tutorials/aws/aws-assumerole

I realise that's not always possible or desired based on your architecture, but might get you around the current limitations of aws-vault and parallel runs.

I wasn't trying to imply that you were being belligerent or anything! Glad we're both doing our best to be constructive here. :)

To your point above: Yes. Definitely aware of that pattern. It doesn't work for us for a number of reasons, at least in part because it spreads around account IDs in a lot of duplicate locations and also is subject to the issue of heterogenous operators who need to run the same root modules with their own roles. Again you can use terraform variables for that but I think that's even stranger than just using aws-vault exec.

Coming back to those, I'd like to reiterate I'm not against working to improve that behaviour and adding support for that could be beneficial, but what I was thinking is more having a flag to enable new parallel safe behaviour and not use as default, at least yet. Does that makes sense?

Totally makes sense. If that's a hard requirement for getting this sort of capability merged to aws-vault then I'm willing to do it. I do stand by my statement that I think this is the wrong default given how broken aws-vault is in the default case but if we're looking for an opt-in burn-in period kind of thing then I'd be happy to do it. It's easy enough for me to add a flag to my tooling.

Am I right to assume you're trying mostly to address 3-5 here as other might be AWS limitation or how we might throttle requests?

1. Many SSO flows are triggered at once as all of the aws-vault instances attempt to refresh their creds. This mainly results in none of them succeeding because AWS SSO 500s due to the request storm

2. Once you have that locked down, you'll then find that you can make requests to get OIDC creds fast enough that you trigger back-offs which aws-vault wasn't handling before.

3. Once you have that locked down, if you happen to call `aws-vault` for the exact same profile in parallel processes, the keychain would error as they each tried to update the keychain item.

4. Once you have that locked down, if the keychain locked for any reason (say your device locked) then all the aws-vaults would attempt to unlock it at the same time which would lead to many prompts to unlock the keychain as each instance tried to unlock it.

5. If you're able to coordinate on the keychain unsealing, then you also end up needing to coordinate on the item gets/sets so that the processes cooperate with each other.

Actually in my case I'm almost entirely focussed on 1 and 2. In general I we use our login keychain to back aws-vault so I've never seen its dedicated keychain behavior before. I only spotted it because I was trying to run my development copy in isolation but I think this is actually the common case.

But what's really broken is 1 and 2. 1 literally can't authenticate in many cases because the AWS API 5xx when I open even 2 SSO flows in parallel. And 2 is more of an inconvenience in that if you run it repeatedly you do eventually succeed but why not just respect the retry-after headers.

3-5 is still worth fixing, IMO, and the behavior is definitely broken in parallel, but it doesn't affect me at all because my login keychain is always unsealed when I log in.

Thanks again and I think this is great context trying to figure out this one.

👍

Co-authored-by: Codex <codex@openai.com>

Co-authored-by: Codex <codex@openai.com> # Conflicts: # vault/ssorolecredentialsprovider.go

timvisher-dd · 2026-02-17T23:57:54Z

Just in case this makes it more palatable this branch seems to work as expected. https://github.com/ByteNess/aws-vault/compare/main...timvisher-dd:aws-vault:sso-browser-lock-behind-option?expand=1

You get the old behavior by default but you pass --parallel-safe or set AWS_VAULT_PARALLEL_SAFE you get the new safe behavior.

mbevc1 · 2026-02-18T17:37:06Z

Just in case this makes it more palatable this branch seems to work as expected. https://github.com/ByteNess/aws-vault/compare/main...timvisher-dd:aws-vault:sso-browser-lock-behind-option?expand=1

You get the old behavior by default but you pass --parallel-safe or set AWS_VAULT_PARALLEL_SAFE you get the new safe behavior.

Thanks, I'd be happy to review your changes behind this flag 👍

And it should mostly address your issues from the list 1-2, right?

timvisher-dd · 2026-02-19T14:32:21Z

Just in case this makes it more palatable this branch seems to work as expected. https://github.com/ByteNess/aws-vault/compare/main...timvisher-dd:aws-vault:sso-browser-lock-behind-option?expand=1
You get the old behavior by default but you pass --parallel-safe or set AWS_VAULT_PARALLEL_SAFE you get the new safe behavior.

Thanks, I'd be happy to review your changes behind this flag 👍

And it should mostly address your issues from the list 1-2, right?

Yes this still fully addresses my needs (1-5) with a bit of added config on my end (totally acceptable). :)

timvisher-dd · 2026-02-19T14:34:50Z

Thanks, I'd be happy to review your changes behind this flag 👍

OK switched to the changeset with the options. :)

mbevc1 · 2026-02-20T17:37:08Z

Thanks, I'll have a look

mbevc1 · 2026-02-24T21:15:32Z

Apologies about a delay here and I wanted to let you know this is still on my radar and I'll try to get back to this shortly!

timvisher-dd · 2026-02-24T21:17:33Z

Apologies about a delay here and I wanted to let you know this is still on my radar and I'll try to get back to this shortly!

I'm running it locally so getting plenty of verification that it works as expected at least in my config. :)

mbevc1

Hi, apologies again for the late review. I've finally had some time to look at this and have few comments on the PR. Major things I'd suggest looking at are: locking per SSO URLs, use defer when locking and masking errors on unlock. Otherwise looks good, thanks again for submitting a PR!

mbevc1 · 2026-03-08T21:09:01Z

USAGE.md

 To configure the default flag values of `aws-vault` and its subcommands:
 * `AWS_VAULT_BACKEND`: Secret backend to use (see the flag `--backend`)
 * `AWS_VAULT_BIOMETRICS`: Use biometric authentication using TouchID, if supported (see the flag `--biometrics`)
+* `AWS_VAULT_PARALLEL_SAFE`: Enable cross-process locking for keychain and cached credentials (see the flag `--parallel-safe`)


Could you please expand documentation of this feature -there's no explanation of when to use --parallel-safe, what it does, what the trade-offs are (serialized keychain ops), or what backends it applies to. Given the flag is opt-in, users who need it won't know what it does. Could you please add a short section on it?

mbevc1 · 2026-03-08T21:11:53Z

vault/ssorolecredentialsprovider.go

+	if max < min {
+		max = min
+	}
+	r := rand.New(rand.NewSource(time.Now().UnixNano()))


Minor suggestion here; This creates a new rand.Source seeded with the current nanosecond on every retry. In practice this is fine since retries are seconds apart, but it's wasteful and better pattern is a package-level rand.Rand initialized once. In Go 1.20+ the global rand is automatically seeded, so this could just be rand.Float64().

mbevc1 · 2026-03-08T21:15:58Z

vault/ssorolecredentialsprovider.go

+	defaultSSOLockLogEvery  = 15 * time.Second
+	defaultSSOLockWarnAfter = 5 * time.Second
+	// 0 means retry indefinitely (caller is expected to use context cancellation).
+	ssoMaxAttempts          = 0


The context passed down from Terraform's credential_process may not have a deadline. If AWS keeps returning 429s indefinitely, this loops forever. Let's either set a reasonable default max (e.g. 10), or document the expectation that callers must pass a timeout context. This could be the case for GetRoleCredentials() where context is passed down to the function.

Thanks for flagging this. I agree that unbounded retries need a safety net.

I went a slightly different direction than a hard attempt cap though. A fixed max (e.g. 10) doesn't map well to variable Retry-After durations — 10 attempts with 30s retry-afters is very different from 10 attempts with 1s retry-afters. Instead, I'm planning to derive the timeout from deviceCreds.ExpiresIn (returned by StartDeviceAuthorization), which is the natural bound for how long any of this should take. Once the device code expires, retrying is pointless anyway.

Re: Terraform's credential_process — worth noting that Terraform doesn't call us via Go code. It subshells to credential_process (which can be any binary), and hardcodes a 60s timeout on that subprocess. So the context concern doesn't apply in the way you described, but the 60s wall clock is a real constraint. That said, the 429 retries here are on GetRoleCredentials, which happens after the browser flow completes, so in practice these retries are short-lived and well within Terraform's timeout. Am I misunderstanding anything about what you were saying here?

I'll also add periodic retry logging with stats (429 count, max Retry-After seen) so users have visibility into what's happening.

Correction on my earlier reply — I said:

Instead, I'm planning to derive the timeout from deviceCreds.ExpiresIn (returned by StartDeviceAuthorization), which is the natural bound for how long any of this should take.

After implementing this I realized deviceCreds.ExpiresIn doesn't apply here. There are two separate retry loops:

The OIDC device polling loop (CreateToken in newOIDCToken) — this is where deviceCreds.ExpiresIn matters, but it's already naturally bounded by AWS: when the device code expires, AWS returns an error that terminates the loop. No change needed.

The GetRoleCredentials retry loop — this is the one you flagged with ssoMaxAttempts = 0. It runs after the OIDC token is already acquired (the device code is consumed at that point). The OIDC access token is valid for ~8 hours, so deviceCreds.ExpiresIn would be the wrong bound here.

What I implemented instead: a 5-minute wall-clock deadline on the GetRoleCredentials retry loop. If we're still getting 429s after 5 minutes, we give up with a descriptive error including retry stats (429 count, max Retry-After seen). The sleep duration is also capped to the remaining time before the deadline so it can't overshoot.

This is a more conservative approach than your suggestion of a hard attempt cap (e.g. 10), but achieves the same goal — preventing infinite loops while being tolerant of variable Retry-After durations.

mbevc1 · 2026-03-08T21:19:50Z

vault/ssorolecredentialsprovider.go

+			return nil, false, ctx.Err()
+		}
+
+		locked, err := p.ssoTokenLock.TryLock()


The manual unlock-on-every-error-path pattern is fragile. Every new code path added in the future must remember to call Unlock(). A panic (e.g. from a nil pointer in newOIDCTokenFn) will leave the lock file held until process death.

The standard Go idiom is defer:

golocked, err := p.ssoTokenLock.TryLock() if locked { defer p.ssoTokenLock.Unlock() // ... rest of logic }

Agreed that defer is the right approach here. One nuance though — the straightforward pattern you suggested:

locked, err := p.ssoTokenLock.TryLock() if locked { defer p.ssoTokenLock.Unlock() // ... rest of logic }

doesn't quite work because TryLock is called inside a for loop. defer runs at function return, not at the end of a loop iteration, so the lock would be held across all subsequent iterations (including the sleep-and-retry path where we explicitly don't want to hold it).

The plan is to extract the locked body into a helper function so defer runs at the right scope — when the helper returns, which is the same point the manual Unlock() currently fires. Lock scope stays identical, but we get panic safety and the fragility concern you raised goes away.

I'll also use errors.Join in the deferred unlock to properly surface both the operation error and the unlock error if both fail (addresses your other comment about swallowed errors).

Does that match up with your understanding and still address the spirit of your concerns?

Sure, as long as we're using defer pattern 'globaly' to make sure we always unlock

mbevc1 · 2026-03-08T21:20:22Z

vault/cachedsessionprovider.go

+			return nil, ctx.Err()
+		}
+
+		locked, err := p.sessionLock.TryLock()


Same here - the manual unlock-on-every-error-path pattern is fragile. Every new code path added in the future must remember to call Unlock(). A panic (e.g. from a nil pointer in newOIDCTokenFn) will leave the lock file held until process death.

The standard Go idiom is defer:

golocked, err := p.ssoTokenLock.TryLock() if locked { defer p.ssoTokenLock.Unlock() // ... rest of logic }

mbevc1 · 2026-03-08T21:23:02Z

vault/cachedsessionprovider.go

+			if err == nil && cached {
+				unlockErr := p.sessionLock.Unlock()
+				if unlockErr != nil {
+					return nil, unlockErr


If both the operation and the unlock fail, the unlock error is returned and the original (more useful) error is silently dropped. The original error should be wrapped or joined: fmt.Errorf("unlock: %w (original: %v)", unlockErr, err). This could be a debugging nightmare in production.

The withLock in locked_keyring.go has the same pattern but correctly returns unlockErr only after function succeeds, so it's fine there.

mbevc1 · 2026-03-08T21:23:36Z

vault/ssorolecredentialsprovider.go

+			if err != nil || token != nil {
+				unlockErr := p.ssoTokenLock.Unlock()
+				if unlockErr != nil {
+					return nil, false, unlockErr


If both the operation and the unlock fail, the unlock error is returned and the original (more useful) error is silently dropped. The original error should be wrapped or joined: fmt.Errorf("unlock: %w (original: %v)", unlockErr, err). This could be a debugging nightmare in production.

The withLock in locked_keyring.go has the same pattern but correctly returns unlockErr only after function succeeds, so it's fine there.

mbevc1 · 2026-03-08T21:27:20Z

vault/locked_keyring.go

+		},
+	)
+
+	ctx := context.Background()


Just a note here of a limitation; the keyring.Keyring interface doesn't pass a context (it's not context-aware), so this is a constraint of the interface. But the consequence is that if Terraform sends SIGKILL after 60s, the process dies, but any process waiting to acquire the keychain lock with context.Background() will wait indefinitely.

This means keychain waiters cannot be cancelled by context deadline or cancellation. Combined with the internal sync.Mutex (k.mu.Lock() is also uncancellable), this is a latent hang risk. Not easy to fix given the interface, but could at perhaps document it. What do you think?

mbevc1 · 2026-03-08T21:30:13Z

vault/sso_lock.go

@@ -0,0 +1,17 @@
+package vault
+
+const defaultSSOLockFilename = "aws-vault.sso.lock"


This always creates aws-vault.sso.lock - a single shared lock regardless of StartURL. you mentioned running 646 profiles across 9 SSO directories. With a global lock, all concurrent OIDC flows queue behind a single mutex even when they're for completely independent SSO endpoints. This turns what could be 9 parallel flows (one per directory) into full serialization.

Suggested improvement: Key the lock by StartURL, the same way session and keychain locks use hashedLockFilename. Perhaps change NewDefaultSSOTokenLock() to NewDefaultSSOTokenLock(startURL string) and add p.StartURL into it via ensureSSODependencies.

Great suggestion. I agree — keying the SSO lock by StartURL is the right move.

To clarify the current behavior: the global lock only serializes the initial OIDC token acquisition (the browser flow + cache write). Once the token is cached, all processes for that StartURL find it in the cache and proceed immediately — the per-role GetRoleCredentials calls already run fully in parallel. So even with the global lock, the serialization window is just the browser auth, not all credential retrieval.

That said, with 9 SSO directories the global lock forces those 9 browser flows to run one after another when they could safely run in parallel. I ran a spike test to verify this: I fired 8 profiles simultaneously (one per distinct StartURL) with no SSO lock at all. All 8 completed successfully with zero 5xx or rate-limit errors from AWS. So parallel browser flows across different directories are safe.

I'll change NewDefaultSSOTokenLock() to accept startURL and hash it for the lock filename, the same pattern the session cache and keychain locks already use. This means independent SSO directories will authenticate in parallel while profiles sharing the same StartURL still serialize to a single browser tab. Should be a nice performance win for multi-directory setups at the cost of a slightly more complex lock key.

LMK if that doesn't sound right to you in any way. :)

Should be okay as long as we're grouping by SSO directory URLs and getting some concurrency benefits. Thanks

timvisher-dd · 2026-03-09T20:01:02Z

@mbevc1 I think I've addressed all your feedback. I'll keep running it with these latest changes and I'm happy to clean up the commits before merging but for now LMK if you catch anything else. :)

mbevc1 · 2026-03-09T22:19:25Z

Thanks @timvisher-dd, I'll review those in next few days and get back to you

github-actions bot added the dependencies Pull requests that update a dependency file label Feb 7, 2026

timvisher-dd changed the title ~~Allow aws-vault to safely be run in parallel~~ feat: Allow aws-vault to safely be run in parallel Feb 7, 2026

github-actions bot added the feat label Feb 7, 2026

timvisher-dd force-pushed the sso-browser-lock branch from b4be5e4 to 96a6c03 Compare February 7, 2026 19:29

timvisher-dd marked this pull request as ready for review February 7, 2026 19:31

timvisher-dd requested a review from mbevc1 as a code owner February 7, 2026 19:31

mbevc1 mentioned this pull request Feb 9, 2026

Allow aws-vault to safely be run in parallel 99designs/aws-vault#1275

Open

timvisher-dd marked this pull request as draft February 9, 2026 14:41

timvisher-dd force-pushed the sso-browser-lock branch from 96a6c03 to 72c41a4 Compare February 10, 2026 17:10

timvisher-dd marked this pull request as ready for review February 10, 2026 17:14

timvisher-dd marked this pull request as draft February 10, 2026 18:48

timvisher-dd force-pushed the sso-browser-lock branch 2 times, most recently from 18e63b7 to be4e082 Compare February 10, 2026 20:58

timvisher-dd marked this pull request as ready for review February 10, 2026 21:24

timvisher-dd marked this pull request as draft February 11, 2026 13:13

timvisher-dd force-pushed the sso-browser-lock branch from be4e082 to 6eee885 Compare February 11, 2026 16:52

timvisher-dd marked this pull request as ready for review February 12, 2026 20:33

feat(lock): add process lock

dad4d29

Co-authored-by: Codex <codex@openai.com>

timvisher-dd and others added 4 commits February 17, 2026 14:04

feat(sso): use process lock

091f156

Co-authored-by: Codex <codex@openai.com>

feat(session-cache): use process lock

3aeaf36

Co-authored-by: Codex <codex@openai.com>

feat(keychain): use process lock

21896e7

feat(oidc): add rate-limit backoff and retries

a5670db

Co-authored-by: Codex <codex@openai.com> # Conflicts: # vault/ssorolecredentialsprovider.go

timvisher-dd force-pushed the sso-browser-lock branch from 6eee885 to a5670db Compare February 19, 2026 14:34

github-actions bot added the documentation Improvements or additions to documentation label Feb 19, 2026

mbevc1 requested changes Mar 8, 2026

View reviewed changes

		@@ -0,0 +1,17 @@
		package vault

		const defaultSSOLockFilename = "aws-vault.sso.lock"

Conversation

timvisher-dd commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance implications

Testing

Colofon

Uh oh!

timvisher-dd commented Feb 9, 2026

Uh oh!

mbevc1 commented Feb 9, 2026

Uh oh!

timvisher-dd commented Feb 10, 2026

Uh oh!

mbevc1 commented Feb 10, 2026

Uh oh!

timvisher-dd commented Feb 10, 2026

Uh oh!

timvisher-dd commented Feb 11, 2026

Uh oh!

mbevc1 commented Feb 11, 2026

Uh oh!

timvisher-dd commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbevc1 commented Feb 15, 2026

Uh oh!

timvisher-dd commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbevc1 commented Feb 16, 2026

Uh oh!

timvisher-dd commented Feb 17, 2026

Uh oh!

timvisher-dd commented Feb 17, 2026

Uh oh!

mbevc1 commented Feb 18, 2026

Uh oh!

timvisher-dd commented Feb 19, 2026

Uh oh!

timvisher-dd commented Feb 19, 2026

Uh oh!

mbevc1 commented Feb 20, 2026

Uh oh!

mbevc1 commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timvisher-dd commented Feb 24, 2026

Uh oh!

mbevc1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

timvisher-dd commented Feb 7, 2026 •

edited

Loading

timvisher-dd commented Feb 11, 2026 •

edited

Loading

timvisher-dd commented Feb 16, 2026 •

edited

Loading

mbevc1 commented Feb 24, 2026 •

edited

Loading