Skip to content

feat(policies): parallelize policy evaluations with bounded concurrency#2901

Merged
migmartri merged 8 commits into
chainloop-dev:mainfrom
waveywaves:feat/parallelize-policy-evaluations
Mar 28, 2026
Merged

feat(policies): parallelize policy evaluations with bounded concurrency#2901
migmartri merged 8 commits into
chainloop-dev:mainfrom
waveywaves:feat/parallelize-policy-evaluations

Conversation

@waveywaves

@waveywaves waveywaves commented Mar 20, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Parallelize policy attachment evaluations within VerifyStatement() and VerifyMaterial() using errgroup with bounded concurrency (max(NumCPU*2, 10))
  • Add WithMaxConcurrency(n) option to PolicyVerifier for configurable parallelism
  • Fix pre-existing concurrency safety issues:
    • Move remotePolicyCache/remoteGroupCache mutexes from instance-level to package-level (the caches are package-level maps but were protected by per-instance mutexes — a latent data race)
    • Use singleflight.Group to coalesce concurrent gRPC fetches for the same policy/group ref — ensures exactly one fetch per key, no duplicate RPCs or head-of-line blocking
    • Use context.WithoutCancel inside singleflight callbacks to prevent the winning goroutine's cancellation from propagating to waiters
    • Move extism.SetLogLevel() to sync.Once (it modifies global state; first engine's log level wins)
  • Marshal statement JSON once per evaluation batch instead of per-policy
  • Add 5 race detector tests:
    • Concurrent VerifyStatement from 10 goroutines
    • Concurrent VerifyMaterial from 10 goroutines (with real SPDX fixture)
    • WithMaxConcurrency option validation
    • Sequential (maxConcurrency=1) vs parallel result equivalence
    • errgroup cancellation on policy load failure

All existing + new tests pass with -race enabled. All CI checks green.

Fixes #2899

@waveywaves waveywaves marked this pull request as draft March 20, 2026 17:23

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 6 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="pkg/policies/loader.go">

<violation number="1" location="pkg/policies/loader.go:181">
P2: Global cache mutex is held during remote GetPolicy I/O, serializing concurrent policy loads and causing avoidable head-of-line blocking.</violation>
</file>

<file name="pkg/policies/group_loader.go">

<violation number="1" location="pkg/policies/group_loader.go:122">
P1: Global cache mutex is held across remote RPC in `Load`, causing head-of-line blocking and serializing concurrent policy loads.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread pkg/policies/group_loader.go Outdated
Comment thread pkg/policies/loader.go Outdated
@waveywaves waveywaves marked this pull request as ready for review March 20, 2026 18:47
@waveywaves waveywaves marked this pull request as draft March 20, 2026 18:48
@waveywaves

Copy link
Copy Markdown
Contributor Author

Both issues identified by cubic are resolved:

  1. loader.go (P2) and group_loader.go (P1) — mutex held during gRPC: Replaced defer mutex.Unlock() with singleflight.Group to coalesce concurrent fetches. The mutex is now only held for cache reads/writes (microseconds), never during I/O. context.WithoutCancel is used inside singleflight callbacks to prevent cross-goroutine cancellation propagation.

@waveywaves

Copy link
Copy Markdown
Contributor Author

@cubic-dev-ai please re-review — the mutex-during-RPC issues have been resolved with singleflight in the latest commits.

@cubic-dev-ai

cubic-dev-ai Bot commented Mar 20, 2026

Copy link
Copy Markdown

@cubic-dev-ai please re-review — the mutex-during-RPC issues have been resolved with singleflight in the latest commits.

@waveywaves I have started the AI code review. It will take a few minutes to complete.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 7 files

@waveywaves waveywaves marked this pull request as ready for review March 20, 2026 19:27

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 7 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="pkg/policies/loader.go">

<violation number="1" location="pkg/policies/loader.go:200">
P2: Using context.WithoutCancel here removes the caller’s deadline/cancellation, so the GetPolicy RPC can outlive canceled/timed-out verification requests and continue running indefinitely under outage conditions.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread pkg/policies/loader.go Outdated
@waveywaves

Copy link
Copy Markdown
Contributor Author

Addressed the P2 finding identified by cubic — context.WithoutCancel now re-applies the original deadline via context.WithDeadline so gRPC calls inside singleflight still have a bounded lifetime. The cancel func is properly deferred.

Fixed in 6846021.

@waveywaves

Copy link
Copy Markdown
Contributor Author

@jiparis this is ready for review when you get a chance — parallelizes policy evaluations with errgroup + singleflight for cache coalescing. All CI green.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="pkg/policies/loader.go">

<violation number="1" location="pkg/policies/loader.go:203">
P2: Singleflight request context inherits only the winner caller’s deadline, so concurrent waiters can fail early with a timeout unrelated to their own context deadlines.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread pkg/policies/loader.go Outdated
@waveywaves

Copy link
Copy Markdown
Contributor Author

Addressed the P2 identified by cubic — replaced context.WithoutCancel + inherited deadline with a fixed 30s timeout from context.Background() inside both singleflight callbacks. This ensures all waiters get uniform timeout behavior regardless of which goroutine won the race.

Fixed in d6dd003.

Comment thread pkg/policies/concurrency_test.go Outdated
@@ -0,0 +1,233 @@
//
// Copyright 2024-2026 The Chainloop Authors.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2026 :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Updated to 2026.

Comment thread pkg/policies/policies.go Outdated
// defaultMaxConcurrency is the default number of concurrent policy evaluations.
// Set higher than NumCPU because policy evaluation is I/O-mixed (gRPC calls,
// file loads) rather than purely CPU-bound. Can be overridden via WithMaxConcurrency.
var defaultMaxConcurrency = max(runtime.NumCPU()*2, 10)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will reduce this to a smaller limit because Chainloop attestations can operate in remote state mode, which means different policy evaluations will contend to store the result using optimistic locking. So, let's ensure it’s smaller. cc/ @jiparis any thoughts?

@jiparis jiparis Mar 23, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that scenario evaluatons will happen all at once and then update the remote state with the aggregated result, but I agree, let's reduce it to something more conservative for now to test the behaviour, somethig like max(runtime.NunCPU(), 5). We also don't want to hit hard on the AI-enabled policies.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — reduced to max(runtime.NumCPU(), 5) per jiparis's suggestion. Updated the comment to document the optimistic locking concern.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied — max(runtime.NumCPU(), 5) it is. Thanks for the context on AI policies.

if err != nil {
return nil, NewPolicyError(err)
}
// Load material content once for all policies in this group

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jiparis

jiparis commented Mar 25, 2026

Copy link
Copy Markdown
Member

@waveywaves please fix the conflicts. I'm giving it a final review. Thanks!

Comment thread pkg/policies/loader.go
// This ensures exactly one gRPC call per ref regardless of concurrency.
// Use context.WithoutCancel so the winning goroutine's context cancellation
// doesn't propagate to waiters sharing the same singleflight key.
result, err, _ := remotePolicyFlight.Do(ref, func() (interface{}, error) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this

@waveywaves waveywaves force-pushed the feat/parallelize-policy-evaluations branch from aa5e8b5 to 5709c4c Compare March 27, 2026 11:52
@waveywaves

Copy link
Copy Markdown
Contributor Author

Conflicts resolved — rebased onto main. The merge conflict was in policy_groups.go where upstream added applyGroupGate() wrapping; integrated that into the parallelized errgroup loops. All tests passing.

@jiparis

jiparis commented Mar 27, 2026

Copy link
Copy Markdown
Member

@waveywaves none of the commits are signed. Please sign them and force push.

Use errgroup to evaluate policy attachments concurrently within
VerifyStatement and VerifyMaterial, bounded by runtime.NumCPU().
This reduces evaluation time from O(n*t) to O(t) where n is the
number of policies and t is the slowest single evaluation.

Also fixes pre-existing concurrency safety issues:
- Move remotePolicyCache/remoteGroupCache mutexes from instance-level
  to package-level to prevent data races under concurrent access
- Move extism.SetLogLevel() from per-Verify() call to engine
  construction to avoid racing on global state

Fixes chainloop-dev#2899

Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>
Use errgroup to evaluate policy attachments concurrently within
VerifyStatement and VerifyMaterial, bounded by runtime.NumCPU().
This reduces evaluation time from O(n*t) to O(t) where n is the
number of policies and t is the slowest single evaluation.

Also fixes pre-existing concurrency safety issues:
- Move remotePolicyCache/remoteGroupCache mutexes from instance-level
  to package-level to prevent data races under concurrent access
- Use check-lock-check pattern in cache loaders to avoid holding the
  mutex during gRPC calls, enabling true parallel cache-miss fetches
- Move extism.SetLogLevel() to sync.Once to guarantee it is called
  exactly once across all concurrent WASM engine instances

Fixes chainloop-dev#2899

Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>
Use errgroup to evaluate policy attachments concurrently within
VerifyStatement and VerifyMaterial, bounded by max(NumCPU*2, 10).
This reduces evaluation time from O(n*t) to O(t) where n is the
number of policies and t is the slowest single evaluation.

Concurrency safety fixes:
- Move remotePolicyCache/remoteGroupCache mutexes from instance-level
  to package-level to prevent data races under concurrent access
- Use singleflight to coalesce concurrent gRPC fetches for the same
  policy/group ref, ensuring exactly one fetch per key
- Move extism.SetLogLevel() to sync.Once to guarantee it is called
  exactly once across all concurrent WASM engine instances

Adds race detector tests covering:
- Concurrent VerifyStatement from 10 goroutines
- Concurrent VerifyMaterial from 10 goroutines
- WithMaxConcurrency option validation
- Sequential (maxConcurrency=1) vs parallel result equivalence
- errgroup cancellation on policy load failure

Fixes chainloop-dev#2899

Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>
Use errgroup to evaluate policy attachments concurrently within
VerifyStatement and VerifyMaterial, bounded by max(NumCPU*2, 10).
This reduces evaluation time from O(n*t) to O(t) where n is the
number of policies and t is the slowest single evaluation.

Concurrency safety fixes:
- Move remotePolicyCache/remoteGroupCache mutexes from instance-level
  to package-level to prevent data races under concurrent access
- Use singleflight to coalesce concurrent gRPC fetches for the same
  policy/group ref, ensuring exactly one fetch per key
- Use context.WithoutCancel inside singleflight callbacks to prevent
  the winning goroutine's cancellation from propagating to waiters
- Move extism.SetLogLevel() to sync.Once to guarantee it is called
  exactly once across all concurrent WASM engine instances

Adds 5 race detector tests covering:
- Concurrent VerifyStatement from 10 goroutines
- Concurrent VerifyMaterial from 10 goroutines (with real SPDX fixture)
- WithMaxConcurrency option validation
- Sequential (maxConcurrency=1) vs parallel result equivalence
- errgroup cancellation on policy load failure

Fixes chainloop-dev#2899

Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>
Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>
context.WithoutCancel detaches cancellation but also drops the deadline.
Re-apply the original deadline so gRPC calls inside singleflight still
have a bounded lifetime under server outage conditions.

Addresses cubic-dev-ai review feedback.

Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>
Replace context.WithoutCancel + inherited deadline with a fixed 30s
timeout from context.Background(). This ensures all singleflight
waiters get uniform timeout behavior regardless of which goroutine
won the race, preventing one caller's short deadline from failing
the shared RPC.

Addresses cubic-dev-ai review feedback.

Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>
- Change default concurrency from max(NumCPU*2, 10) to max(NumCPU, 5)
  to avoid contention on remote state optimistic locking and limit
  pressure on AI-enabled policies (per migmartri and jiparis feedback)
- Fix copyright year in concurrency_test.go

Signed-off-by: Vibhav Bobade <vibhav.bobde@gmail.com>
@waveywaves waveywaves force-pushed the feat/parallelize-policy-evaluations branch from 5709c4c to b79cb84 Compare March 28, 2026 04:15
@migmartri migmartri merged commit c7f0b9b into chainloop-dev:main Mar 28, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: parallelize policy evaluations

3 participants