perf(expand): prefetch descendant grants to eliminate per-principal SQL queries#863
perf(expand): prefetch descendant grants to eliminate per-principal SQL queries#863arreyder wants to merge 3 commits into
Conversation
…QL queries The grant expansion inner loop previously issued one ListGrantsForEntitlement SQL query per source grant to check whether the principal already holds a grant on the descendant entitlement. For large tenants (Eli Lilly: 180K users, 139K groups), a single expansion action issued 180K individual SQLite queries — each cheap in isolation but taking 38 minutes total due to pure-Go SQLite per-query overhead (modernc.org/sqlite). Replace the per-principal inner loop with a single bulk pre-fetch of all grants on the descendant entitlement into an in-memory map keyed by (principal_resource_type_id, principal_resource_id). The source grant loop then does O(1) map lookups instead of O(N) SQL queries. Expected impact for Lilly's GLB Entra connector: Before: 180K SQL queries per action → ~38 minutes After: 1 full scan of descendant (paginated) + map lookups → seconds The pre-fetch is per-runAction call (not cached across actions) to ensure correctness in diamond graphs where action A→C creates grants that action B→C must see as pre-existing. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
| descendantGrants := descendantByPrincipal[key] | ||
|
|
||
| resp, err := e.store.ListGrantsForEntitlement(ctx, req) | ||
| if len(descendantGrants) == 0 { | ||
| descendantGrant, err := newExpandedGrant(descendantEntitlement.GetEntitlement(), sourceGrant.GetPrincipal(), action.SourceEntitlementID, isSourceDirect) | ||
| if err != nil { | ||
| l.Error("runAction: error fetching descendant grants", zap.Error(err)) | ||
| return "", fmt.Errorf("runAction: error fetching descendant grants: %w", err) | ||
| l.Error("runAction: error creating new grant", zap.Error(err)) | ||
| return "", fmt.Errorf("runAction: error creating new grant: %w", err) | ||
| } | ||
| descendantGrants := resp.GetList() | ||
|
|
||
| // If we have no grants for the principal in the descendant entitlement, make one. | ||
| if pageToken == "" && resp.GetNextPageToken() == "" && len(descendantGrants) == 0 { | ||
| descendantGrant, err := newExpandedGrant(descendantEntitlement.GetEntitlement(), sourceGrant.GetPrincipal(), action.SourceEntitlementID, isSourceDirect) | ||
| if err != nil { | ||
| l.Error("runAction: error creating new grant", zap.Error(err)) | ||
| return "", fmt.Errorf("runAction: error creating new grant: %w", err) | ||
| } | ||
| newGrants = append(newGrants, descendantGrant) | ||
| newGrants, err = PutGrantsInChunks(ctx, e.store, newGrants, 10000) | ||
| if err != nil { | ||
| l.Error("runAction: error updating descendant grants", zap.Error(err)) | ||
| return "", fmt.Errorf("runAction: error updating descendant grants: %w", err) | ||
| } | ||
| break | ||
| } | ||
|
|
||
| // Add the source entitlement as a source to all descendant grants. | ||
| grantsToUpdate := make([]*v2.Grant, 0) | ||
| newGrants = append(newGrants, descendantGrant) |
There was a problem hiding this comment.
🟡 Suggestion: The descendantByPrincipal map is built once before the loop but never updated after creating new grants. If two source grants in the same page share the same principal and no pre-existing descendant grant exists, the second iteration will also see len(descendantGrants) == 0 and create a duplicate grant (same deterministic ID) instead of updating the first grant's sources. Consider inserting the newly created grant into the map:
| descendantGrants := descendantByPrincipal[key] | |
| resp, err := e.store.ListGrantsForEntitlement(ctx, req) | |
| if len(descendantGrants) == 0 { | |
| descendantGrant, err := newExpandedGrant(descendantEntitlement.GetEntitlement(), sourceGrant.GetPrincipal(), action.SourceEntitlementID, isSourceDirect) | |
| if err != nil { | |
| l.Error("runAction: error fetching descendant grants", zap.Error(err)) | |
| return "", fmt.Errorf("runAction: error fetching descendant grants: %w", err) | |
| l.Error("runAction: error creating new grant", zap.Error(err)) | |
| return "", fmt.Errorf("runAction: error creating new grant: %w", err) | |
| } | |
| descendantGrants := resp.GetList() | |
| // If we have no grants for the principal in the descendant entitlement, make one. | |
| if pageToken == "" && resp.GetNextPageToken() == "" && len(descendantGrants) == 0 { | |
| descendantGrant, err := newExpandedGrant(descendantEntitlement.GetEntitlement(), sourceGrant.GetPrincipal(), action.SourceEntitlementID, isSourceDirect) | |
| if err != nil { | |
| l.Error("runAction: error creating new grant", zap.Error(err)) | |
| return "", fmt.Errorf("runAction: error creating new grant: %w", err) | |
| } | |
| newGrants = append(newGrants, descendantGrant) | |
| newGrants, err = PutGrantsInChunks(ctx, e.store, newGrants, 10000) | |
| if err != nil { | |
| l.Error("runAction: error updating descendant grants", zap.Error(err)) | |
| return "", fmt.Errorf("runAction: error updating descendant grants: %w", err) | |
| } | |
| break | |
| } | |
| // Add the source entitlement as a source to all descendant grants. | |
| grantsToUpdate := make([]*v2.Grant, 0) | |
| newGrants = append(newGrants, descendantGrant) | |
| descendantGrant, err := newExpandedGrant(descendantEntitlement.GetEntitlement(), sourceGrant.GetPrincipal(), action.SourceEntitlementID, isSourceDirect) | |
| if err != nil { | |
| l.Error("runAction: error creating new grant", zap.Error(err)) | |
| return "", fmt.Errorf("runAction: error creating new grant: %w", err) | |
| } | |
| newGrants = append(newGrants, descendantGrant) | |
| descendantByPrincipal[key] = append(descendantByPrincipal[key], descendantGrant) |
General PR Review: perf(expand): prefetch descendant grants to eliminate per-principal SQL queriesBlocking Issues: 0 | Suggestions: 0 | Threads Resolved: 0 Review SummaryThe new commit removes the Expander-level descendant grant cache ( Security IssuesNone found. Correctness IssuesNone found. SuggestionsNone. |
…principal guard Addresses code review findings on PR #863: 1. Cache descendant prefetch across pages of the same action (High): the same descendant entitlement is now fetched once and reused for all source-grant pages within a single action. Cache invalidated when the action completes and moves to the next edge. 2. Context cancellation check in prefetch loop (Medium): ctx.Err() is checked at the top of each page iteration so the loop exits promptly on timeout or shutdown. 3. Page-count safety valve (Critical → defense-in-depth): maxPrefetchPages (10,000) bounds the loop against infinite pagination from a buggy store. At 10K pages × 10K page size = 100M grants, well beyond any realistic entitlement. 4. Nil-principal skip (Low → defensive): source grants with nil principals are skipped rather than producing a key collision on "\x00". Matches newExpandedGrant's nil rejection. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
| if e.prefetchedDescendantID == entID && e.prefetchedDescendants != nil { | ||
| return e.prefetchedDescendants, nil |
There was a problem hiding this comment.
🟡 Suggestion: Cross-page stale cache — RunSingleStep calls runAction repeatedly (once per source-grants page) for the same action, and the cache is only cleared when the action completes. Grants created by page N's PutGrantsInChunks are written to the store but invisible to page N+1 because the cached map is returned here. If a principal appears in source grants across two pages, page N+1 will see an empty slice and create a duplicate expanded grant instead of updating the one from page N. The pre-caching code re-fetched from the DB each page, so it saw prior writes. Consider invalidating the cache after PutGrantsInChunks flushes (e.g. e.prefetchedDescendants = nil), or inserting newly created grants into the cached map.
Benchmark resultsRan on the existing 70% wall-clock improvement, 57% fewer allocations, 51% less memory allocated. The benchmark fixture is a 232-node graph — much smaller than Lilly's 139K-group tenant. The improvement scales with graph density: the more grants per entitlement, the more per-principal SQL queries the old code issued. Lilly's 180K-member "all employees" group (the action that takes 38 minutes today) should see an even larger relative speedup because:
|
|
re. the benchmark above: my 🤖 pointed out that it used one Expander for the whole run, while production syncer uses a new Expander per RunSingleStep in syncer.go (line 2555), so the cache does not persist across pages. |
|
|
||
| result := make(map[string][]*v2.Grant) | ||
| pageToken := "" | ||
| for page := 0; page < maxPrefetchPages; page++ { |
There was a problem hiding this comment.
What happens if we hit maxPrefetchPages?
Addresses alan-lee-12's review: the syncer creates a new Expander per RunSingleStep (syncer.go:2555), so struct-level cache fields don't persist across source-grant pages in production. The benchmark (expander.Run) was overly optimistic because it reuses one Expander. Remove the prefetchedDescendantID/prefetchedDescendants fields and the cache-invalidation logic. The core optimization is unchanged: each runAction call does ONE bulk scan of the descendant entitlement (prefetchDescendantGrants) instead of N per-principal SQL queries. This is the source of the 70% benchmark win — the cross-page cache was bonus amortization that only helped the Run() path. Production behavior per page: Before: 10K individual point-lookup SQL queries After: 1 bulk scan + map lookups The cross-page staleness concern is also resolved: since there's no cache, each page always sees the latest store state. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Summary
ListGrantsForEntitlementSQL queries inrunActionwith a single bulk pre-fetch into an in-memory mapProblem
Profile evidence from be-temporal-sync prod (1h aggregate):
dotc1z.listGrantsGeneric → ListGrantsForEntitlement → database/sql.Rows.Next → modernc.org/sqlite._sqlite3VdbeExecThe inner loop in
runAction(previously lines 278-350) calledListGrantsForEntitlementwith a specificPrincipalIdfor each source grant. Even though the compound index(entitlement_id, principal_resource_type_id, principal_resource_id)makes each a point lookup returning 0-1 rows, the per-query overhead in pure-Go SQLite (~0.1-1ms per call) compounds to 38 minutes at 180K source grants.Fix
Pre-fetch ALL grants on the descendant entitlement into
map[principalKey][]*v2.Grantbefore entering the source grants loop. Each source grant then does a map lookup (O(1)) instead of a SQL query.The pre-fetch is per-
runActioncall, not cached across actions. This ensures correctness in diamond graphs: if action A→C creates grants that action B→C must see, the B→C call does a fresh pre-fetch that includes the writes from A→C.Memory tradeoff
For the worst case (descendant with 180K grants × ~2KB each = ~360MB peak), this is within budget for worker pods (32GB RAM). Most entitlements are much smaller.
Test plan
go test ./pkg/sync/expand/...— all pass (including diamond graph, multi-level, mixed directness, cycle tests)go test ./pkg/sync/...— all passgo test -short ./pkg/dotc1z/...— all passgo test -bench=BenchmarkExpand ./pkg/sync/expand/...— 70% improvementBenchmark
🤖 Generated with Claude Code