Skip to content

fixes to secondary dispatch hedger#2283

Merged
josephschorr merged 18 commits intomainfrom
fix-tdigest-panic
Mar 24, 2025
Merged

fixes to secondary dispatch hedger#2283
josephschorr merged 18 commits intomainfrom
fix-tdigest-panic

Conversation

@vroldanbet
Copy link
Copy Markdown
Contributor

@vroldanbet vroldanbet commented Mar 19, 2025

  • we observed a panic caused by tdigest. Even after vendoring Fix OOB access in Quantile() influxdata/tdigest#32, the issue kept happening. Given the library also appears unmaintained, we switched to https://github.com/caio/go-tdigest.
  • we missed the cancelation of the primary if the secondary had returned a value. This is key to avoid the computational cost of running the primary if the secondary already executed
  • restructure context cancelation to be finer-grained
  • reduce the amount of time it waits before fire off the primary by using p95 without offset. This reduces the impact over calls that rely on the primary (e.g. secondary does not have an actual answer)
  • added metrics for the computed hedge wait duration, and another one to determine the ratio of cancelled / executed primary dispatches
  • added trace logging
  • optimized usage of CEL expression so it's not compiled on each dispatch
  • fixed panic when logging a DispatchLookupSubjectsResponse

@github-actions github-actions bot added area/dependencies Affects dependencies area/dispatch Affects dispatching of requests labels Mar 19, 2025
@vroldanbet vroldanbet force-pushed the fix-tdigest-panic branch 2 times, most recently from 3ff5a83 to 9745043 Compare March 19, 2025 11:24
@github-actions github-actions bot added area/cli Affects the command line area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools) labels Mar 19, 2025
@vroldanbet vroldanbet changed the title fix panic with a fork with the fix fixes to secondary dispatch hedger Mar 19, 2025
@josephschorr josephschorr force-pushed the fix-tdigest-panic branch 2 times, most recently from 3f955ec to 38a8e94 Compare March 24, 2025 11:07
vroldanbet and others added 18 commits March 24, 2025 13:24
vendors influxdata/tdigest#32 for the purpose of
validating the functionality without the panic. We will likely follow up by replacing the library
as this tdigest does not seem to be maintained anymore.
the rationale is that we want to make the wait number
smaller and not focus on the tail latencies of secondary.
we can now see how many primary dispatches get canceled with the metric.
In our testing environment, a p90 gave us 32% cancelation for checks,
and roughly 61% for lookups.

Let's test with p95
@vroldanbet vroldanbet marked this pull request as ready for review March 24, 2025 12:06
@vroldanbet vroldanbet requested a review from a team as a code owner March 24, 2025 12:06
@josephschorr josephschorr added this pull request to the merge queue Mar 24, 2025
Merged via the queue into main with commit ff12fd3 Mar 24, 2025
41 checks passed
@josephschorr josephschorr deleted the fix-tdigest-panic branch March 24, 2025 12:27
@github-actions github-actions bot locked and limited conversation to collaborators Mar 24, 2025
@miparnisari
Copy link
Copy Markdown
Contributor

TestCheckUsesDelayByDefaultForPrimary seems flaky: https://github.com/authzed/spicedb/actions/runs/14893361910/job/41830618214?pr=2382

ok  	github.com/authzed/spicedb/internal/dispatch/keys	1.144s
panic: panic raised

goroutine 906 [running]:
github.com/authzed/spicedb/internal/dispatch/remote.(*fakeDispatchSvc).DispatchCheck(0xc0030002d0, {0x1c0b5b0, 0xc0007b6c90}, 0x30?)
	/home/runner/work/spicedb/spicedb/internal/dispatch/remote/cluster_test.go:57 +0x[46](https://github.com/authzed/spicedb/actions/runs/14893361910/job/41830618214?pr=2382#step:4:47)7
github.com/authzed/spicedb/pkg/proto/dispatch/v1._DispatchService_DispatchCheck_Handler({0x18a8160, 0xc0030002d0}, {0x1c0b5b0, 0xc0007b6c90}, 0xc002f00300, 0x0)
	/home/runner/work/spicedb/spicedb/pkg/proto/dispatch/v1/dispatch_grpc.pb.go:174 +0x28f
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0003d7400, {0x1c0b5b0, 0xc0007b6c00}, 0xc0007993e0, 0xc0009df800, 0x25bf220, 0x0)
	/home/runner/go/pkg/mod/google.golang.org/grpc@v1.72.0/server.go:1405 +0x1bc3
google.golang.org/grpc.(*Server).handleStream(0xc0003d7400, {0x1c0b9b0, 0xc00038bd40}, 0xc0007993e0)
	/home/runner/go/pkg/mod/google.golang.org/grpc@v1.72.0/server.go:1815 +0x1373
google.golang.org/grpc.(*Server).serveStreams.func2.1()
	/home/runner/go/pkg/mod/google.golang.org/grpc@v1.72.0/server.go:1035 +0x159
created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 989
	/home/runner/go/pkg/mod/google.golang.org/grpc@v1.72.0/server.go:1046 +0x215
FAIL	github.com/authzed/spicedb/internal/dispatch/remote	0.933s
FAIL

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area/cli Affects the command line area/dependencies Affects dependencies area/dispatch Affects dispatching of requests area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants