Skip to content

feat(bigquery): use arrow ZSTD compression for storage read sessions#13846

Draft
alvarowolfx wants to merge 1 commit intogoogleapis:mainfrom
alvarowolfx:feat-bq-arrow-zstd
Draft

feat(bigquery): use arrow ZSTD compression for storage read sessions#13846
alvarowolfx wants to merge 1 commit intogoogleapis:mainfrom
alvarowolfx:feat-bq-arrow-zstd

Conversation

@alvarowolfx
Copy link
Copy Markdown
Contributor

Towards #13742

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the BigQuery API. label Feb 12, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables ZSTD compression for Arrow data format in BigQuery storage read sessions, which should improve performance and reduce network bandwidth usage. The implementation correctly sets the required options in the CreateReadSession request. A new test is added to verify this change. I've found a minor issue in the new test code where a nil pointer dereference could occur, and I've suggested a fix to make the test more robust.

Comment on lines +76 to +83
readOptions := session.GetReadOptions()
arrowSerializationOptions := readOptions.GetArrowSerializationOptions()
if arrowSerializationOptions == nil {
t.Errorf("expected ReadSession.ArrowSerializationOptions != nil")
}
if arrowSerializationOptions.GetBufferCompression() != storagepb.ArrowSerializationOptions_ZSTD {
t.Errorf("expected ReadSession.ArrowSerializationOptions.BufferCompression = %v, want %v", arrowSerializationOptions.GetBufferCompression(), storagepb.ArrowSerializationOptions_ZSTD)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a potential nil pointer dereference on line 77. If session.GetReadOptions() returns nil, readOptions will be nil, and the subsequent call to readOptions.GetArrowSerializationOptions() will cause a panic. It's better to check for nil on readOptions before using it. The suggested change makes the test more robust by adding this check and using t.Fatalf to stop the test on failure, which is consistent with other checks in this test.

    readOptions := session.GetReadOptions()
	if readOptions == nil {
		t.Fatalf("expected ReadSession.ReadOptions to be set, but it was nil")
	}
	arrowSerializationOptions := readOptions.GetArrowSerializationOptions()
	if arrowSerializationOptions == nil {
		t.Fatalf("expected ReadSession.ArrowSerializationOptions to be set, but it was nil")
	}
	if got, want := arrowSerializationOptions.GetBufferCompression(), storagepb.ArrowSerializationOptions_ZSTD; got != want {
		t.Errorf("unexpected BufferCompression: got %v, want %v", got, want)
	}

@alvarowolfx
Copy link
Copy Markdown
Contributor Author

In some early benchmarks, I'm seeing it actually taking a tad bit longer with ZSTD enabled. I wonder if gains are only perceived in bigger tables.

With ZSTD enabled:

➜  bigquery git:(feat-bq-arrow-zstd) ✗ go test -bench=. -timeout 30m -test.benchmem -run BenchmarkIntegration_StorageReadQuery -v
goos: linux
goarch: amd64
pkg: cloud.google.com/go/bigquery
cpu: Intel(R) Xeon(R) CPU @ 2.20GHz
BenchmarkIntegration_StorageReadQuery
BenchmarkIntegration_StorageReadQuery/storage_api_0_max_streams_usa_1910_current_full
BenchmarkIntegration_StorageReadQuery/storage_api_0_max_streams_usa_1910_current_full-24                       1        11954763973 ns/op                0 max_streams           7.000 parallel_streams    6311504 rows 5591852240 B/op 65915049 allocs/op
BenchmarkIntegration_StorageReadQuery/storage_api_1_max_streams_usa_1910_current_full
BenchmarkIntegration_StorageReadQuery/storage_api_1_max_streams_usa_1910_current_full-24                       1        10798308464 ns/op                1.000 max_streams               1.000 parallel_streams    6311504 rows 5619221832 B/op 65911803 allocs/op
BenchmarkIntegration_StorageReadQuery/storage_api_0_max_streams_usa_1910_current_state_eq_fl
BenchmarkIntegration_StorageReadQuery/storage_api_0_max_streams_usa_1910_current_state_eq_fl-24                1        1742100017 ns/op                 0 max_streams           1.000 parallel_streams     218192 rows 176734448 B/op   2280768 allocs/op
BenchmarkIntegration_StorageReadQuery/storage_api_1_max_streams_usa_1910_current_state_eq_fl
BenchmarkIntegration_StorageReadQuery/storage_api_1_max_streams_usa_1910_current_state_eq_fl-24                1        1781451299 ns/op                 1.000 max_streams               1.000 parallel_streams     218192 rows 181883840 B/op   2280805 allocs/op
BenchmarkIntegration_StorageReadQuery/storage_api_0_max_streams_usa_1910_current_state_eq_ca
BenchmarkIntegration_StorageReadQuery/storage_api_0_max_streams_usa_1910_current_state_eq_ca-24                1        2119265299 ns/op                 0 max_streams           1.000 parallel_streams     400762 rows 339429768 B/op   4228854 allocs/op
BenchmarkIntegration_StorageReadQuery/storage_api_1_max_streams_usa_1910_current_state_eq_ca
BenchmarkIntegration_StorageReadQuery/storage_api_1_max_streams_usa_1910_current_state_eq_ca-24                1        1678401147 ns/op                 1.000 max_streams               1.000 parallel_streams     400762 rows 343467368 B/op   4229033 allocs/op
BenchmarkIntegration_StorageReadQuery/storage_api_0_max_streams_usa_1910_current_full_ordered
BenchmarkIntegration_StorageReadQuery/storage_api_0_max_streams_usa_1910_current_full_ordered-24               1        11116730796 ns/op                0 max_streams           1.000 parallel_streams    6311504 rows 5399809016 B/op 65898096 allocs/op
BenchmarkIntegration_StorageReadQuery/storage_api_1_max_streams_usa_1910_current_full_ordered
BenchmarkIntegration_StorageReadQuery/storage_api_1_max_streams_usa_1910_current_full_ordered-24               1        10853172645 ns/op                1.000 max_streams               1.000 parallel_streams    6311504 rows 5411691232 B/op 65901589 allocs/op
PASS
ok      cloud.google.com/go/bigquery    59.810s

Without ZSTD enabled:

➜  bigquery git:(feat-bq-arrow-zstd) ✗ go test -bench=. -timeout 30m -test.benchmem -run BenchmarkIntegration_StorageReadQuery -v
goos: linux
goarch: amd64
pkg: cloud.google.com/go/bigquery
cpu: Intel(R) Xeon(R) CPU @ 2.20GHz
BenchmarkIntegration_StorageReadQuery
BenchmarkIntegration_StorageReadQuery/storage_api_0_max_streams_usa_1910_current_full
BenchmarkIntegration_StorageReadQuery/storage_api_0_max_streams_usa_1910_current_full-24                       1        9042367424 ns/op                 0 max_streams           7.000 parallel_streams    6311504 rows 4733939632 B/op 65626933 allocs/op
BenchmarkIntegration_StorageReadQuery/storage_api_1_max_streams_usa_1910_current_full
BenchmarkIntegration_StorageReadQuery/storage_api_1_max_streams_usa_1910_current_full-24                       1        9368957426 ns/op                 1.000 max_streams               1.000 parallel_streams    6311504 rows 4858514272 B/op 65615940 allocs/op
BenchmarkIntegration_StorageReadQuery/storage_api_0_max_streams_usa_1910_current_state_eq_fl
BenchmarkIntegration_StorageReadQuery/storage_api_0_max_streams_usa_1910_current_state_eq_fl-24                1        1735711993 ns/op                 0 max_streams           1.000 parallel_streams     218192 rows 165103600 B/op   2271140 allocs/op
BenchmarkIntegration_StorageReadQuery/storage_api_1_max_streams_usa_1910_current_state_eq_fl
BenchmarkIntegration_StorageReadQuery/storage_api_1_max_streams_usa_1910_current_state_eq_fl-24                1        1642849492 ns/op                 1.000 max_streams               1.000 parallel_streams     218192 rows 162937120 B/op   2270998 allocs/op
BenchmarkIntegration_StorageReadQuery/storage_api_0_max_streams_usa_1910_current_state_eq_ca
BenchmarkIntegration_StorageReadQuery/storage_api_0_max_streams_usa_1910_current_state_eq_ca-24                1        1788624486 ns/op                 0 max_streams           1.000 parallel_streams     400762 rows 313665584 B/op   4211292 allocs/op
BenchmarkIntegration_StorageReadQuery/storage_api_1_max_streams_usa_1910_current_state_eq_ca
BenchmarkIntegration_StorageReadQuery/storage_api_1_max_streams_usa_1910_current_state_eq_ca-24                1        1774708717 ns/op                 1.000 max_streams               1.000 parallel_streams     400762 rows 302950392 B/op   4210782 allocs/op
BenchmarkIntegration_StorageReadQuery/storage_api_0_max_streams_usa_1910_current_full_ordered
BenchmarkIntegration_StorageReadQuery/storage_api_0_max_streams_usa_1910_current_full_ordered-24               1        9171862649 ns/op                 0 max_streams           1.000 parallel_streams    6311504 rows 4849304568 B/op 65620068 allocs/op
BenchmarkIntegration_StorageReadQuery/storage_api_1_max_streams_usa_1910_current_full_ordered
BenchmarkIntegration_StorageReadQuery/storage_api_1_max_streams_usa_1910_current_full_ordered-24               1        9342665947 ns/op                 1.000 max_streams               1.000 parallel_streams    6311504 rows 4882694040 B/op 65625518 allocs/op
PASS
ok      cloud.google.com/go/bigquery    53.599s

@lidavidm
Copy link
Copy Markdown

The dataset we tested with was 400k rows and consisted mainly of strings. It would depend on the particular latency+bandwidth to/from BigQuery as well.

Additionally it may be worth trying LZ4 in place of ZSTD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: bigquery Issues related to the BigQuery API.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants