feat(bigquery): use arrow ZSTD compression for storage read sessions#13846
feat(bigquery): use arrow ZSTD compression for storage read sessions#13846alvarowolfx wants to merge 1 commit intogoogleapis:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request enables ZSTD compression for Arrow data format in BigQuery storage read sessions, which should improve performance and reduce network bandwidth usage. The implementation correctly sets the required options in the CreateReadSession request. A new test is added to verify this change. I've found a minor issue in the new test code where a nil pointer dereference could occur, and I've suggested a fix to make the test more robust.
| readOptions := session.GetReadOptions() | ||
| arrowSerializationOptions := readOptions.GetArrowSerializationOptions() | ||
| if arrowSerializationOptions == nil { | ||
| t.Errorf("expected ReadSession.ArrowSerializationOptions != nil") | ||
| } | ||
| if arrowSerializationOptions.GetBufferCompression() != storagepb.ArrowSerializationOptions_ZSTD { | ||
| t.Errorf("expected ReadSession.ArrowSerializationOptions.BufferCompression = %v, want %v", arrowSerializationOptions.GetBufferCompression(), storagepb.ArrowSerializationOptions_ZSTD) | ||
| } |
There was a problem hiding this comment.
There's a potential nil pointer dereference on line 77. If session.GetReadOptions() returns nil, readOptions will be nil, and the subsequent call to readOptions.GetArrowSerializationOptions() will cause a panic. It's better to check for nil on readOptions before using it. The suggested change makes the test more robust by adding this check and using t.Fatalf to stop the test on failure, which is consistent with other checks in this test.
readOptions := session.GetReadOptions()
if readOptions == nil {
t.Fatalf("expected ReadSession.ReadOptions to be set, but it was nil")
}
arrowSerializationOptions := readOptions.GetArrowSerializationOptions()
if arrowSerializationOptions == nil {
t.Fatalf("expected ReadSession.ArrowSerializationOptions to be set, but it was nil")
}
if got, want := arrowSerializationOptions.GetBufferCompression(), storagepb.ArrowSerializationOptions_ZSTD; got != want {
t.Errorf("unexpected BufferCompression: got %v, want %v", got, want)
}|
In some early benchmarks, I'm seeing it actually taking a tad bit longer with ZSTD enabled. I wonder if gains are only perceived in bigger tables. With ZSTD enabled: Without ZSTD enabled: |
|
The dataset we tested with was 400k rows and consisted mainly of strings. It would depend on the particular latency+bandwidth to/from BigQuery as well. Additionally it may be worth trying LZ4 in place of ZSTD. |
Towards #13742