Apache Iceberg version
1.10.1
Query engine
None
Please describe the bug 🐞
#12771 added a write table property (write.parquet.stats-enabled.column.<COLUMN_NAME>) to allow statistics to be disabled on a per-column basis. However, it appears that this only seems to work for a single column?
When adding a couple of properties to the table to disable stats across the wire_format_message and json_format_message columns, it appeared that stats were still being written to the Parquet file for the latter column. Here's some output from the Parquet CLI/DuckDB which I used to confirm this...
Row group 0: count: 1000 117.33 B records start: 4 total(compressed): 114.576 kB total(uncompressed):675.036 kB
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
...
wire_format_message BINARY Z _ 1000 36.82 B
json_format_message BINARY Z _ 1000 36.02 B 0 "{"eventMetadata":{"uuid":..." / "{"eventMetadata":{"uuid":..."
...
➜ duckdb
DuckDB v1.4.4 (Andium) 6ddac802ff
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D ATTACH 'warehouse' AS iceberg_catalog (
TYPE iceberg,
ENDPOINT 'http://localhost:8181',
AUTHORIZATION_TYPE 'none'
);
D SELECT * FROM iceberg_table_properties(iceberg_catalog.events.entity_events);
┌─────────────────────────────────────────────────────────┬─────────┐
│ key │ value │
│ varchar │ varchar │
├─────────────────────────────────────────────────────────┼─────────┤
│ write.parquet.compression-codec │ zstd │
│ commit.retry.total-timeout-ms │ 120000 │
│ commit.retry.min-wait-ms │ 3000 │
│ write.parquet.stats-enabled.column.wire_format_message │ false │
│ write.distribution-mode │ hash │
│ commit.retry.num-retries │ 5 │
│ write.parquet.stats-enabled.column.json_format_message │ false │
│ commit.retry.max-wait-ms │ 60000 │
│ owner │ root │
└─────────────────────────────────────────────────────────┴─────────┘
I was able to reproduce this bug in the tests by changing the string here to "false" and flipping the boolean in the assertion here to false. This resulted in the test failing as stats were still being written for the int_field.
Willingness to contribute
Apache Iceberg version
1.10.1
Query engine
None
Please describe the bug 🐞
#12771 added a write table property (
write.parquet.stats-enabled.column.<COLUMN_NAME>) to allow statistics to be disabled on a per-column basis. However, it appears that this only seems to work for a single column?When adding a couple of properties to the table to disable stats across the
wire_format_messageandjson_format_messagecolumns, it appeared that stats were still being written to the Parquet file for the latter column. Here's some output from the Parquet CLI/DuckDB which I used to confirm this...I was able to reproduce this bug in the tests by changing the string here to"false"and flipping the boolean in the assertion here tofalse. This resulted in the test failing as stats were still being written for theint_field.Willingness to contribute