Skip to content

Disabling statistics across multiple columns #15347

@GabrielM98

Description

@GabrielM98

Apache Iceberg version

1.10.1

Query engine

None

Please describe the bug 🐞

#12771 added a write table property (write.parquet.stats-enabled.column.<COLUMN_NAME>) to allow statistics to be disabled on a per-column basis. However, it appears that this only seems to work for a single column?

When adding a couple of properties to the table to disable stats across the wire_format_message and json_format_message columns, it appeared that stats were still being written to the Parquet file for the latter column. Here's some output from the Parquet CLI/DuckDB which I used to confirm this...

Row group 0:  count: 1000  117.33 B records  start: 4  total(compressed): 114.576 kB total(uncompressed):675.036 kB
--------------------------------------------------------------------------------
                                                                      type      encodings count     avg size   nulls   min / max
...
wire_format_message                                                   BINARY    Z   _     1000      36.82 B
json_format_message                                                   BINARY    Z   _     1000      36.02 B    0       "{"eventMetadata":{"uuid":..." / "{"eventMetadata":{"uuid":..."
...

➜  duckdb
DuckDB v1.4.4 (Andium) 6ddac802ff
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D ATTACH 'warehouse' AS iceberg_catalog (
      TYPE iceberg,
      ENDPOINT 'http://localhost:8181',
      AUTHORIZATION_TYPE 'none'
    );
D SELECT * FROM iceberg_table_properties(iceberg_catalog.events.entity_events);
┌─────────────────────────────────────────────────────────┬─────────┐
│                           key                           │  value  │
│                         varchar                         │ varchar │
├─────────────────────────────────────────────────────────┼─────────┤
│ write.parquet.compression-codec                         │ zstd    │
│ commit.retry.total-timeout-ms                           │ 120000  │
│ commit.retry.min-wait-ms                                │ 3000    │
│ write.parquet.stats-enabled.column.wire_format_message  │ false   │
│ write.distribution-mode                                 │ hash    │
│ commit.retry.num-retries                                │ 5       │
│ write.parquet.stats-enabled.column.json_format_message  │ false   │
│ commit.retry.max-wait-ms                                │ 60000   │
│ owner                                                   │ root    │
└─────────────────────────────────────────────────────────┴─────────┘

I was able to reproduce this bug in the tests by changing the string here to "false" and flipping the boolean in the assertion here to false. This resulted in the test failing as stats were still being written for the int_field.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggood first issueGood for newcomers

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions