Column statistics extraction with Parquet footer fallback for Delta and Iceberg#760
Column statistics extraction with Parquet footer fallback for Delta and Iceberg#760dhama-shashank-meesho wants to merge 2 commits intoapache:mainfrom
Conversation
|
@dhama-shashank-meesho can you file an issue summarizing the issues you encountered and then fill out the PR template? It will make it easier for other users to search for bugs that they encountered as well and see that you have fixed it. |
| int openParenIndex = typeName.indexOf("("); | ||
| String trimmedTypeName = openParenIndex > 0 ? typeName.substring(0, openParenIndex) : typeName; | ||
| switch (trimmedTypeName) { | ||
| case "short": |
There was a problem hiding this comment.
Can you update the unit tests to cover this case?
| * @param fields the schema fields for which to extract statistics | ||
| * @return FileStats with extracted statistics, or empty stats if reading fails | ||
| */ | ||
| private FileStats readStatsFromParquetFooter( |
There was a problem hiding this comment.
Let's also make sure there is a test case to cover this flow. It can be a basic one, the detailed coverage for translating from parquet stats to our internal stats representation will be covered in #748
| "Failed to read stats from Parquet footer for file {}: {}. " | ||
| + "File will be included without column statistics.", | ||
| addFile.path(), | ||
| e.getMessage()); |
There was a problem hiding this comment.
| "Failed to read stats from Parquet footer for file {}: {}. " | |
| + "File will be included without column statistics.", | |
| addFile.path(), | |
| e.getMessage()); | |
| "Failed to read stats from Parquet footer for file {}. " | |
| + "File will be included without column statistics.", | |
| addFile.path(), | |
| e); |
This will log out the full exception stacktrace to provide more details on the failure which makes it easier to debug.
| log.debug("Table has no snapshot, skipping file scan (table is empty)"); | ||
| } | ||
|
|
||
| // Filter out files that already exist in Iceberg |
There was a problem hiding this comment.
This is not required, the applyDiff is only for incremental sync. It assumes that the table is already synced once.
| long recordCount = dataFile.getRecordCount(); | ||
| List<ColumnStat> columnStats; | ||
|
|
||
| // For Parquet files, ALWAYS read from footer to match native Iceberg behavior |
There was a problem hiding this comment.
This is avoided as it introduces overheads and the source stats are typically from the files themselves. Iceberg writers actually generate these stats while writing instead of reading from the footer to avoid this same overhead.
There was a problem hiding this comment.
As we can achive the native iceberg performance ?
There was a problem hiding this comment.
I'm not sure I understand this question in this context.
Yes I have added |
|
@dhama-shashank-meesho Can you share the GH issue? I can't seem to find it. |
Important Read
What is the purpose of the pull request
This pull request fixes critical issues with column statistics extraction and handling across Delta Lake and Iceberg table formats. The changes ensure accurate and complete statistics are extracted from Parquet files, with proper fallback mechanisms when statistics are missing from metadata checkpoints.
Brief change log
Delta Lake Statistics Extraction
DeltaStatsExtractorto read column statistics directly from Parquet file footers when Delta checkpoint statistics are NULL or emptyIceberg Statistics Extraction
IcebergDataFileUpdatesSyncto always read statistics from Parquet footers for Parquet files, matching native Iceberg behavior for accuracy and completenessParquet Statistics Conversion
ParquetStatsConverterUtilto safely handle type conversions and prevent ClassCastException errorsOther Improvements
DeltaSchemaExtractorfor better schema handlingVerify this pull request
This change added tests and can be verified as follows:
TestIcebergSyncwhich verify statistics extraction and conversion