Conversation
… into the parquet table
…ds, interfacing with ConversionSource
|
I can do first review for this @the-other-tim-brown @vinishjail97 |
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetDataManager.java
Outdated
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetDataManager.java
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetDataManager.java
Outdated
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetFileConfig.java
Outdated
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetDataManager.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Why are we writing new parquet files again like this through the writer? I think there's some misunderstanding with the parquet incremental sync feature here.
Parquet Incremental Sync Requirements.
- You have a target table where parquet files [p1/f1.parquet, p1/f2.parquet, p2/f1.parquet] have been synced to hudi, iceberg and delta for example.
- In the source changes some changes have been made a new file in partition p1 was added and p2's file was deleted. The incremental sync should now sync the new changes incrementally.
@sapienza88 It's better to align on the approach first here before we push PR's. Can you add the approach for parquet incremental sync in the PR description or any google doc if possible?
There was a problem hiding this comment.
@sapienza88 XTable shouldn't be writing any new data or parquet files it operates at a metadata level. Can you see this comment for reference?
#550 (comment)
Fetch the parquet files that have been added since last syncInstant to retrieve the change log. We can this via the same list call and filtering files based on their creationTime is the simplest way but it's expensive
There was a problem hiding this comment.
@vinishjail97 thanks for the suggestion, but that isn't helping. Could you elaborate on that idea and how you could manage the metadata only for the task of retrieving data from a particular (modification) date? at the very least the current ConversionSource wasn't coded with that in mind.
|
@vinishjail97 I added some comments on the functions so that the approach is clearer. All above suggestions were also taken into account in my last commit. |
|
XTable shouldn't be writing any new data or parquet files it operators at a metadata level. Can you see this comment for reference? I had written few approaches on how to do incremental parquet sync. |
|
@sapienza88 I'm adding a more detailed design and a class level structure to unblock this PR. Design Principle
Architecture Use file modification time as commit identifier, you will be able to identify which files have been synced and which haven't been synced. The files not synced need to have metadata generated. The future functionality like making it optimized, handling deleted parquet files in storage can be handled incrementally, hoping to scope low for this PR. |
…ds using the FileStatus' modifTime attribute
…ificationTime selector
…ppend and 2) filter for sync
| : metadata.getBlocks().get(0).getColumns().get(0).getCodec()) | ||
| .build(); | ||
| }); | ||
| Stream<ParquetFileInfo> getCurrentFileInfo() { |
There was a problem hiding this comment.
@the-other-tim-brown You would want to rename this to the plural (getCurrentFilesInfo())?
Important Read
What is the purpose of the pull request
(For example: This pull request implements the sync for delta format.)
Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)