Parquet Incremental Sync by sapienza88 · Pull Request #768 · apache/incubator-xtable

sapienza88 · 2025-12-10T19:54:49Z

Important Read

Please ensure the GitHub issue is mentioned at the beginning of the PR

What is the purpose of the pull request

(For example: This pull request implements the sync for delta format.)

Brief change log

(for example:)

Fixed JSON parsing error when persisting state
Added unit tests for schema evolution

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added TestConversionController to verify the change.
Manually verified the change by running a job locally.

… into the parquet table

…ds, interfacing with ConversionSource

rahil-c · 2025-12-15T16:19:52Z

I can do first review for this @the-other-tim-brown @vinishjail97

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetDataManager.java

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetFileConfig.java

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetDataManager.java

vinishjail97 · 2025-12-17T19:16:16Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetDataManager.java

Why are we writing new parquet files again like this through the writer? I think there's some misunderstanding with the parquet incremental sync feature here.

Parquet Incremental Sync Requirements.

You have a target table where parquet files [p1/f1.parquet, p1/f2.parquet, p2/f1.parquet] have been synced to hudi, iceberg and delta for example.

In the source changes some changes have been made a new file in partition p1 was added and p2's file was deleted. The incremental sync should now sync the new changes incrementally.

@sapienza88 It's better to align on the approach first here before we push PR's. Can you add the approach for parquet incremental sync in the PR description or any google doc if possible?

@sapienza88 XTable shouldn't be writing any new data or parquet files it operates at a metadata level. Can you see this comment for reference?
#550 (comment)
Fetch the parquet files that have been added since last syncInstant to retrieve the change log. We can this via the same list call and filtering files based on their creationTime is the simplest way but it's expensive

@vinishjail97 thanks for the suggestion, but that isn't helping. Could you elaborate on that idea and how you could manage the metadata only for the task of retrieving data from a particular (modification) date? at the very least the current ConversionSource wasn't coded with that in mind.

sapienza88 · 2025-12-17T19:55:39Z

@vinishjail97 I added some comments on the functions so that the approach is clearer. All above suggestions were also taken into account in my last commit.

…ing)

vinishjail97 · 2025-12-22T19:46:35Z

XTable shouldn't be writing any new data or parquet files it operators at a metadata level. Can you see this comment for reference? I had written few approaches on how to do incremental parquet sync.
#550 (comment)

vinishjail97 · 2025-12-29T07:46:07Z

@sapienza88 I'm adding a more detailed design and a class level structure to unblock this PR.

Design Principle
XTable operates at a metadata level only. The current PR approach of writing new Parquet files with filtered data is incorrect. XTable should:

Discover existing Parquet files from storage
Generate table format metadata (Hudi, Iceberg, Delta) for those files
NEVER write new Parquet files or transform data.

Architecture

  ┌────────────────────────────────────────────────────────────┐
  │                  ParquetConversionSource                   │
  │  - Uses ParquetFileDiscovery to find files                 │
  │  - Converts file metadata to InternalDataFile              │
  │  - Returns snapshots and table changes                     │
  └────────────────────────────────────────────────────────────┘
                              │
                              ▼
  ┌────────────────────────────────────────────────────────────┐
  │              ParquetFileDiscovery (new class)              │
  │  - Lists all .parquet files from filesystem                │
  │  - Filters files by modification time                      │
  │  - Returns lightweight file metadata                       │
  └────────────────────────────────────────────────────────────┘
                              │
                              ▼
  ┌────────────────────────────────────────────────────────────┐
  │            FileSystem (HDFS/S3/GCS/Azure)                  │
  │  - fs.listFiles(basePath, recursive=true)                  │
  └────────────────────────────────────────────────────────────┘

Use file modification time as commit identifier, you will be able to identify which files have been synced and which haven't been synced. The files not synced need to have metadata generated. The future functionality like making it optimized, handling deleted parquet files in storage can be handled incrementally, hoping to scope low for this PR.

…ds using the FileStatus' modifTime attribute

…ificationTime selector

…ppend and 2) filter for sync

sapienza88 · 2026-02-04T21:03:02Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetDataManager.java

-                          : metadata.getBlocks().get(0).getColumns().get(0).getCodec())
-                  .build();
-            });
+  Stream<ParquetFileInfo> getCurrentFileInfo() {


@the-other-tim-brown You would want to rename this to the plural (getCurrentFilesInfo())?

…(not append)

given a parquet file return data from a certain modification time

e541a71

sapienza88 changed the title ~~Parquet Incremental Sync: Given a parquet file return data from a certain modification time~~ Parquet Incremental Sync Dec 10, 2025

Selim Soufargi added 3 commits December 13, 2025 18:20

create the path based on the partition then inject the file to append…

15e282a

… into the parquet table

Handle case of path construction with file partitioned over many fiel…

2ee71c9

…ds, interfacing with ConversionSource

test append Parquet file into table init

6032e5f

add function to test schema equivalence before appending

f6fdc72

vinishjail97 self-requested a review December 16, 2025 08:31

Selim Soufargi added 2 commits December 16, 2025 12:59

construct path to inject to based on partitions

a94c3f3

fix imports

f8bdbfe

vinishjail97 requested changes Dec 17, 2025

View reviewed changes

refactoring (lombok, logs, javadocs and function and approach comment…

c04a983

…ing)

Selim Soufargi added 15 commits January 1, 2026 18:03

use appendFile to append a file into a table while tracking the appen…

5f2541e

…ds using the FileStatus' modifTime attribute

find the files that satisfy to the time condition

47e7076

treat appends as separate files to add in the target partition folder

fbb09ec

update approach: selective block compaction

fe19a60

update approach: added a basic test to check data selection using mod…

da7f300

…ificationTime selector

fix append based on partition value

a8730b7

fix test with basic example where partitions are not considered

d19ccbf

fix test with basic example where partitions are not considered2

aecb204

fix test with basic example where partitions are not considered3

0ec8cbb

test with time of last append is now

9cb75df

test appendFile with Parquet: TODO test with multiple partitions 1) a…

9e125f2

…ppend and 2) filter for sync

merge recursively one partition files

233ca77

fix paths for files to append

b4cba5a

fix bug of appending file path

a564b29

fix bug of schema

d1ceafb

sapienza88 commented Feb 4, 2026

View reviewed changes

the-other-tim-brown and others added 29 commits February 11, 2026 20:33

add TestParquetDataManager, simplify data manager methods

64a5a2d

make IT parameterized, add incremental sync

2135ccf

fix schema extraction to fix snapshot sync

e76ac21

add optimizations

9901c1b

fix for ParquetDataManager Test (mock())

a2bb6c3

before resycing delete the first synced files metadata

2d23f31

in order to append parquet files, perform union then spark overwrite …

ceb924c

…(not append)

solving directory issue when overwriting data using spark

5b0a9fd

solving directory issue when overwriting data using spark2

80e1927

solving directory issue when overwriting data using spark3

0aedd70

solving directory issue when overwriting data using spark4

5cbeeeb

solving directory issue when overwriting data using spark5

13e5a65

solving directory issue when overwriting data using spark6

fa77646

solving directory issue when overwriting data using spark7

e53fa4b

revert writeData changes

a6b75d5

solving directory issue when overwriting data using spark8

bdc9f40

solving directory issue when overwriting data using spark9

43d986d

solving directory issue when overwriting data using spark9

5f67a0f

revert changes

dae07b7

testing without cleaning metadata for the second sync

9e0de0a

testing without cleaning metadata for the second sync

8c1b8f6

testing without cleaning metadata for the second sync

84bae49

testing without cleaning metadata for the second sync

6ce28a3

testing without cleaning metadata for the second sync

cf2564c

revert changes

6cb87ee

revert changes

5c25093

revert changes

17a3134

use no tempDir

5b0d8dc

revert changes

8db84a0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet Incremental Sync#768

Parquet Incremental Sync#768
sapienza88 wants to merge 86 commits intoapache:mainfrom
sapienza88:parquet_incr_sync

sapienza88 commented Dec 10, 2025

Uh oh!

rahil-c commented Dec 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vinishjail97 Dec 17, 2025

Uh oh!

vinishjail97 Dec 22, 2025 •

edited

Loading

Uh oh!

sapienza88 Dec 23, 2025 •

edited

Loading

Uh oh!

sapienza88 commented Dec 17, 2025

Uh oh!

vinishjail97 commented Dec 22, 2025

Uh oh!

vinishjail97 commented Dec 29, 2025

Uh oh!

sapienza88 Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

sapienza88 commented Dec 10, 2025

Important Read

What is the purpose of the pull request

Brief change log

Verify this pull request

Uh oh!

rahil-c commented Dec 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vinishjail97 Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

vinishjail97 Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sapienza88 Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sapienza88 commented Dec 17, 2025

Uh oh!

vinishjail97 commented Dec 22, 2025

Uh oh!

vinishjail97 commented Dec 29, 2025

Uh oh!

sapienza88 Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

vinishjail97 Dec 22, 2025 •

edited

Loading

sapienza88 Dec 23, 2025 •

edited

Loading