Skip to content

feat: pxf fdw support parallel scan#61

Open
MisterRaindrop wants to merge 3 commits intoapache:mainfrom
MisterRaindrop:liuxiaoyu/paralle_fdw_2
Open

feat: pxf fdw support parallel scan#61
MisterRaindrop wants to merge 3 commits intoapache:mainfrom
MisterRaindrop:liuxiaoyu/paralle_fdw_2

Conversation

@MisterRaindrop
Copy link
Collaborator

@MisterRaindrop MisterRaindrop commented Feb 10, 2026

#58


Change logs

Currently, parallel FDW is supported. This implementation depends on the kernel's commit.

The current code is not yet ready for the review stage. This current commit is only an exploratory submission for FDW parallelization. More importantly, I need to ensure that the core part of the kernel is solid first.

apache/cloudberry#1571

Contributor's checklist

Here are some reminders before you submit your pull request:

MisterRaindrop and others added 2 commits February 10, 2026 17:41
- fdw support pg parallel scan
- add parallel scan correctness tests for PXF
@MisterRaindrop MisterRaindrop self-assigned this Feb 10, 2026
@ostinru ostinru self-requested a review February 10, 2026 13:49
// Parallel mode: only process the specified fragment
Fragment specificFragment = fragmenterService.getFragmentByIndex(
context, context.getSpecificFragmentIndex());
fragments = java.util.Collections.singletonList(specificFragment);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: import java.util.Collections?

slock_t mutex; /* mutex for accessing shared state */
int total_fragments; /* total number of fragments */
int next_fragment; /* next fragment index to be assigned */
bool finished; /* true if all fragments have been processed */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true if all fragments have been processed... or cancelled?

And. What is a purpose of write-only variable?

@MisterRaindrop
Copy link
Collaborator Author

@ostinru
Thanks for the review. The current code is not yet ready for the review stage.

My approach to kernel parallel processing is still too simplistic. Maybe I will change or refactor later.

@MisterRaindrop
Copy link
Collaborator Author

MisterRaindrop commented Feb 12, 2026

All deployments are local.
Sizes: 100MB, 1GB, 10GB
Workers: 4
Format:csv

Size Rows Workers COUNT seq (ms) COUNT par (ms) Speedup SUM seq (ms) SUM par (ms) Speedup
100MB x 20file 487,700 4 282 311 0.91x 290 188 1.54x
1GB x 20file 4,994,140 4 2352 1514 1.55x 2448 1314 1.86x
10GB x 20file 49,941,480 4 21524 11589 1.86x 21954 11547 1.90x

When exploring parallelization, the good news is that parallelization does indeed improve efficiency. For small data volumes, the improvement is not obvious and may even be less efficient than non-parallel processing. Only when the data volume is large does it show a noticeable improvement.

However, the current improvement still falls short of the expected level. Theoretically, the speedup factor should be almost equal to the number of workers. The reason it hasn’t reached the expected level may be due to bottlenecks in I/O or CPU. Further exploration will be conducted in the future.

- Introduced virtual segment ID handling for parallel execution in Cloudberry.
- Added PxfBridgeImportStartVirtual function to manage imports with virtual segment IDs.
- Updated PxfFdwScanState structure to include fields for gang-parallel execution.
- Enhanced foreign scan functions to support gang-parallel mode, ensuring unique fragment distribution among workers.
- Implemented initialization and cleanup routines for gang-parallel state management.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants