Skip to content

Add implementation status for cuDF#99

Merged
wgtmac merged 8 commits into
apache:productionfrom
mhaseeb123:cudf-impl-status
Feb 5, 2025
Merged

Add implementation status for cuDF#99
wgtmac merged 8 commits into
apache:productionfrom
mhaseeb123:cudf-impl-status

Conversation

@mhaseeb123

@mhaseeb123 mhaseeb123 commented Jan 29, 2025

Copy link
Copy Markdown
Contributor

This PR adds the implementation status for cuDF to Parquet site.

@mhaseeb123 mhaseeb123 changed the title Add implementation status for cuDF 🚧 Add implementation status for cuDF Jan 29, 2025
| External column data (1) | | | | | (W) |
| Row group "Sorting column" metadata (2) | | | | | (W) |
| Row group pruning using statistics | | | | | ✅ |
| Row group pruning using bloom filter | | | | | |

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I am wrong but I believe the bloom filters are used to prune row groups instead of pages.

Comment thread content/en/docs/File Format/implementationstatus.md Outdated
Comment thread content/en/docs/File Format/implementationstatus.md Outdated
mhaseeb123 and others added 2 commits January 28, 2025 18:27
Comment thread content/en/docs/File Format/implementationstatus.md Outdated
Comment thread content/en/docs/File Format/implementationstatus.md Outdated
@mhaseeb123 mhaseeb123 requested a review from vuule January 29, 2025 20:58
@mhaseeb123 mhaseeb123 marked this pull request as ready for review January 30, 2025 23:26
@mhaseeb123 mhaseeb123 changed the title 🚧 Add implementation status for cuDF Add implementation status for cuDF Jan 30, 2025
The value in each box means:
* ✅: supported
* ❌: not supported
* (R/W): partial reader/writer only support

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an extra piece in legend to allow partial reader- or writer-only support. Happy to remove it and leave the corresponding boxes blank if needed

@wgtmac wgtmac left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update!

* `Java`: [parquet-java](https://github.com/apache/parquet-java)
* `Go`: [parquet-go](https://github.com/apache/arrow-go/tree/main/parquet)
* `Rust`: [parquet-rs](https://github.com/apache/arrow-rs/blob/main/parquet/README.md)
* `CUDA C++`: [cudf](https://github.com/rapidsai/cudf)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be cuDF? Or CUDA C++ is a more official name of it?

@mhaseeb123 mhaseeb123 Feb 3, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuDF is the name of the implementing dataframes library and CUDA C++ is the language being used for implementation. Isn't the convention here like:

* `language`: [impl name](link)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer cuDF here. I think the original intention was to include implementations governed by the Parquet community or the Apache Software Foundation. It would be better to use the library name to encourage other Parquet implementations to appear here. WDYT? @alamb

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also recall that the idea here was to list library names (so this would be better as cuDF) not languages.

It just so happens that we only had one example library for each language so there was (before this PR) a 1-1 correspondence.

Does that make sense @mhaseeb123 ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I will update this

@wgtmac

wgtmac commented Feb 3, 2025

Copy link
Copy Markdown
Member

cc @etseidl @alamb

@etseidl etseidl left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting the party started @mhaseeb123!

Comment thread content/en/docs/File Format/implementationstatus.md Outdated
Comment thread content/en/docs/File Format/implementationstatus.md Outdated
@mhaseeb123 mhaseeb123 requested a review from etseidl February 3, 2025 23:09
Comment thread content/en/docs/File Format/implementationstatus.md Outdated

@etseidl etseidl left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now. Thanks!

@alamb

alamb commented Feb 4, 2025

Copy link
Copy Markdown
Collaborator

This PR adds the implementation status for cuDF to Parquet site.

AMAZING! Thank you @mhaseeb123

I wonder if you had any program / script / definition of what "support" means (mostly so I can crib / copy that and file a ticket in the arrow-rs repository to get this column filled out)

@mhaseeb123

Copy link
Copy Markdown
Contributor Author

I wonder if you had any program / script / definition of what "support" means (mostly so I can crib / copy that and file a ticket in the arrow-rs repository to get this column filled out)

Certainly, the (R) label I used means the cuDF parquet reader supports decompressing a codec, decoding an encoding type, reading (and using) bloom filters but the writer can't compress/encode/write those codecs/encodings/bloom filters respectively depending on the sub-section it's used in. Similarly, a (W) label would mean the opposite that the writer can write a certain field or feature but the reader is unable to read/use it.

Does that make sense?


### Physical types

| Data type | C++ | Java | Go | Rust |

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply removed one space in the Java column so all cols have a consistent width for aesthetic purposes.

@alamb

alamb commented Feb 4, 2025

Copy link
Copy Markdown
Collaborator

Certainly, the (R) label I used means the cuDF parquet reader supports decompressing a codec, decoding an encoding type, reading (and using) bloom filters but the writer can't compress/encode/write those codecs/encodings/bloom filters respectively depending on the sub-section it's used in. Similarly, a (W) label would mean the opposite that the writer can write a certain field or feature but the reader is unable to read/use it.

Does that make sense?

Yes for sure -- I guess i was hoping for some sort of script / example data that I could used when filling this out for arrow-rs. Not required, I was just asking

@mhaseeb123

Copy link
Copy Markdown
Contributor Author

I guess i was hoping for some sort of script / example data that I could used when filling this out for arrow-rs. Not required, I was just asking

We have relevant gtests and pytests in cudf for most if not all the features but collecting them along with input/output files wouldn't be feasible. Sorry!

@wgtmac wgtmac left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Thanks @mhaseeb123 and @bdice @vuule @etseidl @alamb for review!

@alamb

alamb commented Feb 7, 2025

Copy link
Copy Markdown
Collaborator

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants