Add implementation status for cuDF by mhaseeb123 · Pull Request #99 · apache/parquet-site

mhaseeb123 · 2025-01-29T01:31:55Z

This PR adds the implementation status for cuDF to Parquet site.

mhaseeb123 · 2025-01-29T01:43:14Z

+| External column data (1)                     |       |        |       |       |  (W)  |
+| Row group "Sorting column" metadata (2)      |       |        |       |       |  (W)  |
+| Row group pruning using statistics           |       |        |       |       |  ✅   |
+| Row group pruning using bloom filter         |       |        |       |       |  ✅   |


Please correct me if I am wrong but I believe the bloom filters are used to prune row groups instead of pages.

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

mhaseeb123 · 2025-01-30T23:27:28Z

 The value in each box means:
 * ✅: supported
 * ❌: not supported
+* (R/W): partial reader/writer only support


Added an extra piece in legend to allow partial reader- or writer-only support. Happy to remove it and leave the corresponding boxes blank if needed

wgtmac

Thanks for the update!

wgtmac · 2025-02-02T03:02:53Z

 * `Java`: [parquet-java](https://github.com/apache/parquet-java)
 * `Go`: [parquet-go](https://github.com/apache/arrow-go/tree/main/parquet)
 * `Rust`: [parquet-rs](https://github.com/apache/arrow-rs/blob/main/parquet/README.md)
+* `CUDA C++`: [cudf](https://github.com/rapidsai/cudf)


Should this be cuDF? Or CUDA C++ is a more official name of it?

cuDF is the name of the implementing dataframes library and CUDA C++ is the language being used for implementation. Isn't the convention here like:

* `language`: [impl name](link)

I would prefer cuDF here. I think the original intention was to include implementations governed by the Parquet community or the Apache Software Foundation. It would be better to use the library name to encourage other Parquet implementations to appear here. WDYT? @alamb

I also recall that the idea here was to list library names (so this would be better as cuDF) not languages.

It just so happens that we only had one example library for each language so there was (before this PR) a 1-1 correspondence.

Does that make sense @mhaseeb123 ?

Sounds good. I will update this

wgtmac · 2025-02-03T09:32:39Z

cc @etseidl @alamb

etseidl

Thanks for getting the party started @mhaseeb123!

etseidl

Looks good now. Thanks!

alamb · 2025-02-04T14:04:51Z

This PR adds the implementation status for cuDF to Parquet site.

AMAZING! Thank you @mhaseeb123

I wonder if you had any program / script / definition of what "support" means (mostly so I can crib / copy that and file a ticket in the arrow-rs repository to get this column filled out)

mhaseeb123 · 2025-02-04T19:15:39Z

I wonder if you had any program / script / definition of what "support" means (mostly so I can crib / copy that and file a ticket in the arrow-rs repository to get this column filled out)

Certainly, the (R) label I used means the cuDF parquet reader supports decompressing a codec, decoding an encoding type, reading (and using) bloom filters but the writer can't compress/encode/write those codecs/encodings/bloom filters respectively depending on the sub-section it's used in. Similarly, a (W) label would mean the opposite that the writer can write a certain field or feature but the reader is unable to read/use it.

Does that make sense?

mhaseeb123 · 2025-02-04T19:17:33Z


 ### Physical types

-| Data type                                 | C++   | Java   | Go    | Rust  |


Simply removed one space in the Java column so all cols have a consistent width for aesthetic purposes.

alamb · 2025-02-04T20:41:02Z

Certainly, the (R) label I used means the cuDF parquet reader supports decompressing a codec, decoding an encoding type, reading (and using) bloom filters but the writer can't compress/encode/write those codecs/encodings/bloom filters respectively depending on the sub-section it's used in. Similarly, a (W) label would mean the opposite that the writer can write a certain field or feature but the reader is unable to read/use it.

Does that make sense?

Yes for sure -- I guess i was hoping for some sort of script / example data that I could used when filling this out for arrow-rs. Not required, I was just asking

mhaseeb123 · 2025-02-05T02:14:38Z

I guess i was hoping for some sort of script / example data that I could used when filling this out for arrow-rs. Not required, I was just asking

We have relevant gtests and pytests in cudf for most if not all the features but collecting them along with input/output files wouldn't be feasible. Sorry!

wgtmac

+1

Thanks @mhaseeb123 and @bdice @vuule @etseidl @alamb for review!

alamb · 2025-02-07T18:16:43Z

🚀

Add implementation status for cudf

8d8c94d

mhaseeb123 changed the title ~~Add implementation status for cuDF~~ 🚧 Add implementation status for cuDF Jan 29, 2025

mhaseeb123 commented Jan 29, 2025

View reviewed changes

bdice reviewed Jan 29, 2025

View reviewed changes

Comment thread content/en/docs/File Format/implementationstatus.md Outdated

Comment thread content/en/docs/File Format/implementationstatus.md Outdated

mhaseeb123 and others added 2 commits January 28, 2025 18:27

Update content/en/docs/File Format/implementationstatus.md

2fdbd6e

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

Change CUDA to CUDA C++

ec8b66e

vuule reviewed Jan 29, 2025

View reviewed changes

Comment thread content/en/docs/File Format/implementationstatus.md Outdated

vuule reviewed Jan 29, 2025

View reviewed changes

Comment thread content/en/docs/File Format/implementationstatus.md Outdated

Updates from reviews

3836c22

mhaseeb123 requested a review from vuule January 29, 2025 20:58

mhaseeb123 added 2 commits January 29, 2025 15:39

Updates for BIT_PACKED reader support

b178477

Update reader-only support for bloom filters

8ee07cc

vuule approved these changes Jan 30, 2025

View reviewed changes

mhaseeb123 marked this pull request as ready for review January 30, 2025 23:26

mhaseeb123 changed the title ~~🚧 Add implementation status for cuDF~~ Add implementation status for cuDF Jan 30, 2025

mhaseeb123 commented Jan 30, 2025

View reviewed changes

wgtmac reviewed Feb 2, 2025

View reviewed changes

etseidl reviewed Feb 3, 2025

View reviewed changes

Comment thread content/en/docs/File Format/implementationstatus.md Outdated

Comment thread content/en/docs/File Format/implementationstatus.md Outdated

Apply suggestions

7348ec4

mhaseeb123 requested a review from etseidl February 3, 2025 23:09

mhaseeb123 commented Feb 3, 2025

View reviewed changes

Comment thread content/en/docs/File Format/implementationstatus.md Outdated

etseidl approved these changes Feb 3, 2025

View reviewed changes

Use impl name (cuDF) instead of language (CUDA C++)

3bf79d7

mhaseeb123 commented Feb 4, 2025

View reviewed changes

wgtmac approved these changes Feb 5, 2025

View reviewed changes

wgtmac merged commit 4557062 into apache:production Feb 5, 2025

etseidl mentioned this pull request Feb 6, 2025

Add arrow-rs column to Parquet implementation page on parquet-site apache/arrow-rs#7088

Closed

wgtmac mentioned this pull request Feb 20, 2025

Add implementation status of javascript hyparquet #102

Merged

Conversation

mhaseeb123 commented Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mhaseeb123 Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mhaseeb123 Jan 30, 2025

Choose a reason for hiding this comment

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

wgtmac Feb 2, 2025

Choose a reason for hiding this comment

Uh oh!

mhaseeb123 Feb 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wgtmac Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

mhaseeb123 Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

wgtmac commented Feb 3, 2025

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 4, 2025

Uh oh!

mhaseeb123 commented Feb 4, 2025

Uh oh!

mhaseeb123 Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 4, 2025

Uh oh!

mhaseeb123 commented Feb 5, 2025

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mhaseeb123 commented Jan 29, 2025 •

edited

Loading

mhaseeb123 Feb 3, 2025 •

edited

Loading