Skip to content

PARQUET-2470: Update website with larger ecosystem emphasis#59

Merged
wgtmac merged 5 commits into
apache:productionfrom
alamb:alamb/less_hadoop
May 16, 2024
Merged

PARQUET-2470: Update website with larger ecosystem emphasis#59
wgtmac merged 5 commits into
apache:productionfrom
alamb:alamb/less_hadoop

Conversation

@alamb

@alamb alamb commented May 13, 2024

Copy link
Copy Markdown
Collaborator

Rationale

As described on https://issues.apache.org/jira/browse/PARQUET-2470, Parquet's role in the analytics ecosystem is substantial.

However, https://parquet.apache.org/ currently emphasis Parquet's role in the Hadoop ecosystem. I think this causes confusion in several ways:

  1. It implies that parquet is only focused on Hadoop, when I think it is a critical technology across other ecosystems that are unrelated to hadoop (e.g. Apache Iceberg, Delta Lake, etc)
  2. It may further the perception that the Apache Parquet project only focuses on / cares about Hadoop / Java implementation

Changes

Update the home page content to mirror the Apache Project Description https://projects.apache.org/project.html?parquet (which does not mention Hadoop specifically)

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Parquet is available in multiple languages including Java, C++, and Python.

Before this PR

Screenshot 2024-05-13 at 4 13 31 PM

After the PR

Screenshot 2024-05-13 at 4 15 17 PM

Comment thread content/en/_index.md
@vinooganesh

Copy link
Copy Markdown
Collaborator

+1!

@etseidl etseidl left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Hadoop not required 😄

Comment thread content/en/_index.md Outdated
Comment thread content/en/docs/Overview/_index.md Outdated
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.
It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.
Parquet is available in multiple languages including Java, C++, and Python.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echoing @amoeba, perhaps leave out specific languages and leave it vague.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is strange to have this mention of specific technologies -- maybe we can make all three locations consistent (and more general)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mentioning implementation (both as end-user software and as libs) is valuable but shouldn't be part of the elevator pitch. Other formats usually solve this by a dedicated sub-section or page, e.g.:

This would also allow multiple implementations for a single language, which sometimes can be valuable (e.g. if you have a backwards compatible, conservative variant and a fancy new one).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree 100% -- I believe we are beginning to create just such a list on #53

This set of examples is good. I have added it to https://issues.apache.org/jira/browse/PARQUET-2310 which tracks these examples

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>

@julienledem julienledem left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Thank you for taking the initiative. Hadoop is not required indeed. Perhaps at some point we should rename parquet-mr to parquet-java?

@alamb alamb left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the feedback here https://github.com/apache/parquet-site/pull/59/files#r1599769911 I have updated the text in all three places to be

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.
It provides high performance data compression and encoding schemes to handle complex data in bulk.

Screenshot 2024-05-15 at 7 42 40 AM

From my perspective this PR is now ready to merge

Thanks everyone for the reviews and comments


Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.
It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming language and analytics tools.

@vinooganesh vinooganesh May 15, 2024

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we mean for this to say "high performance compression" or is it "high performance, compression"? I think it may be the latter. Or maybe "It provides performant compression and encoding schemes..." I was thinking the first versions sound too much like the compression tool rather than the format

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean for the comma or lack there of to carry any additional semantic meaning. I am happy to put a comma there if you like

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No really strong feelings, was just wondering if there was a subtextual focus intended

@wgtmac

wgtmac commented May 16, 2024

Copy link
Copy Markdown
Member

Let me merge this. Thanks everyone!

@wgtmac wgtmac merged commit 5f690a3 into apache:production May 16, 2024
@alamb alamb deleted the alamb/less_hadoop branch May 16, 2024 16:43
@alamb

alamb commented May 16, 2024

Copy link
Copy Markdown
Collaborator Author

Thanks @wgtmac

@julienledem

Copy link
Copy Markdown
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants