Skip to content

Support Spark 4.x#450

Merged
carsonwang merged 39 commits into
ray-project:masterfrom
pang-wu:pang/spark4
Jun 10, 2026
Merged

Support Spark 4.x#450
carsonwang merged 39 commits into
ray-project:masterfrom
pang-wu:pang/spark4

Conversation

@pang-wu

@pang-wu pang-wu commented Dec 8, 2025

Copy link
Copy Markdown
Collaborator

This PR adapt raydp with Spark 4.x but leave the following work for future improvement:

  1. Support tensorflow 2.16+ (see https://keras.io/getting_started/#tensorflow--keras-2-backwards-compatibility) and numpy 2.x
  2. Support python 3.12 - we deprecated Python 3.9 because it is no longer supported by Spark. Need to modernize python build system.
  3. Deprecate Ray AIR.

To make the tests pass, the PR is based on #458. Once PR#458 is merged this PR should rebase again.

@pang-wu pang-wu changed the title Support SPark 4.0.0 Support Spark 4.0.0 Dec 8, 2025
@pang-wu pang-wu force-pushed the pang/spark4 branch 7 times, most recently from 21be2c9 to 1f04b26 Compare February 16, 2026 17:47
@pang-wu pang-wu changed the title Support Spark 4.0.0 Support Spark 4.x Feb 16, 2026

@rexminnis rexminnis left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together — the CommandLineUtilsBridge pattern and the SparkSubmit rework are clean solutions to the cross-version API drift. A few things I noticed:

  1. Bug: spark340/SparkSqlUtils.toArrowRDD has infinite recursion (see inline comment)
  2. Java target: maven.compiler.source is still 1.8 — worth bumping to 17?
  3. Spark version: spark410.version targets 4.1.0 — consider 4.1.1 (current release)

Happy to help with testing or any of the shim work. I have a working Spark 4.1.1 setup locally and have been validating the Arrow conversion paths end-to-end.

Comment thread core/pom.xml
Comment thread core/pom.xml
@carsonwang

Copy link
Copy Markdown
Collaborator

@pang-wu Great work! The changes to support spark 4.x look good. Please just do a cleanup (remove some outdated code from PR458) and then I think it is ready to merge.

@pang-wu pang-wu force-pushed the pang/spark4 branch 2 times, most recently from e61a8f6 to 6f5d6ca Compare May 6, 2026 00:49
@pang-wu

pang-wu commented May 23, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks @carsonwang! Cleanup is in 539bf7d — removed the legacy non-recoverable path from ObjectStoreWriter.scala (the instance ObjectStoreWriter class with dfToId/getRandomRef/clean, the deprecated fromSparkRDD method, the ObjectRefHolder object, and the now-unused imports). PTAL when you get a chance.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@pang-wu pang-wu requested a review from carsonwang May 23, 2026 06:48
# Conflicts:
#	.github/workflows/pypi.yml
#	.github/workflows/pypi_release.yml
#	.github/workflows/ray_nightly_test.yml
#	.github/workflows/raydp.yml
#	python/setup.py
Comment thread python/setup.py Outdated
"pyarrow >= 4.0.1",
"ray >= 2.37.0",
"pyspark >= 3.1.1, <=3.5.7",
"pyspark >= 4.0.0",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add upper bound?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 6bb6ff2 — capped at < 5.0.0 (next major; Spark generally keeps API stable across minor versions, but a major bump may be breaking).

schema = schema,
timeZoneId = timeZoneId,
errorOnDuplicatedFieldNames = false,
largeVarTypes = false,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should honor the session config sparkSession.sessionState.conf.arrowUseLargeVarTypes too.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 6bb6ff2toDataFrame now reads sparkSession.sessionState.conf.arrowUseLargeVarTypes (captured outside the closure like timeZoneId).

schema = schema,
timeZoneId = timeZoneId,
errorOnDuplicatedFieldNames = false,
largeVarTypes = false,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should honor the session config sparkSession.sessionState.conf.arrowUseLargeVarTypes too.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 6bb6ff2toDataFrame now reads sparkSession.sessionState.conf.arrowUseLargeVarTypes (captured outside the closure like timeZoneId).

pang-wu and others added 2 commits June 7, 2026 14:55
- spark400/spark410 shims: read arrowUseLargeVarTypes from session conf
  in toDataFrame instead of hardcoding to false (matches toArrowSchema)
- setup.py: cap pyspark below 5.0.0

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The spark410 shim's SparkShimProvider matches patches 0..1 and
spark400 matches 0..2, so 4.1.1 is the highest version we actually
shim. Pin the upper bound there instead of an open < 5.0.0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@carsonwang carsonwang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the all the efforts!

@carsonwang carsonwang merged commit 1abfa6f into ray-project:master Jun 10, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants