Support Spark 4.x#450
Conversation
21be2c9 to
1f04b26
Compare
rexminnis
left a comment
There was a problem hiding this comment.
Thanks for putting this together — the CommandLineUtilsBridge pattern and the SparkSubmit rework are clean solutions to the cross-version API drift. A few things I noticed:
- Bug:
spark340/SparkSqlUtils.toArrowRDDhas infinite recursion (see inline comment) - Java target:
maven.compiler.sourceis still1.8— worth bumping to 17? - Spark version:
spark410.versiontargets 4.1.0 — consider 4.1.1 (current release)
Happy to help with testing or any of the shim work. I have a working Spark 4.1.1 setup locally and have been validating the Arrow conversion paths end-to-end.
7acc670 to
c40d89d
Compare
26d576d to
ac217b9
Compare
|
@pang-wu Great work! The changes to support spark 4.x look good. Please just do a cleanup (remove some outdated code from PR458) and then I think it is ready to merge. |
e61a8f6 to
6f5d6ca
Compare
|
Thanks @carsonwang! Cleanup is in 539bf7d — removed the legacy non-recoverable path from |
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
# Conflicts: # .github/workflows/pypi.yml # .github/workflows/pypi_release.yml # .github/workflows/ray_nightly_test.yml # .github/workflows/raydp.yml # python/setup.py
| "pyarrow >= 4.0.1", | ||
| "ray >= 2.37.0", | ||
| "pyspark >= 3.1.1, <=3.5.7", | ||
| "pyspark >= 4.0.0", |
There was a problem hiding this comment.
Done in 6bb6ff2 — capped at < 5.0.0 (next major; Spark generally keeps API stable across minor versions, but a major bump may be breaking).
| schema = schema, | ||
| timeZoneId = timeZoneId, | ||
| errorOnDuplicatedFieldNames = false, | ||
| largeVarTypes = false, |
There was a problem hiding this comment.
This should honor the session config sparkSession.sessionState.conf.arrowUseLargeVarTypes too.
There was a problem hiding this comment.
Done in 6bb6ff2 — toDataFrame now reads sparkSession.sessionState.conf.arrowUseLargeVarTypes (captured outside the closure like timeZoneId).
| schema = schema, | ||
| timeZoneId = timeZoneId, | ||
| errorOnDuplicatedFieldNames = false, | ||
| largeVarTypes = false, |
There was a problem hiding this comment.
This should honor the session config sparkSession.sessionState.conf.arrowUseLargeVarTypes too.
There was a problem hiding this comment.
Done in 6bb6ff2 — toDataFrame now reads sparkSession.sessionState.conf.arrowUseLargeVarTypes (captured outside the closure like timeZoneId).
- spark400/spark410 shims: read arrowUseLargeVarTypes from session conf in toDataFrame instead of hardcoding to false (matches toArrowSchema) - setup.py: cap pyspark below 5.0.0 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The spark410 shim's SparkShimProvider matches patches 0..1 and spark400 matches 0..2, so 4.1.1 is the highest version we actually shim. Pin the upper bound there instead of an open < 5.0.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
carsonwang
left a comment
There was a problem hiding this comment.
Thanks for the all the efforts!
This PR adapt raydp with Spark 4.x but leave the following work for future improvement:
To make the tests pass, the PR is based on #458. Once PR#458 is merged this PR should rebase again.