Support Spark 4.x by pang-wu · Pull Request #450 · ray-project/raydp

pang-wu · 2025-12-08T18:10:44Z

This PR adapt raydp with Spark 4.x but leave the following work for future improvement:

Support tensorflow 2.16+ (see https://keras.io/getting_started/#tensorflow--keras-2-backwards-compatibility) and numpy 2.x
Support python 3.12 - we deprecated Python 3.9 because it is no longer supported by Spark. Need to modernize python build system.
Deprecate Ray AIR.

To make the tests pass, the PR is based on #458. Once PR#458 is merged this PR should rebase again.

rexminnis

Thanks for putting this together — the CommandLineUtilsBridge pattern and the SparkSubmit rework are clean solutions to the cross-version API drift. A few things I noticed:

Bug: spark340/SparkSqlUtils.toArrowRDD has infinite recursion (see inline comment)
Java target: maven.compiler.source is still 1.8 — worth bumping to 17?
Spark version: spark410.version targets 4.1.0 — consider 4.1.1 (current release)

Happy to help with testing or any of the shim work. I have a working Spark 4.1.1 setup locally and have been validating the Arrow conversion paths end-to-end.

carsonwang · 2026-04-02T08:15:24Z

@pang-wu Great work! The changes to support spark 4.x look good. Please just do a cleanup (remove some outdated code from PR458) and then I think it is ready to merge.

pang-wu · 2026-05-23T05:53:39Z

Thanks @carsonwang! Cleanup is in 539bf7d — removed the legacy non-recoverable path from ObjectStoreWriter.scala (the instance ObjectStoreWriter class with dfToId/getRandomRef/clean, the deprecated fromSparkRDD method, the ObjectRefHolder object, and the now-unused imports). PTAL when you get a chance.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

# Conflicts: # .github/workflows/pypi.yml # .github/workflows/pypi_release.yml # .github/workflows/ray_nightly_test.yml # .github/workflows/raydp.yml # python/setup.py

carsonwang · 2026-06-01T08:20:49Z

        "pyarrow >= 4.0.1",
        "ray >= 2.37.0",
-        "pyspark >= 3.1.1, <=3.5.7",
+        "pyspark >= 4.0.0",


Add upper bound?

Done in 6bb6ff2 — capped at < 5.0.0 (next major; Spark generally keeps API stable across minor versions, but a major bump may be breaking).

carsonwang · 2026-06-01T08:27:38Z

+        schema = schema,
+        timeZoneId = timeZoneId,
+        errorOnDuplicatedFieldNames = false,
+        largeVarTypes = false,


This should honor the session config sparkSession.sessionState.conf.arrowUseLargeVarTypes too.

Done in 6bb6ff2 — toDataFrame now reads sparkSession.sessionState.conf.arrowUseLargeVarTypes (captured outside the closure like timeZoneId).

carsonwang · 2026-06-01T08:27:56Z

+        schema = schema,
+        timeZoneId = timeZoneId,
+        errorOnDuplicatedFieldNames = false,
+        largeVarTypes = false,


This should honor the session config sparkSession.sessionState.conf.arrowUseLargeVarTypes too.

Done in 6bb6ff2 — toDataFrame now reads sparkSession.sessionState.conf.arrowUseLargeVarTypes (captured outside the closure like timeZoneId).

- spark400/spark410 shims: read arrowUseLargeVarTypes from session conf in toDataFrame instead of hardcoding to false (matches toArrowSchema) - setup.py: cap pyspark below 5.0.0 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The spark410 shim's SparkShimProvider matches patches 0..1 and spark400 matches 0..2, so 4.1.1 is the highest version we actually shim. Pin the upper bound there instead of an open < 5.0.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

carsonwang

Thanks for the all the efforts!

pang-wu changed the title ~~Support SPark 4.0.0~~ Support Spark 4.0.0 Dec 8, 2025

pang-wu mentioned this pull request Feb 16, 2026

[RayDP 2.0] Support Spark 4.1, Java 17, and Scala 2.13 #460

Draft

6 tasks

pang-wu force-pushed the pang/spark4 branch 7 times, most recently from 21be2c9 to 1f04b26 Compare February 16, 2026 17:47

pang-wu changed the title ~~Support Spark 4.0.0~~ Support Spark 4.x Feb 16, 2026

pang-wu force-pushed the pang/spark4 branch from 36323e4 to e587da8 Compare February 16, 2026 18:56

rexminnis mentioned this pull request Feb 17, 2026

[RFC] RayDP 2.0: Migration to Spark 4.1 & Java 17 #459

Closed

rexminnis reviewed Feb 17, 2026

View reviewed changes

Comment thread core/shims/spark340/src/main/scala/org/apache/spark/sql/SparkSqlUtils.scala

Comment thread core/pom.xml

Comment thread core/pom.xml

pang-wu force-pushed the pang/spark4 branch from 9a330d8 to fd27c93 Compare February 17, 2026 14:24

rexminnis mentioned this pull request Feb 17, 2026

Fix TaskContext leak and resource cleanup in getRDDPartition #463

Open

3 tasks

pang-wu force-pushed the pang/spark4 branch 8 times, most recently from 7acc670 to c40d89d Compare February 17, 2026 18:19

rexminnis mentioned this pull request Feb 17, 2026

Fix CI: drop EOL Python 3.9 and update GitHub Actions #464

Open

1 task

pang-wu force-pushed the pang/spark4 branch 2 times, most recently from 26d576d to ac217b9 Compare February 18, 2026 02:53

pang-wu force-pushed the pang/spark4 branch from ac217b9 to b7d339a Compare March 1, 2026 02:08

pang-wu added 3 commits March 12, 2026 09:41

do one hop forward fetch if recache data change executor

a45140e

more robust executor id parse

22259a6

add test

7b50558

pang-wu and others added 14 commits March 12, 2026 09:46

Support spark 4.0.1

90a4699

tf/estimator.py: only write checkpoint in rank0

201e96a

revert tf/estimator.py

f520469

support spark 4.1.x

679eca6

deprecate python 3.9, add 3.11 to CI

6c7a1b3

update pylint

187ce96

fix pyint rules

8cfc832

fix tensorflow version

6473ccc

pin pandas<3 version

93d5d42

remove df.sqlContext reference

f83db26

extract commandlineutils to custom spark submit

7d4c4de

add new shims

e7148fe

compile against 4.0.0

1989452

use legacy keras

a86d51d

pang-wu force-pushed the pang/spark4 branch from b7d339a to a86d51d Compare March 12, 2026 16:48

pang-wu force-pushed the pang/spark4 branch 2 times, most recently from e61a8f6 to 6f5d6ca Compare May 6, 2026 00:49

clean up dead code

539bf7d

pang-wu force-pushed the pang/spark4 branch from 6f5d6ca to 539bf7d Compare May 19, 2026 03:23

retry ray client init in tests to absorb proxier flakes

66cf296

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

pang-wu requested a review from carsonwang May 23, 2026 06:48

Merge remote-tracking branch 'upstream/master' into pang/spark4

a55cfe4

# Conflicts: # .github/workflows/pypi.yml # .github/workflows/pypi_release.yml # .github/workflows/ray_nightly_test.yml # .github/workflows/raydp.yml # python/setup.py

pang-wu force-pushed the pang/spark4 branch from 4f687fb to a55cfe4 Compare May 23, 2026 08:37

carsonwang reviewed Jun 1, 2026

View reviewed changes

pang-wu and others added 2 commits June 7, 2026 14:55

carsonwang approved these changes Jun 10, 2026

View reviewed changes

carsonwang merged commit 1abfa6f into ray-project:master Jun 10, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Spark 4.x#450

Support Spark 4.x#450
carsonwang merged 39 commits into
ray-project:masterfrom
pang-wu:pang/spark4

pang-wu commented Dec 8, 2025 •

edited

Loading

Uh oh!

rexminnis left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

carsonwang commented Apr 2, 2026

Uh oh!

pang-wu commented May 23, 2026

Uh oh!

carsonwang Jun 1, 2026

Uh oh!

pang-wu Jun 7, 2026

Uh oh!

carsonwang Jun 1, 2026

Uh oh!

pang-wu Jun 7, 2026

Uh oh!

carsonwang Jun 1, 2026

Uh oh!

pang-wu Jun 7, 2026

Uh oh!

carsonwang left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

pang-wu commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rexminnis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

carsonwang commented Apr 2, 2026

Uh oh!

pang-wu commented May 23, 2026

Uh oh!

carsonwang Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

pang-wu Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

carsonwang Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

pang-wu Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

carsonwang Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

pang-wu Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

carsonwang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pang-wu commented Dec 8, 2025 •

edited

Loading