DecisionTreeClassifierSuite fails in Spark 4.2.0-preview3 (Scala 2.13) with corrupted Parquet file error

Hi everyone,

I’m encountering a reproducible failure when running DecisionTreeClassifierSuite in Spark 4.2.0-preview3 (Scala 2.13). The same test suite works correctly in Spark 3.5.x. 
The issue seems to be related to changes introduced in this PR: https://github.com/apache/spark/pull/50665

Error

The failure occurs with the following exception:

org.apache.spark.SparkException: [FAILED_READ_FILE.CANNOT_READ_FILE_FOOTER]
...
Caused by: java.lang.RuntimeException:
... is not a Parquet file. Expected magic number at tail, but found [-1, -1, -1, -1]

This indicates that a Parquet file is being created but is incomplete or corrupted.

Environment

Spark version: 4.2.0-preview3 (custom build from source)

Scala version: 2.13.x

Java: 17

OS: Linux

Build tool: ./build/mvn (same as Spark CI)

Command used:

./build/mvn \
  -pl mllib \
  -DwildcardSuites=org.apache.spark.ml.classification.DecisionTreeClassifierSuite \
  -Phadoop-cloud \
  -Pyarn \
  -Pkubernetes \
  -Djava.version=17 \
  test -fae

Also tested with:

export SPARK_LOCAL_IP=localhost
export HADOOP_PROFILE=hadoop3
Observations

The failing file does exist under target/tmp/..., but appears truncated.

The issue seems to happen during schema inference / Parquet footer reading:

ParquetFileFormat.readParquetFootersInParallel

This suggests that the file is being written but not fully flushed or completed.

Regression

This test works correctly in Spark 3.5.X.

One notable change between versions is the introduction of .toImmutableArraySeq in test setup code, e.g.:

sc.parallelize(
  OldDecisionTreeSuite.generateCategoricalDataPoints().toImmutableArraySeq
).map(_.asML)

vs previous version:

sc.parallelize(
  OldDecisionTreeSuite.generateCategoricalDataPoints()
).map(_.asML)
Hypothesis

This may be related to:

Scala 2.13 collection changes (Array → ImmutableArraySeq)

Lazy evaluation / serialization differences

Possible race condition when writing temporary Parquet files during tests

Interaction with local filesystem (even with Hadoop profile enabled)

In some experiments, forcing materialization with:

rdd.cache()
rdd.count()

seems to mitigate the issue, which suggests a timing or evaluation problem.

Question

Is this a known issue with:

Scala 2.13 collection conversions in tests?

Parquet writing/reading in MLlib test suites?

Or something specific introduced in recent changes (possibly related to test data preparation)?

Additional context

This appears to be triggered consistently when running tests locally, even when mimicking the CI setup (./build/mvn, Hadoop profile, etc.).

Any guidance or pointers would be greatly appreciated.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DecisionTreeClassifierSuite fails in Spark 4.2.0-preview3 (Scala 2.13) with corrupted Parquet file error #54916

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DecisionTreeClassifierSuite fails in Spark 4.2.0-preview3 (Scala 2.13) with corrupted Parquet file error #54916

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions