Skip to content

DecisionTreeClassifierSuite fails in Spark 4.2.0-preview3 (Scala 2.13) with corrupted Parquet file error #54916

@afsantamaria-stratio

Description

@afsantamaria-stratio

Hi everyone,

I’m encountering a reproducible failure when running DecisionTreeClassifierSuite in Spark 4.2.0-preview3 (Scala 2.13). The same test suite works correctly in Spark 3.5.x.
The issue seems to be related to changes introduced in this PR: #50665

Error

The failure occurs with the following exception:

org.apache.spark.SparkException: [FAILED_READ_FILE.CANNOT_READ_FILE_FOOTER]
...
Caused by: java.lang.RuntimeException:
... is not a Parquet file. Expected magic number at tail, but found [-1, -1, -1, -1]

This indicates that a Parquet file is being created but is incomplete or corrupted.

Environment

Spark version: 4.2.0-preview3 (custom build from source)

Scala version: 2.13.x

Java: 17

OS: Linux

Build tool: ./build/mvn (same as Spark CI)

Command used:

./build/mvn
-pl mllib
-DwildcardSuites=org.apache.spark.ml.classification.DecisionTreeClassifierSuite
-Phadoop-cloud
-Pyarn
-Pkubernetes
-Djava.version=17
test -fae

Also tested with:

export SPARK_LOCAL_IP=localhost
export HADOOP_PROFILE=hadoop3
Observations

The failing file does exist under target/tmp/..., but appears truncated.

The issue seems to happen during schema inference / Parquet footer reading:

ParquetFileFormat.readParquetFootersInParallel

This suggests that the file is being written but not fully flushed or completed.

Regression

This test works correctly in Spark 3.5.X.

One notable change between versions is the introduction of .toImmutableArraySeq in test setup code, e.g.:

sc.parallelize(
OldDecisionTreeSuite.generateCategoricalDataPoints().toImmutableArraySeq
).map(_.asML)

vs previous version:

sc.parallelize(
OldDecisionTreeSuite.generateCategoricalDataPoints()
).map(_.asML)
Hypothesis

This may be related to:

Scala 2.13 collection changes (Array → ImmutableArraySeq)

Lazy evaluation / serialization differences

Possible race condition when writing temporary Parquet files during tests

Interaction with local filesystem (even with Hadoop profile enabled)

In some experiments, forcing materialization with:

rdd.cache()
rdd.count()

seems to mitigate the issue, which suggests a timing or evaluation problem.

Question

Is this a known issue with:

Scala 2.13 collection conversions in tests?

Parquet writing/reading in MLlib test suites?

Or something specific introduced in recent changes (possibly related to test data preparation)?

Additional context

This appears to be triggered consistently when running tests locally, even when mimicking the CI setup (./build/mvn, Hadoop profile, etc.).

Any guidance or pointers would be greatly appreciated.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions