-
Notifications
You must be signed in to change notification settings - Fork 29.1k
Description
Hi everyone,
I’m encountering a reproducible failure when running DecisionTreeClassifierSuite in Spark 4.2.0-preview3 (Scala 2.13). The same test suite works correctly in Spark 3.5.x.
The issue seems to be related to changes introduced in this PR: #50665
Error
The failure occurs with the following exception:
org.apache.spark.SparkException: [FAILED_READ_FILE.CANNOT_READ_FILE_FOOTER]
...
Caused by: java.lang.RuntimeException:
... is not a Parquet file. Expected magic number at tail, but found [-1, -1, -1, -1]
This indicates that a Parquet file is being created but is incomplete or corrupted.
Environment
Spark version: 4.2.0-preview3 (custom build from source)
Scala version: 2.13.x
Java: 17
OS: Linux
Build tool: ./build/mvn (same as Spark CI)
Command used:
./build/mvn
-pl mllib
-DwildcardSuites=org.apache.spark.ml.classification.DecisionTreeClassifierSuite
-Phadoop-cloud
-Pyarn
-Pkubernetes
-Djava.version=17
test -fae
Also tested with:
export SPARK_LOCAL_IP=localhost
export HADOOP_PROFILE=hadoop3
Observations
The failing file does exist under target/tmp/..., but appears truncated.
The issue seems to happen during schema inference / Parquet footer reading:
ParquetFileFormat.readParquetFootersInParallel
This suggests that the file is being written but not fully flushed or completed.
Regression
This test works correctly in Spark 3.5.X.
One notable change between versions is the introduction of .toImmutableArraySeq in test setup code, e.g.:
sc.parallelize(
OldDecisionTreeSuite.generateCategoricalDataPoints().toImmutableArraySeq
).map(_.asML)
vs previous version:
sc.parallelize(
OldDecisionTreeSuite.generateCategoricalDataPoints()
).map(_.asML)
Hypothesis
This may be related to:
Scala 2.13 collection changes (Array → ImmutableArraySeq)
Lazy evaluation / serialization differences
Possible race condition when writing temporary Parquet files during tests
Interaction with local filesystem (even with Hadoop profile enabled)
In some experiments, forcing materialization with:
rdd.cache()
rdd.count()
seems to mitigate the issue, which suggests a timing or evaluation problem.
Question
Is this a known issue with:
Scala 2.13 collection conversions in tests?
Parquet writing/reading in MLlib test suites?
Or something specific introduced in recent changes (possibly related to test data preparation)?
Additional context
This appears to be triggered consistently when running tests locally, even when mimicking the CI setup (./build/mvn, Hadoop profile, etc.).
Any guidance or pointers would be greatly appreciated.
Thanks!