[GLUTEN-11550][VL][UT] Enable Variant test suites#11726
Conversation
4a21089 to
794a1fe
Compare
|
Run Gluten Clickhouse CI on x86 |
794a1fe to
c6759c0
Compare
|
Run Gluten Clickhouse CI on x86 |
c6759c0 to
d53e74e
Compare
|
Run Gluten Clickhouse CI on x86 |
d53e74e to
ff31977
Compare
|
Run Gluten Clickhouse CI on x86 |
ff31977 to
f70182f
Compare
|
Run Gluten Clickhouse CI on x86 |
f70182f to
cbba654
Compare
|
Run Gluten Clickhouse CI on x86 |
cbba654 to
aea9428
Compare
|
Run Gluten Clickhouse CI on x86 |
|
@jinchengchenghh @zhztheplayer please review |
There was a problem hiding this comment.
Pull request overview
Adds Spark 4.1-specific detection and fallback behavior for Parquet Variant annotation / variant shredding so Velox avoids reading unsupported encodings, and enables previously-disabled Variant-related test suites.
Changes:
- Add Spark shim hooks to decide when Parquet Variant annotation should trigger fallback (Spark 4.1).
- Refactor Parquet metadata validation to read each footer once and include an always-on Variant annotation check when needed.
- Enable Variant-related test suites for Spark 4.0/4.1 and force UTF-8 file encoding for test JVMs.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| shims/spark41/src/main/scala/org/apache/gluten/sql/shims/spark41/Spark41Shims.scala | Implements Spark 4.1 logic for checking Parquet Variant annotations via footer schema. |
| shims/common/src/main/scala/org/apache/gluten/sql/shims/SparkShims.scala | Adds default shim APIs for Variant-annotation fallback decisions. |
| backends-velox/src/main/scala/org/apache/gluten/utils/ParquetMetadataUtils.scala | Refactors footer scanning and adds Variant-annotation-driven fallback path. |
| backends-velox/src/main/scala/org/apache/gluten/backendsapi/velox/VeloxValidatorApi.scala | Detects “variant shredded” StructType via field metadata and rejects it for Velox. |
| backends-velox/src/main/scala/org/apache/gluten/backendsapi/velox/VeloxBackend.scala | Minor cleanup around metadata validation call chaining. |
| gluten-ut/spark41/src/test/scala/org/apache/gluten/utils/velox/VeloxTestSettings.scala | Enables Variant-related suites for Spark 4.1. |
| gluten-ut/spark40/src/test/scala/org/apache/gluten/utils/velox/VeloxTestSettings.scala | Enables Variant-related suites for Spark 4.0. |
| pom.xml | Forces UTF-8 file.encoding for test JVM execution. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
backends-velox/src/main/scala/org/apache/gluten/utils/ParquetMetadataUtils.scala
Outdated
Show resolved
Hide resolved
backends-velox/src/main/scala/org/apache/gluten/utils/ParquetMetadataUtils.scala
Outdated
Show resolved
Hide resolved
backends-velox/src/main/scala/org/apache/gluten/backendsapi/velox/VeloxValidatorApi.scala
Show resolved
Hide resolved
| // These structs have all fields annotated with __VARIANT_METADATA_KEY metadata. | ||
| // Velox cannot read the variant shredding encoding in Parquet files. | ||
| if ( | ||
| struct.fields.nonEmpty && |
There was a problem hiding this comment.
Do we need to add nonEmpty check?
There was a problem hiding this comment.
Yes, nonEmpty is needed because forall returns true for an empty collection. Without it, an empty struct would be incorrectly detected as variant shredded.
| struct.fields.nonEmpty && | ||
| struct.fields.forall(_.metadata.contains("__VARIANT_METADATA_KEY")) | ||
| ) { | ||
| return Some(s"Variant shredded struct is not supported: $struct") |
There was a problem hiding this comment.
Might the struct $struct too heavy?
There was a problem hiding this comment.
This is consistent with the existing pattern at line 144: s"Schema / data type not supported: $schema" which also prints the full type. Since this is a validation failure message (logged once per scan fallback, not per row), the overhead is negligible.
| if (!GlutenConfig.get.parquetMetadataValidationEnabled) { | ||
| None | ||
| val enabled = GlutenConfig.get.parquetMetadataValidationEnabled | ||
| if (enabled || SparkShimLoader.getSparkShims.needsVariantAnnotationCheck) { |
There was a problem hiding this comment.
Can we reuse parquetMetadataValidationEnabled? the default value is false
There was a problem hiding this comment.
Updated. Variant annotation check is now gated by parquetMetadataValidationEnabled (same as other checks). For Spark 4.1 GlutenParquetVariantShreddingSuite, we set parquetMetadataValidationEnabled=true via sparkConf in the test suite.
|
|
||
| def isParquetFileEncrypted(footer: ParquetMetadata): Boolean | ||
|
|
||
| def shouldFallbackForParquetVariantAnnotation(footer: ParquetMetadata): Boolean = false |
There was a problem hiding this comment.
We would we not fallback?
There was a problem hiding this comment.
Default false because Spark 3.x and 4.0 don't have Parquet variant logical type annotations — only Spark 4.1 introduced VariantLogicalTypeAnnotation. Spark41Shims overrides this to actually check the Parquet schema. For older Spark versions, there's nothing to detect, so no fallback needed.
aea9428 to
6086667
Compare
|
Run Gluten Clickhouse CI on x86 |
Enable GlutenVariantEndToEndSuite, GlutenVariantShreddingSuite, and GlutenParquetVariantShreddingSuite for both spark40 and spark41. Changes: 1. VeloxValidatorApi: Detect variant shredded structs produced by Spark's PushVariantIntoScan (checking __VARIANT_METADATA_KEY metadata) to trigger fallback to Spark's native Parquet reader. 2. ParquetMetadataUtils: Add variant annotation check in isUnsupportedMetadata, gated by parquetMetadataValidationEnabled. Reads the same footer, no extra I/O. 3. Spark41Shims: Add shouldFallbackForParquetVariantAnnotation to detect Parquet variant logical type annotations. 4. GlutenParquetVariantShreddingSuite (spark41): Set parquetMetadataValidationEnabled=true to enable variant annotation fallback detection. 5. pom.xml: Add -Dfile.encoding=UTF-8 to test JVM args. On JDK 17 with LANG=C (CI containers centos-8/9), the default charset is US-ASCII causing garbled output for multi-byte characters. JDK 18+ defaults to UTF-8 via JEP 400. See: https://github.com/apache/spark/blob/v4.0.1/common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java#L508 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
6086667 to
ce27cd8
Compare
|
Run Gluten Clickhouse CI on x86 |
What changes are proposed in this pull request?
Enable
GlutenVariantEndToEndSuite,GlutenVariantShreddingSuite, andGlutenParquetVariantShreddingSuitefor both spark40 and spark41.Changes:
VeloxValidatorApi.scala: Detect variant shredded structs (produced by Spark's
PushVariantIntoScan) by checking for__VARIANT_METADATA_KEYmetadata on struct fields. Triggers fallback to Spark's native Parquet reader since Velox cannot read variant shredding encoding.ParquetMetadataUtils.scala: Add variant annotation check in
isUnsupportedMetadata, gated byparquetMetadataValidationEnabled. Reads the same footer, no extra I/O.Spark41Shims.scala: Add
shouldFallbackForParquetVariantAnnotationto detect Parquet variant logical type annotations.GlutenParquetVariantShreddingSuite (spark41): Set
parquetMetadataValidationEnabled=trueto enable variant annotation fallback detection.pom.xml: Add
-Dfile.encoding=UTF-8to test JVM args. On JDK 17 withLANG=C(CI containers centos-8/9), the default charset is US-ASCII causing garbled output for multi-byte characters. JDK 18+ defaults to UTF-8 via JEP 400. See VariantUtil.java#L508.How was this patch tested?
GlutenVariantEndToEndSuite14✅,GlutenVariantShreddingSuite8✅,GlutenParquetVariantShreddingSuite5✅GlutenVariantEndToEndSuite14✅,GlutenVariantShreddingSuite8✅,GlutenParquetVariantShreddingSuite7✅Was this patch authored or co-authored using generative AI tooling?
Generated-by: GitHub Copilot CLI
Related issue: #11550
Note: This PR subsumes #11723 (GlutenParquetVariantShreddingSuite).