Describe the bug, including details regarding any error messages, version, and platform.
The following exception was thrown when we read a column of ARRAY<STRING> in Spark 3.5.0 and Parquet 1.15.2
Caused by: java.lang.ArithmeticException: integer overflow
at java.base/java.lang.Math.addExact(Math.java:883)
at org.apache.parquet.bytes.CapacityByteArrayOutputStream.addSlab(CapacityByteArrayOutputStream.java:198)
at org.apache.parquet.bytes.CapacityByteArrayOutputStream.write(CapacityByteArrayOutputStream.java:220)
at org.apache.parquet.bytes.LittleEndianDataOutputStream.write(LittleEndianDataOutputStream.java:76)
at java.base/java.io.OutputStream.write(OutputStream.java:127)
at org.apache.parquet.io.api.Binary$ByteArrayBackedBinary.writeTo(Binary.java:319)
at org.apache.parquet.column.values.plain.PlainValuesWriter.writeBytes(PlainValuesWriter.java:55)
at org.apache.parquet.column.values.fallback.FallbackValuesWriter.writeBytes(FallbackValuesWriter.java:178)
at org.apache.parquet.column.impl.ColumnWriterBase.write(ColumnWriterBase.java:196)
at org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addBinary(MessageColumnIO.java:473)
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeWriter$9(ParquetWriteSupport.scala:212)
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeWriter$9$adapted(ParquetWriteSupport.scala:210)
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeArrayWriter$5(ParquetWriteSupport.scala:354)
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeField(ParquetWriteSupport.scala:490)
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeArrayWriter$4(ParquetWriteSupport.scala:354)
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeGroup(ParquetWriteSupport.scala:484)
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeArrayWriter$3(ParquetWriteSupport.scala:352)
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeField(ParquetWriteSupport.scala:490)
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeArrayWriter$2(ParquetWriteSupport.scala:347)
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeGroup(ParquetWriteSupport.scala:484)
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeArrayWriter$1(ParquetWriteSupport.scala:346)
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeArrayWriter$1$adapted(ParquetWriteSupport.scala:342)
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$writeFields$1(ParquetWriteSupport.scala:168)
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeField(ParquetWriteSupport.scala:490)
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.writeFields(ParquetWriteSupport.scala:168)
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$write$1(ParquetWriteSupport.scala:158)
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeMessage(ParquetWriteSupport.scala:478)
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:158)
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:54)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:152)
at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:240)
at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:41)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.write(ParquetOutputWriter.scala:39)
The issue can be worked around by increasing spark.sql.shuffle.partitions to divide data into smaller partitions.
Can it be solved at parquet side?
Component(s)
Core
Describe the bug, including details regarding any error messages, version, and platform.
The following exception was thrown when we read a column of
ARRAY<STRING>in Spark 3.5.0 and Parquet 1.15.2The issue can be worked around by increasing
spark.sql.shuffle.partitionsto divide data into smaller partitions.Can it be solved at parquet side?
Component(s)
Core