Skip to content

[VL] GPU failed on BHJ #11794

@marin-ma

Description

@marin-ma

Backend

VL (Velox)

Bug description

26/03/19 15:23:14 INFO Executor: Executor interrupted and killed task 6.3 in stage 11.0 (TID 4149), reason: Stage cancelled: Job aborted due to stage failure: Task 4 in stage 11.0 failed 4 times, most recent failure: Lost task 4.3 in stage 11.0 (TID 4145) (10.167.33.215 executor 0): org.apache.gluten.exception.GlutenException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: (0 vs. 0) Trying to access non-existing child in RowVector: [ROW ROW<d_date_sk:INTEGER>: 287 elements, no nulls]
Retriable: False
Expression: index < childrenSize_
Function: childAt
File: /velox/velox/vector/ComplexVector.h
Line: 116
Stack trace:
# 0  _ZN8facebook5velox7process10StackTraceC1Ei
# 1  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS0_24CompileTimeStringLiteralENS1_4TypeES7_
# 2  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEvRKNS1_18VeloxCheckFailArgsET0_NS0_24CompileTimeStringLiteralE
# 3  _ZN6gluten16HashTableBuilder8addInputESt10shared_ptrIN8facebook5velox9RowVectorEE
# 4  _ZN6gluten20nativeHashTableBuildERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS6_EES8_S0_ISt10shared_ptrIKN8facebook5velox4TypeEESaISG_EEibbblRS0_ISB_INS_13ColumnarBatchEESaISK_EESB_INSD_6memory10MemoryPoolEE
# 5  Java_org_apache_gluten_vectorized_HashJoinBuilder_nativeBuild
# 6  0x00007fa6a6b7f5da

        at org.apache.gluten.vectorized.HashJoinBuilder.nativeBuild(Native Method)
        at org.apache.spark.sql.execution.unsafe.UnsafeColumnarBuildSideRelation.buildHashTable(UnsafeColumnarBuildSideRelation.scala:188)
        at org.apache.gluten.execution.VeloxBroadcastBuildSideCache$.$anonfun$getOrBuildBroadcastHashTable$1(VeloxBroadcastBuildSideCache.scala:70)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$14(BoundedLocalCache.java:2688)
        at java.base/java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1916)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2686)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2669)
        at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:112)
        at com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:62)
        at org.apache.gluten.execution.VeloxBroadcastBuildSideCache$.getOrBuildBroadcastHashTable(VeloxBroadcastBuildSideCache.scala:65)
        at org.apache.gluten.execution.VeloxBroadcastBuildSideRDD.genBroadcastBuildSideIterator(VeloxBroadcastBuildSideRDD.scala:48)
        at org.apache.gluten.execution.ColumnarInputRDDsWrapper.$anonfun$getIterators$1(WholeStageTransformer.scala:502)
        at scala.collection.immutable.List.flatMap(List.scala:366)
        at org.apache.gluten.execution.ColumnarInputRDDsWrapper.getIterators(WholeStageTransformer.scala:500)
        at org.apache.gluten.execution.WholeStageZippedPartitionsRDD.$anonfun$compute$1(WholeStageZippedPartitionsRDD.scala:46)
        at org.apache.gluten.utils.Arm$.withResource(Arm.scala:25)
        at org.apache.gluten.metrics.GlutenTimeMetric$.millis(GlutenTimeMetric.scala:37)
        at org.apache.gluten.execution.WholeStageZippedPartitionsRDD.compute(WholeStageZippedPartitionsRDD.scala:44)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
        at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
        at org.apache.spark.scheduler.Task.run(Task.scala:141)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
        at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
        at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:840)

Gluten version

No response

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions