Skip to content

[Feature Request]: Iceberg IO Hash Distribution Mode Support for Spark Based Runners #38820

@barunkumaracharya

Description

@barunkumaracharya

What would you like to happen?

In Beam's latest release 2.74.0, Hash Distribution Mode and AutoSharding for ICEBERG IO WRITES were enabled via this PR 38061.
This PR introduced a new transform WriteToPartitions (file WriteToPartitions.java, new in 2.74.0) and this file uses GroupIntoBatches Transform which is not supported in Spark Runners as Spark Runner does not support @OnWindowExpiration annotation (States and Timers) which is why i m unable to use the new feature in Spark Runner.

When i use SparkStructuredStreamingRunner, the code fails at this line

When i use SparkRunner, the code fails at this line

Is there a workaround or solve for Spark Runner Users to use the "hash" distribution mode with "autoSharding" for Iceberg Writes using Beam.

Issue Priority

Priority: 2 (default / most feature requests should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Prism Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions