Skip to content

[Improvement-16754][DataX] Support DataX writer parameter batchSize#18192

Open
leocook wants to merge 1 commit intoapache:devfrom
leocook:fix-16754-datax-batchsize
Open

[Improvement-16754][DataX] Support DataX writer parameter batchSize#18192
leocook wants to merge 1 commit intoapache:devfrom
leocook:fix-16754-datax-batchsize

Conversation

@leocook
Copy link
Copy Markdown
Contributor

@leocook leocook commented Apr 25, 2026

Was this PR generated or assisted by AI?

YES — used Claude (Sonnet 4.6 / Opus 4.7) to help review the original design, simplify the implementation (removing per-database dynamic defaults that were the wrong abstraction), and resolve rebase conflicts against dev. All design decisions, the final approach, and verification were reviewed by me.

Purpose of the pull request

DataX writer currently does not expose the batchSize parameter, so users are stuck with the DataX default of 2048. This is too small for some target databases — e.g. the ClickHouse plugin recommends 65536. This PR makes batchSize configurable from the task UI.

Closes #16754

Brief change log

Backend (dolphinscheduler-task-datax):

  • Add batchSize field to DataxParameters (lombok @Data generates accessors)
  • In DataxTask, write batchSize into the writer JSON only when value > 0, so the DataX upstream default still applies when the user picks 0 (unlimited)
  • Extend DataxParametersTest to cover the new field

Frontend (dolphinscheduler-ui):

  • Add a batchSize select control on the DataX task form
  • Static option list: 0 / 1024 / 2048 / 4096 / 8192 / 16384 / 32768 / 65536 / 131072
  • Default 2048 (matches DataX upstream); ClickHouse / Databend users can pick 65536 / 131072 from the same list
  • Add zh_CN and en_US i18n entries (datax_writer_batch_size)

Design note: an earlier draft switched the option list and default based on target database type (see issue discussion). That was dropped after review — batchSize optima depend on row width, network, and target hardware, not just on the database, so hard-coding database-specific defaults in the UI is the wrong abstraction. A static list keeps the UI consistent with surrounding fields (jobSpeedByte / jobSpeedRecord / memoryLimit).

Verify this pull request

This change added tests and can be verified as follows:

  • Added testBatchSize in DataxParametersTest covering the new field (default 0, 2048, 65536)
  • Updated the existing toString assertion in DataxParametersTest to include batchSize=0
  • Manually verified locally:
    • Created a DataX task targeting MySQL with batchSize=2048 — generated JSON contains "batchSize": 2048 under writer.parameter
    • Set batchSize=0 — generated JSON omits the field, DataX falls back to its own default
    • Switched UI between values, persisted task, reloaded — value preserved

Screenshots

1 2

Pull Request Notice

Pull Request Notice

Add batchSize parameter support for DataX task to control writer batch size.

Backend:
- Add batchSize field to DataxParameters (lombok @DaTa generates accessors)
- Generate batchSize JSON config in DataxTask only when value > 0
- Add unit tests for batchSize parameter

Frontend:
- Add batchSize select with a static option list
  (0/1024/2048/4096/8192/16384/32768/65536/131072)
- Default to 2048 to match the DataX upstream default; ClickHouse / Databend
  users can pick 65536 / 131072 from the dropdown
- Add zh_CN / en_US i18n entries

Closes apache#16754
@github-actions github-actions Bot added UI ui and front end related backend test labels Apr 25, 2026
@leocook leocook marked this pull request as ready for review April 25, 2026 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend test UI ui and front end related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Improvement][datax] Support DataX parameter batchSize for writer

1 participant