Skip to content

feat(connectors): add opt-in CSV output format to Doris sink#3575

Draft
ryankert01 wants to merge 2 commits into
apache:masterfrom
ryankert01:feat/doris-sink-csv-output
Draft

feat(connectors): add opt-in CSV output format to Doris sink#3575
ryankert01 wants to merge 2 commits into
apache:masterfrom
ryankert01:feat/doris-sink-csv-output

Conversation

@ryankert01

Copy link
Copy Markdown
Member

Which issue does this PR address?

Relates to #3215

Stacked on #3574 (in-request retry). Until that merges, this PR's diff includes the retry commit; it reduces to the CSV-only diff once #3574 lands.

Rationale

Follow-up to the Doris sink (#3215). JSON Stream Load parses more expensively on the Doris BE than CSV; a 10k-row benchmark shows CSV loads ~21% faster server-side at roughly half the body size. This adds CSV as an opt-in output format.

What changed?

The sink emitted JSON Stream Load only. This adds an opt-in output_format = "csv" (default stays JSON). CSV is positional where JSON is name-mapped, so it requires columns to pin the column order — open() fails on output_format = "csv" without it.

The hand-rolled encoder uses control-char framing (column_separator \x01, line_delimiter \x02) with enclose/escape quoting so embedded separators/quotes/newlines round-trip. Doris escapes with a prefix char (\"), not RFC-4180 quote-doubling, so the csv crate does not fit and no new dependency is added. JSON null and missing keys map to \N, empty string to "", numbers and bools emit bare, nested values stringify to JSON. The field is output_format (not format) to avoid an env-override collision with the runtime's top-level plugin_config_format.

Local Execution

  • Passed
  • Pre-commit hooks: checks run manually (the license-headers hook cannot run on this machine's bash 3.2 — its script needs mapfile; hawkeye check passes directly and CI enforces it). cargo fmt, cargo clippy -p iggy_connector_doris_sink --all-targets -- -D warnings, cargo test -p iggy_connector_doris_sink (50 unit tests), markdownlint, and taplo all pass. Verified end-to-end against a real apache/doris:4.0.3-all-slim container: the connector-driven CSV test round-trips a value with a comma/quote/backslash byte for byte; the #[ignore]d JSON-vs-CSV benchmark reports CSV ~21% faster (server LoadTimeMs 37 vs 47, body 520KB vs 1060KB). The pre-existing 1000-row bulk test is flaky on the local constrained Doris BE (reproduces without these changes); it passes in CI.

AI Usage

  1. Claude Code (Anthropic).
  2. Implemented the Format enum + output_format config, the hand-rolled CSV encoder, the connector-driven correctness test, the #[ignore]d benchmark, and docs — after researching Doris CSV Stream Load escaping/NULL semantics and confirming them against the running container.
  3. 50 unit tests cover the encoder (escaping, null vs empty, column ordering, nested values); an integration test round-trips hazardous bytes through real Doris; the benchmark measures the load-time delta. Full crate suite, clippy, and lint pass locally.
  4. Yes.

The Doris sink classified transient Stream Load outcomes (5xx/408/429,
transport errors, Publish Timeout) as retryable but never acted on them: the
runtime commits the consumer offset at poll time before consume() runs and
discards its return value, so a transient backend blip silently dropped the
batch under at-most-once delivery.

consume() now retries a transiently-failed batch in-request, re-PUTing under
the same deterministic label so Doris dedupes a prior attempt that actually
landed (e.g. a 2xx whose body could not be read). Permanent failures are never
retried. Backoff and jitter come from iggy_connector_sdk::retry, bounded by new
max_retries/retry_delay/max_retry_delay config (defaults 3/200ms/5s). This
shrinks the at-most-once window within a single poll; cross-poll and crash
delivery remain a runtime concern, not something a sink can fix.

Relates to apache#3215.
The Doris sink only emitted JSON Stream Load, but Doris parses JSON on the BE
more expensively than CSV. A 10k-row benchmark shows CSV loads ~21% faster
server-side at roughly half the body size.

This adds an opt-in output_format = "csv" (default stays JSON). CSV is
positional where JSON is name-mapped, so it requires the columns config to pin
column order; open() rejects output_format = "csv" without it. The hand-rolled
encoder uses control-char framing (column_separator \x01, line_delimiter \x02)
with enclose/escape quoting so embedded separators, quotes, and newlines
round-trip; Doris uses escape-prefix rather than RFC-4180 doubling, so the csv
crate does not fit. JSON null and missing keys map to \N, empty string to "",
numbers and bools emit bare, nested values stringify to JSON. The field is
output_format (not format) to avoid an env-override collision with the
runtime's top-level plugin_config_format.

Verified end-to-end against a real apache/doris:4.0.3-all-slim container: a
connector-driven test round-trips a value containing a comma, a quote, and a
backslash byte for byte, alongside an ignored JSON-vs-CSV Stream Load benchmark.

Relates to apache#3215.
@github-actions

Copy link
Copy Markdown

Thanks for the PR. It is labeled S-waiting-on-review and queued for review.

Slash commands (own line, regular comment) move it around the queue:

  • /ready - back to S-waiting-on-review after addressing feedback
  • /author - flip to S-waiting-on-author while you finish changes
  • /request-review @user-or-team - request a reviewer

See CONTRIBUTING.md for details.

@github-actions github-actions Bot added the S-waiting-on-review PR is waiting on a reviewer label Jun 27, 2026
@codecov

codecov Bot commented Jun 27, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.80805% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 46.47%. Comparing base (307fdb1) to head (5d02391).

Files with missing lines Patch % Lines
core/connectors/sinks/doris_sink/src/lib.rs 93.80% 14 Missing and 6 partials ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             master    #3575       +/-   ##
=============================================
- Coverage     74.07%   46.47%   -27.60%     
  Complexity      937      937               
=============================================
  Files          1249     1246        -3     
  Lines        128248   112024    -16224     
  Branches     104116    87892    -16224     
=============================================
- Hits          94994    52060    -42934     
- Misses        30219    57291    +27072     
+ Partials       3035     2673      -362     
Components Coverage Δ
Rust Core 39.25% <93.80%> (-35.46%) ⬇️
Java SDK 62.44% <ø> (ø)
C# SDK 72.06% <ø> (ø)
Python SDK 88.88% <ø> (ø)
PHP SDK 84.29% <ø> (ø)
Node SDK 91.35% <ø> (ø)
Go SDK 40.14% <ø> (ø)
Files with missing lines Coverage Δ
core/connectors/sinks/doris_sink/src/lib.rs 93.30% <93.80%> (+0.95%) ⬆️

... and 353 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ryankert01 ryankert01 marked this pull request as draft June 28, 2026 01:42
@github-actions github-actions Bot removed the S-waiting-on-review PR is waiting on a reviewer label Jun 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant