feat(connectors): add opt-in CSV output format to Doris sink#3575
Draft
ryankert01 wants to merge 2 commits into
Draft
feat(connectors): add opt-in CSV output format to Doris sink#3575ryankert01 wants to merge 2 commits into
ryankert01 wants to merge 2 commits into
Conversation
The Doris sink classified transient Stream Load outcomes (5xx/408/429, transport errors, Publish Timeout) as retryable but never acted on them: the runtime commits the consumer offset at poll time before consume() runs and discards its return value, so a transient backend blip silently dropped the batch under at-most-once delivery. consume() now retries a transiently-failed batch in-request, re-PUTing under the same deterministic label so Doris dedupes a prior attempt that actually landed (e.g. a 2xx whose body could not be read). Permanent failures are never retried. Backoff and jitter come from iggy_connector_sdk::retry, bounded by new max_retries/retry_delay/max_retry_delay config (defaults 3/200ms/5s). This shrinks the at-most-once window within a single poll; cross-poll and crash delivery remain a runtime concern, not something a sink can fix. Relates to apache#3215.
The Doris sink only emitted JSON Stream Load, but Doris parses JSON on the BE more expensively than CSV. A 10k-row benchmark shows CSV loads ~21% faster server-side at roughly half the body size. This adds an opt-in output_format = "csv" (default stays JSON). CSV is positional where JSON is name-mapped, so it requires the columns config to pin column order; open() rejects output_format = "csv" without it. The hand-rolled encoder uses control-char framing (column_separator \x01, line_delimiter \x02) with enclose/escape quoting so embedded separators, quotes, and newlines round-trip; Doris uses escape-prefix rather than RFC-4180 doubling, so the csv crate does not fit. JSON null and missing keys map to \N, empty string to "", numbers and bools emit bare, nested values stringify to JSON. The field is output_format (not format) to avoid an env-override collision with the runtime's top-level plugin_config_format. Verified end-to-end against a real apache/doris:4.0.3-all-slim container: a connector-driven test round-trips a value containing a comma, a quote, and a backslash byte for byte, alongside an ignored JSON-vs-CSV Stream Load benchmark. Relates to apache#3215.
|
Thanks for the PR. It is labeled Slash commands (own line, regular comment) move it around the queue:
See CONTRIBUTING.md for details. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3575 +/- ##
=============================================
- Coverage 74.07% 46.47% -27.60%
Complexity 937 937
=============================================
Files 1249 1246 -3
Lines 128248 112024 -16224
Branches 104116 87892 -16224
=============================================
- Hits 94994 52060 -42934
- Misses 30219 57291 +27072
+ Partials 3035 2673 -362
🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR address?
Relates to #3215
Rationale
Follow-up to the Doris sink (#3215). JSON Stream Load parses more expensively on the Doris BE than CSV; a 10k-row benchmark shows CSV loads ~21% faster server-side at roughly half the body size. This adds CSV as an opt-in output format.
What changed?
The sink emitted JSON Stream Load only. This adds an opt-in
output_format = "csv"(default stays JSON). CSV is positional where JSON is name-mapped, so it requirescolumnsto pin the column order —open()fails onoutput_format = "csv"without it.The hand-rolled encoder uses control-char framing (
column_separator\x01,line_delimiter\x02) withenclose/escapequoting so embedded separators/quotes/newlines round-trip. Doris escapes with a prefix char (\"), not RFC-4180 quote-doubling, so thecsvcrate does not fit and no new dependency is added. JSONnulland missing keys map to\N, empty string to"", numbers and bools emit bare, nested values stringify to JSON. The field isoutput_format(notformat) to avoid an env-override collision with the runtime's top-levelplugin_config_format.Local Execution
license-headershook cannot run on this machine's bash 3.2 — its script needsmapfile;hawkeye checkpasses directly and CI enforces it).cargo fmt,cargo clippy -p iggy_connector_doris_sink --all-targets -- -D warnings,cargo test -p iggy_connector_doris_sink(50 unit tests),markdownlint, andtaploall pass. Verified end-to-end against a realapache/doris:4.0.3-all-slimcontainer: the connector-driven CSV test round-trips a value with a comma/quote/backslash byte for byte; the#[ignore]d JSON-vs-CSV benchmark reports CSV ~21% faster (server LoadTimeMs 37 vs 47, body 520KB vs 1060KB). The pre-existing 1000-row bulk test is flaky on the local constrained Doris BE (reproduces without these changes); it passes in CI.AI Usage
Formatenum +output_formatconfig, the hand-rolled CSV encoder, the connector-driven correctness test, the#[ignore]d benchmark, and docs — after researching Doris CSV Stream Load escaping/NULL semantics and confirming them against the running container.