Enable streaming transport in analytics-engine-rest QA by default#21575
Enable streaming transport in analytics-engine-rest QA by default#21575bowenlan-amzn wants to merge 3 commits intoopensearch-project:mainfrom
Conversation
|
❌ Gradle check result for aaa0c63: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
PR Reviewer Guide 🔍(Review updated until commit 30f0c0e)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to 30f0c0e Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit 601f5c8
Suggestions up to commit a6a4178
Suggestions up to commit 9ea752c
|
|
❌ Gradle check result for 9ea752c: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
9ea752c to
a6a4178
Compare
|
Persistent review updated to latest commit a6a4178 |
a6a4178 to
601f5c8
Compare
|
Persistent review updated to latest commit 601f5c8 |
|
❌ Gradle check result for 601f5c8: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
…gTest Make streaming transport the default in the analytics-engine-rest QA suite so every IT runs with the streaming flag enabled, rather than only the one dedicated test. - Add `opensearch.experimental.feature.transport.stream.enabled=true` as a systemProperty on the default integTest testCluster. - Remove the `**/StreamingCoordinatorReduceIT.class` exclusion from integTest so the streaming test runs alongside the rest. - Drop the now-redundant `integTestStreaming` task and its dedicated testClusters.integTestStreaming config. Note: the single-shard / single-node path short-circuits local transport, so enabling the flag does not guarantee Flight RPC is exercised for every test — it just ensures no regression when it is. Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
DataFusion physical plans promote string columns to Utf8View inside aggregations (group-by keys, hash tables). Under the streaming transport, the shard ships that VSR straight to the coordinator via Arrow C Data + Flight; the reducer's partition stream is declared Utf8 (from Calcite), so a Utf8View batch reinterprets as broken Utf8 and the downstream read either OOMs the heap (multi-GB alleged length on VarCharVector.get) or throws NegativeArraySizeException. The non-streaming path doesn't see this because RowResponseCodec round-trips values through Object[] and rebuilds a fresh Utf8 VSR on the coordinator. Fix on the streaming receive path: normalize each inbound VSR so every variable-width string column is a VarCharVector (Utf8) before handing it to the reducer sink. Other vector types transfer through unchanged. Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
…ch ordering appendpipe lowers to UnionExec with CoalescePartitionsExec, which polls both partitions concurrently and emits whichever completes first. Under streaming transport, tokio scheduling can flip the winner causing branch 1 output to precede branch 0 — breaking appendpipe semantics. Fix: close senders sequentially with a semaphore-based synchronization point between each. After closing sender-i, wait for the drain thread to re-enter streamNext (proving all branch-i output has been consumed) before closing sender-(i+1). Single-sender cases degenerate to one close with no wait. Known limitation: a TOCTOU race exists where tryAcquire can grab the permit from release() before streamNext is actually entered. This needs a follow-up fix but significantly reduces the failure window. Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
601f5c8 to
30f0c0e
Compare
|
Persistent review updated to latest commit 30f0c0e |
|
❌ Gradle check result for 30f0c0e: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Description
Make streaming transport the default in the
analytics-engine-restQA suite so every IT runs with theopensearch.experimental.feature.transport.stream.enabledflag enabled, rather than only the dedicatedStreamingCoordinatorReduceIT. This guards against regressions in the Arrow Flight stream transport across the full IT surface instead of a single test.Note: the single-shard / single-node path short-circuits local transport, so flag-enabled does not guarantee Flight RPC is exercised for every test — it just ensures no regression when it is.
Changes
systemProperty 'opensearch.experimental.feature.transport.stream.enabled', 'true'to the defaulttestClusters.integTestconfiguration.exclude '**/StreamingCoordinatorReduceIT.class'line fromintegTestso the streaming test runs alongside the rest.integTestStreamingtask and itstestClusters.integTestStreamingconfiguration (its coverage is subsumed by the defaultintegTest).Known issue
Enabling the flag across the full IT suite may cause a cluster OOM mid-suite. This is being triaged in parallel under
stream:qa-oomand is not resolved by this PR. A follow-up may be needed once the root cause is identified.Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.