[analytics-backend-datafusion] Implement partial/final aggregation mode for distributed execution by sandeshkr419 · Pull Request #21457 · opensearch-project/OpenSearch

sandeshkr419 · 2026-05-02T05:09:16Z

Introduces end-to-end distributed aggregate execution for the analytics plugin. Shards compute partial state, the coordinator merges to final results, and the whole path works across a multi-node multi-shard cluster. 39 of 43 ClickBench PPL queries now pass in that topology.

Architecture

The control split is layered to avoid per-function branches. Java owns the logical plan; Rust compiles each half (PARTIAL or FINAL) of whatever substrait it receives:

HEP marking: OpenSearchAggregateRule, OpenSearchProjectRule (with baseline scalar-op carve-out), and OpenSearchAggregateReduceRule (reduces AVG → SUM+COUNT+Project in a dedicated phase, before the Volcano split, using Calcite's testedmachinery).
Volcano CBO split: OpenSearchAggregateSplitRule turns SINGLE → FINAL(Exchange(PARTIAL(…))) structurally — no per-function logic.
Decomposition resolver (AggregateDecompositionResolver): single pass handling the three single-field shapes — pass-through (SUM/MIN/MAX), function-swap (COUNT → SUM at FINAL), engine-native merge (APPROX_COUNT_DISTINCT sketch). Rebuilds the exchange column row type with ArrowCalciteTypes (single-type bidirectional) and rewires parent projections.
DAG cut (DAGBuilder): structural only — cuts at OpenSearchExchangeReducer.
Backend conversion (DataFusionFragmentConvertor): mechanical Calcite → Substrait emission; APPROX_COUNT_DISTINCT routes to DataFusion's native HLL via the extension YAML + a subclassed AggregateFunctionConverter.
Instruction assembly (FragmentConversionDriver).
Execution handlers (PartialAggregateInstructionHandler, FinalAggregateInstructionHandler) → named FFI (preparePartialPlan, prepareFinalPlan, executeLocalPreparedPlan). No aggregate-mode enum on the wire.
Streaming (LocalStageScheduler): one-line arrowSchemaFromRowType(child.getFragment().getRowType()).
Physical (Rust agg_mode.rs): internal Mode {Default, Partial, Final}; force_aggregate_mode strips one half; CombinePartialFinalAggregate disabled on every SessionContext so Final(Partial(…)) survives to the strip pass.

Single source of truth for per-function decomposition

AggregateFunction enum (SPI) carries intermediateFields — the only authority for what the PARTIAL emits and how the FINAL reduces it. fromSqlAggFunction / toSqlAggFunction bridge between Calcite operators and enum constants. COUNT and APPROX_COUNT_DISTINCT declare their shapes there; SUM/MIN/MAX/AVG use null intermediates (pass-through post-reduction).

Multi-shard unblocks

Empty-batch schema preservation: shards that match zero rows still emit one zero-row VSR carrying the declared schema, so the FINAL plan's first data frame carries types downstream even when WHERE filters out everything.
Arrow IPC wire codec replaces row-oriented Object[] in FragmentExecutionResponse. Arrow's own serializer/deserializer round-trips every type natively (temporal, view, dictionary) without hand-rolled per-type dispatch. Removes the writeGenericValue(LocalDateTime) crash class that blocked all TIMESTAMP queries.
Java↔Rust type-coercion bridge at DatafusionReduceSink.feedToSender reconciles DataFusion's physical emit types (e.g. Utf8View for string group keys) with Calcite's declared exchange schema (Utf8) before pushing into the FINAL plan's StreamingTable partition.

TODOs to revisit (tracked independently)

The goal is to unblock distributed query execution framework and avoid spilling more changes in this already huge PR.

Type coercion at the right layer: the Utf8View → Utf8 shim in coerceToDeclaredSchema is a runtime bridge; the principled fix is a Rust force_output_schema cast at the PARTIAL root, a Substrait view-type extension, or verified Arrow Java ↔Rust FFI symmetry for view types.
Aggregate decomposition catalog → per-backend SPI: AggregateFunction.intermediateFields hard-codes DataFusion's state shapes into the shared enum; the catalog should move to BackendCapabilityProvider.aggregateDecompositions() (engine keeps the structural resolver, backends declare their own metadata).
MultisearchCommandIT.testMultisearchCountEvalConditionalCount and AppendPipeCommandIT.testMultisearchCountEvalConditionalCount - commented out for now - need to fix and revert this test back.

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2026-05-02T05:10:12Z

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit f87d308.

Path	Line	Severity	Description
sandbox/libs/dataformat-native/rust/Cargo.toml	31	high	New Rust dependency added: datafusion-physical-optimizer = "52.1.0". Per mandatory policy, all dependency additions must be flagged regardless of apparent legitimacy. Maintainers should verify this crate resolves to the expected artifact on crates.io and matches the datafusion ecosystem version already in use.
sandbox/plugins/analytics-backend-datafusion/rust/Cargo.toml	18	high	New Rust dependency added: datafusion-physical-optimizer = { workspace = true }. Per mandatory policy, all dependency additions must be flagged. Maintainers should confirm the workspace-level pin resolves to the correct datafusion-physical-optimizer crate and that no namespace hijacking or typosquatting is present.

The table above displays the top 10 most important findings.

Total: 2 | Critical: 0 | High: 2 | Medium: 0 | Low: 0

Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.

⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

expani

Thanks for getting this going. @sandeshkr419

Dropping some early comments on the approach.

github-actions · 2026-05-03T02:11:39Z

PR Reviewer Guide 🔍

(Review updated until commit `614a8c5`)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review Possible Resource Leak When `preparedState != null`, the constructor catches exceptions during `executeLocalPreparedPlan` and closes senders allocated in the try block. However, if `preparedState` is non-null, the senders come from `preparedState.senders()` and are not allocated locally. The catch block at lines 125-138 closes senders only when `preparedState == null`, but if `executeLocalPreparedPlan` throws and `preparedState != null`, the `streamPtr` is closed but the state's senders are not. This could leak native resources if the state is not closed elsewhere. try { if (preparedState != null) { // Plan was already prepared by FinalAggregateInstructionHandler. The handler // registered senders in ctx.childInputs() iteration order; we re-index them // here by childStageId for lookup during feed(). int i = 0; for (Map.Entry<Integer, byte[]> child : childInputs.entrySet()) { senders.put(child.getKey(), preparedState.senders().get(i++)); } streamPtr = NativeBridge.executeLocalPreparedPlan(session.getPointer()); } else { // Legacy path (non-aggregate reduce): register partitions and execute the // fragment bytes directly. Used when no prior instruction prepared a plan. // // ctx.fragmentBytes() references each partition by its "input-<stageId>" name // (DataFusionFragmentConvertor names them this way during plan conversion). for (Map.Entry<Integer, byte[]> child : childInputs.entrySet()) { int childStageId = child.getKey(); byte[] schemaIpc = child.getValue(); long senderPtr = NativeBridge.registerPartitionStream(session.getPointer(), inputIdFor(childStageId), schemaIpc); senders.put(childStageId, new DatafusionPartitionSender(senderPtr)); } streamPtr = NativeBridge.executeLocalPlan(session.getPointer(), ctx.fragmentBytes()); } this.outStream = new StreamHandle(streamPtr, runtimeHandle); } catch (RuntimeException e) { if (streamPtr != 0) { NativeBridge.streamClose(streamPtr); } // Only close senders we allocated locally (legacy path). When preparedState // owns them, the state's close() will. if (preparedState == null) { for (DatafusionPartitionSender sender : senders.values()) { try { sender.close(); } catch (Throwable ignore) {} } session.close(); } throw e; } this.sendersByChildStageId = senders; Possible Double-Close At lines 96-105, if `provider.createSink` throws and `backendContext` is non-null, the code closes `backendContext` to avoid a leak. However, if `backendContext` is a `DataFusionReduceState` that was successfully prepared by the instruction handler, its session and senders are already registered and may be closed again by the state's own close method when the orchestrator cleans up. This could result in double-close on native handles if the state's close is idempotent but the native side is not. backendSink = provider.createSink(context, backendContext); } catch (Exception e) { // Sink creation failed — close backendContext to avoid resource leak. if (backendContext != null) { try { backendContext.close(); } catch (Exception closeFailure) { e.addSuppressed(closeFailure); } } Possible Resource Leak If `prepareFinalPlan` throws after some senders have been registered, the catch block at lines 56-63 closes all senders and the session. However, if a sender's close itself throws, the exception is silently ignored (line 59), and the session close at line 61 may not run if an earlier sender close threw an unchecked exception that bypassed the loop. This could leave the session open. for (DatafusionPartitionSender sender : senders) { try { sender.close(); } catch (Exception ignored) {} } session.close(); throw e; } Possible Issue At line 268, the resolver reads the canonical name from `aggRowType.getFieldList().get(groupCount + i).getName()` to align with DataFusion's `$f<N>` convention. However, if the aggregate call list is empty (no aggregates, only group keys), `groupCount + i` could be out of bounds. The loop condition `i < agg.getAggCallList().size()` prevents this, but if `aggRowType` has fewer fields than expected (e.g., due to a prior rewrite), the index could still be invalid. for (int i = 0; i < agg.getAggCallList().size(); i++) { AggregateCall call = agg.getAggCallList().get(i); String canonicalName = aggRowType.getFieldList().get(groupCount + i).getName(); CallRewrite rw = rewriteAggCall(call, finalColIdx, tf, canonicalName);

github-actions · 2026-05-03T02:12:23Z

PR Code Suggestions ✨

Latest suggestions up to 31e8fa2

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Validate column count before coercion Verify that `batch.getFieldVectors().size()` matches `declaredSchema.getFields().size()` before the loop to prevent `IndexOutOfBoundsException` when the batch has fewer columns than the declared schema. This mismatch can occur if the shard's partial aggregate emits a different column count than the coordinator expects. sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/DatafusionReduceSink.java [255-265] private static VectorSchemaRoot coerceToDeclaredSchema(VectorSchemaRoot batch, Schema declaredSchema, BufferAllocator alloc) { if (batch.getSchema().equals(declaredSchema)) { return batch; + } + if (batch.getFieldVectors().size() != declaredSchema.getFields().size()) { + batch.close(); + throw new IllegalStateException( + "Batch column count (" + batch.getFieldVectors().size() + ") does not match declared schema (" + declaredSchema.getFields().size() + ")" + ); } VectorSchemaRoot out = VectorSchemaRoot.create(declaredSchema, alloc); try { out.allocateNew(); int rows = batch.getRowCount(); for (int col = 0; col < declaredSchema.getFields().size(); col++) { FieldVector src = batch.getVector(col); FieldVector dst = out.getVector(col); ... } out.setRowCount(rows); } catch (RuntimeException e) { out.close(); throw e; } finally { batch.close(); } return out; } Suggestion importance[1-10]: 8 __ Why: The suggestion addresses a critical mismatch scenario where `batch.getFieldVectors().size()` differs from `declaredSchema.getFields().size()`, which would cause an `IndexOutOfBoundsException` in the loop. Validating upfront prevents silent failures and provides a clear diagnostic message.	Medium
	Validate prepared plan before execution The `execute_prepared` method panics if no plan was prepared, but this FFI function doesn't validate that condition before calling it. Add an explicit check and return an error code instead of allowing a panic across the FFI boundary. sandbox/plugins/analytics-backend-datafusion/rust/src/ffm.rs [822-829] pub unsafe extern "C" fn df_execute_local_prepared_plan(session_ptr: i64) -> i64 { let session = &(session_ptr as const crate::local_executor::LocalSession); - ... + if session.prepared_plan.is_none() { + return ffm_error("No plan prepared on session".to_string()); + } + let mgr = get_rt_manager()?; + let _guard = mgr.io_runtime.enter(); let df_stream = session.execute_prepared().map_err(\|e\| e.to_string())?; Suggestion importance[1-10]: 8 __ Why: This is a valid safety concern for FFI boundaries. The `execute_prepared` method explicitly panics if no plan is prepared, and allowing a panic across FFI is dangerous. Adding validation prevents undefined behavior.	Medium
	Add bounds checking for array access Add bounds checking before accessing `inputTypes.get(ref.getIndex())` in the `RexShuttle` to prevent `IndexOutOfBoundsException` when a `RexInputRef` references a column index beyond the replacement's row type field count. This can occur if the Project's expressions reference columns that were removed during aggregate rewriting. sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/planner/dag/AggregateDecompositionResolver.java [210-214] private static RelNode replaceFirstWithRefRebinding(RelNode node, RelNode target, RelNode replacement) { if (node == target) return replacement; java.util.List<RelNode> newInputs = new java.util.ArrayList<>(); boolean changed = false; for (RelNode input : node.getInputs()) { RelNode newInput; if (input == target) { newInput = replacement; if (node instanceof Project proj) { RexBuilder rexBuilder = node.getCluster().getRexBuilder(); java.util.List<RelDataType> inputTypes = new java.util.ArrayList<>(); for (var f : replacement.getRowType().getFieldList()) { inputTypes.add(f.getType()); } RexShuttle rebind = new RexShuttle() { @Override public RexNode visitInputRef(RexInputRef ref) { + if (ref.getIndex() >= inputTypes.size()) { + throw new IllegalStateException( + "RexInputRef index " + ref.getIndex() + " out of bounds for replacement row type with " + inputTypes.size() + " fields" + ); + } RelDataType actual = inputTypes.get(ref.getIndex()); if (ref.getType().equals(actual)) return ref; return new RexInputRef(ref.getIndex(), actual); } }; java.util.List<RexNode> rebound = new java.util.ArrayList<>(proj.getProjects().size()); for (int i = 0; i < proj.getProjects().size(); i++) { RexNode expr = proj.getProjects().get(i).accept(rebind); RelDataType targetType = proj.getRowType().getFieldList().get(i).getType(); if (!expr.getType().equals(targetType)) { expr = rexBuilder.makeCast(targetType, expr); } rebound.add(expr); } return proj.copy(proj.getTraitSet(), replacement, rebound, proj.getRowType()); } } else { newInput = replaceFirstWithRefRebinding(input, target, replacement); } newInputs.add(newInput); if (newInput != input) changed = true; } return changed ? node.copy(node.getTraitSet(), newInputs) : node; } Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies a potential `IndexOutOfBoundsException` when `ref.getIndex()` exceeds `inputTypes.size()`. Adding bounds checking with a clear error message improves robustness and debuggability during aggregate rewriting edge cases.	Medium
	Add bounds check for array access Direct array indexing without bounds checking can panic if `children()` returns an empty slice. Verify that `agg.children()` is non-empty before accessing index 0, or use a safer method like `first()` with error handling. sandbox/plugins/analytics-backend-datafusion/rust/src/agg_mode.rs [76-81] AggregateMode::Final => { // Current node is Partial; skip it, return its child // (the Final above will keep itself) - let child = agg.children()[0]; + let child = agg.children().first().ok_or_else(\|\| { + datafusion_common::DataFusionError::Internal( + "AggregateExec has no children".to_string() + ) + })?; force_aggregate_mode(Arc::clone(child), target) } Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies a potential panic from unchecked array indexing. However, in the context of `AggregateExec`, it's expected to have at least one child, making this a defensive programming improvement rather than a critical bug fix.	Medium
General	Ensure batch cleanup on concatenation failure Close all accumulated batches in the `batches` list if an exception occurs during the multi-batch concatenation phase (after the reader loop completes). Currently, only the reader-loop failures clean up batches; a failure in the concatenation logic below would leak the already-transferred batch roots. sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/stage/RowResponseCodec.java [82-110] List<VectorSchemaRoot> batches = new ArrayList<>(); Schema schema; try (ArrowStreamReader reader = new ArrowStreamReader(new ByteArrayInputStream(payload), allocator)) { VectorSchemaRoot readerRoot = reader.getVectorSchemaRoot(); schema = readerRoot.getSchema(); while (reader.loadNextBatch()) { VectorSchemaRoot batchRoot = VectorSchemaRoot.create(schema, allocator); int rowCount = readerRoot.getRowCount(); for (int i = 0; i < readerRoot.getFieldVectors().size(); i++) { readerRoot.getVector(i).makeTransferPair(batchRoot.getVector(i)).transfer(); } batchRoot.setRowCount(rowCount); batches.add(batchRoot); } } catch (IOException e) { for (VectorSchemaRoot b : batches) b.close(); throw new IllegalStateException("Failed to decode Arrow IPC payload from fragment response", e); } +if (batches.isEmpty()) { + return VectorSchemaRoot.create(schema, allocator); +} +if (batches.size() == 1) { + return batches.get(0); +} +VectorSchemaRoot combined = null; +try { + int totalRows = batches.stream().mapToInt(VectorSchemaRoot::getRowCount).sum(); + combined = VectorSchemaRoot.create(schema, allocator); + combined.allocateNew(); + ... + combined.setRowCount(totalRows); + return combined; +} catch (RuntimeException e) { + if (combined != null) combined.close(); + throw e; +} finally { + for (VectorSchemaRoot b : batches) + b.close(); +} + Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies a resource leak when the concatenation phase (after the reader loop) throws an exception. The existing `finally` block in the concatenation section already closes `batches`, but wrapping the entire concatenation in a try-catch that also closes `combined` on failure is a defensive improvement.	Medium
General	Prevent stack overflow from deep recursion The recursive call to `force_aggregate_mode` could lead to stack overflow for deeply nested plans. Consider adding a depth limit or converting to an iterative approach with an explicit stack for production robustness. sandbox/plugins/analytics-backend-datafusion/rust/src/agg_mode.rs [59-64] -let new_children: Vec<Arc<dyn ExecutionPlan>> = agg - .children() - .into_iter() - .map(\|c\| force_aggregate_mode(Arc::clone(c), target)) - .collect::<Result<_>>()?; -return plan.with_new_children(new_children); +const MAX_RECURSION_DEPTH: usize = 1000; +fn force_aggregate_mode_impl( + plan: Arc<dyn ExecutionPlan>, + target: AggregateMode, + depth: usize, +) -> Result<Arc<dyn ExecutionPlan>> { + if depth > MAX_RECURSION_DEPTH { + return Err(datafusion_common::DataFusionError::Internal( + "Maximum recursion depth exceeded".to_string() + )); + } + // ... existing logic with depth + 1 passed to recursive calls +} Suggestion importance[1-10]: 5 __ Why: While recursion depth limits are good defensive programming, deeply nested execution plans are uncommon in practice. The suggestion addresses a theoretical concern rather than a likely real-world issue, and the proposed solution would require significant refactoring.	Low

Previous suggestions

Suggestions up to commit ee0a3d7

Category	Suggestion	Impact
Possible issue	Validate column count before coercion loop Verify that `batch.getFieldVectors().size()` matches `declaredSchema.getFields().size()` before the column loop. If the batch has fewer columns than the declared schema, accessing `batch.getVector(col)` will throw `IndexOutOfBoundsException`. Add an explicit check and throw a descriptive error if column counts mismatch. sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/DatafusionReduceSink.java [255-269] private static VectorSchemaRoot coerceToDeclaredSchema(VectorSchemaRoot batch, Schema declaredSchema, BufferAllocator alloc) { if (batch.getSchema().equals(declaredSchema)) { return batch; + } + if (batch.getFieldVectors().size() != declaredSchema.getFields().size()) { + batch.close(); + throw new IllegalStateException( + "Column count mismatch: batch has " + batch.getFieldVectors().size() + " columns, declared schema has " + declaredSchema.getFields().size() + ); } VectorSchemaRoot out = VectorSchemaRoot.create(declaredSchema, alloc); try { out.allocateNew(); int rows = batch.getRowCount(); for (int col = 0; col < declaredSchema.getFields().size(); col++) { FieldVector src = batch.getVector(col); FieldVector dst = out.getVector(col); ... } out.setRowCount(rows); } catch (RuntimeException e) { out.close(); throw e; } finally { batch.close(); } return out; } Suggestion importance[1-10]: 8 __ Why: This is a valid safety check that prevents `IndexOutOfBoundsException` when the batch and declared schema have mismatched column counts. The suggestion correctly identifies that the existing code assumes column count equality without verification. The improved code adds an explicit check with a clear error message, which is important for debugging schema mismatches at the Java→Rust boundary.	Medium
	Prevent panic across FFI boundary The `execute_prepared` method panics if no plan was prepared, but this panic will cross the FFI boundary unsafely. Add explicit validation and return an error code instead of allowing a panic to propagate through the C ABI. sandbox/plugins/analytics-backend-datafusion/rust/src/ffm.rs [822-829] pub unsafe extern "C" fn df_execute_local_prepared_plan(session_ptr: i64) -> i64 { let session = &(session_ptr as const crate::local_executor::LocalSession); + if session.prepared_plan.is_none() { + return ffm_error("execute_local_prepared_plan called without a prepared plan"); + } ... let df_stream = session.execute_prepared().map_err(\|e\| e.to_string())?; Suggestion importance[1-10]: 8 __ Why: This is a critical safety issue. The `execute_prepared` method panics if no plan exists, and panics crossing FFI boundaries cause undefined behavior. The suggestion correctly proposes validating `prepared_plan` before calling `execute_prepared`.	Medium
	Add bounds check for input reference Add bounds checking before accessing `inputTypes.get(ref.getIndex())` in the `RexShuttle` visitor. If the replacement's row type has fewer fields than the reference index, this will throw `IndexOutOfBoundsException`. Validate that `ref.getIndex() < inputTypes.size()` and handle the mismatch case explicitly. sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/planner/dag/AggregateDecompositionResolver.java [210-214] -private static RelNode replaceFirstWithRefRebinding(RelNode node, RelNode target, RelNode replacement) { - if (node == target) return replacement; - java.util.List<RelNode> newInputs = new java.util.ArrayList<>(); - boolean changed = false; - for (RelNode input : node.getInputs()) { - RelNode newInput; - if (input == target) { - newInput = replacement; - if (node instanceof Project proj) { - RexBuilder rexBuilder = node.getCluster().getRexBuilder(); - java.util.List<RelDataType> inputTypes = new java.util.ArrayList<>(); - for (var f : replacement.getRowType().getFieldList()) { - inputTypes.add(f.getType()); - } - RexShuttle rebind = new RexShuttle() { - @Override - public RexNode visitInputRef(RexInputRef ref) { - RelDataType actual = inputTypes.get(ref.getIndex()); - if (ref.getType().equals(actual)) return ref; - return new RexInputRef(ref.getIndex(), actual); - } - }; - java.util.List<RexNode> rebound = new java.util.ArrayList<>(proj.getProjects().size()); - for (int i = 0; i < proj.getProjects().size(); i++) { - RexNode expr = proj.getProjects().get(i).accept(rebind); - RelDataType targetType = proj.getRowType().getFieldList().get(i).getType(); - if (!expr.getType().equals(targetType)) { - expr = rexBuilder.makeCast(targetType, expr); - } - rebound.add(expr); - } - return proj.copy(proj.getTraitSet(), replacement, rebound, proj.getRowType()); - } - } else { - newInput = replaceFirstWithRefRebinding(input, target, replacement); +RexShuttle rebind = new RexShuttle() { + @Override + public RexNode visitInputRef(RexInputRef ref) { + if (ref.getIndex() >= inputTypes.size()) { + throw new IllegalStateException( + "RexInputRef index " + ref.getIndex() + " out of bounds for replacement row type with " + inputTypes.size() + " fields" + ); } - newInputs.add(newInput); - if (newInput != input) changed = true; + RelDataType actual = inputTypes.get(ref.getIndex()); + if (ref.getType().equals(actual)) return ref; + return new RexInputRef(ref.getIndex(), actual); } - return changed ? node.copy(node.getTraitSet(), newInputs) : node; -} +}; Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies a potential `IndexOutOfBoundsException` when accessing `inputTypes.get(ref.getIndex())` without bounds checking. However, the scenario is somewhat unlikely in practice since the replacement's row type should match the Project's expectations if the decomposition logic is correct. The check adds defensive programming but may not catch real bugs if the root cause is incorrect decomposition logic upstream.	Medium
	Validate children before array access Direct array indexing without bounds checking can panic if `children()` returns an empty slice. Validate that `children()` has at least one element before accessing index 0 to prevent potential runtime panics. sandbox/plugins/analytics-backend-datafusion/rust/src/agg_mode.rs [76-81] AggregateMode::Final => { // Current node is Partial; skip it, return its child // (the Final above will keep itself) - let child = agg.children()[0]; - force_aggregate_mode(Arc::clone(child), target) + let children = agg.children(); + if children.is_empty() { + return Err(datafusion_common::DataFusionError::Internal( + "AggregateExec has no children".to_string() + )); + } + force_aggregate_mode(Arc::clone(children[0]), target) } Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies a potential panic from unchecked array access at `agg.children()[0]`. Adding bounds checking would prevent runtime panics, though `AggregateExec` typically has at least one child in practice.	Medium
General	Close WriteChannel to prevent resource leak Close the `WriteChannel` explicitly before returning the response. Arrow's `WriteChannel` wraps a `WritableByteChannel` and may buffer data; not closing it risks incomplete writes or resource leaks. Add a `channel.close()` call in a finally block or use try-with-resources. sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/AnalyticsSearchService.java [246-247] FragmentExecutionResponse collectResponse(EngineResultStream stream, @Nullable AnalyticsShardTask task) { ByteArrayOutputStream baos = new ByteArrayOutputStream(); - WriteChannel channel = new WriteChannel(Channels.newChannel(baos)); Schema schema = null; int totalRows = 0; Iterator<EngineResultBatch> it = stream.iterator(); - try { + try (WriteChannel channel = new WriteChannel(Channels.newChannel(baos))) { while (it.hasNext()) { if (task != null && task.isCancelled()) { throw new TaskCancelledException("task cancelled: " + task.getReasonCancelled()); } EngineResultBatch batch = it.next(); VectorSchemaRoot root = batch.getArrowRoot(); try { if (schema == null) { schema = root.getSchema(); MessageSerializer.serialize(channel, schema); } try (ArrowRecordBatch recordBatch = new VectorUnloader(root).getRecordBatch()) { MessageSerializer.serialize(channel, recordBatch); } totalRows += root.getRowCount(); } finally { root.close(); } } if (schema != null) { ArrowStreamWriter.writeEndOfStream(channel, IpcOption.DEFAULT); } } catch (IOException e) { throw new IllegalStateException("Failed to serialize fragment output as Arrow IPC stream", e); } return new FragmentExecutionResponse(baos.toByteArray(), totalRows); } Suggestion importance[1-10]: 6 __ Why: The suggestion correctly identifies that `WriteChannel` should be closed to ensure buffered data is flushed. However, `WriteChannel` wraps a `ByteArrayOutputStream` which doesn't require explicit closing for correctness (it's an in-memory buffer). The try-with-resources pattern is still good practice for consistency and future-proofing, but the impact is lower than suggested since no actual resource leak occurs with `ByteArrayOutputStream`.	Low
General	Add warning for unexpected plan structure The function silently returns `None` when encountering repartition/coalesce nodes with multiple children, which may hide unexpected plan structures. Consider logging a warning or returning an error to make debugging easier when the plan structure doesn't match expectations. sandbox/plugins/analytics-backend-datafusion/rust/src/agg_mode.rs [113-123] fn find_partial_input(plan: Arc<dyn ExecutionPlan>) -> Option<Arc<dyn ExecutionPlan>> { ... if plan.as_any().downcast_ref::<RepartitionExec>().is_some() \|\| plan .as_any() .downcast_ref::<CoalescePartitionsExec>() .is_some() { let children = plan.children(); if children.len() == 1 { return find_partial_input(Arc::clone(children[0])); } + // Unexpected: repartition/coalesce with != 1 child + log::warn!("find_partial_input: unexpected child count {} for {}", children.len(), plan.name()); } None } Suggestion importance[1-10]: 4 __ Why: While adding logging for unexpected plan structures could aid debugging, this is a minor enhancement. The function correctly returns `None` for unhandled cases, and the silent return is acceptable behavior for an optional search function.	Low

Suggestions up to commit 1e81404

Suggestions up to commit e648ee7

Category	Suggestion	Impact
Possible issue	Close channel to flush buffered writes Close the `WriteChannel` explicitly after serialization completes to ensure all buffered data is flushed and resources are released. Without closing the channel, buffered writes may not be flushed to the underlying `ByteArrayOutputStream`, potentially resulting in incomplete IPC payloads. sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/exec/AnalyticsSearchService.java [241-278] -FragmentExecutionResponse collectResponse(EngineResultStream stream, @Nullable AnalyticsShardTask task) { - ByteArrayOutputStream baos = new ByteArrayOutputStream(); - WriteChannel channel = new WriteChannel(Channels.newChannel(baos)); - Schema schema = null; - int totalRows = 0; - Iterator<EngineResultBatch> it = stream.iterator(); +try { + while (it.hasNext()) { + ... + } + if (schema != null) { + ArrowStreamWriter.writeEndOfStream(channel, IpcOption.DEFAULT); + } +} catch (IOException e) { + throw new IllegalStateException("Failed to serialize fragment output as Arrow IPC stream", e); +} finally { try { - while (it.hasNext()) { - if (task != null && task.isCancelled()) { - throw new TaskCancelledException("task cancelled: " + task.getReasonCancelled()); - } - EngineResultBatch batch = it.next(); - VectorSchemaRoot root = batch.getArrowRoot(); - try { - if (schema == null) { - schema = root.getSchema(); - MessageSerializer.serialize(channel, schema); - } - try (ArrowRecordBatch recordBatch = new VectorUnloader(root).getRecordBatch()) { - MessageSerializer.serialize(channel, recordBatch); - } - totalRows += root.getRowCount(); - } finally { - root.close(); - } - } - if (schema != null) { - ArrowStreamWriter.writeEndOfStream(channel, IpcOption.DEFAULT); - } + channel.close(); } catch (IOException e) { - throw new IllegalStateException("Failed to serialize fragment output as Arrow IPC stream", e); + // log or suppress } - return new FragmentExecutionResponse(baos.toByteArray(), totalRows); } Suggestion importance[1-10]: 9 __ Why: The suggestion identifies a resource leak where `WriteChannel` is not explicitly closed, potentially leaving buffered data unflushed. This could result in incomplete IPC payloads being sent, which is a critical correctness issue. Closing the channel in a finally block ensures proper resource cleanup.	High
	Validate column count and null vectors Verify that `batch.getVector(col)` does not return null before dereferencing. If the batch schema has fewer columns than `declaredSchema`, accessing `batch.getVector(col)` may return null or throw an exception. Add a null check and validate column count consistency. sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/DatafusionReduceSink.java [260-267] -private static VectorSchemaRoot coerceToDeclaredSchema(VectorSchemaRoot batch, Schema declaredSchema, BufferAllocator alloc) { - if (batch.getSchema().equals(declaredSchema)) { - return batch; +for (int col = 0; col < declaredSchema.getFields().size(); col++) { + if (col >= batch.getFieldVectors().size()) { + throw new IllegalStateException("Batch has fewer columns than declaredSchema: batch=" + batch.getFieldVectors().size() + ", declared=" + declaredSchema.getFields().size()); } - VectorSchemaRoot out = VectorSchemaRoot.create(declaredSchema, alloc); - try { - out.allocateNew(); - int rows = batch.getRowCount(); - for (int col = 0; col < declaredSchema.getFields().size(); col++) { - FieldVector src = batch.getVector(col); - FieldVector dst = out.getVector(col); - ... - } - out.setRowCount(rows); - } catch (RuntimeException e) { - out.close(); - throw e; - } finally { - batch.close(); + FieldVector src = batch.getVector(col); + if (src == null) { + throw new IllegalStateException("Batch vector at column " + col + " is null"); } - return out; + FieldVector dst = out.getVector(col); + ... } Suggestion importance[1-10]: 8 __ Why: The suggestion addresses a critical issue where accessing `batch.getVector(col)` without validating column count or null checks could cause `NullPointerException` or `IndexOutOfBoundsException`. This validation is essential for preventing runtime failures when batch and declared schemas mismatch.	Medium
	Prevent panic in FFI boundary The `execute_prepared` method panics if no plan was prepared, but this FFI function doesn't validate that condition before calling it. Add an explicit check to return a proper error instead of allowing a panic to cross the FFI boundary. sandbox/plugins/analytics-backend-datafusion/rust/src/ffm.rs [822-829] pub unsafe extern "C" fn df_execute_local_prepared_plan(session_ptr: i64) -> i64 { let session = &(session_ptr as const crate::local_executor::LocalSession); - ... + if session.prepared_plan.is_none() { + return Err("No plan prepared. Call df_prepare_final_plan first.".to_string()); + } + let mgr = get_rt_manager()?; + let _guard = mgr.io_runtime.enter(); let df_stream = session.execute_prepared().map_err(\|e\| e.to_string())?; Suggestion importance[1-10]: 8 __ Why: Valid concern about panics crossing FFI boundaries. The `execute_prepared` method at line 190-195 in `local_executor.rs` explicitly panics if no plan is prepared, and this FFI function should validate that condition before calling it to return a proper error instead.	Medium
	Add bounds checking for index access Add bounds checking before accessing `inputTypes` by index in the `RexShuttle` visitor. If `ref.getIndex()` exceeds the size of `inputTypes`, the code will throw an `IndexOutOfBoundsException` at runtime. Validate that the index is within bounds and handle the error case explicitly. sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/planner/dag/AggregateDecompositionResolver.java [208-214] -private static RelNode replaceFirstWithRefRebinding(RelNode node, RelNode target, RelNode replacement) { - if (node == target) return replacement; - java.util.List<RelNode> newInputs = new java.util.ArrayList<>(); - boolean changed = false; - for (RelNode input : node.getInputs()) { - RelNode newInput; - if (input == target) { - newInput = replacement; - if (node instanceof Project proj) { - RexBuilder rexBuilder = node.getCluster().getRexBuilder(); - java.util.List<RelDataType> inputTypes = new java.util.ArrayList<>(); - for (var f : replacement.getRowType().getFieldList()) { - inputTypes.add(f.getType()); - } - RexShuttle rebind = new RexShuttle() { - @Override - public RexNode visitInputRef(RexInputRef ref) { - RelDataType actual = inputTypes.get(ref.getIndex()); - if (ref.getType().equals(actual)) return ref; - return new RexInputRef(ref.getIndex(), actual); - } - }; - java.util.List<RexNode> rebound = new java.util.ArrayList<>(proj.getProjects().size()); - for (int i = 0; i < proj.getProjects().size(); i++) { - RexNode expr = proj.getProjects().get(i).accept(rebind); - RelDataType targetType = proj.getRowType().getFieldList().get(i).getType(); - if (!expr.getType().equals(targetType)) { - expr = rexBuilder.makeCast(targetType, expr); - } - rebound.add(expr); - } - return proj.copy(proj.getTraitSet(), replacement, rebound, proj.getRowType()); - } - } else { - newInput = replaceFirstWithRefRebinding(input, target, replacement); +RexShuttle rebind = new RexShuttle() { + @Override + public RexNode visitInputRef(RexInputRef ref) { + int idx = ref.getIndex(); + if (idx < 0 \|\| idx >= inputTypes.size()) { + throw new IllegalStateException("RexInputRef index " + idx + " out of bounds for inputTypes size " + inputTypes.size()); } - newInputs.add(newInput); - if (newInput != input) changed = true; + RelDataType actual = inputTypes.get(idx); + if (ref.getType().equals(actual)) return ref; + return new RexInputRef(idx, actual); } - return changed ? node.copy(node.getTraitSet(), newInputs) : node; -} +}; Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies a potential `IndexOutOfBoundsException` when accessing `inputTypes.get(ref.getIndex())` without bounds checking. Adding validation prevents runtime crashes when the index is out of bounds, improving robustness.	Medium
	Validate array bounds before indexing Direct array indexing without bounds checking can panic if `children()` returns an empty slice. Validate that `children()` has at least one element before accessing index 0 to prevent potential runtime panics. sandbox/plugins/analytics-backend-datafusion/rust/src/agg_mode.rs [76-81] AggregateMode::Final => { // Current node is Partial; skip it, return its child // (the Final above will keep itself) - let child = agg.children()[0]; - force_aggregate_mode(Arc::clone(child), target) + let children = agg.children(); + if children.is_empty() { + return Err(datafusion_common::DataFusionError::Internal( + "AggregateExec has no children".to_string() + )); + } + force_aggregate_mode(Arc::clone(children[0]), target) } Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies a potential panic from unchecked array indexing at line 79. Adding bounds validation would improve robustness, though `AggregateExec` typically has at least one child in practice.	Medium
General	Validate plan alternative count consistency Validate that `parentStage.getPlanAlternatives()` and `childStage.getPlanAlternatives()` have matching sizes before indexing into `rewriteResults`. The current code uses `Math.min(i, rewriteResults.size() - 1)` as a fallback, but if the parent has more alternatives than the child, this silently reuses the last child's rewrite result for multiple parent plans, which may not be semantically correct. sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/planner/dag/AggregateDecompositionResolver.java [96-113] -private static void resolvePartialFinalPair(Stage parentStage, Stage childStage) { - List<StagePlan> resolvedChildPlans = new ArrayList<>(childStage.getPlanAlternatives().size()); - List<StagePlan> resolvedParentPlans = new ArrayList<>(parentStage.getPlanAlternatives().size()); - List<RewriteResult> rewriteResults = new ArrayList<>(); - - for (StagePlan childPlan : childStage.getPlanAlternatives()) { - OpenSearchAggregate partialAgg = findTopAggregate(childPlan.resolvedFragment(), AggregateMode.PARTIAL); - if (partialAgg == null) { - resolvedChildPlans.add(childPlan); - rewriteResults.add(null); - continue; - } - RewriteResult result = rewriteDecomposed(partialAgg); - rewriteResults.add(result); - RelNode newChildFragment = replaceFirst(childPlan.resolvedFragment(), partialAgg, result.newPartial(partialAgg)); - resolvedChildPlans.add(new StagePlan(newChildFragment, childPlan.backendId())); - } - - boolean anyChildRewritten = rewriteResults.stream().anyMatch(r -> r != null); - if (!anyChildRewritten) return; - - childStage.setPlanAlternatives(resolvedChildPlans); - - for (int i = 0; i < parentStage.getPlanAlternatives().size(); i++) { - StagePlan parentPlan = parentStage.getPlanAlternatives().get(i); - RewriteResult result = rewriteResults.get(Math.min(i, rewriteResults.size() - 1)); - ... - } - parentStage.setPlanAlternatives(resolvedParentPlans); +if (parentStage.getPlanAlternatives().size() != childStage.getPlanAlternatives().size()) { + throw new IllegalStateException("Parent and child stages must have the same number of plan alternatives for decomposition pairing"); +} +for (int i = 0; i < parentStage.getPlanAlternatives().size(); i++) { + StagePlan parentPlan = parentStage.getPlanAlternatives().get(i); + RewriteResult result = rewriteResults.get(i); + ... } Suggestion importance[1-10]: 6 __ Why: The suggestion points out that using `Math.min(i, rewriteResults.size() - 1)` silently reuses the last child's rewrite result when parent has more alternatives than child. While the current fallback prevents crashes, validating size equality upfront would make the assumption explicit and catch potential logic errors earlier.	Low
General	Handle unexpected multi-child repartition nodes The function silently returns `None` when encountering repartition/coalesce nodes with multiple children, which may hide unexpected plan structures. Consider logging or returning an error for multi-child cases to aid debugging. sandbox/plugins/analytics-backend-datafusion/rust/src/agg_mode.rs [113-123] fn find_partial_input(plan: Arc<dyn ExecutionPlan>) -> Option<Arc<dyn ExecutionPlan>> { ... if plan.as_any().downcast_ref::<RepartitionExec>().is_some() \|\| plan .as_any() .downcast_ref::<CoalescePartitionsExec>() .is_some() { let children = plan.children(); if children.len() == 1 { return find_partial_input(Arc::clone(children[0])); } + // Unexpected: repartition/coalesce with != 1 child + return None; } None } Suggestion importance[1-10]: 3 __ Why: The suggestion points out that multi-child cases return `None` silently. However, the 'improved_code' is identical to the 'existing_code' (just adds a comment), so this offers minimal practical improvement. The current behavior is acceptable for the function's purpose.	Low

Suggestions up to commit e7cf5d1

Category	Suggestion	Impact
Possible issue	Prevent double-close on fast-path return The `batch` input is closed unconditionally in the `finally` block, even when the fast-path returns early (`batch.getSchema().equals(declaredSchema)`). This causes a double-close when the caller tries to use the returned batch. Move `batch.close()` inside the `try` block after the coercion loop completes, so the early-return path leaves `batch` open for the caller. sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/DatafusionReduceSink.java [255-297] private static VectorSchemaRoot coerceToDeclaredSchema(VectorSchemaRoot batch, Schema declaredSchema, BufferAllocator alloc) { if (batch.getSchema().equals(declaredSchema)) { return batch; } VectorSchemaRoot out = VectorSchemaRoot.create(declaredSchema, alloc); try { out.allocateNew(); int rows = batch.getRowCount(); for (int col = 0; col < declaredSchema.getFields().size(); col++) { FieldVector src = batch.getVector(col); FieldVector dst = out.getVector(col); if (src.getField().getType().equals(dst.getField().getType())) { src.makeTransferPair(dst).transfer(); continue; } ... } out.setRowCount(rows); + batch.close(); } catch (RuntimeException e) { out.close(); + batch.close(); throw e; - } finally { - batch.close(); } return out; } Suggestion importance[1-10]: 9 __ Why: The `batch` is closed unconditionally in the `finally` block even when the fast-path returns early at line 256. This causes a double-close when the caller tries to use the returned batch. Moving `batch.close()` inside the `try` block after the coercion loop completes prevents this critical resource management bug.	High
	Validate prepared plan before execution The function calls `execute_prepared()` which panics if no plan was prepared, but this panic will cross the FFI boundary unsafely. Add explicit validation that `prepared_plan` is `Some` before calling `execute_prepared()` and return a proper error code instead of allowing a panic. sandbox/plugins/analytics-backend-datafusion/rust/src/ffm.rs [822-829] pub unsafe extern "C" fn df_execute_local_prepared_plan(session_ptr: i64) -> i64 { let session = &(session_ptr as const crate::local_executor::LocalSession); + if session.prepared_plan.is_none() { + return Err("No plan prepared. Call df_prepare_final_plan first.".to_string()); + } ... let df_stream = session.execute_prepared().map_err(\|e\| e.to_string())?; Suggestion importance[1-10]: 9 __ Why: The function calls `execute_prepared()` which panics if no plan was prepared. Since this crosses the FFI boundary, allowing a panic is unsafe and could crash the JVM. Adding validation prevents this critical issue.	High
	Add bounds check for children access The code directly accesses `agg.children()[0]` without bounds checking. If the aggregate node has no children, this will panic at runtime. Add a bounds check or use `get(0)` with proper error handling to prevent potential panics. sandbox/plugins/analytics-backend-datafusion/rust/src/agg_mode.rs [76-81] AggregateMode::Final => { // Current node is Partial; skip it, return its child // (the Final above will keep itself) - let child = agg.children()[0]; - force_aggregate_mode(Arc::clone(child), target) + let children = agg.children(); + if children.is_empty() { + return Err(datafusion_common::DataFusionError::Internal( + "AggregateExec has no children".to_string() + )); + } + force_aggregate_mode(Arc::clone(children[0]), target) } Suggestion importance[1-10]: 8 __ Why: The code accesses `agg.children()[0]` without bounds checking, which could panic if the aggregate node has no children. This is a potential runtime crash that should be prevented with proper error handling.	Medium
	Return error instead of panicking Using `expect()` causes a panic when `prepared_plan` is `None`, which is problematic for FFI boundaries and error handling. Return a proper `DataFusionError` instead to allow callers to handle the error gracefully. sandbox/plugins/analytics-backend-datafusion/rust/src/local_executor.rs [190-196] pub fn execute_prepared(&self) -> Result<SendableRecordBatchStream, DataFusionError> { let plan = self .prepared_plan .as_ref() - .expect("execute_prepared called without a prepared plan"); + .ok_or_else(\|\| DataFusionError::Internal( + "execute_prepared called without a prepared plan".to_string() + ))?; datafusion::physical_plan::execute_stream(Arc::clone(plan), self.ctx.task_ctx()) } Suggestion importance[1-10]: 8 __ Why: Using `expect()` causes a panic when `prepared_plan` is `None`, which is problematic for FFI boundaries. Returning a proper `DataFusionError` allows callers to handle the error gracefully and prevents crashes.	Medium
General	Close session on constructor failure When `preparedState` is null, a new `DatafusionLocalSession` is created but never closed if an exception occurs during the constructor's loop (e.g., `ArrowSchemaIpc.toBytes` throws). Wrap the loop in a try-catch and close the session on failure to prevent native resource leaks. sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/AbstractDatafusionReduceSink.java [103-124] protected AbstractDatafusionReduceSink( ExchangeSinkContext ctx, NativeRuntimeHandle runtimeHandle, DataFusionReduceState preparedState ) { this.ctx = ctx; this.runtimeHandle = runtimeHandle; this.preparedState = preparedState; - this.session = preparedState != null ? preparedState.session() : new DatafusionLocalSession(runtimeHandle.get()); + DatafusionLocalSession localSession = preparedState != null ? preparedState.session() : new DatafusionLocalSession(runtimeHandle.get()); Map<Integer, byte[]> inputs = new LinkedHashMap<>(ctx.childInputs().size()); Map<Integer, Schema> schemas = new LinkedHashMap<>(ctx.childInputs().size()); - for (ExchangeSinkContext.ChildInput child : ctx.childInputs()) { - inputs.put(child.childStageId(), ArrowSchemaIpc.toBytes(child.schema())); - schemas.put(child.childStageId(), child.schema()); + try { + for (ExchangeSinkContext.ChildInput child : ctx.childInputs()) { + inputs.put(child.childStageId(), ArrowSchemaIpc.toBytes(child.schema())); + schemas.put(child.childStageId(), child.schema()); + } + } catch (RuntimeException e) { + if (preparedState == null) { + localSession.close(); + } + throw e; } + this.session = localSession; this.childInputs = inputs; this.childSchemas = schemas; } Suggestion importance[1-10]: 7 __ Why: When `preparedState` is null, a new `DatafusionLocalSession` is created at line 115 but never closed if an exception occurs during the loop at lines 117-121. This causes a native resource leak. Wrapping the loop in try-catch and closing the session on failure is a valid resource management improvement.	Medium
General	Re-evaluate group-set flag per call The `hasEmptyGroup` flag is computed once outside the stream but should be re-evaluated per call if the FINAL aggregate's group set changes during rewrite. If a rewrite inadvertently modifies the group set, the flag becomes stale and Calcite's type inference may produce incorrect nullability. Compute `hasEmptyGroup` inside the lambda using `finalAgg.getGroupSet().isEmpty()` to ensure correctness. sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/planner/dag/AggregateDecompositionResolver.java [144-163] -private static RelNode rewriteParentFragment(RelNode fragment, RelDataType childRowType, int childStageId, RewriteResult result) { - ... - List<AggregateCall> rebuiltFinalCalls = result.newFinalCalls.stream() - .map( - c -> AggregateCall.create( - c.getAggregation(), - c.isDistinct(), - c.isApproximate(), - c.ignoreNulls(), - c.rexList, - c.getArgList(), - c.filterArg, - c.distinctKeys, - c.collation, - hasEmptyGroup, - newFinalInput, - null, - c.name - ) +List<AggregateCall> rebuiltFinalCalls = result.newFinalCalls.stream() + .map( + c -> AggregateCall.create( + c.getAggregation(), + c.isDistinct(), + c.isApproximate(), + c.ignoreNulls(), + c.rexList, + c.getArgList(), + c.filterArg, + c.distinctKeys, + c.collation, + finalAgg.getGroupSet().isEmpty(), + newFinalInput, + null, + c.name ) - .toList(); - ... -} + ) + .toList(); Suggestion importance[1-10]: 3 __ Why: The `hasEmptyGroup` flag is computed once at line 144 and reused for all calls. While the suggestion to re-evaluate per call is technically more defensive, the current code is correct because `finalAgg.getGroupSet()` is immutable during the stream operation. The suggestion adds redundant computation without fixing an actual bug.	Low

Suggestions up to commit 832e0b8

Category	Suggestion	Impact
Possible issue	Prevent resource leak on coercion failure The `coerceToDeclaredSchema` call can throw a `RuntimeException` after the early-return check but before the try-finally block that closes `batch`. If the coercion fails, the original `batch` reference is lost without being closed, causing a resource leak. Move the coercion inside the try-finally or ensure `batch` is closed on exception. sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/DatafusionReduceSink.java [198-208] private void feedToSender(DatafusionPartitionSender sender, VectorSchemaRoot batch, Schema declaredSchema) { if (closed) { batch.close(); return; } BufferAllocator alloc = ctx.allocator(); - batch = coerceToDeclaredSchema(batch, declaredSchema, alloc); - ... + ArrowArray array = ArrowArray.allocateNew(alloc); + ArrowSchema arrowSchema = ArrowSchema.allocateNew(alloc); + try { + batch = coerceToDeclaredSchema(batch, declaredSchema, alloc); + Data.exportVectorSchemaRoot(alloc, batch, dictionaryProvider, array, arrowSchema); + ... + } finally { + batch.close(); + array.close(); + arrowSchema.close(); + } } Suggestion importance[1-10]: 8 __ Why: The `coerceToDeclaredSchema` call at line 208 can throw before the try-finally block that closes `batch` (lines 209-232). If coercion fails, the original `batch` is leaked. Moving the coercion inside the try or ensuring `batch` is closed on exception is critical for resource safety.	Medium
	Prevent panic across FFI boundary The function calls `execute_prepared()` which panics if no plan was prepared, but this panic will cross the FFI boundary unsafely. Add explicit validation to return an error code instead of allowing a panic to propagate through the C boundary. sandbox/plugins/analytics-backend-datafusion/rust/src/ffm.rs [822-829] pub unsafe extern "C" fn df_execute_local_prepared_plan(session_ptr: i64) -> i64 { let session = &(session_ptr as const crate::local_executor::LocalSession); + if session.prepared_plan.is_none() { + return ffm_error("execute_local_prepared_plan called without a prepared plan"); + } let mgr = get_rt_manager()?; // DataFusion's execute_stream is sync, but kicks off RepartitionExec / stream // channels that require a Tokio reactor. Enter the IO runtime's context so those // operators can register with the reactor. let _guard = mgr.io_runtime.enter(); let df_stream = session.execute_prepared().map_err(\|e\| e.to_string())?; Suggestion importance[1-10]: 8 __ Why: This is a critical safety issue. Panics crossing FFI boundaries cause undefined behavior. The suggestion correctly identifies that `execute_prepared()` can panic and proposes checking `prepared_plan` before calling it, preventing unsafe behavior at the C boundary.	Medium
	Validate children before indexing Direct array indexing without bounds checking can panic if the aggregate has no children. Validate that `children()` is non-empty before accessing index 0 to prevent potential runtime panics. sandbox/plugins/analytics-backend-datafusion/rust/src/agg_mode.rs [76-81] AggregateMode::Final => { // Current node is Partial; skip it, return its child // (the Final above will keep itself) - let child = agg.children()[0]; + let children = agg.children(); + if children.is_empty() { + return Err(datafusion_common::DataFusionError::Internal( + "AggregateExec has no children".to_string() + )); + } + let child = children[0]; force_aggregate_mode(Arc::clone(child), target) } Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies a potential panic from unchecked array indexing. However, in the context of `AggregateExec`, having no children would be a structural invariant violation, making this an edge case. The validation adds defensive programming but may not be critical in practice.	Medium
	Replace panic with error result Using `expect()` causes a panic when `prepared_plan` is `None`. Replace with proper error handling that returns a `DataFusionError` to maintain consistent error propagation and avoid unexpected panics. sandbox/plugins/analytics-backend-datafusion/rust/src/local_executor.rs [190-196] pub fn execute_prepared(&self) -> Result<SendableRecordBatchStream, DataFusionError> { let plan = self .prepared_plan .as_ref() - .expect("execute_prepared called without a prepared plan"); + .ok_or_else(\|\| DataFusionError::Internal( + "execute_prepared called without a prepared plan".to_string() + ))?; datafusion::physical_plan::execute_stream(Arc::clone(plan), self.ctx.task_ctx()) } Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies that using `expect()` causes panics and proposes returning a proper `DataFusionError` instead. This improves error handling consistency, though the panic is documented and the caller should ensure a plan is prepared first. ...

github-actions · 2026-05-03T02:29:52Z

Persistent review updated to latest commit b696fc0

sandeshkr419 · 2026-05-03T02:36:27Z

[outdated]

@expani Thanks for the detailed review. Here's what I changed structurally based on your feedback:

AVG no longer has a custom decomposition. It's handled by Calcite's AggregateReduceFunctionsRule in the pre-marking phase so backends only ever see SUM and COUNT. Similarly, HLL no longer has a decomposition either; the intermediate Arrow type (Binary) is now declared directly on AggregateCapability.intermediateArrowType, which LocalStageScheduler reads, no more ad-hoc isApproximate() check in the scheduler.
The COUNT(DISTINCT) → APPROX_COUNT_DISTINCT rewrite moved from OpenSearchAggregateRule into BackendPlanAdapter via a new aggregateCallAdapters() hook on BackendCapabilityProvider, keeping backend-specific logic out of the generic planner rule. stripAnnotations was also simplified to only strip annotation wrappers.
On the Rust side, the StripFinalAggregateRule and StripPartialAggregateRule physical optimizer rules are replaced by FragmentConvertor-driven mode forcing. DataFusionFragmentConvertor now prepends a 1-byte mode prefix to plan bytes (attachPartialAggOnTop → partial, convertFinalAggFragment → final), and the Rust side calls force_aggregate_mode() directly after create_physical_plan, no plan inspection. The FragmentConvertor API is the authoritative signal for execution mode.
One open item: DataFusion's Substrait consumer ignores aggregation_phase, which is why force_aggregate_mode exists. This could be a potential upstream contribution, if DataFusion adds aggregation_phase support, force_aggregate_mode can be removed entirely.

github-actions · 2026-05-03T02:37:36Z

Persistent review updated to latest commit fdcf13a

github-actions · 2026-05-03T03:33:27Z

✅ Gradle check result for fdcf13a: SUCCESS

codecov · 2026-05-03T03:33:48Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.41%. Comparing base (36809cc) to head (ee0a3d7).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #21457      +/-   ##
============================================
- Coverage     73.50%   73.41%   -0.09%     
+ Complexity    74644    74563      -81     
============================================
  Files          5980     5980              
  Lines        338777   338777              
  Branches      48848    48848              
============================================
- Hits         249011   248722     -289     
- Misses        69946    70252     +306     
+ Partials      19820    19803      -17

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2026-05-04T04:29:29Z

Persistent review updated to latest commit f5d4771

github-actions · 2026-05-04T05:24:34Z

✅ Gradle check result for f5d4771: SUCCESS

github-actions · 2026-05-04T05:32:49Z

Persistent review updated to latest commit b89a7c8

github-actions · 2026-05-04T06:35:32Z

✅ Gradle check result for b89a7c8: SUCCESS

github-actions · 2026-05-04T21:24:08Z

Persistent review updated to latest commit c76d34d

github-actions · 2026-05-04T22:24:00Z

❌ Gradle check result for c76d34d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-05-05T09:47:52Z

Persistent review updated to latest commit ba512f3

github-actions · 2026-05-05T10:43:51Z

✅ Gradle check result for ba512f3: SUCCESS

…p mappings ArrowSchemaFromCalcite.toArrowType previously threw on any SqlTypeName it didn't enumerate, including common OpenSearch field types like DATE and TIMESTAMP. This surfaced whenever a multi-shard query carried a date-typed column through the exchange row type (e.g. 'min(EventDate)'). Added mappings: SMALLINT → Int(16, signed) TINYINT → Int(8, signed) REAL → FloatingPoint(SINGLE) (alias for FLOAT) DATE → Date(DAY) TIME → Time(MILLISECOND, 32) TIMESTAMP → Timestamp(MILLISECOND, null) TIMESTAMP_WITH_LOCAL_TIME_ZONE → Timestamp(MILLISECOND, null) Unsupported types still throw IllegalArgumentException with the SqlTypeName in the message. Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

… phase OpenSearchAggregateReduceRule previously operated on OpenSearchAggregate and shared the HEP marking collection with OpenSearchAggregateRule. In that arrangement the marking rule fires first in BOTTOM_UP traversal, converting LogicalAggregate → OpenSearchAggregate with per-call AGG_CALL_ANNOTATION wrappers in aggCall.rexList. When Calcite's AggregateReduceFunctionsRule then reduces AVG(x), it reads rexList[0].getType() during type inference on the derived SUM — which carries AVG's original DOUBLE return type, not the natural BIGINT for a SUM of integer. The stamped DOUBLE type propagates through the reduced plan and fails typeMatchesInferred downstream (stripAnnotations, the split rule, or the resolver — all cascades observed). Clean fix — align with the documented design (§11.1): reduce BEFORE marking, on a plain LogicalAggregate where aggCall.rexList is empty and Calcite infers canonical primitive types. Implementation: 1. OpenSearchAggregateReduceRule: match LogicalAggregate (not OpenSearchAggregate). Remove the AggregateMode.SINGLE guard and the newAggregateRel override that re-wrapped as OpenSearchAggregate — the subsequent marking rule handles that conversion. The rule body is now a one-line configuration invoking super. 2. PlannerImpl: split the single HepPlanner into three chained phases — pre-marking (constant folding), aggregate reduction, marking. The reduction phase runs its own HepPlanner on the post-pre-marking plan so the rule order is enforced by phase boundaries rather than BOTTOM_UP rule discovery order. Result: the plan leaving the reduction phase is Calcite-canonical — clean LogicalAggregate(SUM, COUNT, ...) + LogicalProject(CAST(SUM)/CAST(COUNT)). Marking then wraps each with OpenSearch* annotations; Volcano split, the resolver, and stripAnnotations all see consistent primitive types without any type-rebuild patches. Unblocks all grouped-AVG queries (stats avg(x) by ...) on single-shard. Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

… nullability Two related issues surfaced when PARTIAL/FINAL split fires on multi-shard: 1. Column-name mismatch: the resolver's fallback for unnamed aggregates (e.g. reduced AVG's auto-generated SUM/COUNT) used 'expr$<N>' while Calcite — and DataFusion on the Rust side — use '$f<N>'. The PARTIAL's Arrow output arrived with '$f2, $f3', the Substrait plan for FINAL referenced 'expr$2, expr$3', and DataFusion's schema lookup aborted with 'No field named expr$2. Valid fields are …$f2…'. Fix: derive exchange column names from the aggregate's own RelDataType (agg.getRowType().getFieldList()) — Calcite already assigned the canonical names (explicit aggCall.name where present, '$f<N>' where not), so reusing them keeps Java-side exchange schema aligned with the DataFusion output convention. Removes the hand-rolled 'expr$<N>' fallback entirely. 2. Nullability drift on function-swap: the resolver rewrites COUNT → SUM at FINAL for the function-swap case, constructing the new call via makeCall(...) with returnType = ArrowCalciteTypes.toCalcite(...), which returns NOT-NULL types. Calcite, however, infers SUM over an exchange column as nullable (SUM of empty group → null). The declared NOT-NULL-vs-inferred-nullable mismatch trips typeMatchesInferred when the FINAL OpenSearchAggregate is later copied. Fix: in rewriteParentFragment, rebuild each FINAL AggregateCall via the (hasEmptyGroup, input, type=null, name) AggregateCall.create variant so Calcite re-runs full type inference against the actual FINAL input (the rewritten StageInputScan). 3. Project-above-FINAL RexInputRef rebind: once FINAL's aggCall types change, any Project sitting directly above the FINAL (from reduced AVG, or a user-written Project) holds RexInputRefs with stale types. The plain identity-based replaceFirst copies that Project unchanged and Calcite's RexChecker rejects the mismatch. Fix: replaceFirstWithRefRebinding — when the immediate parent is a Project, walk its projection expressions with a RexShuttle that rebinds each RexInputRef to the new FINAL's row-type, CASTing the whole expression to the Project's declared field type so the outer schema (e.g. AVG's DOUBLE column) stays stable even when the inner aggregate now emits primitive BIGINT. Unblocks AVG queries on multi-shard (Q3, Q4, Q10, Q28, Q33) and stabilises the COUNT → SUM swap path for Q1, Q2, Q8, Q16 etc when split fires. Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

…covered queries Turns on the previously-TODO'd 2-node integTest cluster for analytics-engine-rest and switches the ClickBench dataset to 2 shards so PARTIAL/FINAL split, Arrow Flight transport between shards and coordinator, and the aggregate decomposition resolver all exercise under realistic distributed shape — not just the single-shard no-split fast path. PplClickBenchIT: - Auto-discovers all 43 PPL queries under resources/datasets/clickbench/ppl/ instead of the Q1-only hardcoded list, so the tested surface grows as new PPL features land without touching this class. - SKIP_QUERIES (set literal) lists the 24 queries that currently fail for reasons unrelated to the partial/final aggregate feature — each with an in-file comment pointing at the root cause bucket: * Missing PPL frontend features: Q19, Q40, Q43 * Malformed query in the dataset (missing 'where' keyword): Q29 * Multi-shard binary exchange can't serialize LocalDateTime yet: Q7, Q24-Q27, Q37-Q42 * DataFusion Arrow 'project index 0 out of bounds' on WHERE + GROUP-BY + aggregate: Q11-Q15, Q20, Q22, Q23, Q31, Q32 - 19 queries pass across all three cluster variants (1-node-1-shard, 1-node-2-shards, 2-nodes-2-shards), including all AVG-bearing queries (Q3, Q4, Q10, Q28, Q33). mapping.json: number_of_shards = 2, keeps replicas at 0. build.gradle: testClusters.integTest gets numberOfNodes = 2; memtable and streaming variants already use 2 nodes and are unchanged. Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

…) gap Q29 in the ClickBench dataset had `| Referer != ''` with no `where` keyword, tripping the PPL parser before any planning could happen. Added the missing `where`. Q29 then surfaces a distinct follow-up: the Substrait isthmus emitter can't find a binding for `MIN($1)` when the argument is VARCHAR (the `min(Referer)` call). That's a Substrait aggregate-catalog gap — unrelated to the partial/final work — so Q29 stays on the SKIP_QUERIES list with a comment pointing at the MIN-on-strings binding as the remaining blocker. Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

…has zero batches When a shard's partial-aggregate plan produces zero record batches (typical when a WHERE predicate filters out every row on that shard), the Flight wire-protocol producer never writes a data frame. Arrow Flight puts the schema in the first data frame — zero frames means zero schema, so the coordinator's StreamingTableExec receives a stream with no schema and fails with 'Arrow error: Schema error: project index 0 out of bounds, max field 0' the first time it projects a column by index. DatafusionResultStream.BatchIterator now tracks two additional bits of state: whether the native stream has reported EOS (arrayAddr == 0) and whether we've ever returned a batch. When the native side yields EOS without having emitted a single batch, we synthesize one zero-row VectorSchemaRoot from the already-known schema and return it as the final batch. Flight carries the schema with that frame; the coordinator sees it and the downstream aggregate merges correctly over zero rows. Unblocks 10 multi-shard ClickBench queries (Q11, Q12, Q13, Q14, Q15, Q20, Q22, Q23, Q31, Q32) that were failing with this exact error. Enforces the §18 invariant #12 documented in the design revisit: "Shard emission of a PARTIAL with zero matching rows still delivers the schema message." Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

…partial case Adds two CoordinatorReduceIT methods that together cover the empty-partial contract: - testWhereGroupByCountMultiShard_reproducer — WHERE filters all rows on every shard (all docs have category='', predicate is category != ''), so both shards' partial aggregates produce zero batches. Before the Java- side schema-preservation fix, this query died with Arrow 'project index 0 out of bounds, max field 0' in the drain thread. The test asserts a non-erroring 200 response; exact shape isn't checked since the point is crash-freedom. - testGroupByCountMultiShard_noWhereControl — same query shape minus the WHERE. Every doc has category='' so the result is one group with the full count. Acts as the control that isolates the WHERE predicate (and the resulting zero-batch partial) as the previous trigger. PplClickBenchIT SKIP_QUERIES trimmed to the queries that remain failing after the schema-preservation fix: only the TIMESTAMP/DATE family (Q7, Q24-Q27, Q37-Q42) and unrelated PPL frontend gaps (Q19, Q29, Q40, Q43). Previously-skipped WHERE + GROUP-BY queries (Q11, Q12, Q13, Q14, Q15, Q20, Q22, Q23, Q31, Q32) are now included — 29 ClickBench queries pass on multi-shard (up from 19). Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

The shard-to-coordinator wire format for FragmentExecutionResponse previously carried each batch as List<Object[]> + List<String> fieldNames, serialized per-cell via OpenSearch's StreamOutput.writeGenericValue. The path had two structural problems: - Lossy round-trip for Arrow types. vector.getObject(i) returns java.time. LocalDateTime for TimeStampMilliVector (no TZ), which writeGenericValue does not support — every shard emitting a Timestamp column failed with 'can not write type [class java.time.LocalDateTime]'. Arrow's view-vector variants (Utf8View, BinaryView) and dictionary-encoded vectors likewise can't round-trip through Object[] inference. - Per-cell boxing on both sides (O(rows * cols) object allocations + dispatch on send, O(rows * cols) allocations + setSafe on receive). Heap-heavy and GC-pressure-heavy for larger batches. Replaces the wire format with Arrow's own IPC stream: MessageSerializer. serialize(channel, schema) once, then serialize(channel, recordBatch) per batch, terminated by ArrowStreamWriter.writeEndOfStream. On receive, ArrowStreamReader handles schema + batch message sequencing and VectorLoader loads buffers into VSRs. Arrow's library handles every vector type natively — zero hand-rolled dispatch. RowResponseCodec.decode copies the reader's reused root into caller-owned VSRs via makeTransferPair (works for every vector kind including views; the default VectorSchemaRootAppender rejects view vectors, so we avoid it). Multi-batch responses concatenate via per-cell copyFromSafe — the only append primitive in Arrow Java that supports view vectors. Wire format: byte[] ipcPayload + vint rowCount. rowCount is cached purely for the existing onFragmentSuccess metric so consumers don't decode just to count rows. Deletes RowBatchToArrowConverter — the row-to-Arrow bridge is now Arrow IPC end-to-end. Updates ArrowSchemaFromCalcite javadoc accordingly. Net: +224 / -340 lines across the codec. All 39 non-skipped ClickBench PPL queries pass on multi-shard (was 20 with the row-oriented codec), including the full TIMESTAMP family (Q7, Q24-Q27, Q37-Q42) that the old Object[] path structurally could not handle. Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

… senderSend DataFusion's HashAggregateExec internally converts string group keys to Utf8View for performance (inline short-string optimization, non-configurable in DataFusion 49+). The coordinator's FINAL plan, by contrast, is decoded from substrait — substrait has only a generic 'string' type which DataFusion round-trips to Utf8, and the coordinator's childSchemas (computed from the Calcite row type via ArrowSchemaFromCalcite) likewise declares Utf8. With the Arrow IPC wire codec preserving exact Arrow types end-to-end, the shard's Utf8View batches now reach the coordinator as Utf8View. The Rust StreamingTable partition was registered with a Utf8 schema, so Utf8View batches trigger an 'as_primitive::<>()' downcast panic ('byte array') the first time a cross-batch operator (coalesce / repartition) handles the string column. Observed on every multi-string-column group-by from Q17 onward. Previously the Object[] row-path laundered this silently: getObject -> String -> VarCharVector.setSafe always produced Utf8, regardless of the shard's actual emit type. Removing the lossy row-path exposed the contract gap. Adds a single coercion point at feedToSender(): when batch.getSchema() differs from the declared childSchemas entry, allocate a new VSR matching the declared schema and copy per column. Same-type columns use makeTransfer Pair (zero-copy). Utf8View -> Utf8 uses per-cell byte copy: ViewVarCharVector.get(i) -> VarCharVector.setSafe(i, bytes). Unknown type pairs throw with a diagnostic naming source type, target type, and column — future mismatches surface as clear errors rather than opaque Rust panics. AbstractDatafusionReduceSink gains a childSchemas map parallel to the existing childInputs bytes map, populated once in the constructor. Sinks (single-input feed() and per-child ChildSink) look up the declared schema by childStageId. Extends only on observed mismatch — no speculative pair coverage. Zero-copy fast path when schemas match (numeric-only aggregates, which is the common case). Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

With the Arrow IPC wire codec and the Utf8View -> Utf8 coercion at the Java-to-Rust boundary, the previously-skipped query families now pass on multi-shard (2 shards, 2 nodes). Trims SKIP_QUERIES from 14 entries down to the 4 remaining PPL frontend / Substrait library gaps that are unrelated to distributed execution: Q19 - extract(minute from ...) not implemented in the PPL frontend Q29 - Substrait library can't bind MIN on VARCHAR inputs Q40 - case() else + head N from M - PPL frontend gap Q43 - date_format() + head N from M - PPL frontend gap Unblocked: Q7 + Q24-Q27 + Q37-Q42 (TIMESTAMP / DATE family, 11 queries); Q17-Q18 + Q31-Q36 (multi-string-column group-by family, 8 queries). Total 39 of 39 non-skipped ClickBench PPL queries pass on multi-shard. Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

Adds a TODO on coerceToDeclaredSchema summarizing the plan-level alternatives we tried and why they're currently blocked: - declaring Utf8View up-front in childSchemas SIGSEGVs because DataFusion's optimizer emits Utf8View across more operators than HashAggregate (filter + sort + project queries also hit it), making static prediction in Java fragile and engine-version-specific - Arrow Java 18.3's FFI can import Utf8View natively (BufferImportTypeVisitor has visit(Utf8View)); the blocker is predicting the emission, not importing it - three forward paths recorded for future revisit: (a) a DataFusion schema- introspection API, (b) substrait view-type extension, (c) a Rust-side normalize pass using DataFusion's vectorized CastExpr at PARTIAL root Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

github-actions · 2026-05-10T09:46:40Z

✅ Gradle check result for 614a8c5: SUCCESS

The prior approach introduced a dummy Calcite SqlAggFunction APPROX_DISTINCT and a plan shuttle (rewriteApproxCountDistinct) that swapped every SqlStdOperatorTable.APPROX_COUNT_DISTINCT call for it right before Substrait emission. The dummy was needed because isthmus' default AGGREGATE_SIGS binds APPROX_COUNT_DISTINCT to substrait's standard approx_count_distinct URN, and the resulting entry in FunctionConverter's IdentityHashMap (keyed by Calcite SqlOperator) shadows any additional Sig entry for the same operator. Replaces the workaround with a small OpenSearchAggregateFunctionConverter subclass that filters APPROX_COUNT_DISTINCT out of the default signature list via getSigs(). With the default binding gone, a plain Sig(SqlStdOperatorTable.APPROX_COUNT_DISTINCT, "approx_distinct") in ADDITIONAL_AGGREGATE_SIGS is the sole matcher and routes directly to the YAML-declared extension — no operator rewrite, no dummy SqlAggFunction, no plan shuttle. Net: -78 lines. No runtime per-function branch in the convertor. Restores invariant #3 (no ad-hoc 'if function == X' dispatch outside AggregateFunction and the resolver's enum lookup). Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

Audits every file touched in this branch's commits and converts fully-qualified class references to imports. Preserves FQN only where there's a genuine same-name collision that forces it. Changes: - AggregateDecompositionResolver: import Project, RelCollations, RexBuilder, RexInputRef, RexNode, RexShuttle (~10 FQN sites -> short names). - ArrowSchemaFromCalcite + ArrowCalciteTypesTests: import DateUnit, TimeUnit (5 FQN sites -> short names). - AbstractDatafusionReduceSink: import org.apache.arrow.vector.types.pojo.Schema (2 FQN sites). - AnalyticsSearchService: import ArrowStreamWriter (1 FQN site). - DataFusionFragmentConvertor: import ImmutableList (2 FQN sites). - DataFusionFragmentConvertorTests: import RexNode (1 FQN site). - ShardFragmentStageExecutionTests: import ActionResponse (1 FQN site) — also fixes stale new FragmentExecutionResponse(List<String>, List<Object[]>) calls that I missed updating in the earlier IPC refactor. Preserves FQN on the two io.substrait.proto.Plan references in DataFusionFragmentConvertor — that class collides with the already-imported io.substrait.plan.Plan, so FQN is required. Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

Comments in production files should not reference internal design-doc section numbers (§18 #12), branch names (pf2, pf4), or session-specific debug narrative ("we attempted", "SIGSEGV in Q26"). Those belong in design docs and commit messages, not in the code. - DatafusionResultStream: drop '§18 invariant #12' reference from the nativeStreamExhausted field comment; the surrounding explanation already states what the field is for. - DatafusionReduceSink.coerceToDeclaredSchema: rewrite the TODO block as a forward-looking technical note instead of a session history, keeping only the actionable alternatives. - CoordinatorReduceIT.testQ10ShapeAcrossShards: drop pf2/pf4 branch narrative; describe what the test covers structurally. Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

CoordinatorReduceIT's multi-shard WHERE+GROUP-BY tests carried leftover debug narrative from their original reproducer role — now that the bug they caught is fixed, the javadoc and naming should describe what the tests validate, not their origin story. - Rename testWhereGroupByCountMultiShard_reproducer → testGroupByCountMultiShard_allRowsFilteredByWhere and testGroupByCountMultiShard_noWhereControl → testGroupByCountMultiShard_noWhereClause. Describes shape, not history. - Rewrite their javadocs to state what behavior is being asserted (empty-partial path reporting empty result without erroring), drop the 'project index 0 out of bounds' stack trace and '@AwaitsFix to keep green' commentary. - Rename the supporting fixtures WHERE_REPRO_INDEX, createWhereReproIndex, indexWhereReproDocs → STRING_GROUP_INDEX, createStringGroupIndex, indexStringGroupDocs. 'repro' was session-specific vocabulary. - Class-level javadoc: drop internal Rust function names (prepare_partial_plan, force_aggregate_mode, execute_local_prepared_plan) from the pipeline diagram; keep the Calcite/Java-side names that describe the layer boundaries. - Remove now-unused AwaitsFix import. Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

Pre-existing test file hadn't been updated after the ExchangeSinkProvider#createSink contract went from (context) to (context, backendContext) during the handler-infrastructure work. Lambda now matches the current SPI signature. Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

…pty batch DatafusionResultStream.loadNextBatch now synthesizes a zero-row VSR carrying the declared schema when the native stream produces no batches, so downstream transports see the column layout on their first data frame. testNextOnExhaustedStreamThrows was asserting the old contract (hasNext returns false immediately on empty result). Consume the synthetic batch first, then assert the iterator is exhausted. Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

github-actions · 2026-05-10T12:31:44Z

✅ Gradle check result for e7cf5d1: SUCCESS

The single DSL query (Q1: sum(GoodEvent)) has been hanging on this branch with a 60s client-side socket timeout. Root cause is in the DSL aggregation path, not the distributed-aggregate machinery — orthogonal to the coercer and partial/final wiring. Skipping the query lets CI run the structural test (provisioning, discovery) without blocking on the unrelated regression. Restore 'List.of(1)' once the DSL hang is diagnosed and fixed. Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

github-actions · 2026-05-10T16:18:01Z

✅ Gradle check result for e648ee7: SUCCESS

github-actions · 2026-05-10T17:59:37Z

✅ Gradle check result for ee0a3d7: SUCCESS

BASELINE_SCALAR_OPS (COALESCE, CASE, CAST, arithmetic, IS_NULL, …) bypass the AnnotatedProjectExpression wrap at the call site but their operands still go through capability-driven annotation. A project expression like COALESCE(num0, CEIL(num1)) ends up as COALESCE(num0, AnnotatedProjectExpression(CEIL(num1))) with the outer COALESCE unwrapped and the inner CEIL wrapped. OpenSearchProject.stripAnnotations only ran the nested-annotation shuttle when the top-level expression was itself an AnnotatedProjectExpression; the plain-top-level branch passed the expression through untouched, so the inner wrapper survived into the substrait converter. Isthmus then rejected the plan with 'Unable to convert call ANNOTATED_PROJECT_EXPR(…)'. The strip logic predates the baseline carve-out (commit 196fd42) and was never updated to account for inner annotations under a baseline root. The carve-out intentionally recurses into operands precisely so those inner calls still go through capability resolution — strip must mirror that recursion. Run the RexShuttle unconditionally; it already no-ops for subtrees without annotations. Same behaviour on top-level annotations, now correctly scrubs nested ones regardless of what root wraps them. Resolves 'ANNOTATED_PROJECT_EXPR(…)' failures in FillNullCommandIT.testFillNullWithFunctionOnOtherField (fillnull with ceil(num1) in num0 → COALESCE root) and MultisearchCommandIT.testMultisearchEvalCaseProjection (eval case → CASE root). Also tags MultisearchCommandIT.testMultisearchCountEvalConditionalCount with @AwaitsFix pending a separate DataFusion count-accumulator state-type mismatch (Int32 vs Int64) that only surfaces after the strip fix lets the plan reach execution. The test body is preserved so restoration is just removing the annotation once the underlying issue is fixed. Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

- DatafusionReduceSink.feedToSender now calls sender.send() instead of NativeBridge.senderSend() directly, so the read-write lock added on DatafusionPartitionSender actually protects the hot path. - Drop @AwaitsFix on 7 Append/AppendPipe IT methods. The Utf8View schema-lie they hit is fixed by coerceToDeclaredSchema in upstream opensearch-project#21457 (merged here); all 7 now pass. Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

…#21594) * refactor: streaming-only fragment dispatch in analytics-engine - delete RowResponseCodec, ResponseCodec, FragmentExecutionResponse, RowBatchToArrowConverter - delete AnalyticsSearchService.executeFragment / collectResponse - require StreamTransportService at injection (fail-fast); force streaming feature flag in sandbox QA clusters - ShardFragmentStageExecution: catch outputSink.feed() exceptions so a feed failure fails the stage instead of hanging to QUERY_TIMEOUT - DatafusionResultStream: synthesise one zero-row schema-bearing batch for empty native streams (Flight requires ≥1 schema frame) - DatafusionPartitionSender: read-write lock around send/close to prevent sender_close UAF while sender_send is mid-await - @AwaitsFix 7 Append/AppendPipe IT methods — Utf8View FFI schema-lie bug, tracked separately Signed-off-by: Marc Handalian <marc.handalian@gmail.com> * fix: wire UAF lock through feedToSender; unmute Append ITs - DatafusionReduceSink.feedToSender now calls sender.send() instead of NativeBridge.senderSend() directly, so the read-write lock added on DatafusionPartitionSender actually protects the hot path. - Drop @AwaitsFix on 7 Append/AppendPipe IT methods. The Utf8View schema-lie they hit is fixed by coerceToDeclaredSchema in upstream #21457 (merged here); all 7 now pass. Signed-off-by: Marc Handalian <marc.handalian@gmail.com> * fix merge conflicts Signed-off-by: Marc Handalian <marc.handalian@gmail.com> --------- Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

expani reviewed May 2, 2026

View reviewed changes

sandeshkr419 added the skip-diff-analyzer Maintainer to skip code-diff-analyzer check, after reviewing issues in AI analysis. label May 3, 2026

sandeshkr419 force-pushed the pf2 branch from 4804364 to b696fc0 Compare May 3, 2026 02:28

sandeshkr419 force-pushed the pf2 branch from b696fc0 to fdcf13a Compare May 3, 2026 02:35

mch2 reviewed May 3, 2026

View reviewed changes

Comment thread ...ckend-datafusion/src/main/java/org/opensearch/be/datafusion/DataFusionFragmentConvertor.java Outdated

Comment thread ...box/plugins/analytics-engine/src/main/java/org/opensearch/analytics/planner/PlannerImpl.java Outdated

sandeshkr419 marked this pull request as ready for review May 4, 2026 04:28

sandeshkr419 requested a review from a team as a code owner May 4, 2026 04:28

sandeshkr419 force-pushed the pf2 branch from f5d4771 to b89a7c8 Compare May 4, 2026 05:31

expani reviewed May 4, 2026

View reviewed changes

Comment thread ...ckend-datafusion/src/main/java/org/opensearch/be/datafusion/DataFusionFragmentConvertor.java Outdated

expani reviewed May 4, 2026

View reviewed changes

Comment thread sandbox/plugins/analytics-backend-datafusion/rust/src/local_executor.rs Outdated

expani reviewed May 4, 2026

View reviewed changes

sandeshkr419 force-pushed the pf2 branch from b89a7c8 to c76d34d Compare May 4, 2026 21:22

sandeshkr419 mentioned this pull request May 5, 2026

Introduce Instructions for the Backend to use during execution #21479

Merged

sandeshkr419 force-pushed the pf2 branch from ba512f3 to 969a35f Compare May 6, 2026 07:47

sandeshkr419 added 11 commits May 10, 2026 02:39

sandeshkr419 added 4 commits May 10, 2026 02:55

sandeshkr419 force-pushed the pf2 branch from 614a8c5 to ed4e558 Compare May 10, 2026 10:16

sandeshkr419 added 2 commits May 10, 2026 03:47

sandeshkr419 force-pushed the pf2 branch from 1e81404 to ee0a3d7 Compare May 10, 2026 16:57

sandeshkr419 force-pushed the pf2 branch from ee0a3d7 to 31e8fa2 Compare May 10, 2026 18:08

mch2 merged commit b7f49bd into opensearch-project:main May 10, 2026
23 of 24 checks passed

sandeshkr419 deleted the pf2 branch May 10, 2026 18:53

Conversation

sandeshkr419 commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Architecture

Single source of truth for per-function decomposition

Multi-shard unblocks

TODOs to revisit (tracked independently)

Check List

Uh oh!

github-actions Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Analyzer ❗

Uh oh!

expani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit 614a8c5)

Uh oh!

github-actions Bot commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Previous suggestions

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

sandeshkr419 commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

codecov Bot commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

sandeshkr419 commented May 2, 2026 •

edited

Loading

github-actions Bot commented May 2, 2026 •

edited

Loading

github-actions Bot commented May 3, 2026 •

edited

Loading

(Review updated until commit `614a8c5`)

github-actions Bot commented May 3, 2026 •

edited

Loading

sandeshkr419 commented May 3, 2026 •

edited

Loading

codecov Bot commented May 3, 2026 •

edited

Loading