[Analytics Backend / DataFusion] Onboard PPL rex to DataFusion#21550
[Analytics Backend / DataFusion] Onboard PPL rex to DataFusion#21550RyanL1997 wants to merge 8 commits intoopensearch-project:mainfrom
Conversation
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit 0dafa26.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
… Rust UDFs + array result type Completes the PPL `rex` onboarding started in Part 1 (opensearch-project#21550). The sed-mode forms were already covered by bridges to existing Calcite/DataFusion operators. The extract-mode form has no native DataFusion equivalent and needs three custom Rust UDFs, three Java SqlOperator adapters, and a small handful of analytics- framework / engine plumbing changes to model array result types end-to-end. ## What's new ### Rust — three UDFs in opensearch-native-lib * `rex_extract(input, pattern_lit, group_lit) -> varchar` — single named or numbered group capture. Compiles the regex once at plan time, runs per row. * `rex_extract_multi(input, pattern_lit, group_lit, max_match) -> list<varchar>` — multi-match. `max_match=0` means unbounded; otherwise caps the result at the requested element count. Returns NULL (not an empty list) when there are no matches, matching the SQL plugin's Java implementation. * `rex_offset(input, pattern_lit) -> varchar` — emits the named-group offsets formatted as `"name1=s1-e1&name2=s2-e2"`, alphabetically sorted; end is inclusive, matching the SQL plugin's `RexOffsetFunction.end - 1` convention. Each UDF has 5 unit tests covering the contract above. ### Java — three SqlOperator adapters * `RexExtractAdapter`, `RexExtractMultiAdapter`, `RexOffsetAdapter` — keyed on the SQL plugin's PPL builtin operator names (`REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`) via the analytics-framework ScalarFunction enum. Each adapter rewrites the incoming RexCall to a local target SqlOperator (`LOCAL_REX_EXTRACT_OP`, etc.) that `DataFusionFragmentConvertor`'s `ADDITIONAL_SCALAR_SIGS` maps to the corresponding `rex_extract` / `rex_extract_multi` / `rex_offset` Substrait extension declared in `opensearch_scalar_functions.yaml`. * Pattern operands (and the group operand for the extract variants) are validated as RexLiterals at plan time. Column-valued patterns would force per-row regex compilation on the Rust side and are rejected with an IllegalArgumentException — same contract as the precedent set by RegexpReplaceAdapter in Part 1. `RexExtractAdapterTests` covers this. ### Framework — array return-type support * `FieldType.ARRAY` enum value + `fromSqlTypeName(ARRAY) -> FieldType.ARRAY` in analytics-framework. Without this, `OpenSearchProjectRule.resolveScalar ViableBackends` returns `null` for any scalar with an array return type and the planner emits "No backend supports scalar function [REX_EXTRACT_ MULTI] among [datafusion]". `REX_EXTRACT_MULTI`'s ProjectCapability.Scalar declaration is now `Set.of(FieldType.ARRAY)` rather than the broad scalar set used by every other op (UPPER, ABS, ...) — those genuinely don't return arrays. * `ListVector` handling in three call sites that previously triggered Arrow's `JsonStringArrayList.<clinit>`, which references `JavaTimeModule` from `jackson-datatype-jsr310` (not on the `arrow-flight-rpc` parent plugin's classloader). Bypassing `getObject()` and reading offset buffer + inner data vector directly: - `DatafusionResultStream.getFieldValue` (shard-side row materialization) - `ArrowValues.toJavaValue` (coordinator post-execution row reading) - `RowResponseCodec` (`inferArrowField` + `setVectorValue`) — the Object[]-row → Arrow VectorSchemaRoot wire codec needed an explicit list<utf8> Field with proper child Field, plus a ListVector setter using `startNewValue`/`endValue` + the inner VarCharVector's `setSafe`. ### IT coverage * `RexCommandIT` extended from 9 sed tests to 16 — adds 7 extract-mode cases: single named group, multiple named groups in one row, missing-group NULL handling, multi-match capturing all, `max_match` cap, offset_field output, and no-match passthrough as NULL. ## Required runtime dependency * `commons-text:1.11.0` is now a `runtimeOnly` dep of analytics-engine. Calcite's `SqlFunctions.<clinit>` references LevenshteinDistance from commons-text; without bundling the jar the cluster crashes the first time a query reaches Calcite. The matching SHA1 + LICENSE + NOTICE artifacts are added under `licenses/` per the repo dependency-license check. ## Test results * Rust UDFs — 15/15 unit tests (5 per UDF). * `RexExtractAdapterTests` — 4/4. * `RexCommandIT` — 16/16 (9 sed from Part 1 + 7 new extract). * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green (678 tasks, all sandbox module unit tests + spotless + license + forbidden API). Signed-off-by: Jialiang Liang <jiallian@amazon.com>
… Rust UDFs + array result type Completes the PPL `rex` onboarding started in Part 1 (opensearch-project#21550). The sed-mode forms were already covered by bridges to existing Calcite/DataFusion operators. The extract-mode form has no native DataFusion equivalent and needs three custom Rust UDFs, three Java SqlOperator adapters, and a small handful of analytics- framework / engine plumbing changes to model array result types end-to-end. ## What's new ### Rust — three UDFs in opensearch-native-lib * `rex_extract(input, pattern_lit, group_lit) -> varchar` — single named or numbered group capture. Compiles the regex once at plan time, runs per row. * `rex_extract_multi(input, pattern_lit, group_lit, max_match) -> list<varchar>` — multi-match. `max_match=0` means unbounded; otherwise caps the result at the requested element count. Returns NULL (not an empty list) when there are no matches, matching the SQL plugin's Java implementation. * `rex_offset(input, pattern_lit) -> varchar` — emits the named-group offsets formatted as `"name1=s1-e1&name2=s2-e2"`, alphabetically sorted; end is inclusive, matching the SQL plugin's `RexOffsetFunction.end - 1` convention. Each UDF has 5 unit tests covering the contract above. ### Java — three SqlOperator adapters * `RexExtractAdapter`, `RexExtractMultiAdapter`, `RexOffsetAdapter` — keyed on the SQL plugin's PPL builtin operator names (`REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`) via the analytics-framework ScalarFunction enum. Each adapter rewrites the incoming RexCall to a local target SqlOperator (`LOCAL_REX_EXTRACT_OP`, etc.) that `DataFusionFragmentConvertor`'s `ADDITIONAL_SCALAR_SIGS` maps to the corresponding `rex_extract` / `rex_extract_multi` / `rex_offset` Substrait extension declared in `opensearch_scalar_functions.yaml`. * Pattern operands (and the group operand for the extract variants) are validated as RexLiterals at plan time. Column-valued patterns would force per-row regex compilation on the Rust side and are rejected with an IllegalArgumentException — same contract as the precedent set by RegexpReplaceAdapter in Part 1. `RexExtractAdapterTests` covers this. ### Framework — array return-type support * `FieldType.ARRAY` enum value + `fromSqlTypeName(ARRAY) -> FieldType.ARRAY` in analytics-framework. Without this, `OpenSearchProjectRule.resolveScalar ViableBackends` returns `null` for any scalar with an array return type and the planner emits "No backend supports scalar function [REX_EXTRACT_ MULTI] among [datafusion]". `REX_EXTRACT_MULTI`'s ProjectCapability.Scalar declaration is now `Set.of(FieldType.ARRAY)` rather than the broad scalar set used by every other op (UPPER, ABS, ...) — those genuinely don't return arrays. * `ListVector` handling in three call sites that previously triggered Arrow's `JsonStringArrayList.<clinit>`, which references `JavaTimeModule` from `jackson-datatype-jsr310` (not on the `arrow-flight-rpc` parent plugin's classloader). Bypassing `getObject()` and reading offset buffer + inner data vector directly: - `DatafusionResultStream.getFieldValue` (shard-side row materialization) - `ArrowValues.toJavaValue` (coordinator post-execution row reading) - `RowResponseCodec` (`inferArrowField` + `setVectorValue`) — the Object[]-row → Arrow VectorSchemaRoot wire codec needed an explicit list<utf8> Field with proper child Field, plus a ListVector setter using `startNewValue`/`endValue` + the inner VarCharVector's `setSafe`. ### IT coverage * `RexCommandIT` extended from 9 sed tests to 16 — adds 7 extract-mode cases: single named group, multiple named groups in one row, missing-group NULL handling, multi-match capturing all, `max_match` cap, offset_field output, and no-match passthrough as NULL. ## Test results * Rust UDFs — 15/15 unit tests (5 per UDF). * `RexExtractAdapterTests` — 4/4. * `RexCommandIT` — 16/16 (9 sed from Part 1 + 7 new extract). * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green (678 tasks, all sandbox module unit tests + spotless + license + forbidden API). Signed-off-by: Jialiang Liang <jiallian@amazon.com>
7b7e72c to
b1d2240
Compare
… Rust UDFs + array result type Completes the PPL `rex` onboarding started in Part 1 (opensearch-project#21550). The sed-mode forms were already covered by bridges to existing Calcite/DataFusion operators. The extract-mode form has no native DataFusion equivalent and needs three custom Rust UDFs, three Java SqlOperator adapters, and a small handful of analytics- framework / engine plumbing changes to model array result types end-to-end. ## What's new ### Rust — three UDFs in opensearch-native-lib * `rex_extract(input, pattern_lit, group_lit) -> varchar` — single named or numbered group capture. Compiles the regex once at plan time, runs per row. * `rex_extract_multi(input, pattern_lit, group_lit, max_match) -> list<varchar>` — multi-match. `max_match=0` means unbounded; otherwise caps the result at the requested element count. Returns NULL (not an empty list) when there are no matches, matching the SQL plugin's Java implementation. * `rex_offset(input, pattern_lit) -> varchar` — emits the named-group offsets formatted as `"name1=s1-e1&name2=s2-e2"`, alphabetically sorted; end is inclusive, matching the SQL plugin's `RexOffsetFunction.end - 1` convention. Each UDF has 5 unit tests covering the contract above. ### Java — three SqlOperator adapters * `RexExtractAdapter`, `RexExtractMultiAdapter`, `RexOffsetAdapter` — keyed on the SQL plugin's PPL builtin operator names (`REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`) via the analytics-framework ScalarFunction enum. Each adapter rewrites the incoming RexCall to a local target SqlOperator (`LOCAL_REX_EXTRACT_OP`, etc.) that `DataFusionFragmentConvertor`'s `ADDITIONAL_SCALAR_SIGS` maps to the corresponding `rex_extract` / `rex_extract_multi` / `rex_offset` Substrait extension declared in `opensearch_scalar_functions.yaml`. * Pattern operands (and the group operand for the extract variants) are validated as RexLiterals at plan time. Column-valued patterns would force per-row regex compilation on the Rust side and are rejected with an IllegalArgumentException — same contract as the precedent set by RegexpReplaceAdapter in Part 1. `RexExtractAdapterTests` covers this. ### Framework — array return-type support * `FieldType.ARRAY` enum value + `fromSqlTypeName(ARRAY) -> FieldType.ARRAY` in analytics-framework. Without this, `OpenSearchProjectRule.resolveScalar ViableBackends` returns `null` for any scalar with an array return type and the planner emits "No backend supports scalar function [REX_EXTRACT_ MULTI] among [datafusion]". `REX_EXTRACT_MULTI`'s ProjectCapability.Scalar declaration is now `Set.of(FieldType.ARRAY)` rather than the broad scalar set used by every other op (UPPER, ABS, ...) — those genuinely don't return arrays. * `ListVector` handling in three call sites that previously triggered Arrow's `JsonStringArrayList.<clinit>`, which references `JavaTimeModule` from `jackson-datatype-jsr310` (not on the `arrow-flight-rpc` parent plugin's classloader). Bypassing `getObject()` and reading offset buffer + inner data vector directly: - `DatafusionResultStream.getFieldValue` (shard-side row materialization) - `ArrowValues.toJavaValue` (coordinator post-execution row reading) - `RowResponseCodec` (`inferArrowField` + `setVectorValue`) — the Object[]-row → Arrow VectorSchemaRoot wire codec needed an explicit list<utf8> Field with proper child Field, plus a ListVector setter using `startNewValue`/`endValue` + the inner VarCharVector's `setSafe`. ### IT coverage * `RexCommandIT` extended from 9 sed tests to 16 — adds 7 extract-mode cases: single named group, multiple named groups in one row, missing-group NULL handling, multi-match capturing all, `max_match` cap, offset_field output, and no-match passthrough as NULL. ## Test results * Rust UDFs — 15/15 unit tests (5 per UDF). * `RexExtractAdapterTests` — 4/4. * `RexCommandIT` — 16/16 (9 sed from Part 1 + 7 new extract). * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green (678 tasks, all sandbox module unit tests + spotless + license + forbidden API). Signed-off-by: Jialiang Liang <jiallian@amazon.com>
… Rust UDFs + array result type Completes the PPL `rex` onboarding started in Part 1 (opensearch-project#21550). The sed-mode forms were already covered by bridges to existing Calcite/DataFusion operators. The extract-mode form has no native DataFusion equivalent and needs three custom Rust UDFs, three Java SqlOperator adapters, and a small handful of analytics- framework / engine plumbing changes to model array result types end-to-end. ## What's new ### Rust — three UDFs in opensearch-native-lib * `rex_extract(input, pattern_lit, group_lit) -> varchar` — single named or numbered group capture. Compiles the regex once at plan time, runs per row. * `rex_extract_multi(input, pattern_lit, group_lit, max_match) -> list<varchar>` — multi-match. `max_match=0` means unbounded; otherwise caps the result at the requested element count. Returns NULL (not an empty list) when there are no matches, matching the SQL plugin's Java implementation. * `rex_offset(input, pattern_lit) -> varchar` — emits the named-group offsets formatted as `"name1=s1-e1&name2=s2-e2"`, alphabetically sorted; end is inclusive, matching the SQL plugin's `RexOffsetFunction.end - 1` convention. Each UDF has 5 unit tests covering the contract above. ### Java — three SqlOperator adapters * `RexExtractAdapter`, `RexExtractMultiAdapter`, `RexOffsetAdapter` — keyed on the SQL plugin's PPL builtin operator names (`REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`) via the analytics-framework ScalarFunction enum. Each adapter rewrites the incoming RexCall to a local target SqlOperator (`LOCAL_REX_EXTRACT_OP`, etc.) that `DataFusionFragmentConvertor`'s `ADDITIONAL_SCALAR_SIGS` maps to the corresponding `rex_extract` / `rex_extract_multi` / `rex_offset` Substrait extension declared in `opensearch_scalar_functions.yaml`. * Pattern operands (and the group operand for the extract variants) are validated as RexLiterals at plan time. Column-valued patterns would force per-row regex compilation on the Rust side and are rejected with an IllegalArgumentException — same contract as the precedent set by RegexpReplaceAdapter in Part 1. `RexExtractAdapterTests` covers this. ### Framework — array return-type support * `FieldType.ARRAY` enum value + `fromSqlTypeName(ARRAY) -> FieldType.ARRAY` in analytics-framework. Without this, `OpenSearchProjectRule.resolveScalar ViableBackends` returns `null` for any scalar with an array return type and the planner emits "No backend supports scalar function [REX_EXTRACT_ MULTI] among [datafusion]". `REX_EXTRACT_MULTI`'s ProjectCapability.Scalar declaration is now `Set.of(FieldType.ARRAY)` rather than the broad scalar set used by every other op (UPPER, ABS, ...) — those genuinely don't return arrays. * `ListVector` handling in three call sites that previously triggered Arrow's `JsonStringArrayList.<clinit>`, which references `JavaTimeModule` from `jackson-datatype-jsr310` (not on the `arrow-flight-rpc` parent plugin's classloader). Bypassing `getObject()` and reading offset buffer + inner data vector directly: - `DatafusionResultStream.getFieldValue` (shard-side row materialization) - `ArrowValues.toJavaValue` (coordinator post-execution row reading) - `RowResponseCodec` (`inferArrowField` + `setVectorValue`) — the Object[]-row → Arrow VectorSchemaRoot wire codec needed an explicit list<utf8> Field with proper child Field, plus a ListVector setter using `startNewValue`/`endValue` + the inner VarCharVector's `setSafe`. ### IT coverage * `RexCommandIT` extended from 9 sed tests to 16 — adds 7 extract-mode cases: single named group, multiple named groups in one row, missing-group NULL handling, multi-match capturing all, `max_match` cap, offset_field output, and no-match passthrough as NULL. ## Test results * Rust UDFs — 15/15 unit tests (5 per UDF). * `RexExtractAdapterTests` — 4/4. * `RexCommandIT` — 16/16 (9 sed from Part 1 + 7 new extract). * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green (678 tasks, all sandbox module unit tests + spotless + license + forbidden API). Signed-off-by: Jialiang Liang <jiallian@amazon.com>
b1d2240 to
aa25d15
Compare
…_MATCH_LIMIT The previous commit defaulted `PPL_REX_MAX_MATCH_LIMIT=10` in `UnifiedQueryContext.Builder.settings` to fix the NPE in `AstBuilder.visitRexCommand` on the unified path. The default is correct, but it doesn't respect mid-run cluster overrides — every key in the static map returns its hardcoded value regardless of `_cluster/settings` updates. This breaks `CalciteRexCommandIT.testRexMaxMatchConfigurableLimit`, which explicitly sets the cluster-side limit to 5 and asserts that `max_match=0` caps at 5; on the unified path it stayed at 10. This change introduces a `Builder.liveSettings(Settings)` hook that the REST handler can use to inject the cluster's live `OpenSearchSettings` instance. At `build()` time the Builder snapshots the live value of `PPL_REX_MAX_MATCH_LIMIT` (only — see scoping note below) into the static map, overriding the hardcoded default when the operator has set a cluster value. Snapshot-at-build matches the per-HTTP-request lifecycle of `UnifiedQueryContext` and avoids per-call lookup overhead. ## Why scoped to PPL_REX_MAX_MATCH_LIMIT only The same architectural gap exists for every key in the static map (`QUERY_SIZE_LIMIT`, `PPL_SUBSEARCH_MAXOUT`, `PPL_JOIN_SUBSEARCH_MAXOUT`, `CALCITE_ENGINE_ENABLED`). For three of those, the static defaults are fine in practice (no test overrides them mid-run; `head N` covers `QUERY_SIZE_LIMIT` per-query). `CALCITE_ENGINE_ENABLED` is intentionally pinned to `true` for the unified path — a cluster override toggling it off would defeat the point of routing here. So this PR widens only the one key that demonstrably needs it; widening the snapshot to the rest is a future scope decision tied to whichever new IT first depends on it. ## Wire-up `RestUnifiedQueryAction` gains a `pluginSettings` field (the same `OpenSearchSettings` instance bound in the Guice module) and forwards it to the Builder in both `buildContext` (per-request execution path) and `buildParsingContext` (analytics-routing index name probe). Both construction sites — `SQLPlugin.createSqlAnalyticsRouter` and `TransportPPLQueryAction.<init>` — are updated to pass the existing plugin-side `Settings` instance. `buildParsingContext` had been `static` because it didn't need any instance state; it's now an instance method since it reads `pluginSettings`. ## Test results CalciteRexCommandIT through the analytics-engine route (every PPL query forced through `/_analytics/ppl` via `tests.analytics.force_routing=true`): * Before this change: 17/18 — `testRexMaxMatchConfigurableLimit` fails with `expected:<5> but was:<10>` (cluster override doesn't reach the unified path). * After this change: 18/18 — all `testRexMaxMatch*` variants honor the cluster setting. ## Companion PR opensearch-project/OpenSearch#21550 — onboards PPL `rex` to DataFusion via the analytics-engine path. The 17/18 baseline reported in that PR's description was measured against the previous commit on this branch; with this change the route hits 18/18. Signed-off-by: Jialiang Liang <jiallian@amazon.com>
|
CI status check — failures are expected and explained below:
|
…_MATCH_LIMIT The previous commit defaulted `PPL_REX_MAX_MATCH_LIMIT=10` in `UnifiedQueryContext.Builder.settings` to fix the NPE in `AstBuilder.visitRexCommand` on the unified path. The default is correct, but it doesn't respect mid-run cluster overrides — every key in the static map returns its hardcoded value regardless of `_cluster/settings` updates. This breaks `CalciteRexCommandIT.testRexMaxMatchConfigurableLimit`, which explicitly sets the cluster-side limit to 5 and asserts that `max_match=0` caps at 5; on the unified path it stayed at 10. This change introduces a `Builder.liveSettings(Settings)` hook that the REST handler can use to inject the cluster's live `OpenSearchSettings` instance. At `build()` time the Builder snapshots the live value of `PPL_REX_MAX_MATCH_LIMIT` (only — see scoping note below) into the static map, overriding the hardcoded default when the operator has set a cluster value. Snapshot-at-build matches the per-HTTP-request lifecycle of `UnifiedQueryContext` and avoids per-call lookup overhead. The same architectural gap exists for every key in the static map (`QUERY_SIZE_LIMIT`, `PPL_SUBSEARCH_MAXOUT`, `PPL_JOIN_SUBSEARCH_MAXOUT`, `CALCITE_ENGINE_ENABLED`). For three of those, the static defaults are fine in practice (no test overrides them mid-run; `head N` covers `QUERY_SIZE_LIMIT` per-query). `CALCITE_ENGINE_ENABLED` is intentionally pinned to `true` for the unified path — a cluster override toggling it off would defeat the point of routing here. So this PR widens only the one key that demonstrably needs it; widening the snapshot to the rest is a future scope decision tied to whichever new IT first depends on it. `RestUnifiedQueryAction` gains a `pluginSettings` field (the same `OpenSearchSettings` instance bound in the Guice module) and forwards it to the Builder in both `buildContext` (per-request execution path) and `buildParsingContext` (analytics-routing index name probe). Both construction sites — `SQLPlugin.createSqlAnalyticsRouter` and `TransportPPLQueryAction.<init>` — are updated to pass the existing plugin-side `Settings` instance. `buildParsingContext` had been `static` because it didn't need any instance state; it's now an instance method since it reads `pluginSettings`. CalciteRexCommandIT through the analytics-engine route (every PPL query forced through `/_analytics/ppl` via `tests.analytics.force_routing=true`): * Before this change: 17/18 — `testRexMaxMatchConfigurableLimit` fails with `expected:<5> but was:<10>` (cluster override doesn't reach the unified path). * After this change: 18/18 — all `testRexMaxMatch*` variants honor the cluster setting. opensearch-project/OpenSearch#21550 — onboards PPL `rex` to DataFusion via the analytics-engine path. The 17/18 baseline reported in that PR's description was measured against the previous commit on this branch; with this change the route hits 18/18. Signed-off-by: Jialiang Liang <jiallian@amazon.com>
… Rust UDFs + array result type Completes the PPL `rex` onboarding started in Part 1 (opensearch-project#21550). The sed-mode forms were already covered by bridges to existing Calcite/DataFusion operators. The extract-mode form has no native DataFusion equivalent and needs three custom Rust UDFs, three Java SqlOperator adapters, and a small handful of analytics- framework / engine plumbing changes to model array result types end-to-end. * `rex_extract(input, pattern_lit, group_lit) -> varchar` — single named or numbered group capture. Compiles the regex once at plan time, runs per row. * `rex_extract_multi(input, pattern_lit, group_lit, max_match) -> list<varchar>` — multi-match. `max_match=0` means unbounded; otherwise caps the result at the requested element count. Returns NULL (not an empty list) when there are no matches, matching the SQL plugin's Java implementation. * `rex_offset(input, pattern_lit) -> varchar` — emits the named-group offsets formatted as `"name1=s1-e1&name2=s2-e2"`, alphabetically sorted; end is inclusive, matching the SQL plugin's `RexOffsetFunction.end - 1` convention. Each UDF has 5 unit tests covering the contract above. * `RexExtractAdapter`, `RexExtractMultiAdapter`, `RexOffsetAdapter` — keyed on the SQL plugin's PPL builtin operator names (`REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`) via the analytics-framework ScalarFunction enum. Each adapter rewrites the incoming RexCall to a local target SqlOperator (`LOCAL_REX_EXTRACT_OP`, etc.) that `DataFusionFragmentConvertor`'s `ADDITIONAL_SCALAR_SIGS` maps to the corresponding `rex_extract` / `rex_extract_multi` / `rex_offset` Substrait extension declared in `opensearch_scalar_functions.yaml`. * Pattern operands (and the group operand for the extract variants) are validated as RexLiterals at plan time. Column-valued patterns would force per-row regex compilation on the Rust side and are rejected with an IllegalArgumentException — same contract as the precedent set by RegexpReplaceAdapter in Part 1. `RexExtractAdapterTests` covers this. * `FieldType.ARRAY` enum value + `fromSqlTypeName(ARRAY) -> FieldType.ARRAY` in analytics-framework. Without this, `OpenSearchProjectRule.resolveScalar ViableBackends` returns `null` for any scalar with an array return type and the planner emits "No backend supports scalar function [REX_EXTRACT_ MULTI] among [datafusion]". `REX_EXTRACT_MULTI`'s ProjectCapability.Scalar declaration is now `Set.of(FieldType.ARRAY)` rather than the broad scalar set used by every other op (UPPER, ABS, ...) — those genuinely don't return arrays. * `ListVector` handling in three call sites that previously triggered Arrow's `JsonStringArrayList.<clinit>`, which references `JavaTimeModule` from `jackson-datatype-jsr310` (not on the `arrow-flight-rpc` parent plugin's classloader). Bypassing `getObject()` and reading offset buffer + inner data vector directly: - `DatafusionResultStream.getFieldValue` (shard-side row materialization) - `ArrowValues.toJavaValue` (coordinator post-execution row reading) - `RowResponseCodec` (`inferArrowField` + `setVectorValue`) — the Object[]-row → Arrow VectorSchemaRoot wire codec needed an explicit list<utf8> Field with proper child Field, plus a ListVector setter using `startNewValue`/`endValue` + the inner VarCharVector's `setSafe`. * `RexCommandIT` extended from 9 sed tests to 16 — adds 7 extract-mode cases: single named group, multiple named groups in one row, missing-group NULL handling, multi-match capturing all, `max_match` cap, offset_field output, and no-match passthrough as NULL. * Rust UDFs — 15/15 unit tests (5 per UDF). * `RexExtractAdapterTests` — 4/4. * `RexCommandIT` — 16/16 (9 sed from Part 1 + 7 new extract). * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green (678 tasks, all sandbox module unit tests + spotless + license + forbidden API). Signed-off-by: Jialiang Liang <jiallian@amazon.com>
ad59c57 to
cc57ed1
Compare
… Rust UDFs + array result type Completes the PPL `rex` onboarding started in Part 1 (opensearch-project#21550). The sed-mode forms were already covered by bridges to existing Calcite/DataFusion operators. The extract-mode form has no native DataFusion equivalent and needs three custom Rust UDFs, three Java SqlOperator adapters, and a small handful of analytics- framework / engine plumbing changes to model array result types end-to-end. * `rex_extract(input, pattern_lit, group_lit) -> varchar` — single named or numbered group capture. Compiles the regex once at plan time, runs per row. * `rex_extract_multi(input, pattern_lit, group_lit, max_match) -> list<varchar>` — multi-match. `max_match=0` means unbounded; otherwise caps the result at the requested element count. Returns NULL (not an empty list) when there are no matches, matching the SQL plugin's Java implementation. * `rex_offset(input, pattern_lit) -> varchar` — emits the named-group offsets formatted as `"name1=s1-e1&name2=s2-e2"`, alphabetically sorted; end is inclusive, matching the SQL plugin's `RexOffsetFunction.end - 1` convention. Each UDF has 5 unit tests covering the contract above. * `RexExtractAdapter`, `RexExtractMultiAdapter`, `RexOffsetAdapter` — keyed on the SQL plugin's PPL builtin operator names (`REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`) via the analytics-framework ScalarFunction enum. Each adapter rewrites the incoming RexCall to a local target SqlOperator (`LOCAL_REX_EXTRACT_OP`, etc.) that `DataFusionFragmentConvertor`'s `ADDITIONAL_SCALAR_SIGS` maps to the corresponding `rex_extract` / `rex_extract_multi` / `rex_offset` Substrait extension declared in `opensearch_scalar_functions.yaml`. * Pattern operands (and the group operand for the extract variants) are validated as RexLiterals at plan time. Column-valued patterns would force per-row regex compilation on the Rust side and are rejected with an IllegalArgumentException — same contract as the precedent set by RegexpReplaceAdapter in Part 1. `RexExtractAdapterTests` covers this. * `FieldType.ARRAY` enum value + `fromSqlTypeName(ARRAY) -> FieldType.ARRAY` in analytics-framework. Without this, `OpenSearchProjectRule.resolveScalar ViableBackends` returns `null` for any scalar with an array return type and the planner emits "No backend supports scalar function [REX_EXTRACT_ MULTI] among [datafusion]". `REX_EXTRACT_MULTI`'s ProjectCapability.Scalar declaration is now `Set.of(FieldType.ARRAY)` rather than the broad scalar set used by every other op (UPPER, ABS, ...) — those genuinely don't return arrays. * `ListVector` handling in three call sites that previously triggered Arrow's `JsonStringArrayList.<clinit>`, which references `JavaTimeModule` from `jackson-datatype-jsr310` (not on the `arrow-flight-rpc` parent plugin's classloader). Bypassing `getObject()` and reading offset buffer + inner data vector directly: - `DatafusionResultStream.getFieldValue` (shard-side row materialization) - `ArrowValues.toJavaValue` (coordinator post-execution row reading) - `RowResponseCodec` (`inferArrowField` + `setVectorValue`) — the Object[]-row → Arrow VectorSchemaRoot wire codec needed an explicit list<utf8> Field with proper child Field, plus a ListVector setter using `startNewValue`/`endValue` + the inner VarCharVector's `setSafe`. * `RexCommandIT` extended from 9 sed tests to 16 — adds 7 extract-mode cases: single named group, multiple named groups in one row, missing-group NULL handling, multi-match capturing all, `max_match` cap, offset_field output, and no-match passthrough as NULL. * Rust UDFs — 15/15 unit tests (5 per UDF). * `RexExtractAdapterTests` — 4/4. * `RexCommandIT` — 16/16 (9 sed from Part 1 + 7 new extract). * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green (678 tasks, all sandbox module unit tests + spotless + license + forbidden API). Signed-off-by: Jialiang Liang <jiallian@amazon.com>
cc57ed1 to
5c33102
Compare
|
CI is correctly pulling the canonical published snapshot. The published snapshot is from main, not from feature/mustang-ppl-integration. opensearch-project/sql#5418 is in the feature branch but never made it into a published artifact. So this PR's CI will stay red until either the feature branch gets merged to main (auto-publish) or someone with write access manually triggers |
5c33102 to
e772008
Compare
… Rust UDFs + array result type Completes the PPL `rex` onboarding started in Part 1 (opensearch-project#21550). The sed-mode forms were already covered by bridges to existing Calcite/DataFusion operators. The extract-mode form has no native DataFusion equivalent and needs three custom Rust UDFs, three Java SqlOperator adapters, and a small handful of analytics- framework / engine plumbing changes to model array result types end-to-end. * `rex_extract(input, pattern_lit, group_lit) -> varchar` — single named or numbered group capture. Compiles the regex once at plan time, runs per row. * `rex_extract_multi(input, pattern_lit, group_lit, max_match) -> list<varchar>` — multi-match. `max_match=0` means unbounded; otherwise caps the result at the requested element count. Returns NULL (not an empty list) when there are no matches, matching the SQL plugin's Java implementation. * `rex_offset(input, pattern_lit) -> varchar` — emits the named-group offsets formatted as `"name1=s1-e1&name2=s2-e2"`, alphabetically sorted; end is inclusive, matching the SQL plugin's `RexOffsetFunction.end - 1` convention. Each UDF has 5 unit tests covering the contract above. * `RexExtractAdapter`, `RexExtractMultiAdapter`, `RexOffsetAdapter` — keyed on the SQL plugin's PPL builtin operator names (`REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`) via the analytics-framework ScalarFunction enum. Each adapter rewrites the incoming RexCall to a local target SqlOperator (`LOCAL_REX_EXTRACT_OP`, etc.) that `DataFusionFragmentConvertor`'s `ADDITIONAL_SCALAR_SIGS` maps to the corresponding `rex_extract` / `rex_extract_multi` / `rex_offset` Substrait extension declared in `opensearch_scalar_functions.yaml`. * Pattern operands (and the group operand for the extract variants) are validated as RexLiterals at plan time. Column-valued patterns would force per-row regex compilation on the Rust side and are rejected with an IllegalArgumentException — same contract as the precedent set by RegexpReplaceAdapter in Part 1. `RexExtractAdapterTests` covers this. * `FieldType.ARRAY` enum value + `fromSqlTypeName(ARRAY) -> FieldType.ARRAY` in analytics-framework. Without this, `OpenSearchProjectRule.resolveScalar ViableBackends` returns `null` for any scalar with an array return type and the planner emits "No backend supports scalar function [REX_EXTRACT_ MULTI] among [datafusion]". `REX_EXTRACT_MULTI`'s ProjectCapability.Scalar declaration is now `Set.of(FieldType.ARRAY)` rather than the broad scalar set used by every other op (UPPER, ABS, ...) — those genuinely don't return arrays. * `ListVector` handling in three call sites that previously triggered Arrow's `JsonStringArrayList.<clinit>`, which references `JavaTimeModule` from `jackson-datatype-jsr310` (not on the `arrow-flight-rpc` parent plugin's classloader). Bypassing `getObject()` and reading offset buffer + inner data vector directly: - `DatafusionResultStream.getFieldValue` (shard-side row materialization) - `ArrowValues.toJavaValue` (coordinator post-execution row reading) - `RowResponseCodec` (`inferArrowField` + `setVectorValue`) — the Object[]-row → Arrow VectorSchemaRoot wire codec needed an explicit list<utf8> Field with proper child Field, plus a ListVector setter using `startNewValue`/`endValue` + the inner VarCharVector's `setSafe`. * `RexCommandIT` extended from 9 sed tests to 16 — adds 7 extract-mode cases: single named group, multiple named groups in one row, missing-group NULL handling, multi-match capturing all, `max_match` cap, offset_field output, and no-match passthrough as NULL. * Rust UDFs — 15/15 unit tests (5 per UDF). * `RexExtractAdapterTests` — 4/4. * `RexCommandIT` — 16/16 (9 sed from Part 1 + 7 new extract). * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green (678 tasks, all sandbox module unit tests + spotless + license + forbidden API). Signed-off-by: Jialiang Liang <jiallian@amazon.com>
… Rust UDFs + array result type Completes the PPL `rex` onboarding started in Part 1 (opensearch-project#21550). The sed-mode forms were already covered by bridges to existing Calcite/DataFusion operators. The extract-mode form has no native DataFusion equivalent and needs three custom Rust UDFs, three Java SqlOperator adapters, and a small handful of analytics- framework / engine plumbing changes to model array result types end-to-end. * `rex_extract(input, pattern_lit, group_lit) -> varchar` — single named or numbered group capture. Compiles the regex once at plan time, runs per row. * `rex_extract_multi(input, pattern_lit, group_lit, max_match) -> list<varchar>` — multi-match. `max_match=0` means unbounded; otherwise caps the result at the requested element count. Returns NULL (not an empty list) when there are no matches, matching the SQL plugin's Java implementation. * `rex_offset(input, pattern_lit) -> varchar` — emits the named-group offsets formatted as `"name1=s1-e1&name2=s2-e2"`, alphabetically sorted; end is inclusive, matching the SQL plugin's `RexOffsetFunction.end - 1` convention. Each UDF has 5 unit tests covering the contract above. * `RexExtractAdapter`, `RexExtractMultiAdapter`, `RexOffsetAdapter` — keyed on the SQL plugin's PPL builtin operator names (`REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`) via the analytics-framework ScalarFunction enum. Each adapter rewrites the incoming RexCall to a local target SqlOperator (`LOCAL_REX_EXTRACT_OP`, etc.) that `DataFusionFragmentConvertor`'s `ADDITIONAL_SCALAR_SIGS` maps to the corresponding `rex_extract` / `rex_extract_multi` / `rex_offset` Substrait extension declared in `opensearch_scalar_functions.yaml`. * Pattern operands (and the group operand for the extract variants) are validated as RexLiterals at plan time. Column-valued patterns would force per-row regex compilation on the Rust side and are rejected with an IllegalArgumentException — same contract as the precedent set by RegexpReplaceAdapter in Part 1. `RexExtractAdapterTests` covers this. * `FieldType.ARRAY` enum value + `fromSqlTypeName(ARRAY) -> FieldType.ARRAY` in analytics-framework. Without this, `OpenSearchProjectRule.resolveScalar ViableBackends` returns `null` for any scalar with an array return type and the planner emits "No backend supports scalar function [REX_EXTRACT_ MULTI] among [datafusion]". `REX_EXTRACT_MULTI`'s ProjectCapability.Scalar declaration is now `Set.of(FieldType.ARRAY)` rather than the broad scalar set used by every other op (UPPER, ABS, ...) — those genuinely don't return arrays. * `ListVector` handling in three call sites that previously triggered Arrow's `JsonStringArrayList.<clinit>`, which references `JavaTimeModule` from `jackson-datatype-jsr310` (not on the `arrow-flight-rpc` parent plugin's classloader). Bypassing `getObject()` and reading offset buffer + inner data vector directly: - `DatafusionResultStream.getFieldValue` (shard-side row materialization) - `ArrowValues.toJavaValue` (coordinator post-execution row reading) - `RowResponseCodec` (`inferArrowField` + `setVectorValue`) — the Object[]-row → Arrow VectorSchemaRoot wire codec needed an explicit list<utf8> Field with proper child Field, plus a ListVector setter using `startNewValue`/`endValue` + the inner VarCharVector's `setSafe`. * `RexCommandIT` extended from 9 sed tests to 16 — adds 7 extract-mode cases: single named group, multiple named groups in one row, missing-group NULL handling, multi-match capturing all, `max_match` cap, offset_field output, and no-match passthrough as NULL. * Rust UDFs — 15/15 unit tests (5 per UDF). * `RexExtractAdapterTests` — 4/4. * `RexCommandIT` — 16/16 (9 sed from Part 1 + 7 new extract). * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green (678 tasks, all sandbox module unit tests + spotless + license + forbidden API). Signed-off-by: Jialiang Liang <jiallian@amazon.com>
e772008 to
697bd2e
Compare
…dge-only
Onboards the PPL `rex` command's `mode=sed` surface — the part that lowers to
standard Calcite library operators and bridges through Substrait to DataFusion's
native UDFs. Three sed sub-variants covered:
* `rex field=f mode=sed "s/old/new/"` (no flags) → SqlLibraryOperators.REGEXP_REPLACE_3
(already mapped via the PPL `replace` onboarding from opensearch-project#21527 — no-op here).
* `rex field=f mode=sed "s/old/new/g"` / `/i` / `/gi` → SqlLibraryOperators.REGEXP_REPLACE_PG_4
(4-arg with flags string). New bridge in this PR. DataFusion's regexp_replace
natively accepts 4-arg `(str, pat, repl, flags)` per its substrait UDF binding.
* `rex field=f mode=sed "y/from/to/"` (transliteration) → SqlLibraryOperators.TRANSLATE3.
New bridge in this PR. Resolves to DataFusion's `translate` UDF
(datafusion-functions/src/unicode/translate.rs).
The 4-arg `REGEXP_REPLACE_PG_4` carries the same Java-regex syntax baggage as the
3-arg form: `\Q…\E` quoted-literal blocks (Rust regex rejects them) and bare `$N`
backreferences in the replacement (Rust's identifier-greedy parser
mis-resolves them). RegexpReplaceAdapter, introduced for the 3-arg form in
and replacement at position 2 in both signatures — the rewrite logic doesn't
change. Operands beyond position 2 (the flags string in the 4-arg form) pass
through verbatim. Two new RegexpReplaceAdapterTests cover the 4-arg path.
`TRANSLATE3` doesn't need an adapter — its arguments are character classes, not
regex syntax.
* Rex extract mode (`rex field=f "(?<g>...)"`) — uses the SQL plugin's custom
Java UDFs `REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`, which have no
native DataFusion equivalent. Slated for a follow-up PR that adds Rust-side
UDF implementations, similar to the convert_tz precedent (opensearch-project#21476).
* Sed with occurrence flag (`s/.../.../<N>`) — emits 5-arg
`REGEXP_REPLACE_5`, which DataFusion's native `regexp_replace` does not
support (max 4 args). Also Part 2.
* `RegexpReplaceAdapterTests` — 21/21 (19 from opensearch-project#21527 + 2 new for the 4-arg path).
* `RexCommandIT` (new self-contained QA IT, calcs dataset) — 9/9. Covers all sed
sub-variants: literal (no flags), `/g` global, `/i` case-insensitive, `/gi`
combined, backreferences via `$N`, transliteration `y/from/to/` and
no-match passthrough.
* `./gradlew check -p sandbox -Dsandbox.enabled=true` — green.
The unified-path NPE caused by a missing PPL_REX_MAX_MATCH_LIMIT default is fixed
in opensearch-project/sql#5418 — required for any rex query (sed or extract) to
reach the planner via /_analytics/ppl. This PR's Test results assume opensearch-project#5418 is
applied. Pre-fix: every query NPEs in `AstBuilder.visitRexCommand`. Post-fix:
9/9 RexCommandIT pass.
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
… Rust UDFs + array result type Completes the PPL `rex` onboarding started in Part 1 (opensearch-project#21550). The sed-mode forms were already covered by bridges to existing Calcite/DataFusion operators. The extract-mode form has no native DataFusion equivalent and needs three custom Rust UDFs, three Java SqlOperator adapters, and a small handful of analytics- framework / engine plumbing changes to model array result types end-to-end. * `rex_extract(input, pattern_lit, group_lit) -> varchar` — single named or numbered group capture. Compiles the regex once at plan time, runs per row. * `rex_extract_multi(input, pattern_lit, group_lit, max_match) -> list<varchar>` — multi-match. `max_match=0` means unbounded; otherwise caps the result at the requested element count. Returns NULL (not an empty list) when there are no matches, matching the SQL plugin's Java implementation. * `rex_offset(input, pattern_lit) -> varchar` — emits the named-group offsets formatted as `"name1=s1-e1&name2=s2-e2"`, alphabetically sorted; end is inclusive, matching the SQL plugin's `RexOffsetFunction.end - 1` convention. Each UDF has 5 unit tests covering the contract above. * `RexExtractAdapter`, `RexExtractMultiAdapter`, `RexOffsetAdapter` — keyed on the SQL plugin's PPL builtin operator names (`REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`) via the analytics-framework ScalarFunction enum. Each adapter rewrites the incoming RexCall to a local target SqlOperator (`LOCAL_REX_EXTRACT_OP`, etc.) that `DataFusionFragmentConvertor`'s `ADDITIONAL_SCALAR_SIGS` maps to the corresponding `rex_extract` / `rex_extract_multi` / `rex_offset` Substrait extension declared in `opensearch_scalar_functions.yaml`. * Pattern operands (and the group operand for the extract variants) are validated as RexLiterals at plan time. Column-valued patterns would force per-row regex compilation on the Rust side and are rejected with an IllegalArgumentException — same contract as the precedent set by RegexpReplaceAdapter in Part 1. `RexExtractAdapterTests` covers this. * `FieldType.ARRAY` enum value + `fromSqlTypeName(ARRAY) -> FieldType.ARRAY` in analytics-framework. Without this, `OpenSearchProjectRule.resolveScalar ViableBackends` returns `null` for any scalar with an array return type and the planner emits "No backend supports scalar function [REX_EXTRACT_ MULTI] among [datafusion]". `REX_EXTRACT_MULTI`'s ProjectCapability.Scalar declaration is now `Set.of(FieldType.ARRAY)` rather than the broad scalar set used by every other op (UPPER, ABS, ...) — those genuinely don't return arrays. * `ListVector` handling in three call sites that previously triggered Arrow's `JsonStringArrayList.<clinit>`, which references `JavaTimeModule` from `jackson-datatype-jsr310` (not on the `arrow-flight-rpc` parent plugin's classloader). Bypassing `getObject()` and reading offset buffer + inner data vector directly: - `DatafusionResultStream.getFieldValue` (shard-side row materialization) - `ArrowValues.toJavaValue` (coordinator post-execution row reading) - `RowResponseCodec` (`inferArrowField` + `setVectorValue`) — the Object[]-row → Arrow VectorSchemaRoot wire codec needed an explicit list<utf8> Field with proper child Field, plus a ListVector setter using `startNewValue`/`endValue` + the inner VarCharVector's `setSafe`. * `RexCommandIT` extended from 9 sed tests to 16 — adds 7 extract-mode cases: single named group, multiple named groups in one row, missing-group NULL handling, multi-match capturing all, `max_match` cap, offset_field output, and no-match passthrough as NULL. * Rust UDFs — 15/15 unit tests (5 per UDF). * `RexExtractAdapterTests` — 4/4. * `RexCommandIT` — 16/16 (9 sed from Part 1 + 7 new extract). * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green (678 tasks, all sandbox module unit tests + spotless + license + forbidden API). Signed-off-by: Jialiang Liang <jiallian@amazon.com>
… (Part 3)
Wires Calcite's `SqlLibraryOperators.ARRAY_LENGTH` to DataFusion's native
`array_length`, completing the end-to-end story for PPL `rex` extract-mode
multi-match: queries can now size the list returned by `rex_extract_multi`
(`eval count = array_length(g)`).
* `ScalarFunction.ARRAY_LENGTH` enum value (resolves via the `valueOf()`
fallback on the Calcite operator name).
* Registered in `STANDARD_PROJECT_OPS`. Returns `bigint`, so the existing
`SUPPORTED_FIELD_TYPES` (numeric ∪ keyword ∪ date ∪ {BOOLEAN, TEXT})
covers the capability lookup — no special-case needed.
* `FunctionMappings.s(SqlLibraryOperators.ARRAY_LENGTH, "array_length")` in
`DataFusionFragmentConvertor.ADDITIONAL_SCALAR_SIGS`. Library operators
don't auto-resolve through the substrait default catalog — the same
explicit pinning pattern used for `ILIKE`, `DATE_PART`, and the
`REGEXP_REPLACE_*` family.
* `array_length` extension declaration in `opensearch_scalar_functions.yaml`
with `list<varchar<L1>>` → `i64` and `list<string>` → `i64` impls. Without
a custom YAML extension that matches the actual list type, isthmus emits
"Unable to convert call ARRAY_LENGTH(list<varchar<...>>)" for the
`rex_extract_multi` output.
Lifts CalciteRexCommandIT (SQL plugin's standard rex IT class) through the
analytics-engine route from 14/18 → 17/18. The remaining failure
(testRexMaxMatchConfigurableLimit) is a unified-query architectural gap —
`UnifiedQueryContext` ignores cluster-setting overrides and uses the static
default — unrelated to rex or array_length.
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
The remote OpenSearch Snapshots maven repo (ci.opensearch.org/ci/dbc/snapshots) only republishes from sql/main, not from sql/feature/mustang-ppl-integration, so its 3.7.0.0-SNAPSHOT jars trail the feature branch by however many merges (currently missing PPL_REX_MAX_MATCH_LIMIT, CALCITE_ENGINE_ENABLED, …). The sandbox-check workflow's pre-step opensearch-project#21569 publishes feature-branch unified-query jars to mavenLocal, but Gradle's default SNAPSHOT resolution weighs the remote's explicit <buildNumber>/<timestamp> metadata higher than mavenLocal's <localCopy>true>, so the stale remote wins even when mavenLocal has a newer <lastUpdated>. Confirmed via dependencyInsight: every consumer was binding unified-query-api:3.7.0.0-SNAPSHOT:20260507.224009-12 (60kB, 42 classes, no PPL_REX_MAX_MATCH_LIMIT field reference) instead of the locally-published 3.7.0.0-SNAPSHOT (29kB, 21 classes, has the field). The runtime cluster inherited that stale class via the test-ppl-frontend plugin bundle, which is why every IT touching `rex` failed plan-time with `NullPointerException: Cannot invoke "java.lang.Integer.intValue()" because the return value of "Settings.getSettingValue(PPL_REX_MAX_MATCH_LIMIT)" is null` once the unified path tried to read the setting. Fix: tell the OpenSearch Snapshots remote to refuse `org.opensearch.query` artifacts via mavenContent { excludeGroup }. Three sites declare the remote: * sandbox/build.gradle subprojects { repositories } — applies to every sandbox subproject including qa. * sandbox/plugins/analytics-backend-datafusion/build.gradle — own declaration; left in place for module isolation, filtered identically. * sandbox/plugins/test-ppl-frontend/build.gradle — also pin mavenLocal as the only source for org.opensearch.query so the bundlePlugin task bundles the freshly-published feature-branch jar rather than the stale timestamped one Gradle would otherwise pick. Verified locally: bundled unified-query-api drops 60kB → 29kB, the UnifiedQueryContext$Builder constant pool now references PPL_REX_MAX_MATCH_LIMIT, and RexCommandIT goes 0/16 → 16/16 against the same locally-published jars the CI workflow already produces. Drop this filter once the SQL feature branch merges to sql/main and the remote OpenSearch Snapshots repo catches up — at that point every 3.7.0.0-SNAPSHOT publish will carry the rex max-match default and the mavenLocal preference becomes redundant. Signed-off-by: Jialiang Liang <jiallian@amazon.com>
49c6baf to
d37ee44
Compare
CI fallout from the prior commit's `excludeGroup 'org.opensearch.query'`
filter on the OpenSearch Snapshots remote: the parent subprojects block
no longer carried mavenLocal, so analytics-engine's testImplementation /
internalClusterTest configurations had no repository at all serving
org.opensearch.query, failing with `Could not find
org.opensearch.query:unified-query-api:3.6.0.0-SNAPSHOT` (and -core / -ppl).
Two pieces:
1. sandbox/build.gradle subprojects { repositories } — also declare
mavenLocal scoped to the org.opensearch.query group via mavenContent
{ includeGroup }. mavenLocal becomes the authoritative source for
unified-query SNAPSHOTs (populated by the sandbox-check workflow's
publishUnifiedQueryPublicationToMavenLocal pre-step) without leaking
into resolution for any other group.
2. sandbox/plugins/analytics-engine/build.gradle — bump
sqlUnifiedQueryVersion from 3.6.0.0-SNAPSHOT → 3.7.0.0-SNAPSHOT.
The 3.6 jars don't exist in mavenLocal (only the 3.7 feature-branch
build does), so the older pin was the proximate cause of the CI
resolution failure. Aligning with test-ppl-frontend's already-3.7
declaration also keeps the unified-query consumers consistent.
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
CI surfaced this on the post-rebase rex run:
Duplicate key FunctionAnchor{urn=extension:org.opensearch:scalar_functions,
key=array_length:list} (attempted merging values
array_length:list and array_length:list)
The Part 3 commit declared two impls — `list<varchar<L1>>` and `list<string>`
— with the intent of covering both element-type families produced by
`rex_extract_multi`'s pair of impls. But substrait's compound function key
drops the inner parametric element type at the key level, so both impls
collapse to the same key `array_length:list`. The YAML loader rejects the
collision when the analytics-backend-datafusion plugin's
`SimpleExtension.ExtensionCollection` merges the file in.
Replace the two impls with a single `list<any1>` polymorphic impl. The
`any1` type variable matches any element type at planning, so a call site
that produces `list<varchar<L1>>` (rex_extract_multi varchar overload) and
a call site that produces `list<string>` (rex_extract_multi string
overload) both bind to the one impl. Net effect on planning is equivalent
and the duplicate-key collision goes away.
The duplicate didn't surface on the original rex CI run because the prior
PPL_REX_MAX_MATCH_LIMIT NPE failed every query at plan time before the
function-extension merge was reached. Once the mavenLocal pin fix landed
the prior commit and queries actually reached the planner, this older
latent collision was unmasked.
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
0d812e8 to
0dafa26
Compare
Descriptioin
Onboards the PPL
rexcommand to the analytics-engine path end-to-end.sed mode (
rex field=f mode=sed "s/.../.../<flags>",y/from/to/) — bridges to existing Calcite operators (REGEXP_REPLACE_3,REGEXP_REPLACE_PG_4,TRANSLATE3) → DataFusion's nativeregexp_replace/translateUDFs. The replace adapter handles Java→Rust regex syntax differences (\Q…\E,\$N).extract mode (
rex field=f "(?<g>...)", withmax_match=/offset_field=) — three Rust UDFs (rex_extract,rex_extract_multi,rex_offset), since the SQL plugin's Java UDFs have no DataFusion equivalent. Mirrors theconvert_tzprecedent (Add 3 different types of PPL scalar functions to analytics-engine - prove wiring based on DataFusion capabilities #21476).Three small framework additions land alongside:
FieldType.ARRAY+ a switch arm so the planner's capability lookup matchesrex_extract_multi'sarray<varchar>return type.ListVector.getObject()bypass at three call sites — Arrow'sJsonStringArrayList.<clinit>referencesjackson-datatype-jsr310'sJavaTimeModule, which isn't on thearrow-flight-rpcparent plugin's classloader.array_lengthonboarded as a scalar function (CalciteSqlLibraryOperators.ARRAY_LENGTH→ DataFusion's nativearray_length). Required end-to-end so PPL queries can size the list returned by extract-mode (eval count = array_length(g)).Companion PR
opensearch-project/sql#5418 covers two related SQL-side changes required for full end-to-end behavior on
/_analytics/ppl:PPL_REX_MAX_MATCH_LIMIT=10inUnifiedQueryContext— required for any rex query to reach the planner (without it,AstBuilder.visitRexCommandNPEs unboxing a null setting value).Settingsinstance intoUnifiedQueryContextforPPL_REX_MAX_MATCH_LIMIT, so mid-run_cluster/settingsupdates reach the unified path. Scoped to this one key for now (the other keys inUnifiedQueryContextkeep their static defaults).Tests
RegexpReplaceAdapterTests— 21/21RexExtractAdapterTests— 4/4RexCommandIT(sandbox QA) — 16/16 (9 sed + 7 extract)./gradlew check -p sandbox -Dsandbox.enabled=true— greenSQL plugin's
CalciteRexCommandITvia the analytics-engine routeRun against a cluster with the bundle-side test infrastructure (PPL coverage bundle) + locally-published SQL plugin including #5418.
All extract-mode cases (single group, multiple groups, nested groups, complex patterns), all error-path cases (invalid group names, no named groups),
rexchained withwhere/stats/head/ filtering, and everymax_matchvariant includingtestRexMaxMatchConfigurableLimit(which mid-run updates the cluster setting and asserts it takes effect).