[Analytics Backend / DataFusion] Onboard PPL rex to DataFusion by RyanL1997 · Pull Request #21550 · opensearch-project/OpenSearch

RyanL1997 · 2026-05-07T21:07:07Z

Descriptioin

Onboards the PPL rex command to the analytics-engine path end-to-end.

sed mode (rex field=f mode=sed "s/.../.../<flags>", y/from/to/) — bridges to existing Calcite operators (REGEXP_REPLACE_3, REGEXP_REPLACE_PG_4, TRANSLATE3) → DataFusion's native regexp_replace / translate UDFs. The replace adapter handles Java→Rust regex syntax differences (\Q…\E, \$N).
extract mode (rex field=f "(?<g>...)", with max_match= / offset_field=) — three Rust UDFs (rex_extract, rex_extract_multi, rex_offset), since the SQL plugin's Java UDFs have no DataFusion equivalent. Mirrors the convert_tz precedent (Add 3 different types of PPL scalar functions to analytics-engine - prove wiring based on DataFusion capabilities #21476).

Three small framework additions land alongside:

FieldType.ARRAY + a switch arm so the planner's capability lookup matches rex_extract_multi's array<varchar> return type.
A ListVector.getObject() bypass at three call sites — Arrow's JsonStringArrayList.<clinit> references jackson-datatype-jsr310's JavaTimeModule, which isn't on the arrow-flight-rpc parent plugin's classloader.
array_length onboarded as a scalar function (Calcite SqlLibraryOperators.ARRAY_LENGTH → DataFusion's native array_length). Required end-to-end so PPL queries can size the list returned by extract-mode (eval count = array_length(g)).

Companion PR

opensearch-project/sql#5418 covers two related SQL-side changes required for full end-to-end behavior on /_analytics/ppl:

Defaults PPL_REX_MAX_MATCH_LIMIT=10 in UnifiedQueryContext — required for any rex query to reach the planner (without it, AstBuilder.visitRexCommand NPEs unboxing a null setting value).
Bridges the live cluster Settings instance into UnifiedQueryContext for PPL_REX_MAX_MATCH_LIMIT, so mid-run _cluster/settings updates reach the unified path. Scoped to this one key for now (the other keys in UnifiedQueryContext keep their static defaults).

Tests

Rust UDFs — 15/15 unit tests
RegexpReplaceAdapterTests — 21/21
RexExtractAdapterTests — 4/4
RexCommandIT (sandbox QA) — 16/16 (9 sed + 7 extract)
./gradlew check -p sandbox -Dsandbox.enabled=true — green

SQL plugin's `CalciteRexCommandIT` via the analytics-engine route

Run against a cluster with the bundle-side test infrastructure (PPL coverage bundle) + locally-published SQL plugin including #5418.

Tests executed	Passed	Failed	Pass rate
18	18	0	100.0%

All extract-mode cases (single group, multiple groups, nested groups, complex patterns), all error-path cases (invalid group names, no named groups), rex chained with where / stats / head / filtering, and every max_match variant including testRexMaxMatchConfigurableLimit (which mid-run updates the cluster setting and asserts it takes effect).

github-actions · 2026-05-07T21:08:11Z

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 0dafa26.

Path	Line	Severity	Description
sandbox/plugins/analytics-backend-datafusion/rust/Cargo.toml	75	high	New dependency added: `regex = "1.10"`. Per mandatory supply chain policy, all dependency additions must be flagged regardless of apparent legitimacy. Maintainers should verify the crate name, version, and source resolve to the expected artifact (crates.io `regex` crate by Andrew Gallant) and that the pinned version has no known CVEs.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 1 | Medium: 0 | Low: 0

Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.

⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

… Rust UDFs + array result type Completes the PPL `rex` onboarding started in Part 1 (opensearch-project#21550). The sed-mode forms were already covered by bridges to existing Calcite/DataFusion operators. The extract-mode form has no native DataFusion equivalent and needs three custom Rust UDFs, three Java SqlOperator adapters, and a small handful of analytics- framework / engine plumbing changes to model array result types end-to-end. ## What's new ### Rust — three UDFs in opensearch-native-lib * `rex_extract(input, pattern_lit, group_lit) -> varchar` — single named or numbered group capture. Compiles the regex once at plan time, runs per row. * `rex_extract_multi(input, pattern_lit, group_lit, max_match) -> list<varchar>` — multi-match. `max_match=0` means unbounded; otherwise caps the result at the requested element count. Returns NULL (not an empty list) when there are no matches, matching the SQL plugin's Java implementation. * `rex_offset(input, pattern_lit) -> varchar` — emits the named-group offsets formatted as `"name1=s1-e1&name2=s2-e2"`, alphabetically sorted; end is inclusive, matching the SQL plugin's `RexOffsetFunction.end - 1` convention. Each UDF has 5 unit tests covering the contract above. ### Java — three SqlOperator adapters * `RexExtractAdapter`, `RexExtractMultiAdapter`, `RexOffsetAdapter` — keyed on the SQL plugin's PPL builtin operator names (`REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`) via the analytics-framework ScalarFunction enum. Each adapter rewrites the incoming RexCall to a local target SqlOperator (`LOCAL_REX_EXTRACT_OP`, etc.) that `DataFusionFragmentConvertor`'s `ADDITIONAL_SCALAR_SIGS` maps to the corresponding `rex_extract` / `rex_extract_multi` / `rex_offset` Substrait extension declared in `opensearch_scalar_functions.yaml`. * Pattern operands (and the group operand for the extract variants) are validated as RexLiterals at plan time. Column-valued patterns would force per-row regex compilation on the Rust side and are rejected with an IllegalArgumentException — same contract as the precedent set by RegexpReplaceAdapter in Part 1. `RexExtractAdapterTests` covers this. ### Framework — array return-type support * `FieldType.ARRAY` enum value + `fromSqlTypeName(ARRAY) -> FieldType.ARRAY` in analytics-framework. Without this, `OpenSearchProjectRule.resolveScalar ViableBackends` returns `null` for any scalar with an array return type and the planner emits "No backend supports scalar function [REX_EXTRACT_ MULTI] among [datafusion]". `REX_EXTRACT_MULTI`'s ProjectCapability.Scalar declaration is now `Set.of(FieldType.ARRAY)` rather than the broad scalar set used by every other op (UPPER, ABS, ...) — those genuinely don't return arrays. * `ListVector` handling in three call sites that previously triggered Arrow's `JsonStringArrayList.<clinit>`, which references `JavaTimeModule` from `jackson-datatype-jsr310` (not on the `arrow-flight-rpc` parent plugin's classloader). Bypassing `getObject()` and reading offset buffer + inner data vector directly: - `DatafusionResultStream.getFieldValue` (shard-side row materialization) - `ArrowValues.toJavaValue` (coordinator post-execution row reading) - `RowResponseCodec` (`inferArrowField` + `setVectorValue`) — the Object[]-row → Arrow VectorSchemaRoot wire codec needed an explicit list<utf8> Field with proper child Field, plus a ListVector setter using `startNewValue`/`endValue` + the inner VarCharVector's `setSafe`. ### IT coverage * `RexCommandIT` extended from 9 sed tests to 16 — adds 7 extract-mode cases: single named group, multiple named groups in one row, missing-group NULL handling, multi-match capturing all, `max_match` cap, offset_field output, and no-match passthrough as NULL. ## Required runtime dependency * `commons-text:1.11.0` is now a `runtimeOnly` dep of analytics-engine. Calcite's `SqlFunctions.<clinit>` references LevenshteinDistance from commons-text; without bundling the jar the cluster crashes the first time a query reaches Calcite. The matching SHA1 + LICENSE + NOTICE artifacts are added under `licenses/` per the repo dependency-license check. ## Test results * Rust UDFs — 15/15 unit tests (5 per UDF). * `RexExtractAdapterTests` — 4/4. * `RexCommandIT` — 16/16 (9 sed from Part 1 + 7 new extract). * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green (678 tasks, all sandbox module unit tests + spotless + license + forbidden API). Signed-off-by: Jialiang Liang <jiallian@amazon.com>

… Rust UDFs + array result type Completes the PPL `rex` onboarding started in Part 1 (opensearch-project#21550). The sed-mode forms were already covered by bridges to existing Calcite/DataFusion operators. The extract-mode form has no native DataFusion equivalent and needs three custom Rust UDFs, three Java SqlOperator adapters, and a small handful of analytics- framework / engine plumbing changes to model array result types end-to-end. ## What's new ### Rust — three UDFs in opensearch-native-lib * `rex_extract(input, pattern_lit, group_lit) -> varchar` — single named or numbered group capture. Compiles the regex once at plan time, runs per row. * `rex_extract_multi(input, pattern_lit, group_lit, max_match) -> list<varchar>` — multi-match. `max_match=0` means unbounded; otherwise caps the result at the requested element count. Returns NULL (not an empty list) when there are no matches, matching the SQL plugin's Java implementation. * `rex_offset(input, pattern_lit) -> varchar` — emits the named-group offsets formatted as `"name1=s1-e1&name2=s2-e2"`, alphabetically sorted; end is inclusive, matching the SQL plugin's `RexOffsetFunction.end - 1` convention. Each UDF has 5 unit tests covering the contract above. ### Java — three SqlOperator adapters * `RexExtractAdapter`, `RexExtractMultiAdapter`, `RexOffsetAdapter` — keyed on the SQL plugin's PPL builtin operator names (`REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`) via the analytics-framework ScalarFunction enum. Each adapter rewrites the incoming RexCall to a local target SqlOperator (`LOCAL_REX_EXTRACT_OP`, etc.) that `DataFusionFragmentConvertor`'s `ADDITIONAL_SCALAR_SIGS` maps to the corresponding `rex_extract` / `rex_extract_multi` / `rex_offset` Substrait extension declared in `opensearch_scalar_functions.yaml`. * Pattern operands (and the group operand for the extract variants) are validated as RexLiterals at plan time. Column-valued patterns would force per-row regex compilation on the Rust side and are rejected with an IllegalArgumentException — same contract as the precedent set by RegexpReplaceAdapter in Part 1. `RexExtractAdapterTests` covers this. ### Framework — array return-type support * `FieldType.ARRAY` enum value + `fromSqlTypeName(ARRAY) -> FieldType.ARRAY` in analytics-framework. Without this, `OpenSearchProjectRule.resolveScalar ViableBackends` returns `null` for any scalar with an array return type and the planner emits "No backend supports scalar function [REX_EXTRACT_ MULTI] among [datafusion]". `REX_EXTRACT_MULTI`'s ProjectCapability.Scalar declaration is now `Set.of(FieldType.ARRAY)` rather than the broad scalar set used by every other op (UPPER, ABS, ...) — those genuinely don't return arrays. * `ListVector` handling in three call sites that previously triggered Arrow's `JsonStringArrayList.<clinit>`, which references `JavaTimeModule` from `jackson-datatype-jsr310` (not on the `arrow-flight-rpc` parent plugin's classloader). Bypassing `getObject()` and reading offset buffer + inner data vector directly: - `DatafusionResultStream.getFieldValue` (shard-side row materialization) - `ArrowValues.toJavaValue` (coordinator post-execution row reading) - `RowResponseCodec` (`inferArrowField` + `setVectorValue`) — the Object[]-row → Arrow VectorSchemaRoot wire codec needed an explicit list<utf8> Field with proper child Field, plus a ListVector setter using `startNewValue`/`endValue` + the inner VarCharVector's `setSafe`. ### IT coverage * `RexCommandIT` extended from 9 sed tests to 16 — adds 7 extract-mode cases: single named group, multiple named groups in one row, missing-group NULL handling, multi-match capturing all, `max_match` cap, offset_field output, and no-match passthrough as NULL. ## Test results * Rust UDFs — 15/15 unit tests (5 per UDF). * `RexExtractAdapterTests` — 4/4. * `RexCommandIT` — 16/16 (9 sed from Part 1 + 7 new extract). * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green (678 tasks, all sandbox module unit tests + spotless + license + forbidden API). Signed-off-by: Jialiang Liang <jiallian@amazon.com>

…_MATCH_LIMIT The previous commit defaulted `PPL_REX_MAX_MATCH_LIMIT=10` in `UnifiedQueryContext.Builder.settings` to fix the NPE in `AstBuilder.visitRexCommand` on the unified path. The default is correct, but it doesn't respect mid-run cluster overrides — every key in the static map returns its hardcoded value regardless of `_cluster/settings` updates. This breaks `CalciteRexCommandIT.testRexMaxMatchConfigurableLimit`, which explicitly sets the cluster-side limit to 5 and asserts that `max_match=0` caps at 5; on the unified path it stayed at 10. This change introduces a `Builder.liveSettings(Settings)` hook that the REST handler can use to inject the cluster's live `OpenSearchSettings` instance. At `build()` time the Builder snapshots the live value of `PPL_REX_MAX_MATCH_LIMIT` (only — see scoping note below) into the static map, overriding the hardcoded default when the operator has set a cluster value. Snapshot-at-build matches the per-HTTP-request lifecycle of `UnifiedQueryContext` and avoids per-call lookup overhead. ## Why scoped to PPL_REX_MAX_MATCH_LIMIT only The same architectural gap exists for every key in the static map (`QUERY_SIZE_LIMIT`, `PPL_SUBSEARCH_MAXOUT`, `PPL_JOIN_SUBSEARCH_MAXOUT`, `CALCITE_ENGINE_ENABLED`). For three of those, the static defaults are fine in practice (no test overrides them mid-run; `head N` covers `QUERY_SIZE_LIMIT` per-query). `CALCITE_ENGINE_ENABLED` is intentionally pinned to `true` for the unified path — a cluster override toggling it off would defeat the point of routing here. So this PR widens only the one key that demonstrably needs it; widening the snapshot to the rest is a future scope decision tied to whichever new IT first depends on it. ## Wire-up `RestUnifiedQueryAction` gains a `pluginSettings` field (the same `OpenSearchSettings` instance bound in the Guice module) and forwards it to the Builder in both `buildContext` (per-request execution path) and `buildParsingContext` (analytics-routing index name probe). Both construction sites — `SQLPlugin.createSqlAnalyticsRouter` and `TransportPPLQueryAction.<init>` — are updated to pass the existing plugin-side `Settings` instance. `buildParsingContext` had been `static` because it didn't need any instance state; it's now an instance method since it reads `pluginSettings`. ## Test results CalciteRexCommandIT through the analytics-engine route (every PPL query forced through `/_analytics/ppl` via `tests.analytics.force_routing=true`): * Before this change: 17/18 — `testRexMaxMatchConfigurableLimit` fails with `expected:<5> but was:<10>` (cluster override doesn't reach the unified path). * After this change: 18/18 — all `testRexMaxMatch*` variants honor the cluster setting. ## Companion PR opensearch-project/OpenSearch#21550 — onboards PPL `rex` to DataFusion via the analytics-engine path. The 17/18 baseline reported in that PR's description was measured against the previous commit on this branch; with this change the route hits 18/18. Signed-off-by: Jialiang Liang <jiallian@amazon.com>

RyanL1997 · 2026-05-08T02:31:12Z

CI status check — failures are expected and explained below:

sandbox-check — All 16 RexCommandIT tests NPE on Settings.getSettingValue(PPL_REX_MAX_MATCH_LIMIT).intValue(). This is the same NPE that companion SQL PR nopensearch-project/sql#5418 fixes. CI resolves opensearch-sql-plugin:3.7.0.0-SNAPSHOT from the published snapshots URL, which doesn't yet include opensearch-project/sql#5418 open). Locally this is reproduced by publishToMavenLocal from a worktree carrying opensearch-project/sql#5418, where every test passes (16/16 on the sandbox QA RexCommandIT, 18/18 on the SQL plugin's CalciteRexCommandIT via the analytics-engine route). The dependency is called out in the PR description's Companion PR section. Once The above PR lands and the SQL snapshot refreshes, this check will go green.

…_MATCH_LIMIT The previous commit defaulted `PPL_REX_MAX_MATCH_LIMIT=10` in `UnifiedQueryContext.Builder.settings` to fix the NPE in `AstBuilder.visitRexCommand` on the unified path. The default is correct, but it doesn't respect mid-run cluster overrides — every key in the static map returns its hardcoded value regardless of `_cluster/settings` updates. This breaks `CalciteRexCommandIT.testRexMaxMatchConfigurableLimit`, which explicitly sets the cluster-side limit to 5 and asserts that `max_match=0` caps at 5; on the unified path it stayed at 10. This change introduces a `Builder.liveSettings(Settings)` hook that the REST handler can use to inject the cluster's live `OpenSearchSettings` instance. At `build()` time the Builder snapshots the live value of `PPL_REX_MAX_MATCH_LIMIT` (only — see scoping note below) into the static map, overriding the hardcoded default when the operator has set a cluster value. Snapshot-at-build matches the per-HTTP-request lifecycle of `UnifiedQueryContext` and avoids per-call lookup overhead. The same architectural gap exists for every key in the static map (`QUERY_SIZE_LIMIT`, `PPL_SUBSEARCH_MAXOUT`, `PPL_JOIN_SUBSEARCH_MAXOUT`, `CALCITE_ENGINE_ENABLED`). For three of those, the static defaults are fine in practice (no test overrides them mid-run; `head N` covers `QUERY_SIZE_LIMIT` per-query). `CALCITE_ENGINE_ENABLED` is intentionally pinned to `true` for the unified path — a cluster override toggling it off would defeat the point of routing here. So this PR widens only the one key that demonstrably needs it; widening the snapshot to the rest is a future scope decision tied to whichever new IT first depends on it. `RestUnifiedQueryAction` gains a `pluginSettings` field (the same `OpenSearchSettings` instance bound in the Guice module) and forwards it to the Builder in both `buildContext` (per-request execution path) and `buildParsingContext` (analytics-routing index name probe). Both construction sites — `SQLPlugin.createSqlAnalyticsRouter` and `TransportPPLQueryAction.<init>` — are updated to pass the existing plugin-side `Settings` instance. `buildParsingContext` had been `static` because it didn't need any instance state; it's now an instance method since it reads `pluginSettings`. CalciteRexCommandIT through the analytics-engine route (every PPL query forced through `/_analytics/ppl` via `tests.analytics.force_routing=true`): * Before this change: 17/18 — `testRexMaxMatchConfigurableLimit` fails with `expected:<5> but was:<10>` (cluster override doesn't reach the unified path). * After this change: 18/18 — all `testRexMaxMatch*` variants honor the cluster setting. opensearch-project/OpenSearch#21550 — onboards PPL `rex` to DataFusion via the analytics-engine path. The 17/18 baseline reported in that PR's description was measured against the previous commit on this branch; with this change the route hits 18/18. Signed-off-by: Jialiang Liang <jiallian@amazon.com>

… Rust UDFs + array result type Completes the PPL `rex` onboarding started in Part 1 (opensearch-project#21550). The sed-mode forms were already covered by bridges to existing Calcite/DataFusion operators. The extract-mode form has no native DataFusion equivalent and needs three custom Rust UDFs, three Java SqlOperator adapters, and a small handful of analytics- framework / engine plumbing changes to model array result types end-to-end. * `rex_extract(input, pattern_lit, group_lit) -> varchar` — single named or numbered group capture. Compiles the regex once at plan time, runs per row. * `rex_extract_multi(input, pattern_lit, group_lit, max_match) -> list<varchar>` — multi-match. `max_match=0` means unbounded; otherwise caps the result at the requested element count. Returns NULL (not an empty list) when there are no matches, matching the SQL plugin's Java implementation. * `rex_offset(input, pattern_lit) -> varchar` — emits the named-group offsets formatted as `"name1=s1-e1&name2=s2-e2"`, alphabetically sorted; end is inclusive, matching the SQL plugin's `RexOffsetFunction.end - 1` convention. Each UDF has 5 unit tests covering the contract above. * `RexExtractAdapter`, `RexExtractMultiAdapter`, `RexOffsetAdapter` — keyed on the SQL plugin's PPL builtin operator names (`REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`) via the analytics-framework ScalarFunction enum. Each adapter rewrites the incoming RexCall to a local target SqlOperator (`LOCAL_REX_EXTRACT_OP`, etc.) that `DataFusionFragmentConvertor`'s `ADDITIONAL_SCALAR_SIGS` maps to the corresponding `rex_extract` / `rex_extract_multi` / `rex_offset` Substrait extension declared in `opensearch_scalar_functions.yaml`. * Pattern operands (and the group operand for the extract variants) are validated as RexLiterals at plan time. Column-valued patterns would force per-row regex compilation on the Rust side and are rejected with an IllegalArgumentException — same contract as the precedent set by RegexpReplaceAdapter in Part 1. `RexExtractAdapterTests` covers this. * `FieldType.ARRAY` enum value + `fromSqlTypeName(ARRAY) -> FieldType.ARRAY` in analytics-framework. Without this, `OpenSearchProjectRule.resolveScalar ViableBackends` returns `null` for any scalar with an array return type and the planner emits "No backend supports scalar function [REX_EXTRACT_ MULTI] among [datafusion]". `REX_EXTRACT_MULTI`'s ProjectCapability.Scalar declaration is now `Set.of(FieldType.ARRAY)` rather than the broad scalar set used by every other op (UPPER, ABS, ...) — those genuinely don't return arrays. * `ListVector` handling in three call sites that previously triggered Arrow's `JsonStringArrayList.<clinit>`, which references `JavaTimeModule` from `jackson-datatype-jsr310` (not on the `arrow-flight-rpc` parent plugin's classloader). Bypassing `getObject()` and reading offset buffer + inner data vector directly: - `DatafusionResultStream.getFieldValue` (shard-side row materialization) - `ArrowValues.toJavaValue` (coordinator post-execution row reading) - `RowResponseCodec` (`inferArrowField` + `setVectorValue`) — the Object[]-row → Arrow VectorSchemaRoot wire codec needed an explicit list<utf8> Field with proper child Field, plus a ListVector setter using `startNewValue`/`endValue` + the inner VarCharVector's `setSafe`. * `RexCommandIT` extended from 9 sed tests to 16 — adds 7 extract-mode cases: single named group, multiple named groups in one row, missing-group NULL handling, multi-match capturing all, `max_match` cap, offset_field output, and no-match passthrough as NULL. * Rust UDFs — 15/15 unit tests (5 per UDF). * `RexExtractAdapterTests` — 4/4. * `RexCommandIT` — 16/16 (9 sed from Part 1 + 7 new extract). * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green (678 tasks, all sandbox module unit tests + spotless + license + forbidden API). Signed-off-by: Jialiang Liang <jiallian@amazon.com>

RyanL1997 · 2026-05-08T17:59:16Z

CI is correctly pulling the canonical published snapshot. The published snapshot is from main, not from feature/mustang-ppl-integration. opensearch-project/sql#5418 is in the feature branch but never made it into a published artifact. So this PR's CI will stay red until either the feature branch gets merged to main (auto-publish) or someone with write access manually triggers maven-publish-modules.yml against feature/mustang-ppl-integration.

… Rust UDFs + array result type Completes the PPL `rex` onboarding started in Part 1 (opensearch-project#21550). The sed-mode forms were already covered by bridges to existing Calcite/DataFusion operators. The extract-mode form has no native DataFusion equivalent and needs three custom Rust UDFs, three Java SqlOperator adapters, and a small handful of analytics- framework / engine plumbing changes to model array result types end-to-end. * `rex_extract(input, pattern_lit, group_lit) -> varchar` — single named or numbered group capture. Compiles the regex once at plan time, runs per row. * `rex_extract_multi(input, pattern_lit, group_lit, max_match) -> list<varchar>` — multi-match. `max_match=0` means unbounded; otherwise caps the result at the requested element count. Returns NULL (not an empty list) when there are no matches, matching the SQL plugin's Java implementation. * `rex_offset(input, pattern_lit) -> varchar` — emits the named-group offsets formatted as `"name1=s1-e1&name2=s2-e2"`, alphabetically sorted; end is inclusive, matching the SQL plugin's `RexOffsetFunction.end - 1` convention. Each UDF has 5 unit tests covering the contract above. * `RexExtractAdapter`, `RexExtractMultiAdapter`, `RexOffsetAdapter` — keyed on the SQL plugin's PPL builtin operator names (`REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`) via the analytics-framework ScalarFunction enum. Each adapter rewrites the incoming RexCall to a local target SqlOperator (`LOCAL_REX_EXTRACT_OP`, etc.) that `DataFusionFragmentConvertor`'s `ADDITIONAL_SCALAR_SIGS` maps to the corresponding `rex_extract` / `rex_extract_multi` / `rex_offset` Substrait extension declared in `opensearch_scalar_functions.yaml`. * Pattern operands (and the group operand for the extract variants) are validated as RexLiterals at plan time. Column-valued patterns would force per-row regex compilation on the Rust side and are rejected with an IllegalArgumentException — same contract as the precedent set by RegexpReplaceAdapter in Part 1. `RexExtractAdapterTests` covers this. * `FieldType.ARRAY` enum value + `fromSqlTypeName(ARRAY) -> FieldType.ARRAY` in analytics-framework. Without this, `OpenSearchProjectRule.resolveScalar ViableBackends` returns `null` for any scalar with an array return type and the planner emits "No backend supports scalar function [REX_EXTRACT_ MULTI] among [datafusion]". `REX_EXTRACT_MULTI`'s ProjectCapability.Scalar declaration is now `Set.of(FieldType.ARRAY)` rather than the broad scalar set used by every other op (UPPER, ABS, ...) — those genuinely don't return arrays. * `ListVector` handling in three call sites that previously triggered Arrow's `JsonStringArrayList.<clinit>`, which references `JavaTimeModule` from `jackson-datatype-jsr310` (not on the `arrow-flight-rpc` parent plugin's classloader). Bypassing `getObject()` and reading offset buffer + inner data vector directly: - `DatafusionResultStream.getFieldValue` (shard-side row materialization) - `ArrowValues.toJavaValue` (coordinator post-execution row reading) - `RowResponseCodec` (`inferArrowField` + `setVectorValue`) — the Object[]-row → Arrow VectorSchemaRoot wire codec needed an explicit list<utf8> Field with proper child Field, plus a ListVector setter using `startNewValue`/`endValue` + the inner VarCharVector's `setSafe`. * `RexCommandIT` extended from 9 sed tests to 16 — adds 7 extract-mode cases: single named group, multiple named groups in one row, missing-group NULL handling, multi-match capturing all, `max_match` cap, offset_field output, and no-match passthrough as NULL. * Rust UDFs — 15/15 unit tests (5 per UDF). * `RexExtractAdapterTests` — 4/4. * `RexCommandIT` — 16/16 (9 sed from Part 1 + 7 new extract). * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green (678 tasks, all sandbox module unit tests + spotless + license + forbidden API). Signed-off-by: Jialiang Liang <jiallian@amazon.com>

…dge-only Onboards the PPL `rex` command's `mode=sed` surface — the part that lowers to standard Calcite library operators and bridges through Substrait to DataFusion's native UDFs. Three sed sub-variants covered: * `rex field=f mode=sed "s/old/new/"` (no flags) → SqlLibraryOperators.REGEXP_REPLACE_3 (already mapped via the PPL `replace` onboarding from opensearch-project#21527 — no-op here). * `rex field=f mode=sed "s/old/new/g"` / `/i` / `/gi` → SqlLibraryOperators.REGEXP_REPLACE_PG_4 (4-arg with flags string). New bridge in this PR. DataFusion's regexp_replace natively accepts 4-arg `(str, pat, repl, flags)` per its substrait UDF binding. * `rex field=f mode=sed "y/from/to/"` (transliteration) → SqlLibraryOperators.TRANSLATE3. New bridge in this PR. Resolves to DataFusion's `translate` UDF (datafusion-functions/src/unicode/translate.rs). The 4-arg `REGEXP_REPLACE_PG_4` carries the same Java-regex syntax baggage as the 3-arg form: `\Q…\E` quoted-literal blocks (Rust regex rejects them) and bare `$N` backreferences in the replacement (Rust's identifier-greedy parser mis-resolves them). RegexpReplaceAdapter, introduced for the 3-arg form in and replacement at position 2 in both signatures — the rewrite logic doesn't change. Operands beyond position 2 (the flags string in the 4-arg form) pass through verbatim. Two new RegexpReplaceAdapterTests cover the 4-arg path. `TRANSLATE3` doesn't need an adapter — its arguments are character classes, not regex syntax. * Rex extract mode (`rex field=f "(?<g>...)"`) — uses the SQL plugin's custom Java UDFs `REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`, which have no native DataFusion equivalent. Slated for a follow-up PR that adds Rust-side UDF implementations, similar to the convert_tz precedent (opensearch-project#21476). * Sed with occurrence flag (`s/.../.../<N>`) — emits 5-arg `REGEXP_REPLACE_5`, which DataFusion's native `regexp_replace` does not support (max 4 args). Also Part 2. * `RegexpReplaceAdapterTests` — 21/21 (19 from opensearch-project#21527 + 2 new for the 4-arg path). * `RexCommandIT` (new self-contained QA IT, calcs dataset) — 9/9. Covers all sed sub-variants: literal (no flags), `/g` global, `/i` case-insensitive, `/gi` combined, backreferences via `$N`, transliteration `y/from/to/` and no-match passthrough. * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green. The unified-path NPE caused by a missing PPL_REX_MAX_MATCH_LIMIT default is fixed in opensearch-project/sql#5418 — required for any rex query (sed or extract) to reach the planner via /_analytics/ppl. This PR's Test results assume opensearch-project#5418 is applied. Pre-fix: every query NPEs in `AstBuilder.visitRexCommand`. Post-fix: 9/9 RexCommandIT pass. Signed-off-by: Jialiang Liang <jiallian@amazon.com>

… Rust UDFs + array result type Completes the PPL `rex` onboarding started in Part 1 (opensearch-project#21550). The sed-mode forms were already covered by bridges to existing Calcite/DataFusion operators. The extract-mode form has no native DataFusion equivalent and needs three custom Rust UDFs, three Java SqlOperator adapters, and a small handful of analytics- framework / engine plumbing changes to model array result types end-to-end. * `rex_extract(input, pattern_lit, group_lit) -> varchar` — single named or numbered group capture. Compiles the regex once at plan time, runs per row. * `rex_extract_multi(input, pattern_lit, group_lit, max_match) -> list<varchar>` — multi-match. `max_match=0` means unbounded; otherwise caps the result at the requested element count. Returns NULL (not an empty list) when there are no matches, matching the SQL plugin's Java implementation. * `rex_offset(input, pattern_lit) -> varchar` — emits the named-group offsets formatted as `"name1=s1-e1&name2=s2-e2"`, alphabetically sorted; end is inclusive, matching the SQL plugin's `RexOffsetFunction.end - 1` convention. Each UDF has 5 unit tests covering the contract above. * `RexExtractAdapter`, `RexExtractMultiAdapter`, `RexOffsetAdapter` — keyed on the SQL plugin's PPL builtin operator names (`REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`) via the analytics-framework ScalarFunction enum. Each adapter rewrites the incoming RexCall to a local target SqlOperator (`LOCAL_REX_EXTRACT_OP`, etc.) that `DataFusionFragmentConvertor`'s `ADDITIONAL_SCALAR_SIGS` maps to the corresponding `rex_extract` / `rex_extract_multi` / `rex_offset` Substrait extension declared in `opensearch_scalar_functions.yaml`. * Pattern operands (and the group operand for the extract variants) are validated as RexLiterals at plan time. Column-valued patterns would force per-row regex compilation on the Rust side and are rejected with an IllegalArgumentException — same contract as the precedent set by RegexpReplaceAdapter in Part 1. `RexExtractAdapterTests` covers this. * `FieldType.ARRAY` enum value + `fromSqlTypeName(ARRAY) -> FieldType.ARRAY` in analytics-framework. Without this, `OpenSearchProjectRule.resolveScalar ViableBackends` returns `null` for any scalar with an array return type and the planner emits "No backend supports scalar function [REX_EXTRACT_ MULTI] among [datafusion]". `REX_EXTRACT_MULTI`'s ProjectCapability.Scalar declaration is now `Set.of(FieldType.ARRAY)` rather than the broad scalar set used by every other op (UPPER, ABS, ...) — those genuinely don't return arrays. * `ListVector` handling in three call sites that previously triggered Arrow's `JsonStringArrayList.<clinit>`, which references `JavaTimeModule` from `jackson-datatype-jsr310` (not on the `arrow-flight-rpc` parent plugin's classloader). Bypassing `getObject()` and reading offset buffer + inner data vector directly: - `DatafusionResultStream.getFieldValue` (shard-side row materialization) - `ArrowValues.toJavaValue` (coordinator post-execution row reading) - `RowResponseCodec` (`inferArrowField` + `setVectorValue`) — the Object[]-row → Arrow VectorSchemaRoot wire codec needed an explicit list<utf8> Field with proper child Field, plus a ListVector setter using `startNewValue`/`endValue` + the inner VarCharVector's `setSafe`. * `RexCommandIT` extended from 9 sed tests to 16 — adds 7 extract-mode cases: single named group, multiple named groups in one row, missing-group NULL handling, multi-match capturing all, `max_match` cap, offset_field output, and no-match passthrough as NULL. * Rust UDFs — 15/15 unit tests (5 per UDF). * `RexExtractAdapterTests` — 4/4. * `RexCommandIT` — 16/16 (9 sed from Part 1 + 7 new extract). * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green (678 tasks, all sandbox module unit tests + spotless + license + forbidden API). Signed-off-by: Jialiang Liang <jiallian@amazon.com>

… (Part 3) Wires Calcite's `SqlLibraryOperators.ARRAY_LENGTH` to DataFusion's native `array_length`, completing the end-to-end story for PPL `rex` extract-mode multi-match: queries can now size the list returned by `rex_extract_multi` (`eval count = array_length(g)`). * `ScalarFunction.ARRAY_LENGTH` enum value (resolves via the `valueOf()` fallback on the Calcite operator name). * Registered in `STANDARD_PROJECT_OPS`. Returns `bigint`, so the existing `SUPPORTED_FIELD_TYPES` (numeric ∪ keyword ∪ date ∪ {BOOLEAN, TEXT}) covers the capability lookup — no special-case needed. * `FunctionMappings.s(SqlLibraryOperators.ARRAY_LENGTH, "array_length")` in `DataFusionFragmentConvertor.ADDITIONAL_SCALAR_SIGS`. Library operators don't auto-resolve through the substrait default catalog — the same explicit pinning pattern used for `ILIKE`, `DATE_PART`, and the `REGEXP_REPLACE_*` family. * `array_length` extension declaration in `opensearch_scalar_functions.yaml` with `list<varchar<L1>>` → `i64` and `list<string>` → `i64` impls. Without a custom YAML extension that matches the actual list type, isthmus emits "Unable to convert call ARRAY_LENGTH(list<varchar<...>>)" for the `rex_extract_multi` output. Lifts CalciteRexCommandIT (SQL plugin's standard rex IT class) through the analytics-engine route from 14/18 → 17/18. The remaining failure (testRexMaxMatchConfigurableLimit) is a unified-query architectural gap — `UnifiedQueryContext` ignores cluster-setting overrides and uses the static default — unrelated to rex or array_length. Signed-off-by: Jialiang Liang <jiallian@amazon.com>

The remote OpenSearch Snapshots maven repo (ci.opensearch.org/ci/dbc/snapshots) only republishes from sql/main, not from sql/feature/mustang-ppl-integration, so its 3.7.0.0-SNAPSHOT jars trail the feature branch by however many merges (currently missing PPL_REX_MAX_MATCH_LIMIT, CALCITE_ENGINE_ENABLED, …). The sandbox-check workflow's pre-step opensearch-project#21569 publishes feature-branch unified-query jars to mavenLocal, but Gradle's default SNAPSHOT resolution weighs the remote's explicit <buildNumber>/<timestamp> metadata higher than mavenLocal's <localCopy>true>, so the stale remote wins even when mavenLocal has a newer <lastUpdated>. Confirmed via dependencyInsight: every consumer was binding unified-query-api:3.7.0.0-SNAPSHOT:20260507.224009-12 (60kB, 42 classes, no PPL_REX_MAX_MATCH_LIMIT field reference) instead of the locally-published 3.7.0.0-SNAPSHOT (29kB, 21 classes, has the field). The runtime cluster inherited that stale class via the test-ppl-frontend plugin bundle, which is why every IT touching `rex` failed plan-time with `NullPointerException: Cannot invoke "java.lang.Integer.intValue()" because the return value of "Settings.getSettingValue(PPL_REX_MAX_MATCH_LIMIT)" is null` once the unified path tried to read the setting. Fix: tell the OpenSearch Snapshots remote to refuse `org.opensearch.query` artifacts via mavenContent { excludeGroup }. Three sites declare the remote: * sandbox/build.gradle subprojects { repositories } — applies to every sandbox subproject including qa. * sandbox/plugins/analytics-backend-datafusion/build.gradle — own declaration; left in place for module isolation, filtered identically. * sandbox/plugins/test-ppl-frontend/build.gradle — also pin mavenLocal as the only source for org.opensearch.query so the bundlePlugin task bundles the freshly-published feature-branch jar rather than the stale timestamped one Gradle would otherwise pick. Verified locally: bundled unified-query-api drops 60kB → 29kB, the UnifiedQueryContext$Builder constant pool now references PPL_REX_MAX_MATCH_LIMIT, and RexCommandIT goes 0/16 → 16/16 against the same locally-published jars the CI workflow already produces. Drop this filter once the SQL feature branch merges to sql/main and the remote OpenSearch Snapshots repo catches up — at that point every 3.7.0.0-SNAPSHOT publish will carry the rex max-match default and the mavenLocal preference becomes redundant. Signed-off-by: Jialiang Liang <jiallian@amazon.com>

CI fallout from the prior commit's `excludeGroup 'org.opensearch.query'` filter on the OpenSearch Snapshots remote: the parent subprojects block no longer carried mavenLocal, so analytics-engine's testImplementation / internalClusterTest configurations had no repository at all serving org.opensearch.query, failing with `Could not find org.opensearch.query:unified-query-api:3.6.0.0-SNAPSHOT` (and -core / -ppl). Two pieces: 1. sandbox/build.gradle subprojects { repositories } — also declare mavenLocal scoped to the org.opensearch.query group via mavenContent { includeGroup }. mavenLocal becomes the authoritative source for unified-query SNAPSHOTs (populated by the sandbox-check workflow's publishUnifiedQueryPublicationToMavenLocal pre-step) without leaking into resolution for any other group. 2. sandbox/plugins/analytics-engine/build.gradle — bump sqlUnifiedQueryVersion from 3.6.0.0-SNAPSHOT → 3.7.0.0-SNAPSHOT. The 3.6 jars don't exist in mavenLocal (only the 3.7 feature-branch build does), so the older pin was the proximate cause of the CI resolution failure. Aligning with test-ppl-frontend's already-3.7 declaration also keeps the unified-query consumers consistent. Signed-off-by: Jialiang Liang <jiallian@amazon.com>

CI surfaced this on the post-rebase rex run: Duplicate key FunctionAnchor{urn=extension:org.opensearch:scalar_functions, key=array_length:list} (attempted merging values array_length:list and array_length:list) The Part 3 commit declared two impls — `list<varchar<L1>>` and `list<string>` — with the intent of covering both element-type families produced by `rex_extract_multi`'s pair of impls. But substrait's compound function key drops the inner parametric element type at the key level, so both impls collapse to the same key `array_length:list`. The YAML loader rejects the collision when the analytics-backend-datafusion plugin's `SimpleExtension.ExtensionCollection` merges the file in. Replace the two impls with a single `list<any1>` polymorphic impl. The `any1` type variable matches any element type at planning, so a call site that produces `list<varchar<L1>>` (rex_extract_multi varchar overload) and a call site that produces `list<string>` (rex_extract_multi string overload) both bind to the one impl. Net effect on planning is equivalent and the duplicate-key collision goes away. The duplicate didn't surface on the original rex CI run because the prior PPL_REX_MAX_MATCH_LIMIT NPE failed every query at plan time before the function-extension merge was reached. Once the mavenLocal pin fix landed the prior commit and queries actually reached the planner, this older latent collision was unmasked. Signed-off-by: Jialiang Liang <jiallian@amazon.com>

…ine to 3.7" This reverts commit cae2cb0. Signed-off-by: Jialiang Liang <jiallian@amazon.com>

…enLocal" This reverts commit d37ee44. Signed-off-by: Jialiang Liang <jiallian@amazon.com>

RyanL1997 force-pushed the mustang-rex-part1 branch 2 times, most recently from 7b7e72c to b1d2240 Compare May 7, 2026 23:56

RyanL1997 force-pushed the mustang-rex-part1 branch from b1d2240 to aa25d15 Compare May 8, 2026 00:19

RyanL1997 changed the title ~~[Analytics Backend / DataFusion] Wire PPL rex sed-mode (Part 1) — bridge-only~~ [Analytics Backend / DataFusion] Onboard PPL rex to DataFusion May 8, 2026

RyanL1997 mentioned this pull request May 8, 2026

Bridge PPL_REX_MAX_MATCH_LIMIT into UnifiedQueryContext on the unified query path opensearch-project/sql#5418

Merged

RyanL1997 marked this pull request as ready for review May 8, 2026 02:39

RyanL1997 requested a review from a team as a code owner May 8, 2026 02:39

RyanL1997 force-pushed the mustang-rex-part1 branch from ad59c57 to cc57ed1 Compare May 8, 2026 07:30

RyanL1997 force-pushed the mustang-rex-part1 branch from cc57ed1 to 5c33102 Compare May 8, 2026 16:46

RyanL1997 force-pushed the mustang-rex-part1 branch from 5c33102 to e772008 Compare May 8, 2026 19:58

RyanL1997 force-pushed the mustang-rex-part1 branch from e772008 to 697bd2e Compare May 8, 2026 23:57

RyanL1997 added 4 commits May 8, 2026 22:28

RyanL1997 force-pushed the mustang-rex-part1 branch from 49c6baf to d37ee44 Compare May 9, 2026 05:36

RyanL1997 mentioned this pull request May 9, 2026

[Sandbox SQL snapshot] Pin org.opensearch.query:* (unified-query-*) snapshots to mavenLocal #21578

Merged

RyanL1997 added 2 commits May 9, 2026 11:45

Revert "Wire mavenLocal into sandbox subprojects + bump analytics-eng…

686a37d

…ine to 3.7" This reverts commit cae2cb0. Signed-off-by: Jialiang Liang <jiallian@amazon.com>

Revert "Pin org.opensearch.query:* (unified-query-*) artifacts to mav…

0dafa26

…enLocal" This reverts commit d37ee44. Signed-off-by: Jialiang Liang <jiallian@amazon.com>

RyanL1997 force-pushed the mustang-rex-part1 branch from 0d812e8 to 0dafa26 Compare May 9, 2026 18:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Analytics Backend / DataFusion] Onboard PPL rex to DataFusion#21550

[Analytics Backend / DataFusion] Onboard PPL rex to DataFusion#21550
RyanL1997 wants to merge 8 commits intoopensearch-project:mainfrom
RyanL1997:mustang-rex-part1

RyanL1997 commented May 7, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 7, 2026 •

edited

Loading

Uh oh!

RyanL1997 commented May 8, 2026 •

edited

Loading

Uh oh!

RyanL1997 commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RyanL1997 commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Descriptioin

Companion PR

Tests

SQL plugin's CalciteRexCommandIT via the analytics-engine route

Uh oh!

github-actions Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Analyzer ❗

Uh oh!

RyanL1997 commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RyanL1997 commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RyanL1997 commented May 7, 2026 •

edited

Loading

SQL plugin's `CalciteRexCommandIT` via the analytics-engine route

github-actions Bot commented May 7, 2026 •

edited

Loading

RyanL1997 commented May 8, 2026 •

edited

Loading

RyanL1997 commented May 8, 2026 •

edited

Loading