Skip to content

[Analytics Backend / DataFusion] Onboard PPL rex to DataFusion#21550

Open
RyanL1997 wants to merge 8 commits intoopensearch-project:mainfrom
RyanL1997:mustang-rex-part1
Open

[Analytics Backend / DataFusion] Onboard PPL rex to DataFusion#21550
RyanL1997 wants to merge 8 commits intoopensearch-project:mainfrom
RyanL1997:mustang-rex-part1

Conversation

@RyanL1997
Copy link
Copy Markdown
Contributor

@RyanL1997 RyanL1997 commented May 7, 2026

Descriptioin

Onboards the PPL rex command to the analytics-engine path end-to-end.

  • sed mode (rex field=f mode=sed "s/.../.../<flags>", y/from/to/) — bridges to existing Calcite operators (REGEXP_REPLACE_3, REGEXP_REPLACE_PG_4, TRANSLATE3) → DataFusion's native regexp_replace / translate UDFs. The replace adapter handles Java→Rust regex syntax differences (\Q…\E, \$N).

  • extract mode (rex field=f "(?<g>...)", with max_match= / offset_field=) — three Rust UDFs (rex_extract, rex_extract_multi, rex_offset), since the SQL plugin's Java UDFs have no DataFusion equivalent. Mirrors the convert_tz precedent (Add 3 different types of PPL scalar functions to analytics-engine - prove wiring based on DataFusion capabilities #21476).

Three small framework additions land alongside:

  • FieldType.ARRAY + a switch arm so the planner's capability lookup matches rex_extract_multi's array<varchar> return type.
  • A ListVector.getObject() bypass at three call sites — Arrow's JsonStringArrayList.<clinit> references jackson-datatype-jsr310's JavaTimeModule, which isn't on the arrow-flight-rpc parent plugin's classloader.
  • array_length onboarded as a scalar function (Calcite SqlLibraryOperators.ARRAY_LENGTH → DataFusion's native array_length). Required end-to-end so PPL queries can size the list returned by extract-mode (eval count = array_length(g)).

Companion PR

opensearch-project/sql#5418 covers two related SQL-side changes required for full end-to-end behavior on /_analytics/ppl:

  1. Defaults PPL_REX_MAX_MATCH_LIMIT=10 in UnifiedQueryContext — required for any rex query to reach the planner (without it, AstBuilder.visitRexCommand NPEs unboxing a null setting value).
  2. Bridges the live cluster Settings instance into UnifiedQueryContext for PPL_REX_MAX_MATCH_LIMIT, so mid-run _cluster/settings updates reach the unified path. Scoped to this one key for now (the other keys in UnifiedQueryContext keep their static defaults).

Tests

  • Rust UDFs — 15/15 unit tests
  • RegexpReplaceAdapterTests — 21/21
  • RexExtractAdapterTests — 4/4
  • RexCommandIT (sandbox QA) — 16/16 (9 sed + 7 extract)
  • ./gradlew check -p sandbox -Dsandbox.enabled=true — green

SQL plugin's CalciteRexCommandIT via the analytics-engine route

Run against a cluster with the bundle-side test infrastructure (PPL coverage bundle) + locally-published SQL plugin including #5418.

Tests executed Passed Failed Pass rate
18 18 0 100.0%

All extract-mode cases (single group, multiple groups, nested groups, complex patterns), all error-path cases (invalid group names, no named groups), rex chained with where / stats / head / filtering, and every max_match variant including testRexMaxMatchConfigurableLimit (which mid-run updates the cluster setting and asserts it takes effect).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 0dafa26.

PathLineSeverityDescription
sandbox/plugins/analytics-backend-datafusion/rust/Cargo.toml75highNew dependency added: `regex = "1.10"`. Per mandatory supply chain policy, all dependency additions must be flagged regardless of apparent legitimacy. Maintainers should verify the crate name, version, and source resolve to the expected artifact (crates.io `regex` crate by Andrew Gallant) and that the pinned version has no known CVEs.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 1 | Medium: 0 | Low: 0


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

RyanL1997 added a commit to RyanL1997/OpenSearch that referenced this pull request May 7, 2026
… Rust UDFs + array result type

Completes the PPL `rex` onboarding started in Part 1 (opensearch-project#21550). The sed-mode forms
were already covered by bridges to existing Calcite/DataFusion operators. The
extract-mode form has no native DataFusion equivalent and needs three custom
Rust UDFs, three Java SqlOperator adapters, and a small handful of analytics-
framework / engine plumbing changes to model array result types end-to-end.

## What's new

### Rust — three UDFs in opensearch-native-lib

  * `rex_extract(input, pattern_lit, group_lit) -> varchar` — single named or
    numbered group capture. Compiles the regex once at plan time, runs per row.
  * `rex_extract_multi(input, pattern_lit, group_lit, max_match) -> list<varchar>`
    — multi-match. `max_match=0` means unbounded; otherwise caps the result at
    the requested element count. Returns NULL (not an empty list) when there
    are no matches, matching the SQL plugin's Java implementation.
  * `rex_offset(input, pattern_lit) -> varchar` — emits the named-group offsets
    formatted as `"name1=s1-e1&name2=s2-e2"`, alphabetically sorted; end is
    inclusive, matching the SQL plugin's `RexOffsetFunction.end - 1` convention.

Each UDF has 5 unit tests covering the contract above.

### Java — three SqlOperator adapters

  * `RexExtractAdapter`, `RexExtractMultiAdapter`, `RexOffsetAdapter` — keyed on
    the SQL plugin's PPL builtin operator names (`REX_EXTRACT`,
    `REX_EXTRACT_MULTI`, `REX_OFFSET`) via the analytics-framework
    ScalarFunction enum. Each adapter rewrites the incoming RexCall to a local
    target SqlOperator (`LOCAL_REX_EXTRACT_OP`, etc.) that
    `DataFusionFragmentConvertor`'s `ADDITIONAL_SCALAR_SIGS` maps to the
    corresponding `rex_extract` / `rex_extract_multi` / `rex_offset` Substrait
    extension declared in `opensearch_scalar_functions.yaml`.

  * Pattern operands (and the group operand for the extract variants) are
    validated as RexLiterals at plan time. Column-valued patterns would force
    per-row regex compilation on the Rust side and are rejected with an
    IllegalArgumentException — same contract as the precedent set by
    RegexpReplaceAdapter in Part 1. `RexExtractAdapterTests` covers this.

### Framework — array return-type support

  * `FieldType.ARRAY` enum value + `fromSqlTypeName(ARRAY) -> FieldType.ARRAY`
    in analytics-framework. Without this, `OpenSearchProjectRule.resolveScalar
    ViableBackends` returns `null` for any scalar with an array return type
    and the planner emits "No backend supports scalar function [REX_EXTRACT_
    MULTI] among [datafusion]". `REX_EXTRACT_MULTI`'s ProjectCapability.Scalar
    declaration is now `Set.of(FieldType.ARRAY)` rather than the broad scalar
    set used by every other op (UPPER, ABS, ...) — those genuinely don't return
    arrays.

  * `ListVector` handling in three call sites that previously triggered Arrow's
    `JsonStringArrayList.<clinit>`, which references `JavaTimeModule` from
    `jackson-datatype-jsr310` (not on the `arrow-flight-rpc` parent plugin's
    classloader). Bypassing `getObject()` and reading offset buffer + inner
    data vector directly:
      - `DatafusionResultStream.getFieldValue` (shard-side row materialization)
      - `ArrowValues.toJavaValue` (coordinator post-execution row reading)
      - `RowResponseCodec` (`inferArrowField` + `setVectorValue`) — the
        Object[]-row → Arrow VectorSchemaRoot wire codec needed an explicit
        list<utf8> Field with proper child Field, plus a ListVector setter
        using `startNewValue`/`endValue` + the inner VarCharVector's `setSafe`.

### IT coverage

  * `RexCommandIT` extended from 9 sed tests to 16 — adds 7 extract-mode cases:
    single named group, multiple named groups in one row, missing-group
    NULL handling, multi-match capturing all, `max_match` cap, offset_field
    output, and no-match passthrough as NULL.

## Required runtime dependency

  * `commons-text:1.11.0` is now a `runtimeOnly` dep of analytics-engine.
    Calcite's `SqlFunctions.<clinit>` references LevenshteinDistance from
    commons-text; without bundling the jar the cluster crashes the first time
    a query reaches Calcite. The matching SHA1 + LICENSE + NOTICE artifacts are
    added under `licenses/` per the repo dependency-license check.

## Test results

  * Rust UDFs — 15/15 unit tests (5 per UDF).
  * `RexExtractAdapterTests` — 4/4.
  * `RexCommandIT` — 16/16 (9 sed from Part 1 + 7 new extract).
  * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green (678 tasks,
    all sandbox module unit tests + spotless + license + forbidden API).

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
RyanL1997 added a commit to RyanL1997/OpenSearch that referenced this pull request May 7, 2026
… Rust UDFs + array result type

Completes the PPL `rex` onboarding started in Part 1 (opensearch-project#21550). The sed-mode forms
were already covered by bridges to existing Calcite/DataFusion operators. The
extract-mode form has no native DataFusion equivalent and needs three custom
Rust UDFs, three Java SqlOperator adapters, and a small handful of analytics-
framework / engine plumbing changes to model array result types end-to-end.

## What's new

### Rust — three UDFs in opensearch-native-lib

  * `rex_extract(input, pattern_lit, group_lit) -> varchar` — single named or
    numbered group capture. Compiles the regex once at plan time, runs per row.
  * `rex_extract_multi(input, pattern_lit, group_lit, max_match) -> list<varchar>`
    — multi-match. `max_match=0` means unbounded; otherwise caps the result at
    the requested element count. Returns NULL (not an empty list) when there
    are no matches, matching the SQL plugin's Java implementation.
  * `rex_offset(input, pattern_lit) -> varchar` — emits the named-group offsets
    formatted as `"name1=s1-e1&name2=s2-e2"`, alphabetically sorted; end is
    inclusive, matching the SQL plugin's `RexOffsetFunction.end - 1` convention.

Each UDF has 5 unit tests covering the contract above.

### Java — three SqlOperator adapters

  * `RexExtractAdapter`, `RexExtractMultiAdapter`, `RexOffsetAdapter` — keyed on
    the SQL plugin's PPL builtin operator names (`REX_EXTRACT`,
    `REX_EXTRACT_MULTI`, `REX_OFFSET`) via the analytics-framework
    ScalarFunction enum. Each adapter rewrites the incoming RexCall to a local
    target SqlOperator (`LOCAL_REX_EXTRACT_OP`, etc.) that
    `DataFusionFragmentConvertor`'s `ADDITIONAL_SCALAR_SIGS` maps to the
    corresponding `rex_extract` / `rex_extract_multi` / `rex_offset` Substrait
    extension declared in `opensearch_scalar_functions.yaml`.

  * Pattern operands (and the group operand for the extract variants) are
    validated as RexLiterals at plan time. Column-valued patterns would force
    per-row regex compilation on the Rust side and are rejected with an
    IllegalArgumentException — same contract as the precedent set by
    RegexpReplaceAdapter in Part 1. `RexExtractAdapterTests` covers this.

### Framework — array return-type support

  * `FieldType.ARRAY` enum value + `fromSqlTypeName(ARRAY) -> FieldType.ARRAY`
    in analytics-framework. Without this, `OpenSearchProjectRule.resolveScalar
    ViableBackends` returns `null` for any scalar with an array return type
    and the planner emits "No backend supports scalar function [REX_EXTRACT_
    MULTI] among [datafusion]". `REX_EXTRACT_MULTI`'s ProjectCapability.Scalar
    declaration is now `Set.of(FieldType.ARRAY)` rather than the broad scalar
    set used by every other op (UPPER, ABS, ...) — those genuinely don't return
    arrays.

  * `ListVector` handling in three call sites that previously triggered Arrow's
    `JsonStringArrayList.<clinit>`, which references `JavaTimeModule` from
    `jackson-datatype-jsr310` (not on the `arrow-flight-rpc` parent plugin's
    classloader). Bypassing `getObject()` and reading offset buffer + inner
    data vector directly:
      - `DatafusionResultStream.getFieldValue` (shard-side row materialization)
      - `ArrowValues.toJavaValue` (coordinator post-execution row reading)
      - `RowResponseCodec` (`inferArrowField` + `setVectorValue`) — the
        Object[]-row → Arrow VectorSchemaRoot wire codec needed an explicit
        list<utf8> Field with proper child Field, plus a ListVector setter
        using `startNewValue`/`endValue` + the inner VarCharVector's `setSafe`.

### IT coverage

  * `RexCommandIT` extended from 9 sed tests to 16 — adds 7 extract-mode cases:
    single named group, multiple named groups in one row, missing-group
    NULL handling, multi-match capturing all, `max_match` cap, offset_field
    output, and no-match passthrough as NULL.

## Test results

  * Rust UDFs — 15/15 unit tests (5 per UDF).
  * `RexExtractAdapterTests` — 4/4.
  * `RexCommandIT` — 16/16 (9 sed from Part 1 + 7 new extract).
  * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green (678 tasks,
    all sandbox module unit tests + spotless + license + forbidden API).

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@RyanL1997 RyanL1997 force-pushed the mustang-rex-part1 branch 2 times, most recently from 7b7e72c to b1d2240 Compare May 7, 2026 23:56
RyanL1997 added a commit to RyanL1997/OpenSearch that referenced this pull request May 7, 2026
… Rust UDFs + array result type

Completes the PPL `rex` onboarding started in Part 1 (opensearch-project#21550). The sed-mode forms
were already covered by bridges to existing Calcite/DataFusion operators. The
extract-mode form has no native DataFusion equivalent and needs three custom
Rust UDFs, three Java SqlOperator adapters, and a small handful of analytics-
framework / engine plumbing changes to model array result types end-to-end.

## What's new

### Rust — three UDFs in opensearch-native-lib

  * `rex_extract(input, pattern_lit, group_lit) -> varchar` — single named or
    numbered group capture. Compiles the regex once at plan time, runs per row.
  * `rex_extract_multi(input, pattern_lit, group_lit, max_match) -> list<varchar>`
    — multi-match. `max_match=0` means unbounded; otherwise caps the result at
    the requested element count. Returns NULL (not an empty list) when there
    are no matches, matching the SQL plugin's Java implementation.
  * `rex_offset(input, pattern_lit) -> varchar` — emits the named-group offsets
    formatted as `"name1=s1-e1&name2=s2-e2"`, alphabetically sorted; end is
    inclusive, matching the SQL plugin's `RexOffsetFunction.end - 1` convention.

Each UDF has 5 unit tests covering the contract above.

### Java — three SqlOperator adapters

  * `RexExtractAdapter`, `RexExtractMultiAdapter`, `RexOffsetAdapter` — keyed on
    the SQL plugin's PPL builtin operator names (`REX_EXTRACT`,
    `REX_EXTRACT_MULTI`, `REX_OFFSET`) via the analytics-framework
    ScalarFunction enum. Each adapter rewrites the incoming RexCall to a local
    target SqlOperator (`LOCAL_REX_EXTRACT_OP`, etc.) that
    `DataFusionFragmentConvertor`'s `ADDITIONAL_SCALAR_SIGS` maps to the
    corresponding `rex_extract` / `rex_extract_multi` / `rex_offset` Substrait
    extension declared in `opensearch_scalar_functions.yaml`.

  * Pattern operands (and the group operand for the extract variants) are
    validated as RexLiterals at plan time. Column-valued patterns would force
    per-row regex compilation on the Rust side and are rejected with an
    IllegalArgumentException — same contract as the precedent set by
    RegexpReplaceAdapter in Part 1. `RexExtractAdapterTests` covers this.

### Framework — array return-type support

  * `FieldType.ARRAY` enum value + `fromSqlTypeName(ARRAY) -> FieldType.ARRAY`
    in analytics-framework. Without this, `OpenSearchProjectRule.resolveScalar
    ViableBackends` returns `null` for any scalar with an array return type
    and the planner emits "No backend supports scalar function [REX_EXTRACT_
    MULTI] among [datafusion]". `REX_EXTRACT_MULTI`'s ProjectCapability.Scalar
    declaration is now `Set.of(FieldType.ARRAY)` rather than the broad scalar
    set used by every other op (UPPER, ABS, ...) — those genuinely don't return
    arrays.

  * `ListVector` handling in three call sites that previously triggered Arrow's
    `JsonStringArrayList.<clinit>`, which references `JavaTimeModule` from
    `jackson-datatype-jsr310` (not on the `arrow-flight-rpc` parent plugin's
    classloader). Bypassing `getObject()` and reading offset buffer + inner
    data vector directly:
      - `DatafusionResultStream.getFieldValue` (shard-side row materialization)
      - `ArrowValues.toJavaValue` (coordinator post-execution row reading)
      - `RowResponseCodec` (`inferArrowField` + `setVectorValue`) — the
        Object[]-row → Arrow VectorSchemaRoot wire codec needed an explicit
        list<utf8> Field with proper child Field, plus a ListVector setter
        using `startNewValue`/`endValue` + the inner VarCharVector's `setSafe`.

### IT coverage

  * `RexCommandIT` extended from 9 sed tests to 16 — adds 7 extract-mode cases:
    single named group, multiple named groups in one row, missing-group
    NULL handling, multi-match capturing all, `max_match` cap, offset_field
    output, and no-match passthrough as NULL.

## Test results

  * Rust UDFs — 15/15 unit tests (5 per UDF).
  * `RexExtractAdapterTests` — 4/4.
  * `RexCommandIT` — 16/16 (9 sed from Part 1 + 7 new extract).
  * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green (678 tasks,
    all sandbox module unit tests + spotless + license + forbidden API).

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
RyanL1997 added a commit to RyanL1997/OpenSearch that referenced this pull request May 8, 2026
… Rust UDFs + array result type

Completes the PPL `rex` onboarding started in Part 1 (opensearch-project#21550). The sed-mode forms
were already covered by bridges to existing Calcite/DataFusion operators. The
extract-mode form has no native DataFusion equivalent and needs three custom
Rust UDFs, three Java SqlOperator adapters, and a small handful of analytics-
framework / engine plumbing changes to model array result types end-to-end.

## What's new

### Rust — three UDFs in opensearch-native-lib

  * `rex_extract(input, pattern_lit, group_lit) -> varchar` — single named or
    numbered group capture. Compiles the regex once at plan time, runs per row.
  * `rex_extract_multi(input, pattern_lit, group_lit, max_match) -> list<varchar>`
    — multi-match. `max_match=0` means unbounded; otherwise caps the result at
    the requested element count. Returns NULL (not an empty list) when there
    are no matches, matching the SQL plugin's Java implementation.
  * `rex_offset(input, pattern_lit) -> varchar` — emits the named-group offsets
    formatted as `"name1=s1-e1&name2=s2-e2"`, alphabetically sorted; end is
    inclusive, matching the SQL plugin's `RexOffsetFunction.end - 1` convention.

Each UDF has 5 unit tests covering the contract above.

### Java — three SqlOperator adapters

  * `RexExtractAdapter`, `RexExtractMultiAdapter`, `RexOffsetAdapter` — keyed on
    the SQL plugin's PPL builtin operator names (`REX_EXTRACT`,
    `REX_EXTRACT_MULTI`, `REX_OFFSET`) via the analytics-framework
    ScalarFunction enum. Each adapter rewrites the incoming RexCall to a local
    target SqlOperator (`LOCAL_REX_EXTRACT_OP`, etc.) that
    `DataFusionFragmentConvertor`'s `ADDITIONAL_SCALAR_SIGS` maps to the
    corresponding `rex_extract` / `rex_extract_multi` / `rex_offset` Substrait
    extension declared in `opensearch_scalar_functions.yaml`.

  * Pattern operands (and the group operand for the extract variants) are
    validated as RexLiterals at plan time. Column-valued patterns would force
    per-row regex compilation on the Rust side and are rejected with an
    IllegalArgumentException — same contract as the precedent set by
    RegexpReplaceAdapter in Part 1. `RexExtractAdapterTests` covers this.

### Framework — array return-type support

  * `FieldType.ARRAY` enum value + `fromSqlTypeName(ARRAY) -> FieldType.ARRAY`
    in analytics-framework. Without this, `OpenSearchProjectRule.resolveScalar
    ViableBackends` returns `null` for any scalar with an array return type
    and the planner emits "No backend supports scalar function [REX_EXTRACT_
    MULTI] among [datafusion]". `REX_EXTRACT_MULTI`'s ProjectCapability.Scalar
    declaration is now `Set.of(FieldType.ARRAY)` rather than the broad scalar
    set used by every other op (UPPER, ABS, ...) — those genuinely don't return
    arrays.

  * `ListVector` handling in three call sites that previously triggered Arrow's
    `JsonStringArrayList.<clinit>`, which references `JavaTimeModule` from
    `jackson-datatype-jsr310` (not on the `arrow-flight-rpc` parent plugin's
    classloader). Bypassing `getObject()` and reading offset buffer + inner
    data vector directly:
      - `DatafusionResultStream.getFieldValue` (shard-side row materialization)
      - `ArrowValues.toJavaValue` (coordinator post-execution row reading)
      - `RowResponseCodec` (`inferArrowField` + `setVectorValue`) — the
        Object[]-row → Arrow VectorSchemaRoot wire codec needed an explicit
        list<utf8> Field with proper child Field, plus a ListVector setter
        using `startNewValue`/`endValue` + the inner VarCharVector's `setSafe`.

### IT coverage

  * `RexCommandIT` extended from 9 sed tests to 16 — adds 7 extract-mode cases:
    single named group, multiple named groups in one row, missing-group
    NULL handling, multi-match capturing all, `max_match` cap, offset_field
    output, and no-match passthrough as NULL.

## Test results

  * Rust UDFs — 15/15 unit tests (5 per UDF).
  * `RexExtractAdapterTests` — 4/4.
  * `RexCommandIT` — 16/16 (9 sed from Part 1 + 7 new extract).
  * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green (678 tasks,
    all sandbox module unit tests + spotless + license + forbidden API).

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@RyanL1997 RyanL1997 force-pushed the mustang-rex-part1 branch from b1d2240 to aa25d15 Compare May 8, 2026 00:19
@RyanL1997 RyanL1997 changed the title [Analytics Backend / DataFusion] Wire PPL rex sed-mode (Part 1) — bridge-only [Analytics Backend / DataFusion] Onboard PPL rex to DataFusion May 8, 2026
RyanL1997 added a commit to RyanL1997/sql that referenced this pull request May 8, 2026
…_MATCH_LIMIT

The previous commit defaulted `PPL_REX_MAX_MATCH_LIMIT=10` in
`UnifiedQueryContext.Builder.settings` to fix the NPE in
`AstBuilder.visitRexCommand` on the unified path. The default is correct, but
it doesn't respect mid-run cluster overrides — every key in the static map
returns its hardcoded value regardless of `_cluster/settings` updates. This
breaks `CalciteRexCommandIT.testRexMaxMatchConfigurableLimit`, which
explicitly sets the cluster-side limit to 5 and asserts that `max_match=0`
caps at 5; on the unified path it stayed at 10.

This change introduces a `Builder.liveSettings(Settings)` hook that the REST
handler can use to inject the cluster's live `OpenSearchSettings` instance.
At `build()` time the Builder snapshots the live value of
`PPL_REX_MAX_MATCH_LIMIT` (only — see scoping note below) into the static
map, overriding the hardcoded default when the operator has set a cluster
value. Snapshot-at-build matches the per-HTTP-request lifecycle of
`UnifiedQueryContext` and avoids per-call lookup overhead.

## Why scoped to PPL_REX_MAX_MATCH_LIMIT only

The same architectural gap exists for every key in the static map
(`QUERY_SIZE_LIMIT`, `PPL_SUBSEARCH_MAXOUT`, `PPL_JOIN_SUBSEARCH_MAXOUT`,
`CALCITE_ENGINE_ENABLED`). For three of those, the static defaults are fine
in practice (no test overrides them mid-run; `head N` covers `QUERY_SIZE_LIMIT`
per-query). `CALCITE_ENGINE_ENABLED` is intentionally pinned to `true` for
the unified path — a cluster override toggling it off would defeat the point
of routing here. So this PR widens only the one key that demonstrably needs
it; widening the snapshot to the rest is a future scope decision tied to
whichever new IT first depends on it.

## Wire-up

`RestUnifiedQueryAction` gains a `pluginSettings` field (the same
`OpenSearchSettings` instance bound in the Guice module) and forwards it to
the Builder in both `buildContext` (per-request execution path) and
`buildParsingContext` (analytics-routing index name probe). Both
construction sites — `SQLPlugin.createSqlAnalyticsRouter` and
`TransportPPLQueryAction.<init>` — are updated to pass the existing
plugin-side `Settings` instance.

`buildParsingContext` had been `static` because it didn't need any instance
state; it's now an instance method since it reads `pluginSettings`.

## Test results

CalciteRexCommandIT through the analytics-engine route (every PPL query
forced through `/_analytics/ppl` via `tests.analytics.force_routing=true`):

* Before this change: 17/18 — `testRexMaxMatchConfigurableLimit` fails with
  `expected:<5> but was:<10>` (cluster override doesn't reach the unified
  path).
* After this change: 18/18 — all `testRexMaxMatch*` variants honor the
  cluster setting.

## Companion PR

opensearch-project/OpenSearch#21550 — onboards PPL `rex` to DataFusion via
the analytics-engine path. The 17/18 baseline reported in that PR's
description was measured against the previous commit on this branch; with
this change the route hits 18/18.

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@RyanL1997
Copy link
Copy Markdown
Contributor Author

RyanL1997 commented May 8, 2026

CI status check — failures are expected and explained below:

sandbox-check — All 16 RexCommandIT tests NPE on Settings.getSettingValue(PPL_REX_MAX_MATCH_LIMIT).intValue(). This is the same NPE that companion SQL PR nopensearch-project/sql#5418 fixes. CI resolves opensearch-sql-plugin:3.7.0.0-SNAPSHOT from the published snapshots URL, which doesn't yet include opensearch-project/sql#5418 open). Locally this is reproduced by publishToMavenLocal from a worktree carrying opensearch-project/sql#5418, where every test passes (16/16 on the sandbox QA RexCommandIT, 18/18 on the SQL plugin's CalciteRexCommandIT via the analytics-engine route). The dependency is called out in the PR description's Companion PR section. Once The above PR lands and the SQL snapshot refreshes, this check will go green.

@RyanL1997 RyanL1997 marked this pull request as ready for review May 8, 2026 02:39
@RyanL1997 RyanL1997 requested a review from a team as a code owner May 8, 2026 02:39
RyanL1997 added a commit to RyanL1997/sql that referenced this pull request May 8, 2026
…_MATCH_LIMIT

The previous commit defaulted `PPL_REX_MAX_MATCH_LIMIT=10` in
`UnifiedQueryContext.Builder.settings` to fix the NPE in
`AstBuilder.visitRexCommand` on the unified path. The default is correct, but
it doesn't respect mid-run cluster overrides — every key in the static map
returns its hardcoded value regardless of `_cluster/settings` updates. This
breaks `CalciteRexCommandIT.testRexMaxMatchConfigurableLimit`, which
explicitly sets the cluster-side limit to 5 and asserts that `max_match=0`
caps at 5; on the unified path it stayed at 10.

This change introduces a `Builder.liveSettings(Settings)` hook that the REST
handler can use to inject the cluster's live `OpenSearchSettings` instance.
At `build()` time the Builder snapshots the live value of
`PPL_REX_MAX_MATCH_LIMIT` (only — see scoping note below) into the static
map, overriding the hardcoded default when the operator has set a cluster
value. Snapshot-at-build matches the per-HTTP-request lifecycle of
`UnifiedQueryContext` and avoids per-call lookup overhead.

The same architectural gap exists for every key in the static map
(`QUERY_SIZE_LIMIT`, `PPL_SUBSEARCH_MAXOUT`, `PPL_JOIN_SUBSEARCH_MAXOUT`,
`CALCITE_ENGINE_ENABLED`). For three of those, the static defaults are fine
in practice (no test overrides them mid-run; `head N` covers `QUERY_SIZE_LIMIT`
per-query). `CALCITE_ENGINE_ENABLED` is intentionally pinned to `true` for
the unified path — a cluster override toggling it off would defeat the point
of routing here. So this PR widens only the one key that demonstrably needs
it; widening the snapshot to the rest is a future scope decision tied to
whichever new IT first depends on it.

`RestUnifiedQueryAction` gains a `pluginSettings` field (the same
`OpenSearchSettings` instance bound in the Guice module) and forwards it to
the Builder in both `buildContext` (per-request execution path) and
`buildParsingContext` (analytics-routing index name probe). Both
construction sites — `SQLPlugin.createSqlAnalyticsRouter` and
`TransportPPLQueryAction.<init>` — are updated to pass the existing
plugin-side `Settings` instance.

`buildParsingContext` had been `static` because it didn't need any instance
state; it's now an instance method since it reads `pluginSettings`.

CalciteRexCommandIT through the analytics-engine route (every PPL query
forced through `/_analytics/ppl` via `tests.analytics.force_routing=true`):

* Before this change: 17/18 — `testRexMaxMatchConfigurableLimit` fails with
  `expected:<5> but was:<10>` (cluster override doesn't reach the unified
  path).
* After this change: 18/18 — all `testRexMaxMatch*` variants honor the
  cluster setting.

opensearch-project/OpenSearch#21550 — onboards PPL `rex` to DataFusion via
the analytics-engine path. The 17/18 baseline reported in that PR's
description was measured against the previous commit on this branch; with
this change the route hits 18/18.

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
RyanL1997 added a commit to RyanL1997/OpenSearch that referenced this pull request May 8, 2026
… Rust UDFs + array result type

Completes the PPL `rex` onboarding started in Part 1 (opensearch-project#21550). The sed-mode forms
were already covered by bridges to existing Calcite/DataFusion operators. The
extract-mode form has no native DataFusion equivalent and needs three custom
Rust UDFs, three Java SqlOperator adapters, and a small handful of analytics-
framework / engine plumbing changes to model array result types end-to-end.

  * `rex_extract(input, pattern_lit, group_lit) -> varchar` — single named or
    numbered group capture. Compiles the regex once at plan time, runs per row.
  * `rex_extract_multi(input, pattern_lit, group_lit, max_match) -> list<varchar>`
    — multi-match. `max_match=0` means unbounded; otherwise caps the result at
    the requested element count. Returns NULL (not an empty list) when there
    are no matches, matching the SQL plugin's Java implementation.
  * `rex_offset(input, pattern_lit) -> varchar` — emits the named-group offsets
    formatted as `"name1=s1-e1&name2=s2-e2"`, alphabetically sorted; end is
    inclusive, matching the SQL plugin's `RexOffsetFunction.end - 1` convention.

Each UDF has 5 unit tests covering the contract above.

  * `RexExtractAdapter`, `RexExtractMultiAdapter`, `RexOffsetAdapter` — keyed on
    the SQL plugin's PPL builtin operator names (`REX_EXTRACT`,
    `REX_EXTRACT_MULTI`, `REX_OFFSET`) via the analytics-framework
    ScalarFunction enum. Each adapter rewrites the incoming RexCall to a local
    target SqlOperator (`LOCAL_REX_EXTRACT_OP`, etc.) that
    `DataFusionFragmentConvertor`'s `ADDITIONAL_SCALAR_SIGS` maps to the
    corresponding `rex_extract` / `rex_extract_multi` / `rex_offset` Substrait
    extension declared in `opensearch_scalar_functions.yaml`.

  * Pattern operands (and the group operand for the extract variants) are
    validated as RexLiterals at plan time. Column-valued patterns would force
    per-row regex compilation on the Rust side and are rejected with an
    IllegalArgumentException — same contract as the precedent set by
    RegexpReplaceAdapter in Part 1. `RexExtractAdapterTests` covers this.

  * `FieldType.ARRAY` enum value + `fromSqlTypeName(ARRAY) -> FieldType.ARRAY`
    in analytics-framework. Without this, `OpenSearchProjectRule.resolveScalar
    ViableBackends` returns `null` for any scalar with an array return type
    and the planner emits "No backend supports scalar function [REX_EXTRACT_
    MULTI] among [datafusion]". `REX_EXTRACT_MULTI`'s ProjectCapability.Scalar
    declaration is now `Set.of(FieldType.ARRAY)` rather than the broad scalar
    set used by every other op (UPPER, ABS, ...) — those genuinely don't return
    arrays.

  * `ListVector` handling in three call sites that previously triggered Arrow's
    `JsonStringArrayList.<clinit>`, which references `JavaTimeModule` from
    `jackson-datatype-jsr310` (not on the `arrow-flight-rpc` parent plugin's
    classloader). Bypassing `getObject()` and reading offset buffer + inner
    data vector directly:
      - `DatafusionResultStream.getFieldValue` (shard-side row materialization)
      - `ArrowValues.toJavaValue` (coordinator post-execution row reading)
      - `RowResponseCodec` (`inferArrowField` + `setVectorValue`) — the
        Object[]-row → Arrow VectorSchemaRoot wire codec needed an explicit
        list<utf8> Field with proper child Field, plus a ListVector setter
        using `startNewValue`/`endValue` + the inner VarCharVector's `setSafe`.

  * `RexCommandIT` extended from 9 sed tests to 16 — adds 7 extract-mode cases:
    single named group, multiple named groups in one row, missing-group
    NULL handling, multi-match capturing all, `max_match` cap, offset_field
    output, and no-match passthrough as NULL.

  * Rust UDFs — 15/15 unit tests (5 per UDF).
  * `RexExtractAdapterTests` — 4/4.
  * `RexCommandIT` — 16/16 (9 sed from Part 1 + 7 new extract).
  * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green (678 tasks,
    all sandbox module unit tests + spotless + license + forbidden API).

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@RyanL1997 RyanL1997 force-pushed the mustang-rex-part1 branch from ad59c57 to cc57ed1 Compare May 8, 2026 07:30
RyanL1997 added a commit to RyanL1997/OpenSearch that referenced this pull request May 8, 2026
… Rust UDFs + array result type

Completes the PPL `rex` onboarding started in Part 1 (opensearch-project#21550). The sed-mode forms
were already covered by bridges to existing Calcite/DataFusion operators. The
extract-mode form has no native DataFusion equivalent and needs three custom
Rust UDFs, three Java SqlOperator adapters, and a small handful of analytics-
framework / engine plumbing changes to model array result types end-to-end.

  * `rex_extract(input, pattern_lit, group_lit) -> varchar` — single named or
    numbered group capture. Compiles the regex once at plan time, runs per row.
  * `rex_extract_multi(input, pattern_lit, group_lit, max_match) -> list<varchar>`
    — multi-match. `max_match=0` means unbounded; otherwise caps the result at
    the requested element count. Returns NULL (not an empty list) when there
    are no matches, matching the SQL plugin's Java implementation.
  * `rex_offset(input, pattern_lit) -> varchar` — emits the named-group offsets
    formatted as `"name1=s1-e1&name2=s2-e2"`, alphabetically sorted; end is
    inclusive, matching the SQL plugin's `RexOffsetFunction.end - 1` convention.

Each UDF has 5 unit tests covering the contract above.

  * `RexExtractAdapter`, `RexExtractMultiAdapter`, `RexOffsetAdapter` — keyed on
    the SQL plugin's PPL builtin operator names (`REX_EXTRACT`,
    `REX_EXTRACT_MULTI`, `REX_OFFSET`) via the analytics-framework
    ScalarFunction enum. Each adapter rewrites the incoming RexCall to a local
    target SqlOperator (`LOCAL_REX_EXTRACT_OP`, etc.) that
    `DataFusionFragmentConvertor`'s `ADDITIONAL_SCALAR_SIGS` maps to the
    corresponding `rex_extract` / `rex_extract_multi` / `rex_offset` Substrait
    extension declared in `opensearch_scalar_functions.yaml`.

  * Pattern operands (and the group operand for the extract variants) are
    validated as RexLiterals at plan time. Column-valued patterns would force
    per-row regex compilation on the Rust side and are rejected with an
    IllegalArgumentException — same contract as the precedent set by
    RegexpReplaceAdapter in Part 1. `RexExtractAdapterTests` covers this.

  * `FieldType.ARRAY` enum value + `fromSqlTypeName(ARRAY) -> FieldType.ARRAY`
    in analytics-framework. Without this, `OpenSearchProjectRule.resolveScalar
    ViableBackends` returns `null` for any scalar with an array return type
    and the planner emits "No backend supports scalar function [REX_EXTRACT_
    MULTI] among [datafusion]". `REX_EXTRACT_MULTI`'s ProjectCapability.Scalar
    declaration is now `Set.of(FieldType.ARRAY)` rather than the broad scalar
    set used by every other op (UPPER, ABS, ...) — those genuinely don't return
    arrays.

  * `ListVector` handling in three call sites that previously triggered Arrow's
    `JsonStringArrayList.<clinit>`, which references `JavaTimeModule` from
    `jackson-datatype-jsr310` (not on the `arrow-flight-rpc` parent plugin's
    classloader). Bypassing `getObject()` and reading offset buffer + inner
    data vector directly:
      - `DatafusionResultStream.getFieldValue` (shard-side row materialization)
      - `ArrowValues.toJavaValue` (coordinator post-execution row reading)
      - `RowResponseCodec` (`inferArrowField` + `setVectorValue`) — the
        Object[]-row → Arrow VectorSchemaRoot wire codec needed an explicit
        list<utf8> Field with proper child Field, plus a ListVector setter
        using `startNewValue`/`endValue` + the inner VarCharVector's `setSafe`.

  * `RexCommandIT` extended from 9 sed tests to 16 — adds 7 extract-mode cases:
    single named group, multiple named groups in one row, missing-group
    NULL handling, multi-match capturing all, `max_match` cap, offset_field
    output, and no-match passthrough as NULL.

  * Rust UDFs — 15/15 unit tests (5 per UDF).
  * `RexExtractAdapterTests` — 4/4.
  * `RexCommandIT` — 16/16 (9 sed from Part 1 + 7 new extract).
  * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green (678 tasks,
    all sandbox module unit tests + spotless + license + forbidden API).

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@RyanL1997 RyanL1997 force-pushed the mustang-rex-part1 branch from cc57ed1 to 5c33102 Compare May 8, 2026 16:46
@RyanL1997
Copy link
Copy Markdown
Contributor Author

RyanL1997 commented May 8, 2026

CI is correctly pulling the canonical published snapshot. The published snapshot is from main, not from feature/mustang-ppl-integration. opensearch-project/sql#5418 is in the feature branch but never made it into a published artifact. So this PR's CI will stay red until either the feature branch gets merged to main (auto-publish) or someone with write access manually triggers maven-publish-modules.yml against feature/mustang-ppl-integration.

@RyanL1997 RyanL1997 force-pushed the mustang-rex-part1 branch from 5c33102 to e772008 Compare May 8, 2026 19:58
RyanL1997 added a commit to RyanL1997/OpenSearch that referenced this pull request May 8, 2026
… Rust UDFs + array result type

Completes the PPL `rex` onboarding started in Part 1 (opensearch-project#21550). The sed-mode forms
were already covered by bridges to existing Calcite/DataFusion operators. The
extract-mode form has no native DataFusion equivalent and needs three custom
Rust UDFs, three Java SqlOperator adapters, and a small handful of analytics-
framework / engine plumbing changes to model array result types end-to-end.

  * `rex_extract(input, pattern_lit, group_lit) -> varchar` — single named or
    numbered group capture. Compiles the regex once at plan time, runs per row.
  * `rex_extract_multi(input, pattern_lit, group_lit, max_match) -> list<varchar>`
    — multi-match. `max_match=0` means unbounded; otherwise caps the result at
    the requested element count. Returns NULL (not an empty list) when there
    are no matches, matching the SQL plugin's Java implementation.
  * `rex_offset(input, pattern_lit) -> varchar` — emits the named-group offsets
    formatted as `"name1=s1-e1&name2=s2-e2"`, alphabetically sorted; end is
    inclusive, matching the SQL plugin's `RexOffsetFunction.end - 1` convention.

Each UDF has 5 unit tests covering the contract above.

  * `RexExtractAdapter`, `RexExtractMultiAdapter`, `RexOffsetAdapter` — keyed on
    the SQL plugin's PPL builtin operator names (`REX_EXTRACT`,
    `REX_EXTRACT_MULTI`, `REX_OFFSET`) via the analytics-framework
    ScalarFunction enum. Each adapter rewrites the incoming RexCall to a local
    target SqlOperator (`LOCAL_REX_EXTRACT_OP`, etc.) that
    `DataFusionFragmentConvertor`'s `ADDITIONAL_SCALAR_SIGS` maps to the
    corresponding `rex_extract` / `rex_extract_multi` / `rex_offset` Substrait
    extension declared in `opensearch_scalar_functions.yaml`.

  * Pattern operands (and the group operand for the extract variants) are
    validated as RexLiterals at plan time. Column-valued patterns would force
    per-row regex compilation on the Rust side and are rejected with an
    IllegalArgumentException — same contract as the precedent set by
    RegexpReplaceAdapter in Part 1. `RexExtractAdapterTests` covers this.

  * `FieldType.ARRAY` enum value + `fromSqlTypeName(ARRAY) -> FieldType.ARRAY`
    in analytics-framework. Without this, `OpenSearchProjectRule.resolveScalar
    ViableBackends` returns `null` for any scalar with an array return type
    and the planner emits "No backend supports scalar function [REX_EXTRACT_
    MULTI] among [datafusion]". `REX_EXTRACT_MULTI`'s ProjectCapability.Scalar
    declaration is now `Set.of(FieldType.ARRAY)` rather than the broad scalar
    set used by every other op (UPPER, ABS, ...) — those genuinely don't return
    arrays.

  * `ListVector` handling in three call sites that previously triggered Arrow's
    `JsonStringArrayList.<clinit>`, which references `JavaTimeModule` from
    `jackson-datatype-jsr310` (not on the `arrow-flight-rpc` parent plugin's
    classloader). Bypassing `getObject()` and reading offset buffer + inner
    data vector directly:
      - `DatafusionResultStream.getFieldValue` (shard-side row materialization)
      - `ArrowValues.toJavaValue` (coordinator post-execution row reading)
      - `RowResponseCodec` (`inferArrowField` + `setVectorValue`) — the
        Object[]-row → Arrow VectorSchemaRoot wire codec needed an explicit
        list<utf8> Field with proper child Field, plus a ListVector setter
        using `startNewValue`/`endValue` + the inner VarCharVector's `setSafe`.

  * `RexCommandIT` extended from 9 sed tests to 16 — adds 7 extract-mode cases:
    single named group, multiple named groups in one row, missing-group
    NULL handling, multi-match capturing all, `max_match` cap, offset_field
    output, and no-match passthrough as NULL.

  * Rust UDFs — 15/15 unit tests (5 per UDF).
  * `RexExtractAdapterTests` — 4/4.
  * `RexCommandIT` — 16/16 (9 sed from Part 1 + 7 new extract).
  * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green (678 tasks,
    all sandbox module unit tests + spotless + license + forbidden API).

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
RyanL1997 added a commit to RyanL1997/OpenSearch that referenced this pull request May 8, 2026
… Rust UDFs + array result type

Completes the PPL `rex` onboarding started in Part 1 (opensearch-project#21550). The sed-mode forms
were already covered by bridges to existing Calcite/DataFusion operators. The
extract-mode form has no native DataFusion equivalent and needs three custom
Rust UDFs, three Java SqlOperator adapters, and a small handful of analytics-
framework / engine plumbing changes to model array result types end-to-end.

  * `rex_extract(input, pattern_lit, group_lit) -> varchar` — single named or
    numbered group capture. Compiles the regex once at plan time, runs per row.
  * `rex_extract_multi(input, pattern_lit, group_lit, max_match) -> list<varchar>`
    — multi-match. `max_match=0` means unbounded; otherwise caps the result at
    the requested element count. Returns NULL (not an empty list) when there
    are no matches, matching the SQL plugin's Java implementation.
  * `rex_offset(input, pattern_lit) -> varchar` — emits the named-group offsets
    formatted as `"name1=s1-e1&name2=s2-e2"`, alphabetically sorted; end is
    inclusive, matching the SQL plugin's `RexOffsetFunction.end - 1` convention.

Each UDF has 5 unit tests covering the contract above.

  * `RexExtractAdapter`, `RexExtractMultiAdapter`, `RexOffsetAdapter` — keyed on
    the SQL plugin's PPL builtin operator names (`REX_EXTRACT`,
    `REX_EXTRACT_MULTI`, `REX_OFFSET`) via the analytics-framework
    ScalarFunction enum. Each adapter rewrites the incoming RexCall to a local
    target SqlOperator (`LOCAL_REX_EXTRACT_OP`, etc.) that
    `DataFusionFragmentConvertor`'s `ADDITIONAL_SCALAR_SIGS` maps to the
    corresponding `rex_extract` / `rex_extract_multi` / `rex_offset` Substrait
    extension declared in `opensearch_scalar_functions.yaml`.

  * Pattern operands (and the group operand for the extract variants) are
    validated as RexLiterals at plan time. Column-valued patterns would force
    per-row regex compilation on the Rust side and are rejected with an
    IllegalArgumentException — same contract as the precedent set by
    RegexpReplaceAdapter in Part 1. `RexExtractAdapterTests` covers this.

  * `FieldType.ARRAY` enum value + `fromSqlTypeName(ARRAY) -> FieldType.ARRAY`
    in analytics-framework. Without this, `OpenSearchProjectRule.resolveScalar
    ViableBackends` returns `null` for any scalar with an array return type
    and the planner emits "No backend supports scalar function [REX_EXTRACT_
    MULTI] among [datafusion]". `REX_EXTRACT_MULTI`'s ProjectCapability.Scalar
    declaration is now `Set.of(FieldType.ARRAY)` rather than the broad scalar
    set used by every other op (UPPER, ABS, ...) — those genuinely don't return
    arrays.

  * `ListVector` handling in three call sites that previously triggered Arrow's
    `JsonStringArrayList.<clinit>`, which references `JavaTimeModule` from
    `jackson-datatype-jsr310` (not on the `arrow-flight-rpc` parent plugin's
    classloader). Bypassing `getObject()` and reading offset buffer + inner
    data vector directly:
      - `DatafusionResultStream.getFieldValue` (shard-side row materialization)
      - `ArrowValues.toJavaValue` (coordinator post-execution row reading)
      - `RowResponseCodec` (`inferArrowField` + `setVectorValue`) — the
        Object[]-row → Arrow VectorSchemaRoot wire codec needed an explicit
        list<utf8> Field with proper child Field, plus a ListVector setter
        using `startNewValue`/`endValue` + the inner VarCharVector's `setSafe`.

  * `RexCommandIT` extended from 9 sed tests to 16 — adds 7 extract-mode cases:
    single named group, multiple named groups in one row, missing-group
    NULL handling, multi-match capturing all, `max_match` cap, offset_field
    output, and no-match passthrough as NULL.

  * Rust UDFs — 15/15 unit tests (5 per UDF).
  * `RexExtractAdapterTests` — 4/4.
  * `RexCommandIT` — 16/16 (9 sed from Part 1 + 7 new extract).
  * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green (678 tasks,
    all sandbox module unit tests + spotless + license + forbidden API).

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@RyanL1997 RyanL1997 force-pushed the mustang-rex-part1 branch from e772008 to 697bd2e Compare May 8, 2026 23:57
RyanL1997 added 4 commits May 8, 2026 22:28
…dge-only

Onboards the PPL `rex` command's `mode=sed` surface — the part that lowers to
standard Calcite library operators and bridges through Substrait to DataFusion's
native UDFs. Three sed sub-variants covered:

  * `rex field=f mode=sed "s/old/new/"` (no flags) → SqlLibraryOperators.REGEXP_REPLACE_3
    (already mapped via the PPL `replace` onboarding from opensearch-project#21527 — no-op here).

  * `rex field=f mode=sed "s/old/new/g"` / `/i` / `/gi` → SqlLibraryOperators.REGEXP_REPLACE_PG_4
    (4-arg with flags string). New bridge in this PR. DataFusion's regexp_replace
    natively accepts 4-arg `(str, pat, repl, flags)` per its substrait UDF binding.

  * `rex field=f mode=sed "y/from/to/"` (transliteration) → SqlLibraryOperators.TRANSLATE3.
    New bridge in this PR. Resolves to DataFusion's `translate` UDF
    (datafusion-functions/src/unicode/translate.rs).

The 4-arg `REGEXP_REPLACE_PG_4` carries the same Java-regex syntax baggage as the
3-arg form: `\Q…\E` quoted-literal blocks (Rust regex rejects them) and bare `$N`
backreferences in the replacement (Rust's identifier-greedy parser
mis-resolves them). RegexpReplaceAdapter, introduced for the 3-arg form in
and replacement at position 2 in both signatures — the rewrite logic doesn't
change. Operands beyond position 2 (the flags string in the 4-arg form) pass
through verbatim. Two new RegexpReplaceAdapterTests cover the 4-arg path.

`TRANSLATE3` doesn't need an adapter — its arguments are character classes, not
regex syntax.

  * Rex extract mode (`rex field=f "(?<g>...)"`) — uses the SQL plugin's custom
    Java UDFs `REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`, which have no
    native DataFusion equivalent. Slated for a follow-up PR that adds Rust-side
    UDF implementations, similar to the convert_tz precedent (opensearch-project#21476).

  * Sed with occurrence flag (`s/.../.../<N>`) — emits 5-arg
    `REGEXP_REPLACE_5`, which DataFusion's native `regexp_replace` does not
    support (max 4 args). Also Part 2.

  * `RegexpReplaceAdapterTests` — 21/21 (19 from opensearch-project#21527 + 2 new for the 4-arg path).
  * `RexCommandIT` (new self-contained QA IT, calcs dataset) — 9/9. Covers all sed
    sub-variants: literal (no flags), `/g` global, `/i` case-insensitive, `/gi`
    combined, backreferences via `$N`, transliteration `y/from/to/` and
    no-match passthrough.
  * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green.

The unified-path NPE caused by a missing PPL_REX_MAX_MATCH_LIMIT default is fixed
in opensearch-project/sql#5418 — required for any rex query (sed or extract) to
reach the planner via /_analytics/ppl. This PR's Test results assume opensearch-project#5418 is
applied. Pre-fix: every query NPEs in `AstBuilder.visitRexCommand`. Post-fix:
9/9 RexCommandIT pass.

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
… Rust UDFs + array result type

Completes the PPL `rex` onboarding started in Part 1 (opensearch-project#21550). The sed-mode forms
were already covered by bridges to existing Calcite/DataFusion operators. The
extract-mode form has no native DataFusion equivalent and needs three custom
Rust UDFs, three Java SqlOperator adapters, and a small handful of analytics-
framework / engine plumbing changes to model array result types end-to-end.

  * `rex_extract(input, pattern_lit, group_lit) -> varchar` — single named or
    numbered group capture. Compiles the regex once at plan time, runs per row.
  * `rex_extract_multi(input, pattern_lit, group_lit, max_match) -> list<varchar>`
    — multi-match. `max_match=0` means unbounded; otherwise caps the result at
    the requested element count. Returns NULL (not an empty list) when there
    are no matches, matching the SQL plugin's Java implementation.
  * `rex_offset(input, pattern_lit) -> varchar` — emits the named-group offsets
    formatted as `"name1=s1-e1&name2=s2-e2"`, alphabetically sorted; end is
    inclusive, matching the SQL plugin's `RexOffsetFunction.end - 1` convention.

Each UDF has 5 unit tests covering the contract above.

  * `RexExtractAdapter`, `RexExtractMultiAdapter`, `RexOffsetAdapter` — keyed on
    the SQL plugin's PPL builtin operator names (`REX_EXTRACT`,
    `REX_EXTRACT_MULTI`, `REX_OFFSET`) via the analytics-framework
    ScalarFunction enum. Each adapter rewrites the incoming RexCall to a local
    target SqlOperator (`LOCAL_REX_EXTRACT_OP`, etc.) that
    `DataFusionFragmentConvertor`'s `ADDITIONAL_SCALAR_SIGS` maps to the
    corresponding `rex_extract` / `rex_extract_multi` / `rex_offset` Substrait
    extension declared in `opensearch_scalar_functions.yaml`.

  * Pattern operands (and the group operand for the extract variants) are
    validated as RexLiterals at plan time. Column-valued patterns would force
    per-row regex compilation on the Rust side and are rejected with an
    IllegalArgumentException — same contract as the precedent set by
    RegexpReplaceAdapter in Part 1. `RexExtractAdapterTests` covers this.

  * `FieldType.ARRAY` enum value + `fromSqlTypeName(ARRAY) -> FieldType.ARRAY`
    in analytics-framework. Without this, `OpenSearchProjectRule.resolveScalar
    ViableBackends` returns `null` for any scalar with an array return type
    and the planner emits "No backend supports scalar function [REX_EXTRACT_
    MULTI] among [datafusion]". `REX_EXTRACT_MULTI`'s ProjectCapability.Scalar
    declaration is now `Set.of(FieldType.ARRAY)` rather than the broad scalar
    set used by every other op (UPPER, ABS, ...) — those genuinely don't return
    arrays.

  * `ListVector` handling in three call sites that previously triggered Arrow's
    `JsonStringArrayList.<clinit>`, which references `JavaTimeModule` from
    `jackson-datatype-jsr310` (not on the `arrow-flight-rpc` parent plugin's
    classloader). Bypassing `getObject()` and reading offset buffer + inner
    data vector directly:
      - `DatafusionResultStream.getFieldValue` (shard-side row materialization)
      - `ArrowValues.toJavaValue` (coordinator post-execution row reading)
      - `RowResponseCodec` (`inferArrowField` + `setVectorValue`) — the
        Object[]-row → Arrow VectorSchemaRoot wire codec needed an explicit
        list<utf8> Field with proper child Field, plus a ListVector setter
        using `startNewValue`/`endValue` + the inner VarCharVector's `setSafe`.

  * `RexCommandIT` extended from 9 sed tests to 16 — adds 7 extract-mode cases:
    single named group, multiple named groups in one row, missing-group
    NULL handling, multi-match capturing all, `max_match` cap, offset_field
    output, and no-match passthrough as NULL.

  * Rust UDFs — 15/15 unit tests (5 per UDF).
  * `RexExtractAdapterTests` — 4/4.
  * `RexCommandIT` — 16/16 (9 sed from Part 1 + 7 new extract).
  * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green (678 tasks,
    all sandbox module unit tests + spotless + license + forbidden API).

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
… (Part 3)

Wires Calcite's `SqlLibraryOperators.ARRAY_LENGTH` to DataFusion's native
`array_length`, completing the end-to-end story for PPL `rex` extract-mode
multi-match: queries can now size the list returned by `rex_extract_multi`
(`eval count = array_length(g)`).

  * `ScalarFunction.ARRAY_LENGTH` enum value (resolves via the `valueOf()`
    fallback on the Calcite operator name).
  * Registered in `STANDARD_PROJECT_OPS`. Returns `bigint`, so the existing
    `SUPPORTED_FIELD_TYPES` (numeric ∪ keyword ∪ date ∪ {BOOLEAN, TEXT})
    covers the capability lookup — no special-case needed.
  * `FunctionMappings.s(SqlLibraryOperators.ARRAY_LENGTH, "array_length")` in
    `DataFusionFragmentConvertor.ADDITIONAL_SCALAR_SIGS`. Library operators
    don't auto-resolve through the substrait default catalog — the same
    explicit pinning pattern used for `ILIKE`, `DATE_PART`, and the
    `REGEXP_REPLACE_*` family.
  * `array_length` extension declaration in `opensearch_scalar_functions.yaml`
    with `list<varchar<L1>>` → `i64` and `list<string>` → `i64` impls. Without
    a custom YAML extension that matches the actual list type, isthmus emits
    "Unable to convert call ARRAY_LENGTH(list<varchar<...>>)" for the
    `rex_extract_multi` output.

Lifts CalciteRexCommandIT (SQL plugin's standard rex IT class) through the
analytics-engine route from 14/18 → 17/18. The remaining failure
(testRexMaxMatchConfigurableLimit) is a unified-query architectural gap —
`UnifiedQueryContext` ignores cluster-setting overrides and uses the static
default — unrelated to rex or array_length.

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
The remote OpenSearch Snapshots maven repo (ci.opensearch.org/ci/dbc/snapshots)
only republishes from sql/main, not from sql/feature/mustang-ppl-integration,
so its 3.7.0.0-SNAPSHOT jars trail the feature branch by however many merges
(currently missing PPL_REX_MAX_MATCH_LIMIT, CALCITE_ENGINE_ENABLED, …). The
sandbox-check workflow's pre-step opensearch-project#21569 publishes feature-branch unified-query
jars to mavenLocal, but Gradle's default SNAPSHOT resolution weighs the remote's
explicit <buildNumber>/<timestamp> metadata higher than mavenLocal's
<localCopy>true>, so the stale remote wins even when mavenLocal has a newer
<lastUpdated>.

Confirmed via dependencyInsight: every consumer was binding
unified-query-api:3.7.0.0-SNAPSHOT:20260507.224009-12 (60kB, 42 classes, no
PPL_REX_MAX_MATCH_LIMIT field reference) instead of the locally-published
3.7.0.0-SNAPSHOT (29kB, 21 classes, has the field). The runtime cluster
inherited that stale class via the test-ppl-frontend plugin bundle, which
is why every IT touching `rex` failed plan-time with `NullPointerException:
Cannot invoke "java.lang.Integer.intValue()" because the return value of
"Settings.getSettingValue(PPL_REX_MAX_MATCH_LIMIT)" is null` once the
unified path tried to read the setting.

Fix: tell the OpenSearch Snapshots remote to refuse `org.opensearch.query`
artifacts via mavenContent { excludeGroup }. Three sites declare the remote:

  * sandbox/build.gradle subprojects { repositories } — applies to every
    sandbox subproject including qa.
  * sandbox/plugins/analytics-backend-datafusion/build.gradle — own
    declaration; left in place for module isolation, filtered identically.
  * sandbox/plugins/test-ppl-frontend/build.gradle — also pin mavenLocal as
    the only source for org.opensearch.query so the bundlePlugin task
    bundles the freshly-published feature-branch jar rather than the stale
    timestamped one Gradle would otherwise pick.

Verified locally: bundled unified-query-api drops 60kB → 29kB, the
UnifiedQueryContext$Builder constant pool now references PPL_REX_MAX_MATCH_LIMIT,
and RexCommandIT goes 0/16 → 16/16 against the same locally-published jars
the CI workflow already produces.

Drop this filter once the SQL feature branch merges to sql/main and the
remote OpenSearch Snapshots repo catches up — at that point every
3.7.0.0-SNAPSHOT publish will carry the rex max-match default and the
mavenLocal preference becomes redundant.

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@RyanL1997 RyanL1997 force-pushed the mustang-rex-part1 branch from 49c6baf to d37ee44 Compare May 9, 2026 05:36
CI fallout from the prior commit's `excludeGroup 'org.opensearch.query'`
filter on the OpenSearch Snapshots remote: the parent subprojects block
no longer carried mavenLocal, so analytics-engine's testImplementation /
internalClusterTest configurations had no repository at all serving
org.opensearch.query, failing with `Could not find
org.opensearch.query:unified-query-api:3.6.0.0-SNAPSHOT` (and -core / -ppl).

Two pieces:

1. sandbox/build.gradle subprojects { repositories } — also declare
   mavenLocal scoped to the org.opensearch.query group via mavenContent
   { includeGroup }. mavenLocal becomes the authoritative source for
   unified-query SNAPSHOTs (populated by the sandbox-check workflow's
   publishUnifiedQueryPublicationToMavenLocal pre-step) without leaking
   into resolution for any other group.

2. sandbox/plugins/analytics-engine/build.gradle — bump
   sqlUnifiedQueryVersion from 3.6.0.0-SNAPSHOT → 3.7.0.0-SNAPSHOT.
   The 3.6 jars don't exist in mavenLocal (only the 3.7 feature-branch
   build does), so the older pin was the proximate cause of the CI
   resolution failure. Aligning with test-ppl-frontend's already-3.7
   declaration also keeps the unified-query consumers consistent.

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
CI surfaced this on the post-rebase rex run:

  Duplicate key FunctionAnchor{urn=extension:org.opensearch:scalar_functions,
    key=array_length:list} (attempted merging values
    array_length:list and array_length:list)

The Part 3 commit declared two impls — `list<varchar<L1>>` and `list<string>`
— with the intent of covering both element-type families produced by
`rex_extract_multi`'s pair of impls. But substrait's compound function key
drops the inner parametric element type at the key level, so both impls
collapse to the same key `array_length:list`. The YAML loader rejects the
collision when the analytics-backend-datafusion plugin's
`SimpleExtension.ExtensionCollection` merges the file in.

Replace the two impls with a single `list<any1>` polymorphic impl. The
`any1` type variable matches any element type at planning, so a call site
that produces `list<varchar<L1>>` (rex_extract_multi varchar overload) and
a call site that produces `list<string>` (rex_extract_multi string
overload) both bind to the one impl. Net effect on planning is equivalent
and the duplicate-key collision goes away.

The duplicate didn't surface on the original rex CI run because the prior
PPL_REX_MAX_MATCH_LIMIT NPE failed every query at plan time before the
function-extension merge was reached. Once the mavenLocal pin fix landed
the prior commit and queries actually reached the planner, this older
latent collision was unmasked.

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
RyanL1997 added 2 commits May 9, 2026 11:45
…ine to 3.7"

This reverts commit cae2cb0.

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
…enLocal"

This reverts commit d37ee44.

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@RyanL1997 RyanL1997 force-pushed the mustang-rex-part1 branch from 0d812e8 to 0dafa26 Compare May 9, 2026 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant