[Analytics Backend / DataFusion] Onboard PPL parse to DataFusion by RyanL1997 · Pull Request #21573 · opensearch-project/OpenSearch

RyanL1997 · 2026-05-08T22:48:29Z

Description

Onboards the PPL parse command to the analytics-engine path end-to-end.

parse <field> '<regex>' lowers to one ITEM(PARSE(input, regex, \"regex\"), '<group>') per named group. Two new Rust UDFs back the chain:

parse — compiles the user pattern once, anchors it with ^(?:…)$ to honour Java's Matcher.matches() semantic that the legacy RegexExpression relies on, and emits a uniform-schema map<utf8, utf8> per row. A row that doesn't match (or a group that didn't participate) yields \"\" — same observable behaviour as the legacy path's extractNamedGroup returning null and ParseFunction wrapping it to ExprStringValue(\"\").
item — extracts a value from a map<utf8, utf8> by key. Lowering target for Calcite's SqlStdOperatorTable.ITEM on map operands.

ParseAdapter validates the pattern + method operands as non-null string literals at plan time and gates the method to regex; grok and patterns are rejected up front with a message pointing at the legacy engine. Mirrors the convert_tz precedent (#21476).

Three small framework additions land alongside:

FieldType.MAP + a switch arm so the planner's capability lookup matches parse's map<varchar, varchar> return type. PARSE is registered against FieldType.MAP separately rather than added to STANDARD_PROJECT_OPS, so we don't pollute the standard scalar bucket with a map return type.
ScalarFunction.PARSE + ScalarFunction.ITEM. ITEM resolves through SqlKind.ITEM; PARSE resolves by identifier-name through fromSqlFunction.
A MapVector.getObject() bypass at three call sites — Arrow's JsonStringHashMap.<clinit> references jackson-datatype-jsr310's JavaTimeModule, which isn't on the arrow-flight-rpc parent plugin's classloader. Each site reads the offset buffer + key/value sub-vectors directly into a LinkedHashMap (preserves DataFusion's emission order).

One incidental fix: session_context::create_session_context now calls udf::register_all. The executeWithContextAsync fragment path was the only SessionContext creator not registering OpenSearch UDFs, so any analytics query through the production fragment route failed with \"Unsupported function name\". Pre-existing UDFs (convert_tz, to_unixtime) shared this gap silently because no IT exercised them through the same path.

Out of scope

grok and patterns parse methods. grok requires a Grok pattern library on the Rust side; patterns has its own semantic surface and currently lives in the SQL plugin's PatternsExpression. The Rust UDF rejects both with an explicit message so the next onboarding is a deliberate flip rather than a silent semantics change.

Tests

Rust UDFs — 14/14 unit tests (parse 7/7, item 7/7)
ParseAdapterTests — 9/9
ParseCommandIT (sandbox QA) — 8/8
./gradlew check -p sandbox -Dsandbox.enabled=true — green

SQL plugin's `CalciteParseCommandIT` via the analytics-engine route

Run against a cluster with the bundle-side test infrastructure (PPL coverage bundle) + locally-published SQL plugin.

Tests executed	Passed	Failed	Pass rate
7	7	0	100.0%

Before this PR: 4/7 (only the testParseError* set passed, which throws at AST builder time before reaching the analytics planner). The +3 delta covers testParseCommand, testParseCommandReplaceOriginalField, and testParseCommandWithOtherRunTimeFields — all three exercise extraction, original-field overwrite, and chaining with eval/fields.

github-actions · 2026-05-08T22:49:26Z

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 562734b.

Path	Line	Severity	Description
sandbox/libs/dataformat-native/rust/Cargo.toml	72	high	New Rust crate dependency added: regex = "1.11". Per mandatory flagging rules, all dependency additions must be flagged regardless of apparent legitimacy. Maintainers should verify the artifact resolves to the canonical crates.io regex crate and that the pinned version is correct.
sandbox/plugins/analytics-backend-datafusion/rust/Cargo.toml	50	high	New Rust crate dependency added: regex = { workspace = true }. Per mandatory flagging rules, all dependency additions must be flagged. The workspace reference should be verified to resolve to the same canonical crates.io regex crate pinned in the root workspace.
sandbox/plugins/analytics-backend-datafusion/rust/src/session_context.rs	116	medium	Duplicate call to crate::udf::register_all(&ctx) was added immediately after the existing call that was already present. The accompanying comment claims the second call is required for substrait function resolution, which is misleading — the first call already performs full UDF registration. Double-registration is anomalous, and the fabricated justification warrants review to determine whether this is an accidental leftover or intentional behavior with unexpected side effects in DataFusion's UDF registry.

The table above displays the top 10 most important findings.

Total: 3 | Critical: 0 | High: 2 | Medium: 1 | Low: 0

Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.

⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

github-actions · 2026-05-09T07:42:13Z

❌ Gradle check result for b53c079: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-05-09T22:59:13Z

❌ Gradle check result for ddfc0e7: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Wire PPL `parse <field> '<regex>'` through PPL → Calcite → Substrait → DataFusion. The command lowers to one `ITEM(PARSE(input, regex, "regex"), '<group>')` per named group; PARSE returns `map<utf8, utf8>` of all named groups, ITEM extracts each value. Both UDFs sit on the Rust side of the analytics backend. Highlights: * New Rust UDFs `parse` and `item` (`sandbox/plugins/analytics-backend-datafusion/ rust/src/udf/{parse,item}.rs`). `parse` anchors the user pattern with `^(?:…)$` to honour Java's `Matcher.matches()` semantic that legacy `RegexExpression` relies on, so a row that doesn't match consumes nothing and every named group yields `""` — same observable behaviour as the legacy path. * `ParseAdapter` validates the pattern + method operands as non-null string literals at plan time and gates the method to `regex` (rejects `grok` / `patterns` with a clear error pointing users at the legacy engine until those methods land). * `FieldType.MAP`, `ScalarFunction.PARSE`, `ScalarFunction.ITEM`, `STANDARD_PROJECT_OPS` (ITEM is added; PARSE is registered separately for `FieldType.MAP` because no real OS mapping is a map and we don't want every scalar registering against the MAP bucket), `FunctionMappings.s` entries for `parse` and `item`, and YAML extension declarations. * Codec MapVector handling at three sites (`ArrowValues`, `DatafusionResultStream`, `RowResponseCodec`) — `MapVector.getObject()` builds a `JsonStringHashMap` whose `<clinit>` references jackson-datatype-jsr310 not on the arrow-flight-rpc parent plugin's classloader, so each site reads the offset buffer + key/value sub-vectors directly. * `session_context::create_session_context` now calls `udf::register_all`. The `executeWithContextAsync` fragment path was the only SessionContext creator that wasn't registering OpenSearch UDFs, so any analytics query through that path (the production fragment route) failed with "Unsupported function name". Pre-existing UDFs (`convert_tz`, `to_unixtime`) shared this gap silently because no IT exercised them through the same path. `grok` and `patterns` parse methods are deliberately left on the legacy engine. The Rust UDF rejects them with an explicit message; future onboardings will be deliberate flips rather than silent semantics changes. Verified end-to-end via `CalciteParseCommandIT` under `tests.analytics.force_routing=true`: 7/7 passing (was 4/7 before — only the testParseError* set passed, which throws at AST builder time before reaching the analytics planner). The +3 delta covers `testParseCommand`, `testParseCommandReplaceOriginalField`, and `testParseCommandWithOtherRunTimeFields`. Sandbox QA `ParseCommandIT` (8/8) covers the same code paths against the analytics path directly without depending on the SQL plugin worktree. Signed-off-by: Jialiang Liang <jiallian@amazon.com>

github-actions · 2026-05-10T05:12:13Z

❌ Gradle check result for 9fa5c99: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

RyanL1997 changed the title ~~Enable PPL parse command on the analytics-engine route~~ [Analytics Backend / DataFusion] Onboard PPL parse to DataFusion May 8, 2026

RyanL1997 force-pushed the mustang-parse branch 3 times, most recently from 0d3984f to 562734b Compare May 9, 2026 00:17

RyanL1997 marked this pull request as ready for review May 9, 2026 00:22

RyanL1997 requested a review from a team as a code owner May 9, 2026 00:22

mch2 added skip-diff-analyzer Maintainer to skip code-diff-analyzer check, after reviewing issues in AI analysis. skip-diff-reviewer Maintainer to skip code-diff-reviewer check, after reviewing issues in AI analysis. labels May 9, 2026

RyanL1997 force-pushed the mustang-parse branch from 562734b to b53c079 Compare May 9, 2026 05:39

RyanL1997 force-pushed the mustang-parse branch from b53c079 to ddfc0e7 Compare May 9, 2026 22:40

mch2 reviewed May 9, 2026

View reviewed changes

Comment thread sandbox/plugins/analytics-backend-datafusion/rust/src/session_context.rs

RyanL1997 force-pushed the mustang-parse branch from ddfc0e7 to 9fa5c99 Compare May 10, 2026 04:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Analytics Backend / DataFusion] Onboard PPL parse to DataFusion#21573

[Analytics Backend / DataFusion] Onboard PPL parse to DataFusion#21573
RyanL1997 wants to merge 1 commit intoopensearch-project:mainfrom
RyanL1997:mustang-parse

RyanL1997 commented May 8, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 8, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RyanL1997 commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Out of scope

Tests

SQL plugin's CalciteParseCommandIT via the analytics-engine route

Uh oh!

github-actions Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Analyzer ❗

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RyanL1997 commented May 8, 2026 •

edited

Loading

SQL plugin's `CalciteParseCommandIT` via the analytics-engine route

github-actions Bot commented May 8, 2026 •

edited

Loading