Skip to content

[Analytics Backend / DataFusion] Onboard PPL parse to DataFusion#21573

Open
RyanL1997 wants to merge 1 commit intoopensearch-project:mainfrom
RyanL1997:mustang-parse
Open

[Analytics Backend / DataFusion] Onboard PPL parse to DataFusion#21573
RyanL1997 wants to merge 1 commit intoopensearch-project:mainfrom
RyanL1997:mustang-parse

Conversation

@RyanL1997
Copy link
Copy Markdown
Contributor

@RyanL1997 RyanL1997 commented May 8, 2026

Description

Onboards the PPL parse command to the analytics-engine path end-to-end.

parse <field> '<regex>' lowers to one ITEM(PARSE(input, regex, \"regex\"), '<group>') per named group. Two new Rust UDFs back the chain:

  • parse — compiles the user pattern once, anchors it with ^(?:…)$ to honour Java's Matcher.matches() semantic that the legacy RegexExpression relies on, and emits a uniform-schema map<utf8, utf8> per row. A row that doesn't match (or a group that didn't participate) yields \"\" — same observable behaviour as the legacy path's extractNamedGroup returning null and ParseFunction wrapping it to ExprStringValue(\"\").
  • item — extracts a value from a map<utf8, utf8> by key. Lowering target for Calcite's SqlStdOperatorTable.ITEM on map operands.

ParseAdapter validates the pattern + method operands as non-null string literals at plan time and gates the method to regex; grok and patterns are rejected up front with a message pointing at the legacy engine. Mirrors the convert_tz precedent (#21476).

Three small framework additions land alongside:

  • FieldType.MAP + a switch arm so the planner's capability lookup matches parse's map<varchar, varchar> return type. PARSE is registered against FieldType.MAP separately rather than added to STANDARD_PROJECT_OPS, so we don't pollute the standard scalar bucket with a map return type.
  • ScalarFunction.PARSE + ScalarFunction.ITEM. ITEM resolves through SqlKind.ITEM; PARSE resolves by identifier-name through fromSqlFunction.
  • A MapVector.getObject() bypass at three call sites — Arrow's JsonStringHashMap.<clinit> references jackson-datatype-jsr310's JavaTimeModule, which isn't on the arrow-flight-rpc parent plugin's classloader. Each site reads the offset buffer + key/value sub-vectors directly into a LinkedHashMap (preserves DataFusion's emission order).

One incidental fix: session_context::create_session_context now calls udf::register_all. The executeWithContextAsync fragment path was the only SessionContext creator not registering OpenSearch UDFs, so any analytics query through the production fragment route failed with \"Unsupported function name\". Pre-existing UDFs (convert_tz, to_unixtime) shared this gap silently because no IT exercised them through the same path.

Out of scope

grok and patterns parse methods. grok requires a Grok pattern library on the Rust side; patterns has its own semantic surface and currently lives in the SQL plugin's PatternsExpression. The Rust UDF rejects both with an explicit message so the next onboarding is a deliberate flip rather than a silent semantics change.

Tests

  • Rust UDFs — 14/14 unit tests (parse 7/7, item 7/7)
  • ParseAdapterTests — 9/9
  • ParseCommandIT (sandbox QA) — 8/8
  • ./gradlew check -p sandbox -Dsandbox.enabled=true — green

SQL plugin's CalciteParseCommandIT via the analytics-engine route

Run against a cluster with the bundle-side test infrastructure (PPL coverage bundle) + locally-published SQL plugin.

Tests executed Passed Failed Pass rate
7 7 0 100.0%

Before this PR: 4/7 (only the testParseError* set passed, which throws at AST builder time before reaching the analytics planner). The +3 delta covers testParseCommand, testParseCommandReplaceOriginalField, and testParseCommandWithOtherRunTimeFields — all three exercise extraction, original-field overwrite, and chaining with eval/fields.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 562734b.

PathLineSeverityDescription
sandbox/libs/dataformat-native/rust/Cargo.toml72highNew Rust crate dependency added: regex = "1.11". Per mandatory flagging rules, all dependency additions must be flagged regardless of apparent legitimacy. Maintainers should verify the artifact resolves to the canonical crates.io regex crate and that the pinned version is correct.
sandbox/plugins/analytics-backend-datafusion/rust/Cargo.toml50highNew Rust crate dependency added: regex = { workspace = true }. Per mandatory flagging rules, all dependency additions must be flagged. The workspace reference should be verified to resolve to the same canonical crates.io regex crate pinned in the root workspace.
sandbox/plugins/analytics-backend-datafusion/rust/src/session_context.rs116mediumDuplicate call to crate::udf::register_all(&ctx) was added immediately after the existing call that was already present. The accompanying comment claims the second call is required for substrait function resolution, which is misleading — the first call already performs full UDF registration. Double-registration is anomalous, and the fabricated justification warrants review to determine whether this is an accidental leftover or intentional behavior with unexpected side effects in DataFusion's UDF registry.

The table above displays the top 10 most important findings.

Total: 3 | Critical: 0 | High: 2 | Medium: 1 | Low: 0


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@RyanL1997 RyanL1997 changed the title Enable PPL parse command on the analytics-engine route [Analytics Backend / DataFusion] Onboard PPL parse to DataFusion May 8, 2026
@RyanL1997 RyanL1997 force-pushed the mustang-parse branch 3 times, most recently from 0d3984f to 562734b Compare May 9, 2026 00:17
@RyanL1997 RyanL1997 marked this pull request as ready for review May 9, 2026 00:22
@RyanL1997 RyanL1997 requested a review from a team as a code owner May 9, 2026 00:22
@mch2 mch2 added skip-diff-analyzer Maintainer to skip code-diff-analyzer check, after reviewing issues in AI analysis. skip-diff-reviewer Maintainer to skip code-diff-reviewer check, after reviewing issues in AI analysis. labels May 9, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

❌ Gradle check result for b53c079: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

❌ Gradle check result for ddfc0e7: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Wire PPL `parse <field> '<regex>'` through PPL → Calcite → Substrait → DataFusion.
The command lowers to one `ITEM(PARSE(input, regex, "regex"), '<group>')` per named
group; PARSE returns `map<utf8, utf8>` of all named groups, ITEM extracts each
value. Both UDFs sit on the Rust side of the analytics backend.

Highlights:
* New Rust UDFs `parse` and `item` (`sandbox/plugins/analytics-backend-datafusion/
  rust/src/udf/{parse,item}.rs`). `parse` anchors the user pattern with `^(?:…)$`
  to honour Java's `Matcher.matches()` semantic that legacy `RegexExpression`
  relies on, so a row that doesn't match consumes nothing and every named group
  yields `""` — same observable behaviour as the legacy path.
* `ParseAdapter` validates the pattern + method operands as non-null string
  literals at plan time and gates the method to `regex` (rejects `grok` /
  `patterns` with a clear error pointing users at the legacy engine until those
  methods land).
* `FieldType.MAP`, `ScalarFunction.PARSE`, `ScalarFunction.ITEM`,
  `STANDARD_PROJECT_OPS` (ITEM is added; PARSE is registered separately for
  `FieldType.MAP` because no real OS mapping is a map and we don't want every
  scalar registering against the MAP bucket), `FunctionMappings.s` entries for
  `parse` and `item`, and YAML extension declarations.
* Codec MapVector handling at three sites (`ArrowValues`, `DatafusionResultStream`,
  `RowResponseCodec`) — `MapVector.getObject()` builds a `JsonStringHashMap`
  whose `<clinit>` references jackson-datatype-jsr310 not on the
  arrow-flight-rpc parent plugin's classloader, so each site reads the
  offset buffer + key/value sub-vectors directly.
* `session_context::create_session_context` now calls `udf::register_all`. The
  `executeWithContextAsync` fragment path was the only SessionContext creator
  that wasn't registering OpenSearch UDFs, so any analytics query through that
  path (the production fragment route) failed with "Unsupported function name".
  Pre-existing UDFs (`convert_tz`, `to_unixtime`) shared this gap silently
  because no IT exercised them through the same path.

`grok` and `patterns` parse methods are deliberately left on the legacy engine.
The Rust UDF rejects them with an explicit message; future onboardings will be
deliberate flips rather than silent semantics changes.

Verified end-to-end via `CalciteParseCommandIT` under
`tests.analytics.force_routing=true`: 7/7 passing (was 4/7 before — only the
testParseError* set passed, which throws at AST builder time before reaching
the analytics planner). The +3 delta covers `testParseCommand`,
`testParseCommandReplaceOriginalField`, and `testParseCommandWithOtherRunTimeFields`.

Sandbox QA `ParseCommandIT` (8/8) covers the same code paths against the
analytics path directly without depending on the SQL plugin worktree.

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 9fa5c99: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-diff-analyzer Maintainer to skip code-diff-analyzer check, after reviewing issues in AI analysis. skip-diff-reviewer Maintainer to skip code-diff-reviewer check, after reviewing issues in AI analysis.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants