[Analytics Backend / DataFusion] Onboard PPL parse to DataFusion#21573
[Analytics Backend / DataFusion] Onboard PPL parse to DataFusion#21573RyanL1997 wants to merge 1 commit intoopensearch-project:mainfrom
Conversation
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit 562734b.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
0d3984f to
562734b
Compare
|
❌ Gradle check result for b53c079: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for ddfc0e7: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Wire PPL `parse <field> '<regex>'` through PPL → Calcite → Substrait → DataFusion.
The command lowers to one `ITEM(PARSE(input, regex, "regex"), '<group>')` per named
group; PARSE returns `map<utf8, utf8>` of all named groups, ITEM extracts each
value. Both UDFs sit on the Rust side of the analytics backend.
Highlights:
* New Rust UDFs `parse` and `item` (`sandbox/plugins/analytics-backend-datafusion/
rust/src/udf/{parse,item}.rs`). `parse` anchors the user pattern with `^(?:…)$`
to honour Java's `Matcher.matches()` semantic that legacy `RegexExpression`
relies on, so a row that doesn't match consumes nothing and every named group
yields `""` — same observable behaviour as the legacy path.
* `ParseAdapter` validates the pattern + method operands as non-null string
literals at plan time and gates the method to `regex` (rejects `grok` /
`patterns` with a clear error pointing users at the legacy engine until those
methods land).
* `FieldType.MAP`, `ScalarFunction.PARSE`, `ScalarFunction.ITEM`,
`STANDARD_PROJECT_OPS` (ITEM is added; PARSE is registered separately for
`FieldType.MAP` because no real OS mapping is a map and we don't want every
scalar registering against the MAP bucket), `FunctionMappings.s` entries for
`parse` and `item`, and YAML extension declarations.
* Codec MapVector handling at three sites (`ArrowValues`, `DatafusionResultStream`,
`RowResponseCodec`) — `MapVector.getObject()` builds a `JsonStringHashMap`
whose `<clinit>` references jackson-datatype-jsr310 not on the
arrow-flight-rpc parent plugin's classloader, so each site reads the
offset buffer + key/value sub-vectors directly.
* `session_context::create_session_context` now calls `udf::register_all`. The
`executeWithContextAsync` fragment path was the only SessionContext creator
that wasn't registering OpenSearch UDFs, so any analytics query through that
path (the production fragment route) failed with "Unsupported function name".
Pre-existing UDFs (`convert_tz`, `to_unixtime`) shared this gap silently
because no IT exercised them through the same path.
`grok` and `patterns` parse methods are deliberately left on the legacy engine.
The Rust UDF rejects them with an explicit message; future onboardings will be
deliberate flips rather than silent semantics changes.
Verified end-to-end via `CalciteParseCommandIT` under
`tests.analytics.force_routing=true`: 7/7 passing (was 4/7 before — only the
testParseError* set passed, which throws at AST builder time before reaching
the analytics planner). The +3 delta covers `testParseCommand`,
`testParseCommandReplaceOriginalField`, and `testParseCommandWithOtherRunTimeFields`.
Sandbox QA `ParseCommandIT` (8/8) covers the same code paths against the
analytics path directly without depending on the SQL plugin worktree.
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
|
❌ Gradle check result for 9fa5c99: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Description
Onboards the PPL
parsecommand to the analytics-engine path end-to-end.parse <field> '<regex>'lowers to oneITEM(PARSE(input, regex, \"regex\"), '<group>')per named group. Two new Rust UDFs back the chain:parse— compiles the user pattern once, anchors it with^(?:…)$to honour Java'sMatcher.matches()semantic that the legacyRegexExpressionrelies on, and emits a uniform-schemamap<utf8, utf8>per row. A row that doesn't match (or a group that didn't participate) yields\"\"— same observable behaviour as the legacy path'sextractNamedGroupreturning null andParseFunctionwrapping it toExprStringValue(\"\").item— extracts a value from amap<utf8, utf8>by key. Lowering target for Calcite'sSqlStdOperatorTable.ITEMon map operands.ParseAdaptervalidates the pattern + method operands as non-null string literals at plan time and gates the method toregex;grokandpatternsare rejected up front with a message pointing at the legacy engine. Mirrors theconvert_tzprecedent (#21476).Three small framework additions land alongside:
FieldType.MAP+ a switch arm so the planner's capability lookup matchesparse'smap<varchar, varchar>return type. PARSE is registered againstFieldType.MAPseparately rather than added toSTANDARD_PROJECT_OPS, so we don't pollute the standard scalar bucket with a map return type.ScalarFunction.PARSE+ScalarFunction.ITEM. ITEM resolves throughSqlKind.ITEM; PARSE resolves by identifier-name throughfromSqlFunction.MapVector.getObject()bypass at three call sites — Arrow'sJsonStringHashMap.<clinit>referencesjackson-datatype-jsr310'sJavaTimeModule, which isn't on thearrow-flight-rpcparent plugin's classloader. Each site reads the offset buffer + key/value sub-vectors directly into aLinkedHashMap(preserves DataFusion's emission order).One incidental fix:
session_context::create_session_contextnow callsudf::register_all. TheexecuteWithContextAsyncfragment path was the onlySessionContextcreator not registering OpenSearch UDFs, so any analytics query through the production fragment route failed with\"Unsupported function name\". Pre-existing UDFs (convert_tz,to_unixtime) shared this gap silently because no IT exercised them through the same path.Out of scope
grokandpatternsparse methods.grokrequires a Grok pattern library on the Rust side;patternshas its own semantic surface and currently lives in the SQL plugin'sPatternsExpression. The Rust UDF rejects both with an explicit message so the next onboarding is a deliberate flip rather than a silent semantics change.Tests
parse7/7,item7/7)ParseAdapterTests— 9/9ParseCommandIT(sandbox QA) — 8/8./gradlew check -p sandbox -Dsandbox.enabled=true— greenSQL plugin's
CalciteParseCommandITvia the analytics-engine routeRun against a cluster with the bundle-side test infrastructure (PPL coverage bundle) + locally-published SQL plugin.
Before this PR: 4/7 (only the
testParseError*set passed, which throws at AST builder time before reaching the analytics planner). The +3 delta coverstestParseCommand,testParseCommandReplaceOriginalField, andtestParseCommandWithOtherRunTimeFields— all three exercise extraction, original-field overwrite, and chaining witheval/fields.