Add PPL list-style aggregates: first, last, list, take, values#21512
Closed
mch2 wants to merge 1 commit into
Closed
Add PPL list-style aggregates: first, last, list, take, values#21512mch2 wants to merge 1 commit into
mch2 wants to merge 1 commit into
Conversation
Wires five PPL stats aggregates onto the DataFusion backend. Analytics
backends must declare per-backend capability: the engine will not offer
an aggregate to a backend unless it appears in the backend's
AggregateFunction enum + capability set. Missing from that set, the
frontend silently falls back to coordinator-side evaluation (or errors
if the type isn't representable), so these aggregates were unreachable
even though DataFusion itself has native equivalents for most.
How each is wired:
first(x), last(x) -> DataFusion first_value / last_value UDAFs.
NameBasedAggregateFunctionConverter rewrites the
PPL name to the DF-native target. Without an
ORDER BY the DF UDAFs return an arbitrary group
element (documented PPL-vs-DF divergence).
list(x) -> DataFusion array_agg via pure alias rewrite.
values(x) -> array_agg with forced DISTINCT + ORDER BY on the
operand itself (PPL values() is distinct +
sorted). Uses AliasConfig.renameDistinctSortedBySelf.
take(x [, n]) -> custom Rust UDAF (analytics-backend-datafusion/
rust/src/udaf/take.rs) — DataFusion 52.x has no
equivalent. Collects up to N values; when n is
absent defaults to 10, matching PPL docs.
Also includes:
- ARG_MIN / ARG_MAX rewrite (for PPL earliest / latest) to
first_value / last_value with a synthesized ASC_NULLS_LAST sort
field on the key arg. DataFusion 52.x has no native min_by/max_by.
- approx_count_distinct alias UDAF (Rust side) that routes to DF's
native approx_distinct — the yaml emits the PPL-canonical name so
the substrait consumer can find it without a name-based fallback.
- session_context::create_session_context registers the OpenSearch
UDAFs alongside the default ListingTable so the instruction-based
execution path (upstream PR opensearch-project#21479) can resolve `take` and
`approx_count_distinct` at shard-scan time. Without this, substrait
decoding fails with "Aggregate function take:any_i32 is not
supported" because the plain register-at-session-new path isn't
taken on the new instruction-handler flow.
- NameBasedAggregateFunctionConverter: central matcher shim that
(a) falls back to name-based operator lookup when the stock isthmus
matcher's identity map misses PPL's own SqlAggFunction instances,
and (b) assembles an AggregateFunctionInvocation directly when the
YAML signature uses wildcards the directMap lookup can't expand
against runtime types. Alias rewrites live in NAME_ALIASES.
- opensearch_aggregate.yaml: substrait extension declaring the
first_value / last_value / array_agg / take variants with the
LIST?<...> / any1 conventions the substrait-java 0.89.1 parser
accepts.
- Five CalcitePPL*CommandIT suites (19 tests total) covering the
PPL-frontend -> substrait -> DataFusion path end-to-end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit 6de61ab.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wires five PPL stats aggregates onto the DataFusion analytics backend:
first_value/last_valueUDAFsarray_agg(pure alias rewrite)array_aggwith forced DISTINCT + ORDER BY on the operand itself (PPLvalues()is distinct + sorted)sandbox/plugins/analytics-backend-datafusion/rust/src/udaf/take.rs); DataFusion 52.x has no equivalentThe analytics engine will not route an aggregate to a backend unless it appears in that backend's
AggregateFunctioncapability set. Without this declaration the PPL frontend silently falls back to coordinator-side evaluation (or errors if the type isn't representable), so these five aggregates were unreachable even though DataFusion has native equivalents for most.Also included
earliest/latest→first_value/last_valuewith a synthesizedASC_NULLS_LASTsort field on the key arg (DataFusion 52.x has no nativemin_by/max_by).approx_distinct— the yaml emits the PPL-canonical name so the substrait consumer resolves it without a name-based fallback.session_context::create_session_contextUDAF registration — registers OpenSearch UDAFs alongside the default ListingTable so the instruction-based execution path (upstream PR Introduce Instructions for the Backend to use during execution #21479) can resolvetakeandapprox_count_distinctat shard-scan time. Without this, substrait decoding fails with "Aggregate function take:any_i32 is not supported" because the plain register-at-session-new path isn't taken on the new instruction-handler flow. Benefits any plugin UDAF routed through the instruction path.SqlAggFunctioninstances, and (b) assembles anAggregateFunctionInvocationdirectly when the YAML signature uses wildcards the directMap lookup can't expand against runtime types. Alias rewrites live inNAME_ALIASES.opensearch_aggregate.yaml— substrait extension declaring thefirst_value/last_value/array_agg/takevariants with theLIST?<...>/any1conventions the substrait-java 0.89.1 parser accepts.Tests
5
CalcitePPL*CommandITsuites covering the PPL-frontend → substrait → DataFusion path end-to-end (19 tests total, all green):CalcitePPLFirstCommandIT(4 tests)CalcitePPLLastCommandIT(4 tests)CalcitePPLListCommandIT(3 tests)CalcitePPLTakeCommandIT(5 tests)CalcitePPLValuesCommandIT(3 tests)Plus
NameBasedAggregateFunctionConverterTestsunit coverage (8 tests).Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.