Skip to content

Add PPL list-style aggregates: first, last, list, take, values#21512

Closed
mch2 wants to merge 1 commit into
opensearch-project:mainfrom
mch2:ppl/list-aggs
Closed

Add PPL list-style aggregates: first, last, list, take, values#21512
mch2 wants to merge 1 commit into
opensearch-project:mainfrom
mch2:ppl/list-aggs

Conversation

@mch2
Copy link
Copy Markdown
Member

@mch2 mch2 commented May 6, 2026

Summary

Wires five PPL stats aggregates onto the DataFusion analytics backend:

  • first / last → DataFusion first_value / last_value UDAFs
  • list → DataFusion array_agg (pure alias rewrite)
  • valuesarray_agg with forced DISTINCT + ORDER BY on the operand itself (PPL values() is distinct + sorted)
  • take → custom Rust UDAF (sandbox/plugins/analytics-backend-datafusion/rust/src/udaf/take.rs); DataFusion 52.x has no equivalent

The analytics engine will not route an aggregate to a backend unless it appears in that backend's AggregateFunction capability set. Without this declaration the PPL frontend silently falls back to coordinator-side evaluation (or errors if the type isn't representable), so these five aggregates were unreachable even though DataFusion has native equivalents for most.

Also included

  • ARG_MIN / ARG_MAX rewrite for PPL earliest / latestfirst_value / last_value with a synthesized ASC_NULLS_LAST sort field on the key arg (DataFusion 52.x has no native min_by / max_by).
  • approx_count_distinct alias UDAF (Rust side) routing to DF's native approx_distinct — the yaml emits the PPL-canonical name so the substrait consumer resolves it without a name-based fallback.
  • session_context::create_session_context UDAF registration — registers OpenSearch UDAFs alongside the default ListingTable so the instruction-based execution path (upstream PR Introduce Instructions for the Backend to use during execution #21479) can resolve take and approx_count_distinct at shard-scan time. Without this, substrait decoding fails with "Aggregate function take:any_i32 is not supported" because the plain register-at-session-new path isn't taken on the new instruction-handler flow. Benefits any plugin UDAF routed through the instruction path.
  • NameBasedAggregateFunctionConverter — central matcher shim that (a) falls back to name-based operator lookup when the stock isthmus matcher's identity map misses PPL's own SqlAggFunction instances, and (b) assembles an AggregateFunctionInvocation directly when the YAML signature uses wildcards the directMap lookup can't expand against runtime types. Alias rewrites live in NAME_ALIASES.
  • opensearch_aggregate.yaml — substrait extension declaring the first_value / last_value / array_agg / take variants with the LIST?<...> / any1 conventions the substrait-java 0.89.1 parser accepts.

Tests

5 CalcitePPL*CommandIT suites covering the PPL-frontend → substrait → DataFusion path end-to-end (19 tests total, all green):

  • CalcitePPLFirstCommandIT (4 tests)
  • CalcitePPLLastCommandIT (4 tests)
  • CalcitePPLListCommandIT (3 tests)
  • CalcitePPLTakeCommandIT (5 tests)
  • CalcitePPLValuesCommandIT (3 tests)

Plus NameBasedAggregateFunctionConverterTests unit coverage (8 tests).

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Wires five PPL stats aggregates onto the DataFusion backend. Analytics
backends must declare per-backend capability: the engine will not offer
an aggregate to a backend unless it appears in the backend's
AggregateFunction enum + capability set. Missing from that set, the
frontend silently falls back to coordinator-side evaluation (or errors
if the type isn't representable), so these aggregates were unreachable
even though DataFusion itself has native equivalents for most.

How each is wired:

  first(x), last(x)  -> DataFusion first_value / last_value UDAFs.
                        NameBasedAggregateFunctionConverter rewrites the
                        PPL name to the DF-native target. Without an
                        ORDER BY the DF UDAFs return an arbitrary group
                        element (documented PPL-vs-DF divergence).

  list(x)            -> DataFusion array_agg via pure alias rewrite.

  values(x)          -> array_agg with forced DISTINCT + ORDER BY on the
                        operand itself (PPL values() is distinct +
                        sorted). Uses AliasConfig.renameDistinctSortedBySelf.

  take(x [, n])      -> custom Rust UDAF (analytics-backend-datafusion/
                        rust/src/udaf/take.rs) — DataFusion 52.x has no
                        equivalent. Collects up to N values; when n is
                        absent defaults to 10, matching PPL docs.

Also includes:

  - ARG_MIN / ARG_MAX rewrite (for PPL earliest / latest) to
    first_value / last_value with a synthesized ASC_NULLS_LAST sort
    field on the key arg. DataFusion 52.x has no native min_by/max_by.

  - approx_count_distinct alias UDAF (Rust side) that routes to DF's
    native approx_distinct — the yaml emits the PPL-canonical name so
    the substrait consumer can find it without a name-based fallback.

  - session_context::create_session_context registers the OpenSearch
    UDAFs alongside the default ListingTable so the instruction-based
    execution path (upstream PR opensearch-project#21479) can resolve `take` and
    `approx_count_distinct` at shard-scan time. Without this, substrait
    decoding fails with "Aggregate function take:any_i32 is not
    supported" because the plain register-at-session-new path isn't
    taken on the new instruction-handler flow.

  - NameBasedAggregateFunctionConverter: central matcher shim that
    (a) falls back to name-based operator lookup when the stock isthmus
    matcher's identity map misses PPL's own SqlAggFunction instances,
    and (b) assembles an AggregateFunctionInvocation directly when the
    YAML signature uses wildcards the directMap lookup can't expand
    against runtime types. Alias rewrites live in NAME_ALIASES.

  - opensearch_aggregate.yaml: substrait extension declaring the
    first_value / last_value / array_agg / take variants with the
    LIST?<...> / any1 conventions the substrait-java 0.89.1 parser
    accepts.

  - Five CalcitePPL*CommandIT suites (19 tests total) covering the
    PPL-frontend -> substrait -> DataFusion path end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mch2 mch2 requested a review from a team as a code owner May 6, 2026 15:07
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 6de61ab.

PathLineSeverityDescription
sandbox/plugins/analytics-engine/build.gradle79highNew runtime dependency added: 'com.fasterxml.jackson.datatype:jackson-datatype-jsr310:${versions.jackson}' declared as 'implementation' (bundled into the plugin). Per mandatory supply chain rule, any new dependency must be flagged for maintainer verification regardless of how legitimate the artifact name appears.
sandbox/plugins/analytics-engine/licenses/jackson-databind-2.21.3.jar.sha11mediumSHA1 integrity verification file for jackson-databind-2.21.3.jar is deleted. jackson-databind remains declared as 'compileOnly' in build.gradle, so removal may be intentional (not bundled = no license check needed), but deleting an artifact integrity file warrants explicit maintainer review to confirm the dependency is truly no longer bundled.

The table above displays the top 10 most important findings.

Total: 2 | Critical: 0 | High: 1 | Medium: 1 | Low: 0


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@mch2 mch2 closed this May 6, 2026
@mch2 mch2 deleted the ppl/list-aggs branch May 6, 2026 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant