Add PERCENTILE aggregation function and related validation#16545
Add PERCENTILE aggregation function and related validation#16545FearfulTomcat27 wants to merge 1 commit intoapache:masterfrom
Conversation
d18b6dd to
f63a796
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds an exact percentile aggregation to the IoTDB relational query engine and wires it through parsing/semantic validation, aggregation execution (grouped + non-grouped), RPC aggregation type enums, and integration tests.
Changes:
- Introduces a new exact percentile state container (
Percentile) plus grouped/non-grouped accumulators. - Registers
PERCENTILEacross SQL constants, builtin function enums, semantic validation, and accumulator factory creation. - Adds integration tests for correct percentile results and invalid-usage error handling.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
iotdb-protocol/thrift-commons/src/main/thrift/common.thrift |
Adds PERCENTILE to TAggregationType for RPC/planner/executor integration. |
iotdb-core/node-commons/src/main/java/org/apache/iotdb/commons/udf/builtin/relational/TableBuiltinAggregationFunction.java |
Registers percentile as a builtin aggregation and sets intermediate type handling. |
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/utils/constant/SqlConstant.java |
Adds the PERCENTILE SQL function name constant. |
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/queryengine/plan/relational/sql/parser/AstBuilder.java |
Adds parser-time validation for percentile argument literal type. |
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/queryengine/plan/relational/metadata/TableMetadataImpl.java |
Adds semantic validation and return-type inference for percentile. |
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/queryengine/execution/operator/source/relational/aggregation/grouped/array/PercentileBigArray.java |
Adds grouped state container for percentile states with retained-size tracking. |
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/queryengine/execution/operator/source/relational/aggregation/grouped/GroupedPercentileAccumulator.java |
Implements grouped exact percentile aggregation accumulation/merge/finalization. |
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/queryengine/execution/operator/source/relational/aggregation/PercentileAccumulator.java |
Implements non-grouped exact percentile aggregation accumulation/merge/finalization. |
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/queryengine/execution/operator/source/relational/aggregation/AccumulatorFactory.java |
Wires TAggregationType.PERCENTILE to the new accumulators. |
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/queryengine/execution/operator/source/relational/Percentile.java |
Adds the in-memory exact percentile computation + (de)serialization logic. |
integration-test/src/test/java/org/apache/iotdb/relational/it/query/recent/IoTDBTableAggregationIT.java |
Adds integration tests for percentile and invalid-usage error cases. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for (int i = 0; i < positionCount; i++) { | ||
| position = selectedPositions[i]; | ||
| groupId = groupIds[position]; |
There was a problem hiding this comment.
When a mask is applied, selectedPositions may contain fewer valid entries than mask.getPositionCount(). The loop over selectedPositions should be bounded by mask.getSelectedPositionCount() to avoid aggregating masked-out rows (or stale positions).
| for (int i = 0; i < positionCount; i++) { | ||
| position = selectedPositions[i]; | ||
| groupId = groupIds[position]; |
There was a problem hiding this comment.
In the masked (non-selectAll) branch, iterate selectedPositions using mask.getSelectedPositionCount() as the loop bound to avoid aggregating stale/unselected positions.
| Percentile percentile = new Percentile(); | ||
| if (size > percentile.capacity) { | ||
| percentile.capacity = size; | ||
| percentile.values = new double[size]; | ||
| } |
There was a problem hiding this comment.
Percentile.deserialize() does not restore the 'sorted' state. Since serialize() writes values in their current internal order (which may be unsorted), leaving the deserialized instance with sorted=true can lead to incorrect getPercentile() results if the deserialized object is queried directly. Set sorted=false on deserialization (or serialize in sorted order) to preserve correctness.
| public void updateRetainedSize(long index, Percentile value) { | ||
| Percentile percentile = array.get(index); | ||
| if (percentile != null) { | ||
| sizeOfPercentile -= percentile.getEstimatedSize(); | ||
| } |
There was a problem hiding this comment.
PercentileBigArray tracks retained size only when an element is set, but Percentile instances can grow their internal arrays as values are added. Since addValue()/merge() mutate the object after retrieval without calling updateRetainedSize, sizeOfPercentile can become arbitrarily stale and under-report memory. Consider updating the retained-size accounting when a Percentile grows (or redesign size tracking to avoid cached sizes for growable objects).
| array.forEach( | ||
| item -> { | ||
| if (item != null) { | ||
| item.clear(); | ||
| } | ||
| }); | ||
| sizeOfPercentile = 0; |
There was a problem hiding this comment.
reset() sets sizeOfPercentile = 0 even though the Percentile objects (and their backing arrays) remain allocated. This makes sizeOf()/isEmpty() inaccurate and can cause severe underestimation of operator memory usage after reset. Keep sizeOfPercentile consistent with retained objects (e.g., don't zero it, or recompute it based on current capacities).
| array.forEach( | |
| item -> { | |
| if (item != null) { | |
| item.clear(); | |
| } | |
| }); | |
| sizeOfPercentile = 0; | |
| sizeOfPercentile = 0; | |
| array.forEach( | |
| item -> { | |
| if (item != null) { | |
| item.clear(); | |
| sizeOfPercentile += item.getEstimatedSize(); | |
| } | |
| }); |
| if (mask.isSelectAll()) { | ||
| for (int i = 0; i < positionCount; i++) { | ||
| if (!column.isNull(i)) { | ||
| percentile.addValue(column.getLong(i)); | ||
| } |
There was a problem hiding this comment.
INT64/TIMESTAMP inputs are being implicitly converted to double when stored in Percentile, which silently loses precision for large INT64 values (>|2^53|) and can return incorrect results. Consider implementing a long-backed path for INT64/TIMESTAMP or using an exact conversion check (and throwing) when the long value cannot be represented exactly as a double.
| for (int i = 0; i < positionCount; i++) { | ||
| position = selectedPositions[i]; | ||
| groupId = groupIds[position]; |
There was a problem hiding this comment.
The non-selectAll path should iterate up to mask.getSelectedPositionCount() (not mask.getPositionCount()) when traversing selectedPositions; otherwise masked-out rows can be included.
| for (int i = 0; i < positionCount; i++) { | ||
| position = selectedPositions[i]; | ||
| groupId = groupIds[position]; |
There was a problem hiding this comment.
Looping over selectedPositions should use mask.getSelectedPositionCount() as the bound; using mask.getPositionCount() can include masked-out positions and corrupt results.
| int groupId = groupIds[i]; | ||
| Percentile percentile = array.get(groupId); | ||
| if (!valueColumn.isNull(i)) { | ||
| percentile.addValue(valueColumn.getLong(i)); | ||
| } |
There was a problem hiding this comment.
INT64/TIMESTAMP values are implicitly converted to double when stored (via Percentile.addValue(double)). This silently loses precision for values outside the exact double integer range (>|2^53|), producing incorrect percentiles. Consider storing longs separately for INT64/TIMESTAMP or, at minimum, enforce exact conversion (similar to AbstractApproxPercentileAccumulator.toDoubleExact) and fail fast when precision would be lost.
This pull request introduces the new
percentileaggregation function to the IoTDB relational query engine, providing support for calculating percentiles over numeric columns. The changes include the implementation of the core percentile calculation logic, integration into the accumulator factory, and comprehensive tests for both correct behavior and error handling.Percentile Aggregation Support
PercentileAccumulatorand supportingPercentileclass to implement percentile calculations for numeric types (INT32,INT64,FLOAT,DOUBLE,TIMESTAMP). [1] [2]AccumulatorFactory, enabling both grouped and non-grouped percentile aggregations. [1] [2] [3]Testing and Validation
IoTDBTableAggregationIT.javafor verifying percentile queries and their results, including grouped queries.percentilefunction, such as wrong argument count, invalid percentage values, and unsupported data types.