Skip to content

Add PERCENTILE aggregation function and related validation#16545

Open
FearfulTomcat27 wants to merge 1 commit intoapache:masterfrom
FearfulTomcat27:feat/percentile
Open

Add PERCENTILE aggregation function and related validation#16545
FearfulTomcat27 wants to merge 1 commit intoapache:masterfrom
FearfulTomcat27:feat/percentile

Conversation

@FearfulTomcat27
Copy link
Copy Markdown
Contributor

This pull request introduces the new percentile aggregation function to the IoTDB relational query engine, providing support for calculating percentiles over numeric columns. The changes include the implementation of the core percentile calculation logic, integration into the accumulator factory, and comprehensive tests for both correct behavior and error handling.

Percentile Aggregation Support

  • Added PercentileAccumulator and supporting Percentile class to implement percentile calculations for numeric types (INT32, INT64, FLOAT, DOUBLE, TIMESTAMP). [1] [2]
  • Integrated the new percentile accumulator into the AccumulatorFactory, enabling both grouped and non-grouped percentile aggregations. [1] [2] [3]

Testing and Validation

  • Added new integration tests to IoTDBTableAggregationIT.java for verifying percentile queries and their results, including grouped queries.
  • Extended exception tests to cover invalid usages of the percentile function, such as wrong argument count, invalid percentage values, and unsupported data types.

Copilot AI review requested due to automatic review settings April 23, 2026 17:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an exact percentile aggregation to the IoTDB relational query engine and wires it through parsing/semantic validation, aggregation execution (grouped + non-grouped), RPC aggregation type enums, and integration tests.

Changes:

  • Introduces a new exact percentile state container (Percentile) plus grouped/non-grouped accumulators.
  • Registers PERCENTILE across SQL constants, builtin function enums, semantic validation, and accumulator factory creation.
  • Adds integration tests for correct percentile results and invalid-usage error handling.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
iotdb-protocol/thrift-commons/src/main/thrift/common.thrift Adds PERCENTILE to TAggregationType for RPC/planner/executor integration.
iotdb-core/node-commons/src/main/java/org/apache/iotdb/commons/udf/builtin/relational/TableBuiltinAggregationFunction.java Registers percentile as a builtin aggregation and sets intermediate type handling.
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/utils/constant/SqlConstant.java Adds the PERCENTILE SQL function name constant.
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/queryengine/plan/relational/sql/parser/AstBuilder.java Adds parser-time validation for percentile argument literal type.
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/queryengine/plan/relational/metadata/TableMetadataImpl.java Adds semantic validation and return-type inference for percentile.
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/queryengine/execution/operator/source/relational/aggregation/grouped/array/PercentileBigArray.java Adds grouped state container for percentile states with retained-size tracking.
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/queryengine/execution/operator/source/relational/aggregation/grouped/GroupedPercentileAccumulator.java Implements grouped exact percentile aggregation accumulation/merge/finalization.
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/queryengine/execution/operator/source/relational/aggregation/PercentileAccumulator.java Implements non-grouped exact percentile aggregation accumulation/merge/finalization.
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/queryengine/execution/operator/source/relational/aggregation/AccumulatorFactory.java Wires TAggregationType.PERCENTILE to the new accumulators.
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/queryengine/execution/operator/source/relational/Percentile.java Adds the in-memory exact percentile computation + (de)serialization logic.
integration-test/src/test/java/org/apache/iotdb/relational/it/query/recent/IoTDBTableAggregationIT.java Adds integration tests for percentile and invalid-usage error cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +158 to +160
for (int i = 0; i < positionCount; i++) {
position = selectedPositions[i];
groupId = groupIds[position];
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a mask is applied, selectedPositions may contain fewer valid entries than mask.getPositionCount(). The loop over selectedPositions should be bounded by mask.getSelectedPositionCount() to avoid aggregating masked-out rows (or stale positions).

Copilot uses AI. Check for mistakes.
Comment on lines +242 to +244
for (int i = 0; i < positionCount; i++) {
position = selectedPositions[i];
groupId = groupIds[position];
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the masked (non-selectAll) branch, iterate selectedPositions using mask.getSelectedPositionCount() as the loop bound to avoid aggregating stale/unselected positions.

Copilot uses AI. Check for mistakes.
Comment on lines +139 to +143
Percentile percentile = new Percentile();
if (size > percentile.capacity) {
percentile.capacity = size;
percentile.values = new double[size];
}
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Percentile.deserialize() does not restore the 'sorted' state. Since serialize() writes values in their current internal order (which may be unsorted), leaving the deserialized instance with sorted=true can lead to incorrect getPercentile() results if the deserialized object is queried directly. Set sorted=false on deserialization (or serialize in sorted order) to preserve correctness.

Copilot uses AI. Check for mistakes.
Comment on lines +57 to +61
public void updateRetainedSize(long index, Percentile value) {
Percentile percentile = array.get(index);
if (percentile != null) {
sizeOfPercentile -= percentile.getEstimatedSize();
}
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PercentileBigArray tracks retained size only when an element is set, but Percentile instances can grow their internal arrays as values are added. Since addValue()/merge() mutate the object after retrieval without calling updateRetainedSize, sizeOfPercentile can become arbitrarily stale and under-report memory. Consider updating the retained-size accounting when a Percentile grows (or redesign size tracking to avoid cached sizes for growable objects).

Copilot uses AI. Check for mistakes.
Comment on lines +68 to +74
array.forEach(
item -> {
if (item != null) {
item.clear();
}
});
sizeOfPercentile = 0;
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reset() sets sizeOfPercentile = 0 even though the Percentile objects (and their backing arrays) remain allocated. This makes sizeOf()/isEmpty() inaccurate and can cause severe underestimation of operator memory usage after reset. Keep sizeOfPercentile consistent with retained objects (e.g., don't zero it, or recompute it based on current capacities).

Suggested change
array.forEach(
item -> {
if (item != null) {
item.clear();
}
});
sizeOfPercentile = 0;
sizeOfPercentile = 0;
array.forEach(
item -> {
if (item != null) {
item.clear();
sizeOfPercentile += item.getEstimatedSize();
}
});

Copilot uses AI. Check for mistakes.
Comment on lines +167 to +171
if (mask.isSelectAll()) {
for (int i = 0; i < positionCount; i++) {
if (!column.isNull(i)) {
percentile.addValue(column.getLong(i));
}
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INT64/TIMESTAMP inputs are being implicitly converted to double when stored in Percentile, which silently loses precision for large INT64 values (>|2^53|) and can return incorrect results. Consider implementing a long-backed path for INT64/TIMESTAMP or using an exact conversion check (and throwing) when the long value cannot be represented exactly as a double.

Copilot uses AI. Check for mistakes.
Comment on lines +186 to +188
for (int i = 0; i < positionCount; i++) {
position = selectedPositions[i];
groupId = groupIds[position];
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-selectAll path should iterate up to mask.getSelectedPositionCount() (not mask.getPositionCount()) when traversing selectedPositions; otherwise masked-out rows can be included.

Copilot uses AI. Check for mistakes.
Comment on lines +214 to +216
for (int i = 0; i < positionCount; i++) {
position = selectedPositions[i];
groupId = groupIds[position];
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looping over selectedPositions should use mask.getSelectedPositionCount() as the bound; using mask.getPositionCount() can include masked-out positions and corrupt results.

Copilot uses AI. Check for mistakes.
Comment on lines +176 to +180
int groupId = groupIds[i];
Percentile percentile = array.get(groupId);
if (!valueColumn.isNull(i)) {
percentile.addValue(valueColumn.getLong(i));
}
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INT64/TIMESTAMP values are implicitly converted to double when stored (via Percentile.addValue(double)). This silently loses precision for values outside the exact double integer range (>|2^53|), producing incorrect percentiles. Consider storing longs separately for INT64/TIMESTAMP or, at minimum, enforce exact conversion (similar to AbstractApproxPercentileAccumulator.toDoubleExact) and fail fast when precision would be lost.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants