diff --git a/LogicalTypes.md b/LogicalTypes.md index 78fdf293..820320dc 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -254,7 +254,10 @@ Used in contexts where precision is traded off for smaller footprint and potenti The primitive type is a 2-byte `FIXED_LEN_BYTE_ARRAY`. -The sort order for `FLOAT16` is signed (with special handling of NANs and signed zeros); it uses the same [logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and `DOUBLE`. +Like `FLOAT` and `DOUBLE`, the sort order for `FLOAT16` is signed with special +handling for NaNs and signed zeros. Writers should use IEEE754TotalOrder for +consistent handling of these edge cases. See the `ColumnOrder` union in the +[Thrift definition](src/main/thrift/parquet.thrift) for details. ## Temporal Types diff --git a/README.md b/README.md index d3482093..d398ac4f 100644 --- a/README.md +++ b/README.md @@ -158,7 +158,9 @@ documented in [LogicalTypes.md][logical-types]. Parquet stores min/max statistics at several levels (such as Column Chunk, Column Index, and Data Page). These statistics are according to a sort order, which is defined for each column in the file footer. Parquet supports common -sort orders for logical and primitive types. The details are documented in the +sort orders for logical and primitive types and also special orders for types +with potentially ambiguous semantics (e.g., NaN ordering for floating point +types). The details are documented in the [Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union. ## Nested Encoding diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 883264c3..225f85f9 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -309,6 +309,13 @@ struct Statistics { 7: optional bool is_max_value_exact; /** If true, min_value is the actual minimum value for a column */ 8: optional bool is_min_value_exact; + /** + * Count of NaN values in the column; only present if physical type is FLOAT + * or DOUBLE, or logical type is FLOAT16. + * If this field is not present, readers MUST assume NaNs may be present + * (i.e. MUST assume nan_count > 0 and MAY NOT assume nan_count == 0). + */ + 9: optional i64 nan_count; } /** Empty structs to use as logical type annotations */ @@ -1050,6 +1057,9 @@ struct RowGroup { /** Empty struct to signal the order defined by the physical or logical type */ struct TypeDefinedOrder {} +/** Empty struct to signal IEEE 754 total order for floating point types */ +struct IEEE754TotalOrder {} + /** * Union to specify the order used for the min_value and max_value fields for a * column. This union takes the role of an enhanced enum that allows rich @@ -1058,6 +1068,7 @@ struct TypeDefinedOrder {} * Possible values are: * * TypeDefinedOrder - the column uses the order defined by its logical or * physical type (if there is no logical type). + * * IEEE754TotalOrder - the floating point column uses IEEE 754 total order. * * If the reader does not support the value of this union, min and max stats * for this column should be ignored. @@ -1111,23 +1122,78 @@ union ColumnOrder { * 64-bit signed integer (nanos) * See https://github.com/apache/parquet-format/issues/502 for more details * - * (*) Because the sorting order is not specified properly for floating - * point values (relations vs. total ordering) the following + * (*) Because TYPE_ORDER is ambiguous for floating point types due to + * underspecified handling of NaN and -0/+0, it is recommended that writers + * use IEEE_754_TOTAL_ORDER for these types. + * + * If TYPE_ORDER is used for floating point types, then the following * compatibility rules should be applied when reading statistics: * - If the min is a NaN, it should be ignored. * - If the max is a NaN, it should be ignored. + * - If the nan_count field is set, a reader can compute + * nan_count + null_count == num_values to deduce whether all non-null + * values are NaN. * - If the min is +0, the row group may contain -0 values as well. * - If the max is -0, the row group may contain +0 values as well. * - When looking for NaN values, min and max should be ignored. + * If the nan_count field is set, it can be used to check whether + * NaNs are present. * - * When writing statistics the following rules should be followed: - * - NaNs should not be written to min or max statistics fields. + * When writing page or column chunk statistics for columns with + * TYPE_ORDER order, the following rules must be followed: + * - The nan_count field must be set for floating point types, even if + * it is zero. + * - If the nan_count field is set, min and max statistics fields, when + * present, must not contain NaN values and must be computed from + * non-NaN values only. This signals to readers that the min and max + * statistics are reliable for non-NaN values. + * - If all non-null values are NaN, min and max statistics must not be + * written. * - If the computed max value is zero (whether negative or positive), * `+0.0` should be written into the max statistics field. * - If the computed min value is zero (whether negative or positive), * `-0.0` should be written into the min statistics field. + * + * When writing column indexes for columns with TYPE_ORDER order, the + * following rules must be followed: + * - NaNs must not be written to min_values or max_values. + * - If all non-null values of a page are NaN, a column index must not + * be written for this column chunk because min_values and max_values + * are required. + * - If the computed max value is zero (whether negative or positive), + * `+0.0` should be written into the corresponding max_values entry. + * - If the computed min value is zero (whether negative or positive), + * `-0.0` should be written into the corresponding min_values entry. */ 1: TypeDefinedOrder TYPE_ORDER; + + /* + * The floating point type is ordered according to the totalOrder predicate, + * as defined in section 5.10 of IEEE-754 (2008 revision). Only columns of + * physical type FLOAT or DOUBLE, or logical type FLOAT16 may use this ordering. + * + * Intuitively, this orders floats mathematically, but defines -0 to be less + * than +0, -NaN to be less than anything else, and +NaN to be greater than + * anything else. It also defines an order between different bit representations + * of the same value. + * + * When writing statistics for columns with IEEE_754_TOTAL_ORDER order, then + * following rules must be followed: + * - Writing the nan_count field is mandatory when using this ordering. + * - Min and max statistics must contain the smallest and largest non-NaN + * values respectively, or if all non-null values are NaN, the smallest and + * largest NaN values as defined by IEEE 754 total order. + * + * When reading statistics for columns with this order, the following rules + * should be followed: + * - Readers should consult the nan_count field to determine whether NaNs + * are present. + * - A reader can compute nan_count + null_count == num_values to deduce + * whether all non-null values are NaN. In the page index, which does not + * have a num_values field, the presence of a NaN value in min_values + * or max_values indicates that all non-null values are NaN. + */ + 2: IEEE754TotalOrder IEEE_754_TOTAL_ORDER; } struct PageLocation { @@ -1199,6 +1265,18 @@ struct ColumnIndex { * Such more compact values must still be valid values within the column's * logical type. Readers must make sure that list entries are populated before * using them by inspecting null_pages. + * + * For columns of physical type FLOAT or DOUBLE, or logical type FLOAT16, + * NaN values are not to be included in these bounds. If all non-null values + * of a page are NaN, then a writer must do the following: + * - If the order of this column is TYPE_ORDER, then a column index must + * not be written for this column chunk. While this is unfortunate for + * performance, it is necessary to avoid conflict with legacy files that + * still included NaN in min_values and max_values even if the page had + * non-NaN values. To mitigate this, IEEE754_TOTAL_ORDER is recommended. + * - If the order of this column is IEEE754_TOTAL_ORDER, then min_values[i] + * and max_values[i] of that page must be set to the smallest and largest + * NaN values as defined by IEEE 754 total order. */ 2: required list min_values 3: required list max_values @@ -1240,6 +1318,15 @@ struct ColumnIndex { * Same as repetition_level_histograms except for definitions levels. **/ 7: optional list definition_level_histograms; + + /** + * A list containing the number of NaN values for each page. Only present + * for columns of physical type FLOAT or DOUBLE, or logical type FLOAT16. + * If this field is not present, readers MUST assume that there might be + * NaN values in any page. + */ + 8: optional list nan_counts + } struct AesGcmV1 {