Describe the bug, including details regarding any error messages, version, and platform.
I am not 100% convinced this is a bug or rather some unexpected behaviour.
When an R Date vector contains Inf or -Inf, converting to an Arrow table (e.g. via as_arrow_table() or write_parquet()) silently converts these to extreme, but finite, dates instead of null or raising an error.
Reprex:
library(arrow)
library(tibble)
library(dplyr)
tibble(x = as.Date(c(Inf, -Inf))) |>
as_arrow_table() |>
collect()
#> # A tibble: 2 × 1
#> x
#> <date>
#> 1 5881580-07-11
#> 2 -5877641-06-23
Second example (includes NaN):
library(arrow)
library(tibble)
library(dplyr)
chunks <- tibble(x = as.Date(c(Inf, -Inf, NaN))) |>
as_arrow_table()
chunks$columns
#> [[1]]
#> ChunkedArray
#> <date32[day]>
#> [
#> [
#> <value out of range: 2147483647>,
#> <value out of range: -2147483648>,
#> 1970-01-01
#> ]
#> ]
A final reprex to highlight the problematic aspect of casting to int32 which results in NA_integer_ clashing with INT_MIN:
library(arrow)
library(tibble)
library(dplyr)
tibble(x = as.Date(c(Inf, -Inf, NaN))) |>
as_arrow_table() |>
collect() |>
mutate(y = as.integer(x))
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `y = as.integer(x)`.
#> Caused by warning:
#> ! NAs introduced by coercion to integer range
#> # A tibble: 3 × 2
#> x y
#> <date> <int>
#> 1 5881580-07-11 2147483647
#> 2 -5877641-06-23 NA
#> 3 1970-01-01 0
Expected behaviour:
I think it might be better if Inf/-Inf dates are either converted to null in the Arrow array, or the conversion should error/warn indicating that infinite date values are not representable in date32.
Root cause:
In https://github.com/apache/arrow/blob/main/r/src/r_to_arrow.cpp#L600, FromRdate for Date32Type does:
static int FromRDate(const Date32Type*, double from) {
return static_cast<int>(std::floor(from));
}
As far as I understand, static_cast<int>(std::floor(Inf)) is undefined behaviour in C++. On most platforms this would produce INT_MAX/ INT_MIN, which Arrow then interprets as concrete dates ~5.8 million years from epoch.
A possible fix would be to check for non-finite values before the cast:
static int FromRDate(const Date32Type*, double from) {
if (!std::infinite(from)) {
// handle as null or error/ warn
}
return static_cast<int>(std::floor(from));
}
NaN date are also affected by the same behaviour. Arguably more problematic as the roundtrip transforms them into 0 (i.e. epoch).
Component(s)
R
Describe the bug, including details regarding any error messages, version, and platform.
I am not 100% convinced this is a bug or rather some unexpected behaviour.
When an R
Datevector containsInfor-Inf, converting to an Arrow table (e.g. viaas_arrow_table()orwrite_parquet()) silently converts these to extreme, but finite, dates instead ofnullor raising an error.Reprex:
Second example (includes
NaN):A final reprex to highlight the problematic aspect of casting to
int32which results inNA_integer_clashing withINT_MIN:Expected behaviour:
I think it might be better if
Inf/-Infdates are either converted tonullin the Arrow array, or the conversion should error/warn indicating that infinite date values are not representable indate32.Root cause:
In https://github.com/apache/arrow/blob/main/r/src/r_to_arrow.cpp#L600,
FromRdateforDate32Typedoes:As far as I understand,
static_cast<int>(std::floor(Inf))is undefined behaviour in C++. On most platforms this would produceINT_MAX/INT_MIN, which Arrow then interprets as concrete dates ~5.8 million years from epoch.A possible fix would be to check for non-finite values before the cast:
NaNdate are also affected by the same behaviour. Arguably more problematic as the roundtrip transforms them into0(i.e. epoch).Component(s)
R