Skip to content

R Date Inf/ -Inf silently corrupted when converting to Arrow date32 #49836

@dragosmg

Description

@dragosmg

Describe the bug, including details regarding any error messages, version, and platform.

I am not 100% convinced this is a bug or rather some unexpected behaviour.

When an R Date vector contains Inf or -Inf, converting to an Arrow table (e.g. via as_arrow_table() or write_parquet()) silently converts these to extreme, but finite, dates instead of null or raising an error.

Reprex:

library(arrow)
library(tibble)
library(dplyr)

tibble(x = as.Date(c(Inf, -Inf))) |> 
    as_arrow_table() |> 
    collect()
#> # A tibble: 2 × 1
#>   x             
#>   <date>        
#> 1 5881580-07-11 
#> 2 -5877641-06-23

Second example (includes NaN):

library(arrow)
library(tibble)
library(dplyr)

chunks <- tibble(x = as.Date(c(Inf, -Inf, NaN))) |> 
  as_arrow_table()

chunks$columns
#> [[1]]
#> ChunkedArray
#> <date32[day]>
#> [
#>   [
#>     <value out of range: 2147483647>,
#>     <value out of range: -2147483648>,
#>     1970-01-01
#>   ]
#> ]

A final reprex to highlight the problematic aspect of casting to int32 which results in NA_integer_ clashing with INT_MIN:

library(arrow)
library(tibble)
library(dplyr)

tibble(x = as.Date(c(Inf, -Inf, NaN))) |> 
  as_arrow_table() |> 
  collect() |> 
  mutate(y = as.integer(x))
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `y = as.integer(x)`.
#> Caused by warning:
#> ! NAs introduced by coercion to integer range
#> # A tibble: 3 × 2
#>   x                       y
#>   <date>              <int>
#> 1 5881580-07-11  2147483647
#> 2 -5877641-06-23         NA
#> 3 1970-01-01              0

Expected behaviour:
I think it might be better if Inf/-Inf dates are either converted to null in the Arrow array, or the conversion should error/warn indicating that infinite date values are not representable in date32.

Root cause:
In https://github.com/apache/arrow/blob/main/r/src/r_to_arrow.cpp#L600, FromRdate for Date32Type does:

  static int FromRDate(const Date32Type*, double from) {
    return static_cast<int>(std::floor(from));
  }

As far as I understand, static_cast<int>(std::floor(Inf)) is undefined behaviour in C++. On most platforms this would produce INT_MAX/ INT_MIN, which Arrow then interprets as concrete dates ~5.8 million years from epoch.

A possible fix would be to check for non-finite values before the cast:

static int FromRDate(const Date32Type*, double from) {
    if (!std::infinite(from)) {
        // handle as null or error/ warn
    }
    return static_cast<int>(std::floor(from));
  }

NaN date are also affected by the same behaviour. Arguably more problematic as the roundtrip transforms them into 0 (i.e. epoch).

Component(s)

R

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions