Skip to content

Incorrect minimum date conversion #94

@yruslan

Description

@yruslan

Describe the bug

We have data with a numeric column that needs to be converted to date.

When settings are:

pattern = yyyyMMdd
plus_infinity_symbol = 99999
minus_infinity_symbol = 0
plus_infinity_value =  9999-12-31
minus_infinity_value = 1600-01-01

We get 0 converted to 1600-01-01`

When settings are:

pattern = yyyyMMdd
plus_infinity_symbol = 99999
minus_infinity_symbol = 0
plus_infinity_value =  99991231
minus_infinity_value = 16000101

We get 0 converted to 1599-12-31`.

There could be other factors. We are not sure that minus_infinity_value is the only thing that affects the outcome.

Need to confirm:

  • Date pattern yyyyMMdd is compatible with minus_infinity_value = 1600-01-01
  • What causes the incorrect date to be converted and fix it.

To Reproduce

See he description

Expected behavior

Both settings should produce the same date 1600-01-01.

Business Value

--

Screenshots

--

Additional context

Found the culprit. The issue is when the timezone property is defined for a field that has Date data type. Solution: timezone should be ignored when converting from a number to date when time component is not defined.

A test that reproduces the issue (can be added to StandardizationInterpreter_DateSuite.scala:

  test("date pattern from numeric values with infinity") {
    val seq: Seq[Int] = Seq(
      0,
      20260423,
      9999
    )
    val desiredSchema = StructType(Seq(
      StructField(fieldName, DateType, nullable = false,
        new MetadataBuilder()
          .putString(MetadataKeys.Pattern, "yyyyMMdd")
          .putString(MetadataKeys.IsNonStandard, "false")
          .putString(MetadataKeys.MinusInfinitySymbol, "0")
          .putString(MetadataKeys.MinusInfinityValue, "16000101")
          .putString(MetadataKeys.PlusInfinitySymbol, "9999")
          .putString(MetadataKeys.PlusInfinityValue, "99991231")
          .putString(MetadataKeys.DefaultTimeZone, "Africa/Johannesburg")
          .build)
    ))
    val exp: Seq[DateRow] = Seq(
      DateRow(Date.valueOf("1600-01-01")),
      DateRow(Date.valueOf("2026-04-23")),
      DateRow(Date.valueOf("9999-12-31"))
    )

    val src = seq.toDF(fieldName)

    val std = Standardization.standardize(src, desiredSchema).cacheIfNotCachedYet()
    logDataFrameContent(std)

    std.as[DateRow].select(fieldName).collect().toList.foreach(println)

    assertResult(exp)(std.as[DateRow].collect().toList)
  }

It outputs:

[1599-12-31]
[2026-04-22]
[9999-12-30]

Expected:

[1601-01-01]
[2026-04-23]
[9999-12-31]

The conversion query executed when timezone is defined:

s"to_date(to_utc_timestamp(to_timestamp(CAST(`%s` AS STRING), '$pattern'), '$tz'))"

The conversion query executed when timezone is not defined:

s"to_date(CAST(`%s` AS STRING), '$pattern')"

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions