Describe the bug
We have data with a numeric column that needs to be converted to date.
When settings are:
pattern = yyyyMMdd
plus_infinity_symbol = 99999
minus_infinity_symbol = 0
plus_infinity_value = 9999-12-31
minus_infinity_value = 1600-01-01
We get 0 converted to 1600-01-01`
When settings are:
pattern = yyyyMMdd
plus_infinity_symbol = 99999
minus_infinity_symbol = 0
plus_infinity_value = 99991231
minus_infinity_value = 16000101
We get 0 converted to 1599-12-31`.
There could be other factors. We are not sure that minus_infinity_value is the only thing that affects the outcome.
Need to confirm:
- Date pattern
yyyyMMdd is compatible with minus_infinity_value = 1600-01-01
- What causes the incorrect date to be converted and fix it.
To Reproduce
See he description
Expected behavior
Both settings should produce the same date 1600-01-01.
Business Value
--
Screenshots
--
Additional context
Found the culprit. The issue is when the timezone property is defined for a field that has Date data type. Solution: timezone should be ignored when converting from a number to date when time component is not defined.
A test that reproduces the issue (can be added to StandardizationInterpreter_DateSuite.scala:
test("date pattern from numeric values with infinity") {
val seq: Seq[Int] = Seq(
0,
20260423,
9999
)
val desiredSchema = StructType(Seq(
StructField(fieldName, DateType, nullable = false,
new MetadataBuilder()
.putString(MetadataKeys.Pattern, "yyyyMMdd")
.putString(MetadataKeys.IsNonStandard, "false")
.putString(MetadataKeys.MinusInfinitySymbol, "0")
.putString(MetadataKeys.MinusInfinityValue, "16000101")
.putString(MetadataKeys.PlusInfinitySymbol, "9999")
.putString(MetadataKeys.PlusInfinityValue, "99991231")
.putString(MetadataKeys.DefaultTimeZone, "Africa/Johannesburg")
.build)
))
val exp: Seq[DateRow] = Seq(
DateRow(Date.valueOf("1600-01-01")),
DateRow(Date.valueOf("2026-04-23")),
DateRow(Date.valueOf("9999-12-31"))
)
val src = seq.toDF(fieldName)
val std = Standardization.standardize(src, desiredSchema).cacheIfNotCachedYet()
logDataFrameContent(std)
std.as[DateRow].select(fieldName).collect().toList.foreach(println)
assertResult(exp)(std.as[DateRow].collect().toList)
}
It outputs:
[1599-12-31]
[2026-04-22]
[9999-12-30]
Expected:
[1601-01-01]
[2026-04-23]
[9999-12-31]
The conversion query executed when timezone is defined:
s"to_date(to_utc_timestamp(to_timestamp(CAST(`%s` AS STRING), '$pattern'), '$tz'))"
The conversion query executed when timezone is not defined:
s"to_date(CAST(`%s` AS STRING), '$pattern')"
Describe the bug
We have data with a numeric column that needs to be converted to
date.When settings are:
We get
0converted to 1600-01-01`When settings are:
We get
0converted to 1599-12-31`.There could be other factors. We are not sure that
minus_infinity_valueis the only thing that affects the outcome.Need to confirm:
yyyyMMddis compatible withminus_infinity_value = 1600-01-01To Reproduce
See he description
Expected behavior
Both settings should produce the same date
1600-01-01.Business Value
--
Screenshots
--
Additional context
Found the culprit. The issue is when the timezone property is defined for a field that has
Datedata type. Solution:timezoneshould be ignored when converting from a number to date when time component is not defined.A test that reproduces the issue (can be added to
StandardizationInterpreter_DateSuite.scala:It outputs:
Expected:
The conversion query executed when timezone is defined:
The conversion query executed when timezone is not defined: